Databricks CI/CD: Azure DevOps Pipeline + DABs

Поділитися
Вставка
  • Опубліковано 5 січ 2025

КОМЕНТАРІ •

  • @luiscarcamo8421
    @luiscarcamo8421 17 днів тому +1

    Thanks, Dustin! You help me a lot for a production pipeline!

  • @gangadharneelam3107
    @gangadharneelam3107 4 місяці тому +2

    Hey Dustin,
    We're currently exploring DABs, and it feels like this was made just for us!😅
    Thanks a lot for sharing it!

  • @BrianMurrays
    @BrianMurrays 3 дні тому

    Thanks for the video; I had been looking into how to set this up for a while and this video finally got me to a place of having a working process. I just about have all of this setup in my environment but the most recent issue I'm running into is if I develop locally and run a DLT pipeline from VSCode it sets everything up with my credentials. When I merge to my dev branch that triggers the CICD pipeline (running as the service principal), the step that runs the job throws an error that the tables defined in the DLT pipeline are managed by another pipeline (the one with my credentials). If I use DLT do I just never test from vscode, or do I need to go clean those up each time? Is there a better way to manage this?

  • @אופיראוחיון-ס8י
    @אופיראוחיון-ס8י Місяць тому +1

    Thank you!
    I have a few processes that are not related to each other. Do I need to create a separate DAB for each one? How can I make the process more dynamic?

    • @DustinVannoy
      @DustinVannoy  16 днів тому +1

      The general guidance is if you want to deploy together and code can be versioned together, then put it in the same bundle (all using same databricks.yml). If you want to keep things separate then its fine to have separate bundles and you can either deploy in separate CD pipelines or the same one by calling `databricks bundle deploy` multiple times, once from each directory with a databricks.yml.
      For making it more dyanmic I suggest variables, especially complex variables, but usually that is just to change values based on the target environment. Using SDK to create workflows is an alternative to DABs and other things have been discussed which might be more of a blend between the two options eventually.

    • @אופיראוחיון-ס8י
      @אופיראוחיון-ס8י 16 днів тому

      @ Thank you very much!

  • @moncefansseti1907
    @moncefansseti1907 2 місяці тому +1

    Hey Dustin, if we want to add more ressources like adls bronze silver and gold storage do we need to add it to the envi variables?

    • @DustinVannoy
      @DustinVannoy  16 днів тому

      You can deploy schemas within unity catalog but for external storage locations or volumes I would expect those to either happen from Terraform or as notebooks/scripts that you run in the deploy pipeline. Jobs to populate the storage would be defined in DABs, but not the creation of storage itself unless it's built into a job you trigger with bundle run.

  • @benjamingeyer8907
    @benjamingeyer8907 4 місяці тому

    Now do it in Terraform ;)
    Great video as always!

    • @DustinVannoy
      @DustinVannoy  4 місяці тому +1

      🤣🤣 it may happen one day, but not today. I would probably need help from build5nines.com

  • @unilmittakola
    @unilmittakola 3 місяці тому

    Hey Dustin,
    We're currently implementing data bricks asset bundles using Azure DevOps to deploy workflows. The bundles we are using storing it in the GitHub. Can you please help me with the YAML script for it.

  • @thusharr7787
    @thusharr7787 4 місяці тому

    Thanks, one question I have some metadata files in the project folder, I need to copy this to a volume in Unity catlog. Is it possible through this deploy process ?

    • @DustinVannoy
      @DustinVannoy  4 місяці тому

      Using Databricks CLI path, you can have command that copies data up to volume. Replace all the curly brace { } parts with your own values.
      databricks fs cp --overwrite {local_path} dbfs:/Volumes/{catalog}/{schema}/{volume_name}/{filename}

  • @albertwang1134
    @albertwang1134 3 місяці тому

    Hi Dustin, have you tried to configure and deploy a single node cluster by using Databricks Bundle?

    • @DustinVannoy
      @DustinVannoy  3 місяці тому

      Yes, it is possible. It looks something like this:
      job_clusters:
      - job_cluster_key: job_cluster
      new_cluster:
      spark_version: 14.3.x-scala2.12
      node_type_id: m6gd.xlarge
      num_workers: 0
      data_security_mode: SINGLE_USER
      spark_conf:
      spark.master: local[*, 4]
      spark.databricks.cluster.profile: singleNode
      custom_tags: {"ResourceClass": "SingleNode"}

    • @albertwang1134
      @albertwang1134 3 місяці тому

      @@DustinVannoy Thanks a lot! This cannot be found in the Databricks documentation.

  • @albertwang1134
    @albertwang1134 4 місяці тому

    I am learning DABs at this moment. So lucky that I found this video. Thank you, @DustinVannoy. Do you mind if I ask a couple of questions?

    • @DustinVannoy
      @DustinVannoy  4 місяці тому

      Yes, ask away. I'll answer what I can.

    • @albertwang1134
      @albertwang1134 4 місяці тому

      Thank you, @@DustinVannoy. I wonder whether the following development progress does make sence. And if there any thing we could improve it.
      Background:
      (1) We have two Azure Databricks workspaces, one is for development, one is for production.
      (2) I am the only Data Engineer in our team, and we don't have dedicate QA. I am responsible to development and test. Who consume the data will do UAT.
      (3) We use Azure DevOps (repository and pipelines).
      Process:
      (1) Initialization
      (1.1) Create a new project by using `databricks bundle init`
      (1.2) Push the new project to Azure DevOps
      (1.3) On development DBR workspace, create a GIT Folder under `/Users/myname/` and link to the Azure DevOps repository
      (2) Development
      (2.1) Create a feature branch on DBR workspace
      (2.2) Do my development and hand test
      (2.3) Create a unit test job and the scheduled daily job
      (2.4) Create a pull request from the feature branch to the main branch on DBR workspace
      (3) CI
      (3.1) An Azure CI pipeline (build pipeline) will be trigerred after the pull request is created
      (3.2) The CI pipeline will check out the feature branch, and do `databricks bundle deploy` and `databricks bundle run --job the_unit_test_job` on the development DBR workspace by using Service Principal.
      (3.3) The test result will show on the pull request
      (4) CD
      (4.1) If everything looks good, the pull request will be approved
      (4.2) Manually trigger an Azure CD pipeline (release pipeline). Checkout the main branch, do `databricks bundle deploy` to the production DBR workspace by using Service Principal
      Explanation:
      (1) Because we are a small team and I am the only person who works on this, we do not have a `release` branch to simply the process
      (2) Due to the same reason, we also do not have a staging DBR workspace

    • @DustinVannoy
      @DustinVannoy  4 місяці тому +1

      Overall process is good. It’s typical not to have a separate QA person. I try to use yaml pipeline for the release step so code would look pretty similar to what you use to automate deploy to dev. I recommend having unit tests you can easily run as you build which is why I try to use Databricks-connect to run a few specific unit tests at a time. But, running workflows on all-purpose or serverless isn’t too bad an option for quick testing as you develop.

  • @fb-gu2er
    @fb-gu2er 3 місяці тому

    Now do AWS 😂

    • @DustinVannoy
      @DustinVannoy  3 місяці тому

      Meaning AWS account with GitHub Actions? If not, what combo of tools are you curious about for the deployment?

  • @anindyabanerjee5733
    @anindyabanerjee5733 16 днів тому

    @DustinVannoy Will this work with a Databricks Personal Access Token instead of Service Connection/Service Principle?

    • @DustinVannoy
      @DustinVannoy  16 днів тому

      Yes, but for deploying DABs to Staging/Prod you want to use the same user every time so they are consistently the owner. For Github Actions I use a token in a secret. I think you could pull from key vault in dev ops pipeline, not positive on the best practice there.