Thanks for the video; I had been looking into how to set this up for a while and this video finally got me to a place of having a working process. I just about have all of this setup in my environment but the most recent issue I'm running into is if I develop locally and run a DLT pipeline from VSCode it sets everything up with my credentials. When I merge to my dev branch that triggers the CICD pipeline (running as the service principal), the step that runs the job throws an error that the tables defined in the DLT pipeline are managed by another pipeline (the one with my credentials). If I use DLT do I just never test from vscode, or do I need to go clean those up each time? Is there a better way to manage this?
Thank you! I have a few processes that are not related to each other. Do I need to create a separate DAB for each one? How can I make the process more dynamic?
The general guidance is if you want to deploy together and code can be versioned together, then put it in the same bundle (all using same databricks.yml). If you want to keep things separate then its fine to have separate bundles and you can either deploy in separate CD pipelines or the same one by calling `databricks bundle deploy` multiple times, once from each directory with a databricks.yml. For making it more dyanmic I suggest variables, especially complex variables, but usually that is just to change values based on the target environment. Using SDK to create workflows is an alternative to DABs and other things have been discussed which might be more of a blend between the two options eventually.
You can deploy schemas within unity catalog but for external storage locations or volumes I would expect those to either happen from Terraform or as notebooks/scripts that you run in the deploy pipeline. Jobs to populate the storage would be defined in DABs, but not the creation of storage itself unless it's built into a job you trigger with bundle run.
Hey Dustin, We're currently implementing data bricks asset bundles using Azure DevOps to deploy workflows. The bundles we are using storing it in the GitHub. Can you please help me with the YAML script for it.
Thanks, one question I have some metadata files in the project folder, I need to copy this to a volume in Unity catlog. Is it possible through this deploy process ?
Using Databricks CLI path, you can have command that copies data up to volume. Replace all the curly brace { } parts with your own values. databricks fs cp --overwrite {local_path} dbfs:/Volumes/{catalog}/{schema}/{volume_name}/{filename}
Thank you, @@DustinVannoy. I wonder whether the following development progress does make sence. And if there any thing we could improve it. Background: (1) We have two Azure Databricks workspaces, one is for development, one is for production. (2) I am the only Data Engineer in our team, and we don't have dedicate QA. I am responsible to development and test. Who consume the data will do UAT. (3) We use Azure DevOps (repository and pipelines). Process: (1) Initialization (1.1) Create a new project by using `databricks bundle init` (1.2) Push the new project to Azure DevOps (1.3) On development DBR workspace, create a GIT Folder under `/Users/myname/` and link to the Azure DevOps repository (2) Development (2.1) Create a feature branch on DBR workspace (2.2) Do my development and hand test (2.3) Create a unit test job and the scheduled daily job (2.4) Create a pull request from the feature branch to the main branch on DBR workspace (3) CI (3.1) An Azure CI pipeline (build pipeline) will be trigerred after the pull request is created (3.2) The CI pipeline will check out the feature branch, and do `databricks bundle deploy` and `databricks bundle run --job the_unit_test_job` on the development DBR workspace by using Service Principal. (3.3) The test result will show on the pull request (4) CD (4.1) If everything looks good, the pull request will be approved (4.2) Manually trigger an Azure CD pipeline (release pipeline). Checkout the main branch, do `databricks bundle deploy` to the production DBR workspace by using Service Principal Explanation: (1) Because we are a small team and I am the only person who works on this, we do not have a `release` branch to simply the process (2) Due to the same reason, we also do not have a staging DBR workspace
Overall process is good. It’s typical not to have a separate QA person. I try to use yaml pipeline for the release step so code would look pretty similar to what you use to automate deploy to dev. I recommend having unit tests you can easily run as you build which is why I try to use Databricks-connect to run a few specific unit tests at a time. But, running workflows on all-purpose or serverless isn’t too bad an option for quick testing as you develop.
Yes, but for deploying DABs to Staging/Prod you want to use the same user every time so they are consistently the owner. For Github Actions I use a token in a secret. I think you could pull from key vault in dev ops pipeline, not positive on the best practice there.
Thanks, Dustin! You help me a lot for a production pipeline!
Hey Dustin,
We're currently exploring DABs, and it feels like this was made just for us!😅
Thanks a lot for sharing it!
Thanks for the video; I had been looking into how to set this up for a while and this video finally got me to a place of having a working process. I just about have all of this setup in my environment but the most recent issue I'm running into is if I develop locally and run a DLT pipeline from VSCode it sets everything up with my credentials. When I merge to my dev branch that triggers the CICD pipeline (running as the service principal), the step that runs the job throws an error that the tables defined in the DLT pipeline are managed by another pipeline (the one with my credentials). If I use DLT do I just never test from vscode, or do I need to go clean those up each time? Is there a better way to manage this?
Thank you!
I have a few processes that are not related to each other. Do I need to create a separate DAB for each one? How can I make the process more dynamic?
The general guidance is if you want to deploy together and code can be versioned together, then put it in the same bundle (all using same databricks.yml). If you want to keep things separate then its fine to have separate bundles and you can either deploy in separate CD pipelines or the same one by calling `databricks bundle deploy` multiple times, once from each directory with a databricks.yml.
For making it more dyanmic I suggest variables, especially complex variables, but usually that is just to change values based on the target environment. Using SDK to create workflows is an alternative to DABs and other things have been discussed which might be more of a blend between the two options eventually.
@ Thank you very much!
Hey Dustin, if we want to add more ressources like adls bronze silver and gold storage do we need to add it to the envi variables?
You can deploy schemas within unity catalog but for external storage locations or volumes I would expect those to either happen from Terraform or as notebooks/scripts that you run in the deploy pipeline. Jobs to populate the storage would be defined in DABs, but not the creation of storage itself unless it's built into a job you trigger with bundle run.
Now do it in Terraform ;)
Great video as always!
🤣🤣 it may happen one day, but not today. I would probably need help from build5nines.com
Hey Dustin,
We're currently implementing data bricks asset bundles using Azure DevOps to deploy workflows. The bundles we are using storing it in the GitHub. Can you please help me with the YAML script for it.
Thanks, one question I have some metadata files in the project folder, I need to copy this to a volume in Unity catlog. Is it possible through this deploy process ?
Using Databricks CLI path, you can have command that copies data up to volume. Replace all the curly brace { } parts with your own values.
databricks fs cp --overwrite {local_path} dbfs:/Volumes/{catalog}/{schema}/{volume_name}/{filename}
Hi Dustin, have you tried to configure and deploy a single node cluster by using Databricks Bundle?
Yes, it is possible. It looks something like this:
job_clusters:
- job_cluster_key: job_cluster
new_cluster:
spark_version: 14.3.x-scala2.12
node_type_id: m6gd.xlarge
num_workers: 0
data_security_mode: SINGLE_USER
spark_conf:
spark.master: local[*, 4]
spark.databricks.cluster.profile: singleNode
custom_tags: {"ResourceClass": "SingleNode"}
@@DustinVannoy Thanks a lot! This cannot be found in the Databricks documentation.
I am learning DABs at this moment. So lucky that I found this video. Thank you, @DustinVannoy. Do you mind if I ask a couple of questions?
Yes, ask away. I'll answer what I can.
Thank you, @@DustinVannoy. I wonder whether the following development progress does make sence. And if there any thing we could improve it.
Background:
(1) We have two Azure Databricks workspaces, one is for development, one is for production.
(2) I am the only Data Engineer in our team, and we don't have dedicate QA. I am responsible to development and test. Who consume the data will do UAT.
(3) We use Azure DevOps (repository and pipelines).
Process:
(1) Initialization
(1.1) Create a new project by using `databricks bundle init`
(1.2) Push the new project to Azure DevOps
(1.3) On development DBR workspace, create a GIT Folder under `/Users/myname/` and link to the Azure DevOps repository
(2) Development
(2.1) Create a feature branch on DBR workspace
(2.2) Do my development and hand test
(2.3) Create a unit test job and the scheduled daily job
(2.4) Create a pull request from the feature branch to the main branch on DBR workspace
(3) CI
(3.1) An Azure CI pipeline (build pipeline) will be trigerred after the pull request is created
(3.2) The CI pipeline will check out the feature branch, and do `databricks bundle deploy` and `databricks bundle run --job the_unit_test_job` on the development DBR workspace by using Service Principal.
(3.3) The test result will show on the pull request
(4) CD
(4.1) If everything looks good, the pull request will be approved
(4.2) Manually trigger an Azure CD pipeline (release pipeline). Checkout the main branch, do `databricks bundle deploy` to the production DBR workspace by using Service Principal
Explanation:
(1) Because we are a small team and I am the only person who works on this, we do not have a `release` branch to simply the process
(2) Due to the same reason, we also do not have a staging DBR workspace
Overall process is good. It’s typical not to have a separate QA person. I try to use yaml pipeline for the release step so code would look pretty similar to what you use to automate deploy to dev. I recommend having unit tests you can easily run as you build which is why I try to use Databricks-connect to run a few specific unit tests at a time. But, running workflows on all-purpose or serverless isn’t too bad an option for quick testing as you develop.
Now do AWS 😂
Meaning AWS account with GitHub Actions? If not, what combo of tools are you curious about for the deployment?
@DustinVannoy Will this work with a Databricks Personal Access Token instead of Service Connection/Service Principle?
Yes, but for deploying DABs to Staging/Prod you want to use the same user every time so they are consistently the owner. For Github Actions I use a token in a secret. I think you could pull from key vault in dev ops pipeline, not positive on the best practice there.