Great video, learned a lot! I do have a question; would it make sense to define a base environment for serverless notebooks and jobs, and in the bundle reference said default environment? Ideally it would be in one spot, so upgrading the package versions would be simple and easy to test. This way developers could be sure that any package they get used to, is available across the whole bundle.
The idea makes sense but the way environments interact with workflows is still different depending on what task type you use. Plus you can't use them with standard clusters at this point. So it depends on how much variety you have in your jobs which is why I don't really include that in my repo yet.
Loving bundles so far. Only issue so far I've had is the databricks vscode extension seems to be modifying my bundles yml file behind the scenes. For example when I attach to a cluster in the extension it will override my job cluster to use that attached cluster when I deploy to the dev target in development mode.
Thanks Dustin for the video. Is there a way where I can specify sub-set of resources (workflows, DLT pieplines) to run in specific env? For example, I would like to deploy only Unit test job in DEV and not in PROD env.
You would need to define the job in the targets section of only the targets you want it in. If it needs to go to more than one environment, use YAML anchor to avoid code deuplication. I would normally just let a testing job get deployed to prod without a schedule, but others can't allow that or prefer not to do it that way.
Is there a way to define Policies as a resource and deploy . I have some 15 to 20 policies which my jobs can use any of them. If there is a way to manage these policies to apply policy changes, it will be very convenient
Great video Dustin! Especially on the advanced configuration of the databricks.yaml. I'd like to hear your opinion on the /src in the root of the folder. If you're team/organisation is used to work with a mono repo it would be great to have all common packages in the root, however, if you're more of a polyrepo kinda team/organisation, building and hosting the packages remotely (i.e. Nexus or something) could be a better approach in my opinion. Or am I missing something? How would you deal with a job where task 1 and task 2 have source code with conflicting dependencies?
Is there a way for python wheel tasks to combine the functionality we had without serverless to use: libraries: - whl../dist/*.whl so that the wheel gets deployed automatically with using serverless? As if I am trying to include environments for serverless I can't longer specify libraries for the wheel task (and therefore it is not deployed automatically) and I also need to hardcode my path for the wheel in the workspace. Could not find an example for that so far. All the best, Thomas
Are you trying to install the wheel in a notebook task, so you are required to install with %pip install? If you include the artifact section it should build and upload the wheel regardless of usage in a taks. You can predict the path within the .bundle deploy if you aren't setting mode: development, but I've been uploading it to a specific workspace or volume location. As environments for serverless evolve I may come back wtih more examples of how those should be used.
Great content!! I am trying to deploy the same job into different environments DEV/QA/PRD. I want to override parameters passed to the job from variable-group defined on the Azure DevOps portal. Can you please suggest how to proceed on this?
The part that references variables group PrdVariables shows how you set different variables and values depending on target environment. - stage: toProduction variables: - group: PrdVariables condition: | eq(variables['Build.SourceBranch'], 'refs/heads/main') In the part where you deploy the bundle, you can pass in variable values. See the docs for how that can be set. docs.databricks.com/en/dev-tools/bundles/settings.html#set-a-variables-value
Thanks a lot, @DustinVannoy for this great presentation! I have a question: which is the better approach for project structuration: one bundle yml config file for all my sub-projects or each sub-project have its own Databricks and bundle yml file? Thanks again :)
Great video ! Is there a way to overide variables defined in the databricks.yml in each of the job yml definition so that the variable has a different value for that job only ?
If value is the same for a job across all targets you wouldn't use a variable. To override job values you would set those in the target section which I always include in databricks.yml.
Not really, using databricks bundle validate is best way to see things. There are some options to view as debug but I haven't found something that works quite like Terraform plan. When you try to run destroy it does show what will be destroyed before you confirm.
Change from mode: development to mode: production (or just remove that line). This will remove prefix and change default destination. However, for dev target I recommend you keep the prefix if multiple developers will be working in the same workspace. Production target is best deployed as a service principal from CICD pipeline (like Azure DevOps Pipeline) to avoid different people deploying the same bundle and having conflicts with resource owner and code version.
Once the code is deployed it gets uploaded in the shared folder can't we store that some where else like an artifact or storage account because there are chances that someone may deleted that bundle from shared folder. It is always like with databricks deployment before and after asset bundles.
You can set permissions on the workspace folder and I recommend also having it all checked into version control such as GitHub in case you ever need to recover an older version.
Exciting stuff! Will definitely be trying to implement this in my future work!
Thanks a lot Dustin... Really appreciate it :)
Thanks for the video. It helped me a lot in my YT channel.
Great video, learned a lot!
I do have a question; would it make sense to define a base environment for serverless notebooks and jobs, and in the bundle reference said default environment? Ideally it would be in one spot, so upgrading the package versions would be simple and easy to test. This way developers could be sure that any package they get used to, is available across the whole bundle.
The idea makes sense but the way environments interact with workflows is still different depending on what task type you use. Plus you can't use them with standard clusters at this point. So it depends on how much variety you have in your jobs which is why I don't really include that in my repo yet.
Loving bundles so far. Only issue so far I've had is the databricks vscode extension seems to be modifying my bundles yml file behind the scenes. For example when I attach to a cluster in the extension it will override my job cluster to use that attached cluster when I deploy to the dev target in development mode.
Which version of the extension are you on, 1.3.0?
@@DustinVannoyYup, I did have it on a pre release which I thought was the issue but switched back to 1.3.0 and the "feature" persisted.
Thanks Dustin for the video.
Is there a way where I can specify sub-set of resources (workflows, DLT pieplines) to run in specific env?
For example, I would like to deploy only Unit test job in DEV and not in PROD env.
You would need to define the job in the targets section of only the targets you want it in. If it needs to go to more than one environment, use YAML anchor to avoid code deuplication. I would normally just let a testing job get deployed to prod without a schedule, but others can't allow that or prefer not to do it that way.
Is there a way to define Policies as a resource and deploy . I have some 15 to 20 policies which my jobs can use any of them. If there is a way to manage these policies to apply policy changes, it will be very convenient
Great video Dustin! Especially on the advanced configuration of the databricks.yaml.
I'd like to hear your opinion on the /src in the root of the folder. If you're team/organisation is used to work with a mono repo it would be great to have all common packages in the root, however, if you're more of a polyrepo kinda team/organisation, building and hosting the packages remotely (i.e. Nexus or something) could be a better approach in my opinion. Or am I missing something?
How would you deal with a job where task 1 and task 2 have source code with conflicting dependencies?
Is there a way for python wheel tasks to combine the functionality we had without serverless to use:
libraries: - whl../dist/*.whl so that the wheel gets deployed automatically with using serverless?
As if I am trying to include environments for serverless I can't longer specify libraries for the wheel task (and therefore it is not deployed automatically) and I also need to hardcode my path for the wheel in the workspace.
Could not find an example for that so far.
All the best,
Thomas
Are you trying to install the wheel in a notebook task, so you are required to install with %pip install?
If you include the artifact section it should build and upload the wheel regardless of usage in a taks. You can predict the path within the .bundle deploy if you aren't setting mode: development, but I've been uploading it to a specific workspace or volume location.
As environments for serverless evolve I may come back wtih more examples of how those should be used.
Great content!! I am trying to deploy the same job into different environments DEV/QA/PRD. I want to override parameters passed to the job from variable-group defined on the Azure DevOps portal. Can you please suggest how to proceed on this?
The part that references variables group PrdVariables shows how you set different variables and values depending on target environment.
- stage: toProduction
variables:
- group: PrdVariables
condition: |
eq(variables['Build.SourceBranch'], 'refs/heads/main')
In the part where you deploy the bundle, you can pass in variable values. See the docs for how that can be set. docs.databricks.com/en/dev-tools/bundles/settings.html#set-a-variables-value
Thanks a lot, @DustinVannoy for this great presentation! I have a question: which is the better approach for project structuration: one bundle yml config file for all my sub-projects or each sub-project have its own Databricks and bundle yml file? Thanks again :)
Great video ! Is there a way to overide variables defined in the databricks.yml in each of the job yml definition so that the variable has a different value for that job only ?
If value is the same for a job across all targets you wouldn't use a variable. To override job values you would set those in the target section which I always include in databricks.yml.
Any way to see a plan like you would with terraform?
Not really, using databricks bundle validate is best way to see things. There are some options to view as debug but I haven't found something that works quite like Terraform plan. When you try to run destroy it does show what will be destroyed before you confirm.
how do you change the Catalog Name specific to an environment?
I would use a bundle variable and set it in the target overrides, then reference it anywhere you need it.
Can we integrate Azure pipelines + DAB for ci cd implementation?
Are you referring to Azure DevOps CI pipelines? You can do that and I am considering a video on that since it has been requested a few times.
@@DustinVannoy yes, thank you!
@@DustinVannoy Please, can you do that? hahaha
Video showing Azure DevOps Pipeline is published!
ua-cam.com/video/ZuQzIbRoFC4/v-deo.html
How to remove [dev my_user_name]. Please suggest
Change from mode: development to mode: production (or just remove that line). This will remove prefix and change default destination. However, for dev target I recommend you keep the prefix if multiple developers will be working in the same workspace. Production target is best deployed as a service principal from CICD pipeline (like Azure DevOps Pipeline) to avoid different people deploying the same bundle and having conflicts with resource owner and code version.
@@DustinVannoy Thank you Vannoy!! Worked fine now !!
Once the code is deployed it gets uploaded in the shared folder can't we store that some where else like an artifact or storage account because there are chances that someone may deleted that bundle from shared folder. It is always like with databricks deployment before and after asset bundles.
You can set permissions on the workspace folder and I recommend also having it all checked into version control such as GitHub in case you ever need to recover an older version.