Thanks, this question was asked in an interview. I have implemented the same and able to deploy the Notebook with Devops pipeline. Great Video. Thank you Dave.
Hi thanks for the comment. This demo describes how to deploy notebooks, but the process is very similar for libraries. In the case of a library your build process would usually use a Linux shell to build the library (wheel/egg etc.) and your release would then push these into the cluster. Please don't think of the repo as storage to sync with, the repo is your source for development purposes only, and the code will be taken from this and built into a deployment artefact by the DevOps process. It's the artefact that will be deployed to production, and production should never connect to your repository.
I am trying to deploy all files from the folder but its giving error "Exception calling "ReadAllBytes" with "1" argument(s): "Could not find a part of the path". Though I can deploy one file. Please help how can I deploy all files from the folder to databricks
Hi thanks for the comment. A few people had this issue so I added the update at ua-cam.com/video/l35MBEJiUgk/v-deo.html which deals with path issues. I also updated the instructions on GitHub which are now accurate. Sorry for the confusion, I wasn't able to edit this video but did put the link in the description. Let me know if you still have problems and I'll try to help out
Hi Rahul, when doing this sort of thing you just need to create a build/test process for each deliverable. A Python library would have a build and test pipeline, as would the notebooks, and finally the ADF environment would have its own. Each build will produce an artifact in DevOps which you can tag as current, ready for deployment etc. and then you'd have a release pipeline which brings all three artifacts together and deploys to a test environment for integration testing before deploying to production and setting up triggers. Hope that makes sense? I'm currently working on a more full featured ADF demo to show more of the governance side of DevOps, hopefully that will also be useful as it highlights that scripted deployment is the easy part of DevOps!
I've also updated the documents to correct the two issues we found (including variable groups). Thanks again for reaching out, it really helps to make sure I've not missed anything.
Thanks for the question. Azure DevOps can consume the output of most test suites, so you'd just run a test using something like PyTest and then import the output into your build or release. Take a look at github.com/davedoesdemos/DataDevOps/tree/master/PythonTesting for instructions for Python - this isn't in Databricks but it shows the technique you need.
Thanks for the comment, and great question! If you are viewing Azure DevOps as a pure automation tool then no, technically there is nothing stopping you doing things that way and it will probably work for the immediate requirement. This is the same as the argument on Azure Data Factory where some people choose to use the Git repo directly. It's not good practice though since you're missing out on the whole purpose of the DevOps tooling, which is governance. To use the governance side of the tool properly then yes, you should ideally perform a build to create an artefact and then release to deploy and test it. I believe there are changes coming which will allow you to add "stages" in DevOps in a single pipeline, and although that appears to do what you're asking, ultimately you'd still have build and release with an immutable artefact in the object store. For data projects this is often a stumbling block because the initial requirement always looks like a simple automation one. I guarantee though if your project is large enough you'll eventually work out why you need the governance that goes with this, so I'd recommend using it from the start :)
Hi thanks for the question. The notebook would be separate to ADF. The Notebook and ADF pipelines would each use their own release mechanisms and you must create your processes to make sure these line up. If they are developed together then you would deploy both to your testing, QA, production together so that any structured integration testing can test them together. If they are not developed together then they would be deployed independently and your processes need to ensure you have some kind of "contract" in place to ensure the functionality of one is understood by the other.
Hi, thanks for the comment! I don't use Snowflake, so unfortunately can't help you there. I'm sure the process would be similar if they have a code first approach and either API access or some other way to push the artifacts into the environment.
The "master" and "Main" are just labels so won't be the cause of your issue. Do you have it configured to use main? If you really think it's that you can rename main to master in Git.
@@DaveDoesDemos thanks for your prompt reply. But, I do not think the sync works as I tried using azure DevOps git integration and it failed at git sync. When I tried the same with git integration with GitHub it did sync; could you try to connect a notebook via Azure DevOps git integration? probably you’d understand where I’m stuck. I would need to build the CI/CD and so I’d need to overcome this issue.
@@rajkiranboggala9722 Hi I just tested this on Azure DevOps using both an existing and a new repo, and with Master and Main naming. All of these scenarios sync fine on my set up. I suggest first copy your file somewhere for backup and then unlink in the Databricks interface by clicking the git button at the top of revision history. Click unlink and then save. Next, click the button again and click link - make sure you set your url correctly using dev.azure.com/yourorg/yourproject/_git/yourrepo and then choose your correct branch. If the branch doesn't show then save anyway, wait for it to fail then try again and it should load. Also make sure you don't have a policy on your main branch in your org preventing commits. This would usually be the case in any well set up organisation since you should not be commiting to main/master but to a feature branch instead and then doing a pull request. If all of the above don't work, it may be a support ticket as there's something weird happening. Hope that helps!
@@DaveDoesDemos I did follow everything as mentioned in the video, but, getting error: Error linking with Git: Response from Azure DevOps Services: status code: 500, reason phrase: ?{"$id":"1","innerException":null,"message":"Unable to complete authentication for user due to looping logins","typeName":"Microsoft.VisualStudio.Services.Identity.IdentityLoopingLoginException, Microsoft.VisualStudio.Services.WebApi","typeKey":"IdentityLoopingLoginException","errorCode":0,"eventId":4207} I do not know what to do!!! But, thanks for checking
Hi thanks for the comment. Subscription wouldn't make much difference here, you'd just need to use variables for the connection to the cluster in each deployment pipeline and it will connect using the credentials you provide. Use a variable group for each deployment and store the variable values in Key Vault as in the demo. You could use several keyvaults in different subscriptions if necessary, but for simplicity you might want to choose a subscription for management and store the vault there. If you get stuck let me know and I'll see what I can do.
@@DaveDoesDemos Thanks Dave, can you make a short video on how to configure pipeline to move notebook from one environment to another. That would be more helpful.
Unfortunately I'm pretty busy at the moment but will certainly try to make something specific to that in the future. The video will be almost exactly the same though. The only thing that you'd change to send the code to a different environment is the script. specifically the $(Databricks) variable with a different secret, and the $URI might be for a different cluster. Both of these can be set up in variable groups such that you have one variable group for test and a different one for production. Hope that's helpful I'll try to get a guide written as soon as I can. $Secret = "Bearer " + "$(Databricks)" # Set the URI of the workspace and the API endpoint $Uri = ".azuredatabricks.net/api/2.0/workspace/import" I do have another Databricks code promotion video planned which will use an upcoming feature for integration with Github. As soon as I get access to the new code and permission to post about it I'll get that up too :)
Hi, the methods would be identical using YAML so in theory you should be able to Google for the code examples. I have a strong preference against YAML for data pipelines in any team that doesn't have a dedicated pipeline engineer. Data teams simply don't need the stress of learning yet another markup language just to achieve something there's a perfectly good GUI for. Data deployment pipelines don't change often enough to make YAML worthwhile in my opinion. The time is better spent doing data transformation and modelling work.
thank you Dave for this demo, it shows all the steps needed. I suggest something if you can add "CI-CD" in your video title to make it easy for others to find your video, for me I spent hours scrolling down on youtube to find yours, I guess yours should be on the top result :), thanks again David (y)
If we give file path upto a folder location instead of file, would it deploy all files in that folder. if not could you guide how can we make it recursive for all files or even to child folders as well.
Hi Ashok, check out the newer version of Databricks which works in a much more normal way. There you can open a project (aka whole repo!) in the Databricks interface. While the theory is the same their implementation is much nicer. Unfortunately I've not had time to make a new video on this yet. The making artifacts and deployment will be pretty much the same, just that you now check in the whole repo at once.
@@guptaashok121 Can you tell me what component you use to make that loop and where it implements it? It is not very clear to me how to make CICD with all the files in a folder
Hi Dave, I'm currently working on a CD pipeline that requires to install libraries in the Databricks Cluster, which apparently needs more settings, also would like to know if there is a DataOps 1 video this cuz on the playlist i'm only able to see from videos 2, 3 and 4. Excellent channel! Regards from Mexico!
Hi @@jaxo116, thanks for the comment. Libraries should be the same process to upload, you can find details in the Databricks API reference on their site. I believe you have the choice to upload libraries to the workspace or to the cluster, which has some effect on the management of the libraries but works in basically the same way for either. If you have specifics let me know and I can look into it in more detail. Yes the naming of the series was bad, sorry! I had intended to upload an introduction to data ops video and never got around to it. There is a video about environments which serves as a bit of an intro. I will be putting out more videos soon explaining the wider DevOps stuff for data people. I've been running internal workshops on this so think I have a good handle on where data folk need the most pointers and explanation. There's a bunch of DevOps info out there, the problem is it's all aimed at coders so I've been trying to translate that for data people :)
In my workspace, when I go under User Settings - Git Integration and select Azure DevOps Services as Git Provider, it gives an error pop up that says Error: Uncaught TypeError: Cannot read property 'value' of undefined Reload the page and try again. If the error persists, contact support. Reference error code: d2f8c3b5119a49cb9e6e854e9b336725 Any idea on how I can resolve this or what is this error related to?
Thank you for the demo but in that way only the feature branch of the first notebook will be versioned and not the preprod or prod ones. What is the right way to promote code from one notebook to another keeping them versioned and merging branches step by step? if I overwrite notebook content by workspace importing API , that will be replacing the notebook and git versioning link will fall
Production is versioned within the object store of Azure DevOps. Humans won't have access to the production environment so there won't be any changes in that environment anyway, since all changes are made and submitted via the development environment. Hope that makes sense?
@@DaveDoesDemos thank you! yes but what if you use Github which is supported by Databricks and the use case where you can have some "middle-way" environments such as Int, QA, Staging, etc ? if you want to keep all these envs versioned you should link git notebook per notebook or is there an automation mechanism that allows it?
@@HarryStyle93 if you're using GitHub the process is identical. You write the code in your dev environment and then use DevOps to copy that into an artifact (the deliverable code). This code is then pushed into the various other environments, which you're free to configure however you like. In real life I'd recommend Dev (where you code and unit test) then Test (where you run integration testing against fake data with known answers), then QA/Preprod (where you run against real data but place the results somewhere else for checking, and finally production. These can be the same environment/cluster if you really want to work that way or need to save money, but ideally should be different. When doing this, the only environment linked to GitHub is the development environment, all of the others get notebooks and other stuff delivered by DevOps as artifacts and should not have human interaction.
Thanks for the video Dave. It has been very helpful for me. There isn't much out there about Databricks CI/CD. After adapting to your stream-of-consciousness style, it seems the presentation of ideas vs the actions in the video are totally out of sync from a scripting perspective. If viewers have some experience with the CI/CD process in Azure DevOps already, this probably is not a blocker, but it could be a little difficult if no experience (the target audience?) or if English is not your first language.
Hi Benjamin thanks for the comment and feedback. It's a difficult subject to cover well as most data people see CI/CD as scripted deployment, which is very easy. I wanted to cover it in the way it's intended which required a little more understanding of the collaborative nature of CI/CD and Git, and DevOps in general. I'm working on a bunch of new content in Microsoft UK around collaborative DevOps, testing and more agile data architectures with this stuff in mind and hopefully will translate these to some more up to date videos later in the year. This is borne out of seeing large mature customers hitting operational scale issues as data pros work in traditional ways. It's a long road though!
Hi Dave, Thanks for this demo. I am actually trying to deploy only those notebooks which got checked in latest in Master branch. Is there a way to achieve this functionality. FYI: i have found one bash script which is giving latest committed files but without path using below link. levelup.gitconnected.com/continuous-integration-and-delivery-in-azure-databricks-1ba56da3db45 Let me know if you have any idea about it.
Hi thanks for the question. You should be deploying as a whole asset rather than picking and choosing, that way your deployment asset is a complete solution. Don't think of this as just version control, it's much wider than that. Having said that, you can do this with some scripting if you really need to, but it would add complexity. Is there a reason you don't want to grab everything in the folder? Finally, Databricks have announced Workspace 2 which will change the way this works for the better. Your whole project in Databricks will be a repo so you won't need to check files in one by one. I have no info on when this is coming, but as soon as I get access I will make a demo of the new functionality.
@@DaveDoesDemos Thanks for your detailed information. In this request, i actually just wanted to deploy to PROD only those notebooks which got changed rather than deploying all of the folders and notebooks again and again. I have found the script to do this, my next challenge is to overwrite existing file on PROD if present rather putting it on 'Deployment' folder and then moving it manually to its actual location.
@@yogeshjain5549 keep in mind that this introduces a risk that your code will not be in a known consistent state, and will be much harder to replicate the environment using your release pipelines. This is the reason we deploy as one artifact for the project, and it eventually enables you to use ephemeral environments for continuous integration and deployment. You may also want to encapsulate some of your code into libraries which will be deployed to the cluster separately, making the notebooks smaller. There are no hard rules though so as long as you know what you're doing it should work fine.
@@DaveDoesDemos I feel like I have to be missing something. This appears to create a pipeline for a single file, but your saying we should grab an entire folder (all notebooks) which is what I'm attempting to do. Do I need to wrap your powershell script in a for loop to hit copy all my artifacts that were created in the pipeline step?
@@GuyBehindTheScreen1 No you're not missing anything. When I made this video the API only supported single file copy so you'd have to copy them in a loop. The Databricks interface now supports whole repos so the method is slightly different but the concept is the same.
Awesome demo Dave, thanks a lot - I have replicated this and works ok with one notebook in the same environment - the file name is hardcoded - $fileName = "$(System.DefaultWorkingDirectory)/_Build Notebook Artifact/NotebooksArtifact/DemoNotebookSept.py", how can I generalise this for all the files and folders in the main branch and what happens to $newNotebookName in this case?
Hi glad you enjoyed the demo. I'd recommend looking at using the newer Databricks methods which I've not had a chance to demo yet. These allow you to open a whole project at a time. For my older method you'd want to list out the contents of the folder and iterate through an array of filenames. In theory since you'll want your deploy script to be explicit you could even list them in the script using copy and paste, although this may get frustrating in a busy environment.
Thanks for sharing such a valuable piece of information. Quick question, I'm wondering what if my workspace is not accessible over Public Network and my Azure DevOps is using a Microsoft Self Hosted Pipeline? Any thoughts?
In that case you'd need to set up private networking with vnets. The method would be the same, you just have a headache getting the network working. Usually there's no reason to do this though, I would recommend using cloud native networking, otherwise you're just adding operational cost for no benefit (unless you work for the NSA or a nuclear power facility...).
Yeah! we are facing that scenario (customer requirement). Basically, the Azure DevOps Microsoft hosted agent (and because of that the release pipeline) wherever it'll get deployed on demad, needs to be able to reach our private databricks cluster URL passing through our azure firewall. So far I haven't got any strategy working on this. Would appreciate if you know some documentation to take a glimpse. Thanks for answering. New subscriber!
@@AlejoBohorquez960307 Sorry I missed the hosted agent part. Unfortunately I think you need to use a self hosted agent on your vnet to do this, or reconfigure the Databricks to use a public endpoint. It's very normal to use public endpoints on Databricks, we didn't even support private connections until last year and many large global businesses used it quite happily. I often argue that hooking it up to your corporate network poses more of a risk since attacks would then be targeted rather than random (assuming you didn't make your url identifiable, of course).
Thanks, this question was asked in an interview. I have implemented the same and able to deploy the Notebook with Devops pipeline. Great Video. Thank you Dave.
Delivered the purpose … you are rocking
could you please suggest how to sync libraries to Azure repo like notebooks
Hi thanks for the comment. This demo describes how to deploy notebooks, but the process is very similar for libraries. In the case of a library your build process would usually use a Linux shell to build the library (wheel/egg etc.) and your release would then push these into the cluster.
Please don't think of the repo as storage to sync with, the repo is your source for development purposes only, and the code will be taken from this and built into a deployment artefact by the DevOps process. It's the artefact that will be deployed to production, and production should never connect to your repository.
Thank you so much , keep growing 😊
thanks for this video, just what I need it!
Glad you enjoyed it!
I am trying to deploy all files from the folder but its giving error "Exception calling "ReadAllBytes" with "1" argument(s): "Could not find a part of the path". Though I can deploy one file. Please help how can I deploy all files from the folder to databricks
Hi thanks for the comment. A few people had this issue so I added the update at ua-cam.com/video/l35MBEJiUgk/v-deo.html which deals with path issues. I also updated the instructions on GitHub which are now accurate. Sorry for the confusion, I wasn't able to edit this video but did put the link in the description. Let me know if you still have problems and I'll try to help out
@@DaveDoesDemos Thanks :)
It was really a great demo, can you please help demo of deploying data factory that have databricks notebook activity running.
Hi Rahul, when doing this sort of thing you just need to create a build/test process for each deliverable. A Python library would have a build and test pipeline, as would the notebooks, and finally the ADF environment would have its own. Each build will produce an artifact in DevOps which you can tag as current, ready for deployment etc. and then you'd have a release pipeline which brings all three artifacts together and deploys to a test environment for integration testing before deploying to production and setting up triggers. Hope that makes sense? I'm currently working on a more full featured ADF demo to show more of the governance side of DevOps, hopefully that will also be useful as it highlights that scripted deployment is the easy part of DevOps!
@@DaveDoesDemos thank you dave, waiting to see your another session soon.
Simple daigram or flow chat well help really well to keep things in mind for longer time Dev so i was hope that will help.
I am facing issue , during the execution of the script , for deployment . Need your help on this Dave
$Secrets = "Bearer " + "$(Databricks)"
It's showing the term 'Databricks' is not recognised as the name of cmdlet , function script file or operable
Hi there, did you set up the variable group in Azure DevOps pointing to your Key Vault with the Databricks secret in it?
Hopefully this video will help resolve the issue ua-cam.com/video/l35MBEJiUgk/v-deo.html
I've also updated the documents to correct the two issues we found (including variable groups). Thanks again for reaching out, it really helps to make sure I've not missed anything.
@dave - Can you send code for each loop ....for how to deploy multiple files from a folder ? It would be great help
Nice demo..how can we add unit test when deploying notebooks using azure devops..
Thanks for the question. Azure DevOps can consume the output of most test suites, so you'd just run a test using something like PyTest and then import the output into your build or release. Take a look at github.com/davedoesdemos/DataDevOps/tree/master/PythonTesting for instructions for Python - this isn't in Databricks but it shows the technique you need.
Could you please explain me more about the binary contents in power shell script
Thanks for sharing such a wonderful demo can please create one demo how to create CI/CD pipeline for azure AD access token with service principal
Great presentation!
Is Build pipeline really required? Can we include Publish Build Artifacts task as a part of Release pipeline and not have build pipeline at all ?
Thanks for the comment, and great question! If you are viewing Azure DevOps as a pure automation tool then no, technically there is nothing stopping you doing things that way and it will probably work for the immediate requirement. This is the same as the argument on Azure Data Factory where some people choose to use the Git repo directly. It's not good practice though since you're missing out on the whole purpose of the DevOps tooling, which is governance. To use the governance side of the tool properly then yes, you should ideally perform a build to create an artefact and then release to deploy and test it. I believe there are changes coming which will allow you to add "stages" in DevOps in a single pipeline, and although that appears to do what you're asking, ultimately you'd still have build and release with an immutable artefact in the object store. For data projects this is often a stumbling block because the initial requirement always looks like a simple automation one. I guarantee though if your project is large enough you'll eventually work out why you need the governance that goes with this, so I'd recommend using it from the start :)
@@DaveDoesDemos Thanks Dave for a very quick revert. Appreciate your in depth answer. It makes sense 👍
Keep posting videos. They are really nice ones.
@@tejashavele Thanks for the feedback, more coming soon around IoT scenarios :)
can you please let me know if u can have databricks notebook in ADF that can be moved to dev to QA
Hi thanks for the question. The notebook would be separate to ADF. The Notebook and ADF pipelines would each use their own release mechanisms and you must create your processes to make sure these line up. If they are developed together then you would deploy both to your testing, QA, production together so that any structured integration testing can test them together. If they are not developed together then they would be deployed independently and your processes need to ensure you have some kind of "contract" in place to ensure the functionality of one is understood by the other.
Mate, why are you not making CI/CD pipeline using Snowflake?
Hi, thanks for the comment! I don't use Snowflake, so unfortunately can't help you there. I'm sure the process would be similar if they have a code first approach and either API access or some other way to push the artifacts into the environment.
Hi @Dave Does Demos
Thanks for it.
How we can find out the powershell script.
Hi the powershell is in the instructions linked to in the description github.com/davedoesdemos/DataDevOps/blob/master/Databricks/DatabricksDevOps.md
@@DaveDoesDemos Thanks alot
Getting git sync error as there’s no master anymore in git, it’s now main. Is there any work around?
The "master" and "Main" are just labels so won't be the cause of your issue. Do you have it configured to use main? If you really think it's that you can rename main to master in Git.
@@DaveDoesDemos thanks for your prompt reply. But, I do not think the sync works as I tried using azure DevOps git integration and it failed at git sync. When I tried the same with git integration with GitHub it did sync; could you try to connect a notebook via Azure DevOps git integration? probably you’d understand where I’m stuck. I would need to build the CI/CD and so I’d need to overcome this issue.
@@rajkiranboggala9722 Hi I just tested this on Azure DevOps using both an existing and a new repo, and with Master and Main naming. All of these scenarios sync fine on my set up. I suggest first copy your file somewhere for backup and then unlink in the Databricks interface by clicking the git button at the top of revision history. Click unlink and then save. Next, click the button again and click link - make sure you set your url correctly using dev.azure.com/yourorg/yourproject/_git/yourrepo and then choose your correct branch. If the branch doesn't show then save anyway, wait for it to fail then try again and it should load. Also make sure you don't have a policy on your main branch in your org preventing commits. This would usually be the case in any well set up organisation since you should not be commiting to main/master but to a feature branch instead and then doing a pull request. If all of the above don't work, it may be a support ticket as there's something weird happening. Hope that helps!
@@DaveDoesDemos I did follow everything as mentioned in the video, but, getting error: Error linking with Git: Response from Azure DevOps Services: status code: 500, reason phrase: ?{"$id":"1","innerException":null,"message":"Unable to complete authentication for user due to looping logins","typeName":"Microsoft.VisualStudio.Services.Identity.IdentityLoopingLoginException, Microsoft.VisualStudio.Services.WebApi","typeKey":"IdentityLoopingLoginException","errorCode":0,"eventId":4207}
I do not know what to do!!! But, thanks for checking
@@rajkiranboggala7085 500 would be server error which I would guess is incorrect url for the repo, are you certain that's right?
Hi Dave, how to configure release pipeline for other environment (eg.; production) which is under another subscription.
Hi thanks for the comment. Subscription wouldn't make much difference here, you'd just need to use variables for the connection to the cluster in each deployment pipeline and it will connect using the credentials you provide. Use a variable group for each deployment and store the variable values in Key Vault as in the demo. You could use several keyvaults in different subscriptions if necessary, but for simplicity you might want to choose a subscription for management and store the vault there. If you get stuck let me know and I'll see what I can do.
@@DaveDoesDemos Thanks Dave, can you make a short video on how to configure pipeline to move notebook from one environment to another. That would be more helpful.
Unfortunately I'm pretty busy at the moment but will certainly try to make something specific to that in the future. The video will be almost exactly the same though. The only thing that you'd change to send the code to a different environment is the script. specifically the $(Databricks) variable with a different secret, and the $URI might be for a different cluster. Both of these can be set up in variable groups such that you have one variable group for test and a different one for production. Hope that's helpful I'll try to get a guide written as soon as I can.
$Secret = "Bearer " + "$(Databricks)"
# Set the URI of the workspace and the API endpoint
$Uri = ".azuredatabricks.net/api/2.0/workspace/import"
I do have another Databricks code promotion video planned which will use an upcoming feature for integration with Github. As soon as I get access to the new code and permission to post about it I'll get that up too :)
@@DaveDoesDemos Thank you Dave
Can you please make a video on "Databricks Code Promotion using DevOps CI/CD" using Pipeline Artifact YAML method please..
Hi, the methods would be identical using YAML so in theory you should be able to Google for the code examples. I have a strong preference against YAML for data pipelines in any team that doesn't have a dedicated pipeline engineer. Data teams simply don't need the stress of learning yet another markup language just to achieve something there's a perfectly good GUI for. Data deployment pipelines don't change often enough to make YAML worthwhile in my opinion. The time is better spent doing data transformation and modelling work.
hi, this tutorial was very helpful to me.
I would like to know if you have any about implementing Cluster through Devops?
Hi, no I don't have a demo for that yet, but you could use the same techniques of ARM templates and API to achieve this I think.
thank you Dave for this demo, it shows all the steps needed. I suggest something if you can add "CI-CD" in your video title to make it easy for others to find your video, for me I spent hours scrolling down on youtube to find yours, I guess yours should be on the top result :), thanks again David (y)
Thanks for the suggestion, I'll add that in :)
If we give file path upto a folder location instead of file, would it deploy all files in that folder. if not could you guide how can we make it recursive for all files or even to child folders as well.
Hi Ashok, check out the newer version of Databricks which works in a much more normal way. There you can open a project (aka whole repo!) in the Databricks interface. While the theory is the same their implementation is much nicer. Unfortunately I've not had time to make a new video on this yet. The making artifacts and deployment will be pretty much the same, just that you now check in the whole repo at once.
@@DaveDoesDemos thanks for quick reply. I could just use your code with foreach loop to deploy all files.
@@guptaashok121 yes that definitely works
@@guptaashok121 Can you tell me what component you use to make that loop and where it implements it? It is not very clear to me how to make CICD with all the files in a folder
@@DaveDoesDemos Hi , Had you got any video created for moving all the notebooks in a particular folder please
thank you so much
Have you tried out this demo? Maybe you're already doing DataOps and have some advice for others. Leave a comment here to start the discussion!
Hi Dave, how to configure release pipeline for other environment (eg.; production) which is under another subscription.
Hi Dave, I'm currently working on a CD pipeline that requires to install libraries in the Databricks Cluster, which apparently needs more settings, also would like to know if there is a DataOps 1 video this cuz on the playlist i'm only able to see from videos 2, 3 and 4.
Excellent channel!
Regards from Mexico!
Hi @@jaxo116, thanks for the comment. Libraries should be the same process to upload, you can find details in the Databricks API reference on their site. I believe you have the choice to upload libraries to the workspace or to the cluster, which has some effect on the management of the libraries but works in basically the same way for either. If you have specifics let me know and I can look into it in more detail.
Yes the naming of the series was bad, sorry! I had intended to upload an introduction to data ops video and never got around to it. There is a video about environments which serves as a bit of an intro. I will be putting out more videos soon explaining the wider DevOps stuff for data people. I've been running internal workshops on this so think I have a good handle on where data folk need the most pointers and explanation. There's a bunch of DevOps info out there, the problem is it's all aimed at coders so I've been trying to translate that for data people :)
In my workspace, when I go under User Settings - Git Integration and select Azure DevOps Services as Git Provider, it gives an error pop up that says
Error: Uncaught TypeError: Cannot read property 'value' of undefined Reload the page and try again. If the error persists, contact support. Reference error code: d2f8c3b5119a49cb9e6e854e9b336725
Any idea on how I can resolve this or what is this error related to?
That sounds like a support issue so I'd contact support and raise a ticket. Sorry I can't be more help
Thank you for the demo but in that way only the feature branch of the first notebook will be versioned and not the preprod or prod ones. What is the right way to promote code from one notebook to another keeping them versioned and merging branches step by step? if I overwrite notebook content by workspace importing API , that will be replacing the notebook and git versioning link will fall
Production is versioned within the object store of Azure DevOps. Humans won't have access to the production environment so there won't be any changes in that environment anyway, since all changes are made and submitted via the development environment. Hope that makes sense?
@@DaveDoesDemos thank you! yes but what if you use Github which is supported by Databricks and the use case where you can have some "middle-way" environments such as Int, QA, Staging, etc ? if you want to keep all these envs versioned you should link git notebook per notebook or is there an automation mechanism that allows it?
@@HarryStyle93 if you're using GitHub the process is identical. You write the code in your dev environment and then use DevOps to copy that into an artifact (the deliverable code). This code is then pushed into the various other environments, which you're free to configure however you like. In real life I'd recommend Dev (where you code and unit test) then Test (where you run integration testing against fake data with known answers), then QA/Preprod (where you run against real data but place the results somewhere else for checking, and finally production. These can be the same environment/cluster if you really want to work that way or need to save money, but ideally should be different.
When doing this, the only environment linked to GitHub is the development environment, all of the others get notebooks and other stuff delivered by DevOps as artifacts and should not have human interaction.
@@DaveDoesDemos ok I see your point. Thank you
thank you!
Thanks for the video Dave. It has been very helpful for me. There isn't much out there about Databricks CI/CD. After adapting to your stream-of-consciousness style, it seems the presentation of ideas vs the actions in the video are totally out of sync from a scripting perspective. If viewers have some experience with the CI/CD process in Azure DevOps already, this probably is not a blocker, but it could be a little difficult if no experience (the target audience?) or if English is not your first language.
Hi Benjamin thanks for the comment and feedback. It's a difficult subject to cover well as most data people see CI/CD as scripted deployment, which is very easy. I wanted to cover it in the way it's intended which required a little more understanding of the collaborative nature of CI/CD and Git, and DevOps in general. I'm working on a bunch of new content in Microsoft UK around collaborative DevOps, testing and more agile data architectures with this stuff in mind and hopefully will translate these to some more up to date videos later in the year. This is borne out of seeing large mature customers hitting operational scale issues as data pros work in traditional ways. It's a long road though!
Hi Dave,
Thanks for this demo.
I am actually trying to deploy only those notebooks which got checked in latest in Master branch.
Is there a way to achieve this functionality.
FYI: i have found one bash script which is giving latest committed files but without path using below link.
levelup.gitconnected.com/continuous-integration-and-delivery-in-azure-databricks-1ba56da3db45
Let me know if you have any idea about it.
Hi thanks for the question. You should be deploying as a whole asset rather than picking and choosing, that way your deployment asset is a complete solution. Don't think of this as just version control, it's much wider than that.
Having said that, you can do this with some scripting if you really need to, but it would add complexity. Is there a reason you don't want to grab everything in the folder?
Finally, Databricks have announced Workspace 2 which will change the way this works for the better. Your whole project in Databricks will be a repo so you won't need to check files in one by one. I have no info on when this is coming, but as soon as I get access I will make a demo of the new functionality.
@@DaveDoesDemos
Thanks for your detailed information.
In this request, i actually just wanted to deploy to PROD only those notebooks which got changed rather than deploying all of the folders and notebooks again and again.
I have found the script to do this, my next challenge is to overwrite existing file on PROD if present rather putting it on 'Deployment' folder and then moving it manually to its actual location.
@@yogeshjain5549 keep in mind that this introduces a risk that your code will not be in a known consistent state, and will be much harder to replicate the environment using your release pipelines. This is the reason we deploy as one artifact for the project, and it eventually enables you to use ephemeral environments for continuous integration and deployment. You may also want to encapsulate some of your code into libraries which will be deployed to the cluster separately, making the notebooks smaller. There are no hard rules though so as long as you know what you're doing it should work fine.
@@DaveDoesDemos I feel like I have to be missing something. This appears to create a pipeline for a single file, but your saying we should grab an entire folder (all notebooks) which is what I'm attempting to do. Do I need to wrap your powershell script in a for loop to hit copy all my artifacts that were created in the pipeline step?
@@GuyBehindTheScreen1 No you're not missing anything. When I made this video the API only supported single file copy so you'd have to copy them in a loop. The Databricks interface now supports whole repos so the method is slightly different but the concept is the same.
18K + views but 900 subscribers why? if you are watching the content, no harm to subscribe right?
Awesome demo Dave, thanks a lot - I have replicated this and works ok with one notebook in the same environment - the file name is hardcoded - $fileName = "$(System.DefaultWorkingDirectory)/_Build Notebook Artifact/NotebooksArtifact/DemoNotebookSept.py", how can I generalise this for all the files and folders in the main branch and what happens to $newNotebookName in this case?
Hi glad you enjoyed the demo. I'd recommend looking at using the newer Databricks methods which I've not had a chance to demo yet. These allow you to open a whole project at a time. For my older method you'd want to list out the contents of the folder and iterate through an array of filenames. In theory since you'll want your deploy script to be explicit you could even list them in the script using copy and paste, although this may get frustrating in a busy environment.
Thanks for sharing such a valuable piece of information. Quick question, I'm wondering what if my workspace is not accessible over Public Network and my Azure DevOps is using a Microsoft Self Hosted Pipeline? Any thoughts?
In that case you'd need to set up private networking with vnets. The method would be the same, you just have a headache getting the network working. Usually there's no reason to do this though, I would recommend using cloud native networking, otherwise you're just adding operational cost for no benefit (unless you work for the NSA or a nuclear power facility...).
Yeah! we are facing that scenario (customer requirement). Basically, the Azure DevOps Microsoft hosted agent (and because of that the release pipeline) wherever it'll get deployed on demad, needs to be able to reach our private databricks cluster URL passing through our azure firewall. So far I haven't got any strategy working on this. Would appreciate if you know some documentation to take a glimpse. Thanks for answering. New subscriber!
@@AlejoBohorquez960307 Sorry I missed the hosted agent part. Unfortunately I think you need to use a self hosted agent on your vnet to do this, or reconfigure the Databricks to use a public endpoint. It's very normal to use public endpoints on Databricks, we didn't even support private connections until last year and many large global businesses used it quite happily. I often argue that hooking it up to your corporate network poses more of a risk since attacks would then be targeted rather than random (assuming you didn't make your url identifiable, of course).