I like the idea of keeping the code in wheel structure so that we can build the wheels for unit testing and possibly integration testing. It's the best of both worlds. Nice!😃
This is great and think has solved an issue we have been battling with - how multiple people can develop at the same time. I guess for external dependencies you will need to make sure they are already installed on the cluster - so you lose the nice aspect of using pip install on your whl to automatically download them.
I'm brand new to Databricks. I want to get a team of data engineers to do proper testing with CI/CD. Using wheels seemed to create as many problems as it solved. Local repo python modules that are test-driven seems like a good baby step to ensuring quality while allowing rapid sustainable development.
Thanks for this perspective on using packages in notebooks. The dbx init command creates a package structure, but there is no clear documentation on how the package can be built/used within notebooks. It makes more sense now.
Hi Simon thanks for the video. Regarding the your concern in minute 22 (in prod prefer the concrete version of whl file) that can be solved in this way I think. The application repo will create a whl file an upload to s3 with specific version. The versioned package will be loaded during the cluster spinning up. It can be same for dev and test envs too. In order to enable parallel development , the way you show in this video can be used but giving different versions of the whl files (in setup.py) not to overwrite to master package ( not to be updated by different developer) in dev test envs. Therefore while as you explain here with repos import module functionality , so many developer can do changes and test them in the workspace notebook, the prod env can run fixed ( in a regular whl file ) version and control the prod version of whl file. I did a similar process and works fine. Best Akif
niiice!!! but, what if I need to use an exclusive databricks command, like "dbutils" in a function of this library? A .py file wouldn't run, what would be another option?
@AdvancingAnalytics Hi Simon, finally I got a tutorial related to my use case. Thanks a lot for this. I have a query though, if you can please help me on this. I have an application running on linux server which is having shell scripts, conda env setup, pip_req.txt, and python files for etl process. We have sqllite as well for metadata management. In this case if I have to move to databricks with minimum code changes, how should I design this. I mean in databricks for shell scripts calls, sqllite db, conda env setup what should be alternative or will it work as it is.
Hi , I have one question.how can I expose my databricks notebook as an endpoint for front end applications? I think creating workflows and running a job will take some time in producing the output. we want something in real time.can you suggest
The only way to expose "a notebook" is by having your app call the jobs API and triggering that notebook - it can take input parameters, provide outputs etc so would mimic a web service, but you won't get over the latency problem of calling a spark job. If you are trying to return results of a SQL query, use the SQL endpoint & Serverless instead, if you're trying to do inference with machine learning, then use model serving endpoints. Those are pretty much your options!
Thanks Simon for sharing your knowledge. Question around managing code for multiple entities. Would you create multiple git repos per databricks entity or 1 repo with all the entities in databricks cicd-labs sort of folder structure?
Hi, nice video. Would like to know if exist any way to do the same in R ? I whant versioning my developed lib and reference it in the script from repository. The material for R is too poor.
Hi, incase I create a FunctionTest notebook which will be outside of the DBXRepos and came from a different repo all together. In that case, it can I import using "from Library.hydr8.audit.lineage import *" and call addLineage() ? (using the 2nd approach you shown ?)
Nope, not on its own. It's certainly convenient to reference config directly, but you still need something to manage searching, queueing etc. Definitely a nice pattern for local config, but I'm not convinced I want production systems relying on having the right branch synced to a repo!
@@AdvancingAnalytics I understand what you're saying, but how is that any different than making sure production is built and deployed from the right branch. Furthermore, it seems to me it would eliminate complexity and artifacts they could be rationalized as driven by legacy limitations.
@@darryll127 very true, much of it is "it feels wrong" not "it is wrong". As long as we can still factor in the relevant code quality checks, linting/formatting, syntax checks, testing etc, then there's no real reason that it's "bad". From a wider architecture, keeping metadata purely in the repo means it's not accessible to other tools, so orchestration planes wouldn't see it for example. If you're building an entirely databricks-based architecture might be suitable?
@@AdvancingAnalytics you raise a good point, however I could see a process whereby you have the option in your framework to read the JSON and store the it in Delta tables (taking advantage of things like AutoLoader, schema evolution, complex types) and then take advantage of the Delta readers which are not dependent on Databricks per se as a means of exposing the Config data to external environments.
Firstly, love the Channel Simon and the team! I have been using %run "../shared/xxxxx" as a technique to consume functions from a notebook stored within a git repo. Is there any downsides to this option? - Many thanks!
You are the god of Databricks!! Enjoying watch and learn))
I like the idea of keeping the code in wheel structure so that we can build the wheels for unit testing and possibly integration testing. It's the best of both worlds. Nice!😃
Thanks a lot for putting this video, it is going to save my life
This is great and think has solved an issue we have been battling with - how multiple people can develop at the same time. I guess for external dependencies you will need to make sure they are already installed on the cluster - so you lose the nice aspect of using pip install on your whl to automatically download them.
thanks Simon!!!!! just great video , great content
Thanks
This is really cool. Thanks for showing this.
I'm brand new to Databricks. I want to get a team of data engineers to do proper testing with CI/CD. Using wheels seemed to create as many problems as it solved. Local repo python modules that are test-driven seems like a good baby step to ensuring quality while allowing rapid sustainable development.
This was awesome, thanks! Definitely thinking I want to change to use this feature rather than the wheels
Good stuff. Let me know how you get on
Thanks for this perspective on using packages in notebooks. The dbx init command creates a package structure, but there is no clear documentation on how the package can be built/used within notebooks. It makes more sense now.
Hi Simon thanks for the video. Regarding the your concern in minute 22 (in prod prefer the concrete version of whl file) that can be solved in this way I think. The application repo will create a whl file an upload to s3 with specific version. The versioned package will be loaded during the cluster spinning up. It can be same for dev and test envs too. In order to enable parallel development , the way you show in this video can be used but giving different versions of the whl files (in setup.py) not to overwrite to master package ( not to be updated by different developer) in dev test envs. Therefore while as you explain here with repos import module functionality , so many developer can do changes and test them in the workspace notebook, the prod env can run fixed ( in a regular whl file ) version and control the prod version of whl file. I did a similar process and works fine.
Best
Akif
Coool!
Thanks for this great video!
Thanks
Good video. How to read delta table inside UDF. Please suggest..
i really love, solve our interrogations around simmlifying code
niiice!!! but, what if I need to use an exclusive databricks command, like "dbutils" in a function of this library? A .py file wouldn't run, what would be another option?
Hey Simon!.. How does this work for a job cluster when launching databricks operators from airflow
@AdvancingAnalytics Hi Simon, finally I got a tutorial related to my use case. Thanks a lot for this. I have a query though, if you can please help me on this. I have an application running on linux server which is having shell scripts, conda env setup, pip_req.txt, and python files for etl process. We have sqllite as well for metadata management. In this case if I have to move to databricks with minimum code changes, how should I design this. I mean in databricks for shell scripts calls, sqllite db, conda env setup what should be alternative or will it work as it is.
Hi , I have one question.how can I expose my databricks notebook as an endpoint for front end applications? I think creating workflows and running a job will take some time in producing the output. we want something in real time.can you suggest
The only way to expose "a notebook" is by having your app call the jobs API and triggering that notebook - it can take input parameters, provide outputs etc so would mimic a web service, but you won't get over the latency problem of calling a spark job. If you are trying to return results of a SQL query, use the SQL endpoint & Serverless instead, if you're trying to do inference with machine learning, then use model serving endpoints. Those are pretty much your options!
Thanks Simon for sharing your knowledge. Question around managing code for multiple entities. Would you create multiple git repos per databricks entity or 1 repo with all the entities in databricks cicd-labs sort of folder structure?
Great question. This is a big debate. Mono vs multi repo. We always prefer multi repo to make cicd less complicated
IF you want to use poetry as dependency manager, how to you solve this? how do you install the dependencies on this repo way?
Hi, nice video.
Would like to know if exist any way to do the same in R ?
I whant versioning my developed lib and reference it in the script from repository. The material for R is too poor.
Hi, incase I create a FunctionTest notebook which will be outside of the DBXRepos and came from a different repo all together. In that case, it can I import using "from Library.hydr8.audit.lineage import *" and call addLineage() ? (using the 2nd approach you shown ?)
Simon, would this change your thinking / architecture for Hydro to put Config metadata in JSON files and not bother with a database at all?
Nope, not on its own. It's certainly convenient to reference config directly, but you still need something to manage searching, queueing etc. Definitely a nice pattern for local config, but I'm not convinced I want production systems relying on having the right branch synced to a repo!
@@AdvancingAnalytics I understand what you're saying, but how is that any different than making sure production is built and deployed from the right branch.
Furthermore, it seems to me it would eliminate complexity and artifacts they could be rationalized as driven by legacy limitations.
@@darryll127 very true, much of it is "it feels wrong" not "it is wrong". As long as we can still factor in the relevant code quality checks, linting/formatting, syntax checks, testing etc, then there's no real reason that it's "bad".
From a wider architecture, keeping metadata purely in the repo means it's not accessible to other tools, so orchestration planes wouldn't see it for example. If you're building an entirely databricks-based architecture might be suitable?
@@AdvancingAnalytics you raise a good point, however I could see a process whereby you have the option in your framework to read the JSON and store the it in Delta tables (taking advantage of things like AutoLoader, schema evolution, complex types) and then take advantage of the Delta readers which are not dependent on Databricks per se as a means of exposing the Config data to external environments.
Hello
Firstly, love the Channel Simon and the team! I have been using %run "../shared/xxxxx" as a technique to consume functions from a notebook stored within a git repo. Is there any downsides to this option? - Many thanks!
Great question. Wheels are transferable and testable. Notes are not as easy to test. Wheels give you a more robust deployment option.