This is helpful Dustin, thank you. I have a question for you. This is kinda of related to your problem as I'm trying to figure out testing as well as the branching + deployment strategy (e.g. CI vs abbreviated Trunk based dev + PR + CI for automated testing) but I have perhaps 3 different types of "work" that are related but different possibly in the same repo. So setting up CI/CD pipelines and testing is downstream of the repo structure or well, if multiple repos. So that said, I'm trying to decide repo structure and I'm a bit torn on if there should be multiple repositories or separate repositories. I'm working as an embedded end-to-end ML/AI/Data engineer on a product team and we have 3 primary "Functions of work" as I like to call it: 1. Automated ETL Pipelines (e.g. Databricks workflows) 2. LLM Tools, Utilities, etc -- we use or plan to use a lot of LLMs 3. POCs for businesses using GenAI/ML/etc that may NEVER have an updated data set needed, so no needed DBX Workflow I don't think these should all be in the same repo. The main question is should the #3, which likely also does not have Unit Tests, Integration Tests should live in the same repo as #1 (DBX Workflows), perhaps in a separate top level directory called "Sandbox" or if that should be it's own Github repository. And if/when any one of these POCs gets the go ahead for regular data ingestion and/or "PROD usage" perhaps it is then ported over to "Workflows" (either the directory, if it lives in the same repo, or the Workflows Repo, if it's a separate repo). I feel this video applies very well to #1, perhaps to #2. And I should then keep #3 to itself. I have another question about CI/CD as it pertains to doing both a Medallion Architecture + DEV/TEST/PROD environments. There seems to be a ton of redundancy if I am using a CI pipeline for testing of my code in DEV on a Pull Request to then merge the code to MAIN in the DEV environment. At this point, the code is "tested" but environment wise, still lives in DEV and has not yet been pushed up through UAT or PROD.
Really good stuff! Thank you so much for these posts! It is very inspiring and I have some work to do to reach this level. We are very heavy on using SQL-code for transformations, using temporary views and cte:s. Is that a bad strategy in the sense that it makes it really hard to test? So for example instead of having a CASE-statement you would instead use a UDF that is more easily testable? How do you test big SQL-transformations?
The downside of using SQL directly is it makes testing harder, but there are some different options. Certain ways of doing ETL with SQL like dbt or DLT have testing capabilities built in. Are you using SQL files that are then orchestrated via Databricks Workflows? If so you could follow similar ideas to what I show in the integration testing. Using UDF can make some of your unit testing easier but can have performance implications, so isn't always a best practice. I'm hoping to share more about SQL based ETL options this year but please let me know more about your design so it can influence some of the examples I build on this topic.
This is helpful Dustin, thank you.
I have a question for you. This is kinda of related to your problem as I'm trying to figure out testing as well as the branching + deployment strategy (e.g. CI vs abbreviated Trunk based dev + PR + CI for automated testing) but I have perhaps 3 different types of "work" that are related but different possibly in the same repo. So setting up CI/CD pipelines and testing is downstream of the repo structure or well, if multiple repos.
So that said, I'm trying to decide repo structure and I'm a bit torn on if there should be multiple repositories or separate repositories.
I'm working as an embedded end-to-end ML/AI/Data engineer on a product team and we have 3 primary "Functions of work" as I like to call it:
1. Automated ETL Pipelines (e.g. Databricks workflows)
2. LLM Tools, Utilities, etc -- we use or plan to use a lot of LLMs
3. POCs for businesses using GenAI/ML/etc that may NEVER have an updated data set needed, so no needed DBX Workflow
I don't think these should all be in the same repo. The main question is should the #3, which likely also does not have Unit Tests, Integration Tests should live in the same repo as #1 (DBX Workflows), perhaps in a separate top level directory called "Sandbox" or if that should be it's own Github repository. And if/when any one of these POCs gets the go ahead for regular data ingestion and/or "PROD usage" perhaps it is then ported over to "Workflows" (either the directory, if it lives in the same repo, or the Workflows Repo, if it's a separate repo).
I feel this video applies very well to #1, perhaps to #2. And I should then keep #3 to itself.
I have another question about CI/CD as it pertains to doing both a Medallion Architecture + DEV/TEST/PROD environments. There seems to be a ton of redundancy if I am using a CI pipeline for testing of my code in DEV on a Pull Request to then merge the code to MAIN in the DEV environment. At this point, the code is "tested" but environment wise, still lives in DEV and has not yet been pushed up through UAT or PROD.
Excelent material
Really good stuff! Thank you so much for these posts! It is very inspiring and I have some work to do to reach this level.
We are very heavy on using SQL-code for transformations, using temporary views and cte:s. Is that a bad strategy in the sense that it makes it really hard to test?
So for example instead of having a CASE-statement you would instead use a UDF that is more easily testable?
How do you test big SQL-transformations?
The downside of using SQL directly is it makes testing harder, but there are some different options. Certain ways of doing ETL with SQL like dbt or DLT have testing capabilities built in. Are you using SQL files that are then orchestrated via Databricks Workflows? If so you could follow similar ideas to what I show in the integration testing.
Using UDF can make some of your unit testing easier but can have performance implications, so isn't always a best practice. I'm hoping to share more about SQL based ETL options this year but please let me know more about your design so it can influence some of the examples I build on this topic.