Dustin Vannoy
Dustin Vannoy
  • 59
  • 205 934
Developer Best Practices on Databricks: Git, Tests, and Automated Deployment
Data engineers and data scientists benefit from using best practices learned from years of software development. This video walks through 3 of the most important practices to build quality analytics solutions. It is meant to be an overview of what following these practices looks like for a Databricks developer.
This video covers:
- Version control basics and demo of Git integration with Databricks workspace
- Automated tests with pytest for unit testing and Databricks Workflows for integration testing
- CI/CD including running tests prior to deployment with GitHub Actions
* All thoughts and opinions are my own , though for this video influenced by Databricks SMEs *
Intro video that discusses development process and full list of best practices is available here: ua-cam.com/video/IWS2AzkTKl0/v-deo.html
Blog post for Developer Best Practices on Databricks: dustinvannoy.com/2025/01/05/best-practices-for-data-engineers-on-databricks/
More from Dustin:
Website: dustinvannoy.com
LinkedIn: www.linkedin.com/in/dustinvannoy
Github: github.com/datakickstart
CHAPTERS
0:00 Intro
0:31 Version Control (Git)
7:57 Unit Tests + Integration Tests
28:00 Automated Deploy
36:35 Outro
Переглядів: 340

Відео

7 Best Practices for Development and CICD on Databricks
Переглядів 60514 днів тому
In this video I share why developer experience and best practices are important and why I think Databricks offers the best developer experience for a data platform. I'll cover high level developer lifecycle and 7 ways to improve your team's development process with a goal of better quality and reliability. Stay tuned for follow up videos that cover some of the key topics discussed here. Blog po...
Databricks VS Code: Multiple Projects In VS Code Workspace
Переглядів 5292 місяці тому
In this video I cover a specific option for work with Databricks Visual Studio Code Extension…what it I have many project folders each as their own bundle but I want to work in the same VS Code workspace? I talk through a couple ways to work with this and show how to switch the active project folder in order to run files from different bundles. You may need this if: - VS Code is only opening on...
Databricks VS Code Extension v2: Upgrade steps
Переглядів 3723 місяці тому
In this short video I show you how to upgrade a project from using Databricks Visual Studio Code version 1 to using the new version. There are a few key setup steps included and a quick glimpse at the new Databricks run button. For a more complete view of using the Databricks Visual Studio Code extension, see this video: ua-cam.com/video/o4qMWHgT1zM/v-deo.html * All thoughts and opinions are my...
Databricks VS Code Extension v2: Setup and Feature Demo
Переглядів 2,8 тис.3 місяці тому
Databricks Visual Studio Code Extension v2, the next major release, is now generally available. In this video I walk through the initial setup and the main ways you will run code and deploy resources using this extension. I also provide some key tips to make sure you don't get stuck along the way. * All thoughts and opinions are my own * References: Databricks blog: www.databricks.com/blog/simp...
Databricks CI/CD: Azure DevOps Pipeline + DABs
Переглядів 7 тис.4 місяці тому
Many organizations choose Azure DevOps for automated deployments on Azure. When deploying to Databricks you can take similar deploy pipeline code that you use for other projects but use it with Databricks Asset Bundles. This video shows most of the steps involved in setting this up by following along with a blog post that shares example code and steps. * All thoughts and opinions are my own * B...
Databricks Asset Bundles: Advanced Examples
Переглядів 8 тис.6 місяців тому
Databricks Asset Bundles is now GA (Generally Available). As more Databricks users start to rely on Databricks Asset Bundles (DABs) for their development and deployment workflows, let's look at some advanced patterns people have been asking for examples to help them get started. Blog post with these examples: dustinvannoy.com/2024/06/25/databricks-asset-bundles-advanced Intro post: dustinvannoy...
Introducing DBRX Open LLM - Data Engineering San Diego (May 2024)
Переглядів 2957 місяців тому
A special event presented by Data Engineering San Diego, Databricks User Group, and San Diego Software Engineers. Presentation: Introducing DBRX - Open LLM by Databricks By: Vitaliy Chiley, Head of LLM Pretraining for Mosaic at Databricks DBRX is an open-source LLM by Databricks which when recently released outperformed established open-source models on a set of standard benchmarks. Join us to ...
Monitoring Databricks with System Tables
Переглядів 3,5 тис.10 місяців тому
In this video I focus on a different side of monitoring: What do the Databricks system tables offer me for monitoring? How much does this overlap with the application logs and Spark metrics? Databricks System Tables are a public preview feature that can be enabled if you have Unity Catalog on your workspace. I introduce the concept in the first 3 minutes then summarize where this is most helpfu...
Databricks Monitoring with Log Analytics - Updated for DBR 11.3+
Переглядів 3,9 тис.11 місяців тому
In this video I show the latest way to setup and use Log Analytics for storing and querying you Databricks logs. My prior video covered the steps for earlier Databricks Runtime Versions (prior to 11.0). This video covers using the updated code for Databricks Runtime 11.3, 12.2, or 13.3. There are various options for monitoring Databricks, but since Log Analytics provides a way to easily query l...
Databricks CI/CD: Intro to Databricks Asset Bundles (DABs)
Переглядів 20 тис.Рік тому
Databricks Asset Bundles provide a way to use the command line to deploy and run a set of Databricks assets - like notebooks, Python code, Delta Live Tables pipelines, and workflows. This is useful both for running jobs that are being developed locally and for automating CI/CD processes that will deploy and test code changes. In this video I explain why Databricks Asset Bundles are a good optio...
Data + AI Summit 2023: Key Takeaways
Переглядів 615Рік тому
Data AI Summit key takeaways from a Data Engineers perspective. Which features coming to Apache Spark and to Databricks are most exciting for data engineering? I cover that plus a decent amount of AI and LLM talk in this informal video. See the blog post for a bit more thought out summaries and links to many of the keynote demos related to the features I am excited about. Blog post: dustinvanno...
PySpark Kickstart - Read and Write Data with Apache Spark
Переглядів 898Рік тому
Every Spark pipeline involves reading data from a data source or table and often ends with writing data. In this video we walk through some of the most common formats and cloud storage used for reading and writing with Spark. Includes some guidance on authenticating to ADLS, OneLake, S3, Google Cloud Storage, Azure SQL Database, and Snowflake. Once you have watched this tutorial, go find a free...
Spark SQL Kickstart: Your first Spark SQL application
Переглядів 968Рік тому
Get hands on with Spark SQL to build your first data pipeline. In this video I walk you through how to read, transform, and write the NYC Taxi dataset which can be found on Databricks, Azure Synapse, or downloaded from the web to wherever you run Apache Spark. Once you have watched and followed along with this tutorial, go find a free dataset and try to write your own Spark application. * All t...
PySpark Kickstart - Your first Apache Spark data pipeline
Переглядів 4,2 тис.Рік тому
PySpark Kickstart - Your first Apache Spark data pipeline
Spark Environment - Azure Databricks Trial
Переглядів 491Рік тому
Spark Environment - Azure Databricks Trial
Spark Environment - Databricks Community Edition
Переглядів 1,1 тис.Рік тому
Spark Environment - Databricks Community Edition
Apache Spark DataKickstart - Introduction to Spark
Переглядів 1,6 тис.Рік тому
Apache Spark DataKickstart - Introduction to Spark
Unity Catalog setup for Azure Databricks
Переглядів 16 тис.Рік тому
Unity Catalog setup for Azure Databricks
Visual Studio Code Extension for Databricks
Переглядів 17 тис.Рік тому
Visual Studio Code Extension for Databricks
Parallel Load in Spark Notebook - Questions Answered
Переглядів 2,4 тис.Рік тому
Parallel Load in Spark Notebook - Questions Answered
Delta Change Feed and Delta Merge pipeline (extended demo)
Переглядів 2,2 тис.2 роки тому
Delta Change Feed and Delta Merge pipeline (extended demo)
Data Engineering SD: Rise of Immediate Intelligence - Apache Druid
Переглядів 2512 роки тому
Data Engineering SD: Rise of Immediate Intelligence - Apache Druid
Azure Synapse integration with Microsoft Purview data catalog
Переглядів 2,2 тис.2 роки тому
Azure Synapse integration with Microsoft Purview data catalog
Adi Polak - Chaos Engineering - Managing Stages in a Complex Data Flow - Data Engineering SD
Переглядів 1952 роки тому
Adi Polak - Chaos Engineering - Managing Stages in a Complex Data Flow - Data Engineering SD
Azure Synapse Spark Monitoring with Log Analytics
Переглядів 5 тис.2 роки тому
Azure Synapse Spark Monitoring with Log Analytics
Parallel table ingestion with a Spark Notebook (PySpark + Threading)
Переглядів 14 тис.2 роки тому
Parallel table ingestion with a Spark Notebook (PySpark Threading)
SQL Server On Docker + deploy DB to Azure
Переглядів 4,7 тис.2 роки тому
SQL Server On Docker deploy DB to Azure
Michael Kennedy - 10 tips for developers and data scientists - Data Engineering SD
Переглядів 2162 роки тому
Michael Kennedy - 10 tips for developers and data scientists - Data Engineering SD
Synapse Kickstart: Part 5 - Manage Hub
Переглядів 892 роки тому
Synapse Kickstart: Part 5 - Manage Hub

КОМЕНТАРІ

  • @felipeporto4396
    @felipeporto4396 6 годин тому

    Excelent material

  • @ExplainedbyAI-q2n
    @ExplainedbyAI-q2n 8 годин тому

    At 2:35 you mention getting into some new databricks features like 'jump to code/definition' in a different video. Could you add a link to that video? The option to see where code is defined, especially 'intellisense-like-behaviour' is something I miss a lot, most of all when using the magic %run command to import functions from different notebooks.

  • @perer232
    @perer232 11 годин тому

    Really good stuff! Thank you so much for these posts! It is very inspiring and I have some work to do to reach this level. We are very heavy on using SQL-code for transformations, using temporary views and cte:s. Is that a bad strategy in the sense that it makes it really hard to test? So for example instead of having a CASE-statement you would instead use a UDF that is more easily testable? How do you test big SQL-transformations?

  • @BrianMurrays
    @BrianMurrays 4 дні тому

    Thanks for the video; I had been looking into how to set this up for a while and this video finally got me to a place of having a working process. I just about have all of this setup in my environment but the most recent issue I'm running into is if I develop locally and run a DLT pipeline from VSCode it sets everything up with my credentials. When I merge to my dev branch that triggers the CICD pipeline (running as the service principal), the step that runs the job throws an error that the tables defined in the DLT pipeline are managed by another pipeline (the one with my credentials). If I use DLT do I just never test from vscode, or do I need to go clean those up each time? Is there a better way to manage this?

  • @stefanjelic3318
    @stefanjelic3318 17 днів тому

    Great content.

  • @anindyabanerjee5733
    @anindyabanerjee5733 18 днів тому

    @DustinVannoy Will this work with a Databricks Personal Access Token instead of Service Connection/Service Principle?

    • @DustinVannoy
      @DustinVannoy 18 днів тому

      Yes, but for deploying DABs to Staging/Prod you want to use the same user every time so they are consistently the owner. For Github Actions I use a token in a secret. I think you could pull from key vault in dev ops pipeline, not positive on the best practice there.

  • @perer232
    @perer232 18 днів тому

    Hi! Thanks for the content! Can you describe more in detail how you run automated tests. What do you test? Etc... Could be a topic for a future video? Real examples. Thanks again

    • @DustinVannoy
      @DustinVannoy 18 днів тому

      Yes, editing that one to release in January

  • @luiscarcamo8421
    @luiscarcamo8421 18 днів тому

    Thanks, Dustin! You help me a lot for a production pipeline!

  • @indreshsingh3410
    @indreshsingh3410 27 днів тому

    Still confused on how did I exact populate your Bundle Resources explorer ?

    • @DustinVannoy
      @DustinVannoy 14 днів тому

      So you need to have a databricks.yml file in the root folder and it has to have some workflows or pipelines defined. Check out my videos on intro to Databricks Asset Bundles if you aren't sure how to get jobs created. The short answer is you can create a starting project using `databricks bundle init` or find examples like what I have in the resources folder and modify as needed for your project.

  • @אופיראוחיון-ס8י
    @אופיראוחיון-ס8י Місяць тому

    Thank you! I have a few processes that are not related to each other. Do I need to create a separate DAB for each one? How can I make the process more dynamic?

    • @DustinVannoy
      @DustinVannoy 18 днів тому

      The general guidance is if you want to deploy together and code can be versioned together, then put it in the same bundle (all using same databricks.yml). If you want to keep things separate then its fine to have separate bundles and you can either deploy in separate CD pipelines or the same one by calling `databricks bundle deploy` multiple times, once from each directory with a databricks.yml. For making it more dyanmic I suggest variables, especially complex variables, but usually that is just to change values based on the target environment. Using SDK to create workflows is an alternative to DABs and other things have been discussed which might be more of a blend between the two options eventually.

    • @אופיראוחיון-ס8י
      @אופיראוחיון-ס8י 18 днів тому

      @ Thank you very much!

  • @TenMinuteKQL
    @TenMinuteKQL Місяць тому

    Great video!

  • @GaneshKrishnamurthy-i9l
    @GaneshKrishnamurthy-i9l Місяць тому

    Is there a way to define Policies as a resource and deploy . I have some 15 to 20 policies which my jobs can use any of them. If there is a way to manage these policies to apply policy changes, it will be very convenient

  • @swapnilmd7616
    @swapnilmd7616 Місяць тому

    Is it possible to use DAB with a Standard Databricks cluster?

    • @DustinVannoy
      @DustinVannoy 14 днів тому

      Yes, meaning not a job cluster but an all-purpose cluster? You can either reference one with existing_cluster_id or define one in resources under the `clusters` section. docs.databricks.com/en/dev-tools/bundles/settings.html#specification

  • @norbertczulewicz1695
    @norbertczulewicz1695 Місяць тому

    I tried to test this extension and Databricks Connect but when I run file *.py file with Databricks Connect spark session variable is not initialized. I got an error: pyspark.errors.exceptions.base.PySparkRuntimeError: [CONNECT_URL_NOT_SET] Cannot create a Spark Connect session because the Spark Connect remote URL has not been set. Please define the remote URL by setting either the 'spark.remote' option or the 'SPARK_REMOTE' environment variable. I didn't configure SPARK_REMOTE but I added explicit session creation: config = Config( host = dbs_host, token = access_token, cluster_id = cluster_id, ) spark = DatabricksSession.builder.sdkConfig(config).getOrCreate() I use Profile Auth. Type. Databricks Connect is enabled. Upload and run file works Databricks runtime is 15.4.x Databricks Connect 15.4.3

  • @maeklund86
    @maeklund86 Місяць тому

    Great video, learned a lot! I do have a question; would it make sense to define a base environment for serverless notebooks and jobs, and in the bundle reference said default environment? Ideally it would be in one spot, so upgrading the package versions would be simple and easy to test. This way developers could be sure that any package they get used to, is available across the whole bundle.

    • @DustinVannoy
      @DustinVannoy 14 днів тому

      The idea makes sense but the way environments interact with workflows is still different depending on what task type you use. Plus you can't use them with standard clusters at this point. So it depends on how much variety you have in your jobs which is why I don't really include that in my repo yet.

  • @derkachm
    @derkachm Місяць тому

    Hi Dustin, nice video! Any plants to do the same but for Microsoft Fabric?

    • @DustinVannoy
      @DustinVannoy Місяць тому

      @@derkachm No, I am not doing enough with Fabric yet to add anything new there.

  • @KaioPedroza
    @KaioPedroza Місяць тому

    just great!!! I still using only the workspace UI and setups, but I really want to start using this VSCode extension.. I'm going to test some features and do some basic commands.. but anyway, just great! thank you very much

  • @vinayakmishra1837
    @vinayakmishra1837 Місяць тому

    Custom Logs cannot be written via Diagnostic Settings? Reason for using spark-monitoring?

  • @thevision-y1b
    @thevision-y1b 2 місяці тому

    is the spill memory bad? @3:48

    • @DustinVannoy
      @DustinVannoy 2 місяці тому

      @@thevision-y1b yes, it’s not ideal. Indicates I either want to 1) change to a worker VM type with more memory per core or 2) split into more tasks since the Input Size for my median and max tasks is a bit too high. By the way, these days that input size is usually ok for me but I use different node types now.

  • @collinsm8263
    @collinsm8263 2 місяці тому

    Thank you for the video 👌. I have question, how can I convert existing Databricks jobs(ML, python, sql etc) that were initiated manually in the past to start running through the pipeline? That is our Data engineering team were running these jobs manually but now we want to use DABs and Azure DevOps to run the jobs automatically. Thank you

  • @kamalkunjapur5383
    @kamalkunjapur5383 2 місяці тому

    Great video!! Much appreciate the effort put in to add join section, Dustin.

  • @moncefansseti1907
    @moncefansseti1907 2 місяці тому

    Hey Dustin, if we want to add more ressources like adls bronze silver and gold storage do we need to add it to the envi variables?

    • @DustinVannoy
      @DustinVannoy 18 днів тому

      You can deploy schemas within unity catalog but for external storage locations or volumes I would expect those to either happen from Terraform or as notebooks/scripts that you run in the deploy pipeline. Jobs to populate the storage would be defined in DABs, but not the creation of storage itself unless it's built into a job you trigger with bundle run.

  • @gangadharneelam3107
    @gangadharneelam3107 2 місяці тому

    Very helpful. Thanks for sharing!!

  • @mananbhimani8024
    @mananbhimani8024 3 місяці тому

    my cluster is taking so much time to deploy , any ideas ?

    • @DustinVannoy
      @DustinVannoy 3 місяці тому

      If you view the event log you might see some things. Sometimes a message will show that compute couldn't be retrieved from Azure which may be a quota limit (very common in trial accounts). If you added init scripts or libraries that can slow it down. Otherwise you can try posting more details (like event log info) in Databricks community. If you are really stuck and that doesn't help, message me more details through LinkedIn.

  • @fb-gu2er
    @fb-gu2er 3 місяці тому

    Now do AWS 😂

    • @DustinVannoy
      @DustinVannoy 3 місяці тому

      Meaning AWS account with GitHub Actions? If not, what combo of tools are you curious about for the deployment?

  • @fb-gu2er
    @fb-gu2er 3 місяці тому

    Any way to see a plan like you would with terraform?

    • @DustinVannoy
      @DustinVannoy 3 місяці тому

      Not really, using databricks bundle validate is best way to see things. There are some options to view as debug but I haven't found something that works quite like Terraform plan. When you try to run destroy it does show what will be destroyed before you confirm.

  • @gardnmi
    @gardnmi 3 місяці тому

    Still needs work. Issues I found so far: 1. The authentication can still clobber your CLI auth causing lots of confusion. 2. The file sync needs a full refresh option. Only way to currently do so is to delete the sync folder in the . databricks folder. 3. Sync needs to be 2 ways. Databricks/Spark connect is still not feature complete so you unfortunately have to use the notebook in some cases. 4. Overwrite job cluster feature installs your python whl onto your all purpose cluster but if you make any changes to the package l, it doesn't remove the old whl and update it with a new whl with your changes causing confusing errors.

    • @DustinVannoy
      @DustinVannoy 3 місяці тому

      For number 2, I agree. For number 3, I disagree, I think using git provider to push/pull from various environments is the right way to handle it. This is based on my belief its too confusing to sync two ways without git and often a team of people may be working together anyway. For number 4, if you append a incrementing version number or timestamp it will update on all-purpose cluster that already has it installed. Not really an IDE thing but it is all sort of related.

  • @benjamingeyer8907
    @benjamingeyer8907 3 місяці тому

    We need PyCharm and DataGrip support!

    • @DustinVannoy
      @DustinVannoy 3 місяці тому

      blog.jetbrains.com/pycharm/2024/08/introducing-the-pycharm-databricks-integration/

  • @unilmittakola
    @unilmittakola 3 місяці тому

    Hey Dustin, We're currently implementing data bricks asset bundles using Azure DevOps to deploy workflows. The bundles we are using storing it in the GitHub. Can you please help me with the YAML script for it.

  • @praveenreddy177
    @praveenreddy177 3 місяці тому

    How to remove [dev my_user_name]. Please suggest

    • @DustinVannoy
      @DustinVannoy 3 місяці тому

      Change from mode: development to mode: production (or just remove that line). This will remove prefix and change default destination. However, for dev target I recommend you keep the prefix if multiple developers will be working in the same workspace. Production target is best deployed as a service principal from CICD pipeline (like Azure DevOps Pipeline) to avoid different people deploying the same bundle and having conflicts with resource owner and code version.

    • @praveenreddy177
      @praveenreddy177 3 місяці тому

      @@DustinVannoy Thank you Vannoy!! Worked fine now !!

  • @AbhijitIngale-h6b
    @AbhijitIngale-h6b 3 місяці тому

    Hi Dustin, A basic question, how this method is different than configuring Azure portal -> databricks workspace home page -> Diagnostic Settings -> Exporting logs to Log Analytics

    • @DustinVannoy
      @DustinVannoy 3 місяці тому

      The things that are logged are different. I've never written it up but we had some logs enabled that way plus we used this. There are other options to get logs, of course, but I found this one to be useful in the past for Azure focused environments.

  • @houssemlahmar6409
    @houssemlahmar6409 3 місяці тому

    Thanks Dustin for the video. Is there a way where I can specify sub-set of resources (workflows, DLT pieplines) to run in specific env? For example, I would like to deploy only Unit test job in DEV and not in PROD env.

    • @DustinVannoy
      @DustinVannoy 3 місяці тому

      You would need to define the job in the targets section of only the targets you want it in. If it needs to go to more than one environment, use YAML anchor to avoid code deuplication. I would normally just let a testing job get deployed to prod without a schedule, but others can't allow that or prefer not to do it that way.

  • @albertwang1134
    @albertwang1134 4 місяці тому

    Hi Dustin, have you tried to configure and deploy a single node cluster by using Databricks Bundle?

    • @DustinVannoy
      @DustinVannoy 3 місяці тому

      Yes, it is possible. It looks something like this: job_clusters: - job_cluster_key: job_cluster new_cluster: spark_version: 14.3.x-scala2.12 node_type_id: m6gd.xlarge num_workers: 0 data_security_mode: SINGLE_USER spark_conf: spark.master: local[*, 4] spark.databricks.cluster.profile: singleNode custom_tags: {"ResourceClass": "SingleNode"}

    • @albertwang1134
      @albertwang1134 3 місяці тому

      @@DustinVannoy Thanks a lot! This cannot be found in the Databricks documentation.

  • @lavenderliu7833
    @lavenderliu7833 4 місяці тому

    Hi Dustin, is there any way to monitor compute event log from log analytics?

  • @gangadharneelam3107
    @gangadharneelam3107 4 місяці тому

    Hey Dustin, We're currently exploring DABs, and it feels like this was made just for us!😅 Thanks a lot for sharing it!

  • @gangadharneelam3107
    @gangadharneelam3107 4 місяці тому

    Hey Dustin, Thanks for the amazing explanation! DABs are sure to be adopted by every dev team!

  • @thusharr7787
    @thusharr7787 4 місяці тому

    Thanks, one question I have some metadata files in the project folder, I need to copy this to a volume in Unity catlog. Is it possible through this deploy process ?

    • @DustinVannoy
      @DustinVannoy 4 місяці тому

      Using Databricks CLI path, you can have command that copies data up to volume. Replace all the curly brace { } parts with your own values. databricks fs cp --overwrite {local_path} dbfs:/Volumes/{catalog}/{schema}/{volume_name}/{filename}

  • @saipremikak5049
    @saipremikak5049 4 місяці тому

    Wonderful tutorial, Thank you! This approach works effectively for running multiple tables in parallel when using spark.read and spark.write to a table. However, if the process involves reading with spark.read and then merging the data into a table based on a condition, one thread interferes with another, leading to thread failure. Is there any workaround for this?

    • @VishalMishra-d3f
      @VishalMishra-d3f 3 місяці тому

      nice observation . i am also facing this issue. can u figure out soln ? how do u know "merging the data into a table based on a condition" is issue?

    • @DustinVannoy
      @DustinVannoy 3 місяці тому

      I don't think I follow. Is there a code example you can send along? For Databricks I sometimes just set this up as separate parallel workflow tasks but you may be describing other challenges. If there is error message you encounter please share.

  • @deepakpatil5059
    @deepakpatil5059 4 місяці тому

    Great content!! I am trying to deploy the same job into different environments DEV/QA/PRD. I want to override parameters passed to the job from variable-group defined on the Azure DevOps portal. Can you please suggest how to proceed on this?

    • @DustinVannoy
      @DustinVannoy 4 місяці тому

      The part that references variables group PrdVariables shows how you set different variables and values depending on target environment. - stage: toProduction variables: - group: PrdVariables condition: | eq(variables['Build.SourceBranch'], 'refs/heads/main') In the part where you deploy the bundle, you can pass in variable values. See the docs for how that can be set. docs.databricks.com/en/dev-tools/bundles/settings.html#set-a-variables-value

  • @albertwang1134
    @albertwang1134 4 місяці тому

    I am learning DABs at this moment. So lucky that I found this video. Thank you, @DustinVannoy. Do you mind if I ask a couple of questions?

    • @DustinVannoy
      @DustinVannoy 4 місяці тому

      Yes, ask away. I'll answer what I can.

    • @albertwang1134
      @albertwang1134 4 місяці тому

      Thank you, @@DustinVannoy. I wonder whether the following development progress does make sence. And if there any thing we could improve it. Background: (1) We have two Azure Databricks workspaces, one is for development, one is for production. (2) I am the only Data Engineer in our team, and we don't have dedicate QA. I am responsible to development and test. Who consume the data will do UAT. (3) We use Azure DevOps (repository and pipelines). Process: (1) Initialization (1.1) Create a new project by using `databricks bundle init` (1.2) Push the new project to Azure DevOps (1.3) On development DBR workspace, create a GIT Folder under `/Users/myname/` and link to the Azure DevOps repository (2) Development (2.1) Create a feature branch on DBR workspace (2.2) Do my development and hand test (2.3) Create a unit test job and the scheduled daily job (2.4) Create a pull request from the feature branch to the main branch on DBR workspace (3) CI (3.1) An Azure CI pipeline (build pipeline) will be trigerred after the pull request is created (3.2) The CI pipeline will check out the feature branch, and do `databricks bundle deploy` and `databricks bundle run --job the_unit_test_job` on the development DBR workspace by using Service Principal. (3.3) The test result will show on the pull request (4) CD (4.1) If everything looks good, the pull request will be approved (4.2) Manually trigger an Azure CD pipeline (release pipeline). Checkout the main branch, do `databricks bundle deploy` to the production DBR workspace by using Service Principal Explanation: (1) Because we are a small team and I am the only person who works on this, we do not have a `release` branch to simply the process (2) Due to the same reason, we also do not have a staging DBR workspace

    • @DustinVannoy
      @DustinVannoy 4 місяці тому

      Overall process is good. It’s typical not to have a separate QA person. I try to use yaml pipeline for the release step so code would look pretty similar to what you use to automate deploy to dev. I recommend having unit tests you can easily run as you build which is why I try to use Databricks-connect to run a few specific unit tests at a time. But, running workflows on all-purpose or serverless isn’t too bad an option for quick testing as you develop.

  • @benjamingeyer8907
    @benjamingeyer8907 4 місяці тому

    Now do it in Terraform ;) Great video as always!

    • @DustinVannoy
      @DustinVannoy 4 місяці тому

      🤣🤣 it may happen one day, but not today. I would probably need help from build5nines.com

  • @asuretril867
    @asuretril867 4 місяці тому

    Thanks a lot Dustin... Really appreciate it :)

  • @pytalista
    @pytalista 4 місяці тому

    Thanks for the video. It helped me a lot in my YT channel.

  • @bartsimons6325
    @bartsimons6325 4 місяці тому

    Great video Dustin! Especially on the advanced configuration of the databricks.yaml. I'd like to hear your opinion on the /src in the root of the folder. If you're team/organisation is used to work with a mono repo it would be great to have all common packages in the root, however, if you're more of a polyrepo kinda team/organisation, building and hosting the packages remotely (i.e. Nexus or something) could be a better approach in my opinion. Or am I missing something? How would you deal with a job where task 1 and task 2 have source code with conflicting dependencies?

  • @DataMyselfAI
    @DataMyselfAI 4 місяці тому

    Is there a way for python wheel tasks to combine the functionality we had without serverless to use: libraries: - whl../dist/*.whl so that the wheel gets deployed automatically with using serverless? As if I am trying to include environments for serverless I can't longer specify libraries for the wheel task (and therefore it is not deployed automatically) and I also need to hardcode my path for the wheel in the workspace. Could not find an example for that so far. All the best, Thomas

    • @DustinVannoy
      @DustinVannoy 4 місяці тому

      Are you trying to install the wheel in a notebook task, so you are required to install with %pip install? If you include the artifact section it should build and upload the wheel regardless of usage in a taks. You can predict the path within the .bundle deploy if you aren't setting mode: development, but I've been uploading it to a specific workspace or volume location. As environments for serverless evolve I may come back wtih more examples of how those should be used.

  • @HughVert
    @HughVert 4 місяці тому

    Hey, thanks for video! I was wondering if you know whether those audit logs are still exist even if audit log not configured (audit log/ log delivery)? I mean - are events will still be written in the back and once it is enabled (via system tables) could be consumed?

  • @usmanrahat2913
    @usmanrahat2913 5 місяців тому

    How do you enable intellisense?

  • @dreamsinfinite83
    @dreamsinfinite83 5 місяців тому

    how do you change the Catalog Name specific to an environment?

    • @DustinVannoy
      @DustinVannoy 4 місяці тому

      I would use a bundle variable and set it in the target overrides, then reference it anywhere you need it.

  • @dhananjaypunekar5853
    @dhananjaypunekar5853 6 місяців тому

    Thansk for the explanation! Is there any way to view exported DBC files in VS code?

    • @DustinVannoy
      @DustinVannoy 4 місяці тому

      You should export as source files instead of dbc files if you want to view and edit in VS Code.

  • @NoahPitts713
    @NoahPitts713 6 місяців тому

    Exciting stuff! Will definitely be trying to implement this in my future work!