Це відео не доступне.
Перепрошуємо.

Advancing Spark - Developing Python Libraries with Databricks Repos

Поділитися
Вставка
  • Опубліковано 11 лис 2021
  • The addition of Databricks Repos changed a lot of our working processes around maintaining notebooks, but the process for building out our own python libraries hasn't changed much over the years. With "Files for Databricks Repos", we suddenly see a massive shift in how we can structure our library development, with some huge productivity boosts in there.
    In this video, Simon talks through the process from the ground up - taking a simple dataframe transformation, turning it into a function, building that function into a wheel then replacing it with a direct reference inside Databricks Repos!
    For more info on the new additions to Databricks Repos, check out docs.databrick...
    As always, if you need help with our Data Lakehouse journey, stop by www.advancinganalytics.co.uk to see if we can help

КОМЕНТАРІ • 34

  • @dmitryanoshin8004
    @dmitryanoshin8004 2 роки тому +5

    You are the god of Databricks!! Enjoying watch and learn))

  • @lackshubalasubramaniam7311
    @lackshubalasubramaniam7311 Рік тому

    I like the idea of keeping the code in wheel structure so that we can build the wheels for unit testing and possibly integration testing. It's the best of both worlds. Nice!😃

  • @toddflanders8155
    @toddflanders8155 2 роки тому +2

    I'm brand new to Databricks. I want to get a team of data engineers to do proper testing with CI/CD. Using wheels seemed to create as many problems as it solved. Local repo python modules that are test-driven seems like a good baby step to ensuring quality while allowing rapid sustainable development.

  • @niallferguson8019
    @niallferguson8019 2 роки тому +1

    This is great and think has solved an issue we have been battling with - how multiple people can develop at the same time. I guess for external dependencies you will need to make sure they are already installed on the cluster - so you lose the nice aspect of using pip install on your whl to automatically download them.

  • @goat4real262
    @goat4real262 6 місяців тому

    Thanks a lot for putting this video, it is going to save my life

  • @mrakifcakir
    @mrakifcakir 2 роки тому +1

    Hi Simon thanks for the video. Regarding the your concern in minute 22 (in prod prefer the concrete version of whl file) that can be solved in this way I think. The application repo will create a whl file an upload to s3 with specific version. The versioned package will be loaded during the cluster spinning up. It can be same for dev and test envs too. In order to enable parallel development , the way you show in this video can be used but giving different versions of the whl files (in setup.py) not to overwrite to master package ( not to be updated by different developer) in dev test envs. Therefore while as you explain here with repos import module functionality , so many developer can do changes and test them in the workspace notebook, the prod env can run fixed ( in a regular whl file ) version and control the prod version of whl file. I did a similar process and works fine.
    Best
    Akif

  • @sumukhghodke7566
    @sumukhghodke7566 2 роки тому

    Thanks for this perspective on using packages in notebooks. The dbx init command creates a package structure, but there is no clear documentation on how the package can be built/used within notebooks. It makes more sense now.

  • @briancuster7355
    @briancuster7355 2 роки тому

    This is really cool. Thanks for showing this.

  • @julsgranados6861
    @julsgranados6861 2 роки тому +1

    thanks Simon!!!!! just great video , great content

  • @almarey5533
    @almarey5533 2 роки тому

    This was awesome, thanks! Definitely thinking I want to change to use this feature rather than the wheels

  • @taglud
    @taglud 2 роки тому

    i really love, solve our interrogations around simmlifying code

  • @aradhanachaturvedi3352
    @aradhanachaturvedi3352 2 місяці тому

    Hi , I have one question.how can I expose my databricks notebook as an endpoint for front end applications? I think creating workflows and running a job will take some time in producing the output. we want something in real time.can you suggest

    • @AdvancingAnalytics
      @AdvancingAnalytics  Місяць тому

      The only way to expose "a notebook" is by having your app call the jobs API and triggering that notebook - it can take input parameters, provide outputs etc so would mimic a web service, but you won't get over the latency problem of calling a spark job. If you are trying to return results of a SQL query, use the SQL endpoint & Serverless instead, if you're trying to do inference with machine learning, then use model serving endpoints. Those are pretty much your options!

  • @marcocaviezel2672
    @marcocaviezel2672 2 роки тому

    Coool!
    Thanks for this great video!

  • @deepanjandatta4622
    @deepanjandatta4622 Рік тому

    Hi, incase I create a FunctionTest notebook which will be outside of the DBXRepos and came from a different repo all together. In that case, it can I import using "from Library.hydr8.audit.lineage import *" and call addLineage() ? (using the 2nd approach you shown ?)

  • @dankepovoa
    @dankepovoa Рік тому

    niiice!!! but, what if I need to use an exclusive databricks command, like "dbutils" in a function of this library? A .py file wouldn't run, what would be another option?

  • @gass8
    @gass8 2 роки тому

    Hi, nice video.
    Would like to know if exist any way to do the same in R ?
    I whant versioning my developed lib and reference it in the script from repository. The material for R is too poor.

  • @chandandey572
    @chandandey572 11 місяців тому

    ​@AdvancingAnalytics Hi Simon, finally I got a tutorial related to my use case. Thanks a lot for this. I have a query though, if you can please help me on this. I have an application running on linux server which is having shell scripts, conda env setup, pip_req.txt, and python files for etl process. We have sqllite as well for metadata management. In this case if I have to move to databricks with minimum code changes, how should I design this. I mean in databricks for shell scripts calls, sqllite db, conda env setup what should be alternative or will it work as it is.

  • @chobblegobbler6671
    @chobblegobbler6671 Рік тому

    Hey Simon!.. How does this work for a job cluster when launching databricks operators from airflow

  • @penter1992
    @penter1992 2 роки тому

    IF you want to use poetry as dependency manager, how to you solve this? how do you install the dependencies on this repo way?

  • @penchalaiahnarakatla9396
    @penchalaiahnarakatla9396 2 роки тому

    Good video. How to read delta table inside UDF. Please suggest..

  • @MicrosoftFabric
    @MicrosoftFabric 2 роки тому

    Thanks Simon for sharing your knowledge. Question around managing code for multiple entities. Would you create multiple git repos per databricks entity or 1 repo with all the entities in databricks cicd-labs sort of folder structure?

    • @AdvancingAnalytics
      @AdvancingAnalytics  2 роки тому

      Great question. This is a big debate. Mono vs multi repo. We always prefer multi repo to make cicd less complicated

  • @darryll127
    @darryll127 2 роки тому

    Simon, would this change your thinking / architecture for Hydro to put Config metadata in JSON files and not bother with a database at all?

    • @AdvancingAnalytics
      @AdvancingAnalytics  2 роки тому

      Nope, not on its own. It's certainly convenient to reference config directly, but you still need something to manage searching, queueing etc. Definitely a nice pattern for local config, but I'm not convinced I want production systems relying on having the right branch synced to a repo!

    • @darryll127
      @darryll127 2 роки тому +1

      @@AdvancingAnalytics I understand what you're saying, but how is that any different than making sure production is built and deployed from the right branch.
      Furthermore, it seems to me it would eliminate complexity and artifacts they could be rationalized as driven by legacy limitations.

    • @AdvancingAnalytics
      @AdvancingAnalytics  2 роки тому +1

      @@darryll127 very true, much of it is "it feels wrong" not "it is wrong". As long as we can still factor in the relevant code quality checks, linting/formatting, syntax checks, testing etc, then there's no real reason that it's "bad".
      From a wider architecture, keeping metadata purely in the repo means it's not accessible to other tools, so orchestration planes wouldn't see it for example. If you're building an entirely databricks-based architecture might be suitable?

    • @darryll127
      @darryll127 2 роки тому +1

      @@AdvancingAnalytics you raise a good point, however I could see a process whereby you have the option in your framework to read the JSON and store the it in Delta tables (taking advantage of things like AutoLoader, schema evolution, complex types) and then take advantage of the Delta readers which are not dependent on Databricks per se as a means of exposing the Config data to external environments.

  • @allieubisse316
    @allieubisse316 2 роки тому

    Hello

  • @paulnilandbbq
    @paulnilandbbq 2 роки тому

    Firstly, love the Channel Simon and the team! I have been using %run "../shared/xxxxx" as a technique to consume functions from a notebook stored within a git repo. Is there any downsides to this option? - Many thanks!

    • @AdvancingAnalytics
      @AdvancingAnalytics  2 роки тому +1

      Great question. Wheels are transferable and testable. Notes are not as easy to test. Wheels give you a more robust deployment option.