DBT Core on Cloud Run Job

Поділитися
Вставка
  • Опубліковано 2 жов 2024

КОМЕНТАРІ • 20

  • @bugtank
    @bugtank 5 днів тому +1

    Great overview. I did a self learn of both dbt and cloud run. I'm now configuring my cloud run dbt job and needed a little bit of confirmation that it was the way to go for me. This video helped with that and backed it up with the proper knowledge. You also avoided all the rabbit holes and tangents. I like your style. Subscribed.

  • @fbnz742
    @fbnz742 28 днів тому +1

    Hi Richard, thank you so much for sharing this. This is exactly what I wanted. I have a few questions:
    1. Do you have any example on how to orchestrate it using Composer? I mean the DAG code.
    2. I am quite new to DBT. I used DBT Cloud before and I could run everything (Upstream + Downstream jobs) or just Upstream, just Downstream, etc. Can I do it using DBT Core + Cloud Run?
    3. This is quite off-topic to the video but wanna ask: DBT Cloud offers a VERY nice visualization of the full chain of dependencies. Is there any way to do it outside of DBT Cloud?
    Thanks again!

    • @practicalgcp2780
      @practicalgcp2780  25 днів тому +1

      No worries, happy it helped.
      DBT cloud is indeed a very good platform to run DBT core. There is nothing wrong using DBT cloud, in fact so many businesses still use DBT cloud mostly today.
      However that doesn’t mean it’s the only solution, or the best solution for all DBT use cases. For example, DBT cloud even as of today does not have a “deployment” concept, it relies on calling a version control SaaS for each run instead of using a local copy instead. Some caching has been implemented to prevent it from going down but it’s more of a workaround than a solution. This means for running mission critical applications needs a strict SLA, it maybe better to not use DBT cloud.
      Many companies for various reasons also can’t use SaaS like this, due to data privacy concerns or operation cost being too high as number of users grow. Which leaves with DBT core.
      You are right this approach does not give you an easy way to rerun the DBT job partially, however I think you can do this via the airflow dag parameter (the config json) which can be passed from the UI then passed to the cloud run job itself to have a way to deal with ad-hoc administrative tasks. The thing I like about cloud run other than k8 operators (which is another way to run DBT core on an airflow), is it’s an Google SDK, and serverless, much easier to test and control.
      If this isn’t an option you like, I recently came across github.com/astronomer/astronomer-cosmos I haven’t tried it but it looks quite promising. My concerns on this is mainly what does it mean when the version changes between DBT and airflow, how accurate is the mapping, or does it have compatibility issues with composer default package which in the past I had a lot of issues with hence the k8 executor solution being more popular.
      One thing worth mentioning is in my view, composer seems to be travelling in a direction of serverless, and decentralised, and it makes no sense to centralise everything on one singer cluster anymore, that means if you run cosmos on a dedicated cluster it might be a better option. But again, I haven’t tried it yet.

    • @fbnz742
      @fbnz742 23 дні тому

      @@practicalgcp2780 Thanks for your reply!
      I honestly have tried to setup DBT with Composer using another of your videos, but it looked sketchy to me, so I found the Cloud Run option to be way better.
      While looking for it, almost all the tutorials pointed me to Cosmos, which really looks to be a great option, however, in my point of view, it colides with Composer, meaning that you can use one or the other, and Composer has the easiness of being fully hosted (plus its whats used in my org :D so I can't really change)
      On my 3rd topic, I found that you can actually generate and serve dbt docs, but with Cloud Run, I don't know exactly how to handle documents that are generate in the filesystem, but I believe that this would be a great option to have HTML documentation. Did you try it by any chance? Or, tried to handle files in the filesystem

  • @HARDselection
    @HARDselection 3 місяці тому +2

    As a member of a very small data team managing a complex orchestration workload, this is exactly what I was looking for. Thanks!

  • @adeolamorren2678
    @adeolamorren2678 2 місяці тому

    with this approach is it possible to add environment variables that are isolated for each run? I basically want to pass environment variables for each run when I invoke google cloud run

    • @practicalgcp2780
      @practicalgcp2780  2 місяці тому +2

      Environment variables are typically not designed for manipulating runtime variables each time, these are typically set for each environment, and stick to each deployment not run.
      But it looks like both options are possible, and stick to passing command line arguments because that’s more appropriate to override compared to environment variables. See this article on how to do it, it’s explained well chrlschn.medium.com/programmatically-invoke-cloud-run-jobs-with-runtime-overrides-4de96cbd158c

  • @agss
    @agss 3 місяці тому +1

    Thank you for the very insightful video!
    What is your take on using Dataform instead of DBT, when it comes to capabilities of both tools and ease to deploy and manage those solutions?

    • @practicalgcp2780
      @practicalgcp2780  3 місяці тому +2

      Thank you and spot on question, I was wondering who is going to ask this first 🙌 I am actually making a Dataform video in the background but don’t want to public it unless I am 100% sure I am saying something useful.
      But based on my current findings, you could use either and depends on what you need both can be a good fit. Dataform is a lot easier to get up and running but it’s quite new and I won’t recommend using it for something too critical at this stage, and it’s also missing some key features like templating using jinja (I don’t really like the JavaScript templating system, as it’s built on typescript, that is something no one uses, you would be lock-in to something with no support which in my view is quite dangerous). But it is something a lot easier to get up and running natively in gcp.
      DBT is still the go to choice in my view, because it is built in Python has a strong open source community. For mission critical data modelling work, I still think DBT is much better.

    • @agss
      @agss 3 місяці тому

      @@practicalgcp2780 you brought up exactly what I was worrying about.
      I highly appreciate your insight!

    • @strmanlt
      @strmanlt Місяць тому +1

      Our team was debating migrating from dbt to Dataform. Dataform is actually is a pretty decent tool, but the main issues for us was the 1000 node limit per repo. So maybe if you have very simple models that do not require a lot of notes it would work fine, but for us the long term scalability was the deciding factor

    • @practicalgcp2780
      @practicalgcp2780  Місяць тому

      @@strmanlt thanks for the input on this! Can I ask what is the 1000 node you are referring to? Can you share the docs on this. Is it 1000 node limit on number of steps / sql you can write?

    • @fbnz742
      @fbnz742 28 днів тому

      Just wante to share my thoughts here: I used Dataform for an entire project and it worked quite well. My data model was not so complex, and I learned how to integrate its logs with Airflow, being able to set up alerts to Slack, pointing to the log file of the failed job, etc, however, I agree that Dataform templating is very strange. I personally don't have expertise with JavaScript so suffered a lot with some things, but I was able to do pretty much all I wanted. I suffered a lot with looking for things in the internet, and DBT is the exact opposite: you can find tons of content online. I would go with DBT.

  • @adeolamorren2678
    @adeolamorren2678 2 місяці тому

    One seperate question; if we have dependencies, since it's a serverless environment, we should add the dbt deps command in the dockerfile args, or runtime override args right?

    • @practicalgcp2780
      @practicalgcp2780  2 місяці тому

      No I don’t think that is the right way to do it, serverless environments you can still package up dependencies, and this is something you typically need to do in build time not run time, I.e, while you are packing the container in your CI pipeline. DBT can generate a lock file which can be used to ensure packaging consistency on versions, so you don’t end up having different versions each time you run the build. See docs.getdbt.com/reference/commands/deps
      The other reason you don’t want to do that at run time is it could be very slow to install dependencies each time because it requires downloading these, plus you may not want internet access on a production environment to be more secure in some setups so doing this in build time makes a lot more sense

  • @10xApe
    @10xApe 4 місяці тому

    Can cloud run be used for Power BI datarefresh gateway ?

    • @practicalgcp2780
      @practicalgcp2780  4 місяці тому

      I haven’t used power BI so I googled what is data refresh gateway, according to learn.microsoft.com/en-us/power-bi/connect-data/refresh-scheduled-refresh it looks like it’s some sort of service you can control refresh via a schedule? Unless there is some sort of API it allows you to trigger from the Google Cloud ecosystem I am not sure if you can use it. I assume you are thinking of triggering some DBT job first then refresh the dashboard?