Implement SCD type 2 in data build tool (dbt) in 10 mins | Tutorial for beginners
Вставка
- Опубліковано 13 гру 2024
- In our next video in this dbt series we look at how easy it is to implement a Type 2 SCD in dbt on top of Snowflake.
In this series I'll be producing a series of short videos which are based are dbt's excellent, free training modules found here:
courses.getdbt...
❄️Want to SUPERCHARGE your career and become an EXPERT in Snowflake??❄️
Mastering Snowflake is accepting applications now to work with us in a small group. Serious inquiries only pls.
forms.gle/WBqa...
❄️Order my book: Mastering Snowflake Solutions HERE: ❄️
Amazon UK - www.amazon.co....
Amazon US - www.amazon.com...
Amazon AUS - www.amazon.com...
Amazon IND - www.amazon.in/...
❄️Get my Free SnowPro core guide HERE: ❄️
program.master...
❄️Become a student on my course: ❄️
Snowflake Practice Questions - SnowPro Core Certified Udemy Course www.udemy.com/...
❄️Connect with me on Linkedin:❄️
/ adammorton121
Normally I think of orders as a fact, and the changing status might be modelled as an accumulating snapshot. This makes querying duration in status very easy for reporting and BI tools. The source you have is a common format for an order history table. You may have just selected this because it was a clear example for this dbt functionality, but it is not a good example for dimensional modelling. Usually with type 2 scd's we are examining certain attributes of a dimension (Customer, product, sales channel for example could be dimensions that apply to orders) to see if they changed and making decisions to capture that change if it is important. Status is not really an attribute of an order dimension, status is a dimension related to an order fact.
I don’t disagree with anything you say here. The example was from dbt’s own free training which I reference in the video description. This was the only reason I worked with this particular example. Cheers
Thanks again Adam. This series has been a great tool for understanding the full functionality of dbt. In this example, you're putting your snapshot object in a separate schema. If this was an example focusing on traditional dimensions (Like one of the other commenters pointed out E.g. customers, products, date) you would still want your dimension table in the same schema, even though they are treated as "snapshot" objects in dbt. Am I missing something? Also, what if we only want to consider certain attributes of an entity when determining if a change needs to be made, for instance, we don't care if a customer's hair color changes (sorry, couldn't think of a better example). Is there a setting to exclude certain attributes?
Hey Rob, the short answers are yes and yes. Look up the check strategy for snapshots on the dbt docs for more details docs.getdbt.com/docs/build/snapshots
HI Adam,
So I am trying to create an SCD from a table that already contains changes say for example you have on row with the order status as shipped for order AAA and another for the same order where the status is delivered. When I use your method, on the second run, I get the error that I can only update one row per time, how can I create an SCD from this type of table using dbt.
Thanks Adam.
Would be nice if you could mention how easily the complexity arises when you are to actually join one more such tables to build a dimension.
Will be interesting to hear your thoughts on how you solve that part 🙏
Hi, thanks for your question. Of course there's a lot of factors which could influence the approach here. My initial strategy would be to consider using a view to join multiple tables together. This view then would act as the source to the Matillion job which populates the dimension as the target table.
I mean the join itself between those snapshot tables is quite complex and I was more looking for how would you solve it in dbt. We dont use Matillion.
I would like to create the dimension itself in dbt
Sure the same approach applies in dbt as it’s SQL based. So you can break the logic down into an intermediate step to with the join and then use that as a source for the dimension table. It would also allow for other downstream models to leverage the same intermediate table if required.
Hi adam, I have one query facing in my development, there is a duplicate records in snowflake database and when running snapshot in dbt getting error as duplicate records how to overcome that error while running dbt snapshot for model , please revert it help
It sounds like you probably need to add some logic to filter out the duplicates first
you mentioned about course structure which contains 10 modules, how to access that? can you please share ?
Hello the link to apply to join the Mastering Snowflake program is in the video description. Thank you!
How can we achieve a type 2 SCD in dbt if the data source has a historical data? The historical data source has records for each day.
There’s a few ways you can do this depending on your specific requirements. Recently I used a incremental materialization.
This included a delete + insert strategy.
Creating a hash value of the primary key and the date updated field allowed us to identify the changes in the historical records as well as defining the row end date.
If you’d like a video on this let me know as I’m sure many ppl have the same challenge.
@@mastering_snowflake I would love to have a video on this ☺
Adam can you please tell me about how to pull the data from mysql and load into snowflake usig dbt tool
You need to get your data into Snowflake first before you can use dbt
Lesch Junctions