I've read the whole "The data warehouse toolkit" book by Kimball twice and this video explains the most important part of the 500 page book as clear as possible. Well done!
Thanks for this video. I have a few questions: 1) in the star schema, we won't do any joins between the dimension tables right, but why did you create a table by joining all the dimension tables to flatten all the dimensions in a single table? 2) since we are creating the mart tables by joining with other tables, How these tables get refreshed because those are not views or materialized views?
@@KahanDataSolutions Hi, it would be even more helpful if u could make a video about tutorial on how to add time and date dimension in star or snowflake schema
Thanks! Great and simple overview for someone who is pretty new to this. What I would recommend is to explain pitfalls like - how to load dimensions for values there to be unique, why you have to care about having unique values in dimensions, what will happen if you left join to the dimension which has duplicate records.
Thank you for the great and to the point presentation, especially this is great that you showed a simple live code to execute it since most tutorials just repeat the books and present the theory and don't show how to code it. It would be great though to cover SCD implementation in the dimension tables and what would happen to the natural keys when we will have to use surrogate keys for the dimension and fact tables, because in almost all real-world DWH examples there is always need for keeping the history of dimensions inside the dim tables, which of course the OLTP primary keys will not be applicable.
Hi, love your channel, I'm learning a lot. What are your thoughts on Star-schema vs One Big Table (OBT)? Would you make a video comparing pros and cons of each other?
Hey, thanks for the video! It gives a good overview how to model a star schema. But how could new staging data be ingested in the tables of a star schema? For example, an easy but inefficient approach would be to create the tables on a daily basis. But to be more optimal, you would need a process to ingest new data into the tables. Do you have an idea how that could be done in modern warehouses like Snowflake? Or some resources on that? I think it would be helpful to add some technical columns to the raw data layer, like a load timestamp, to track the records that need to be ingested. Furthermore, a valid_to and valid_from timestamp in dimension tables could be added where changes can occur (changed address of a customer).
Typo on the chapter header #7. Great video tho on a hard to teach and under-covered topic. I strongly agree about the usability case for Star schemas even in a modern stack. It creates a highly ordered, easy to reason about, junction point between the chaos of sources, ingestion, data lakes, etc and the complexity of ~infinite data consumer use cases. The payoff in downstream development efficiency is huge.
Ah, dang. Good catch on the typo. Unfortunately I can't edit that part after posting. Really appreciate your feedback on the video too. As you can probably tell, I'm on the same page as you and think it's still a great strategy.
Is it necessary to join in orders and customers in the transaction fact table when the transaction table already has customer_id and order_id assuming the source tables have referential integrity constraints?
Hi, could you kindly make a video on how to load the fact table incrementally? What if we have an OLTP system and the dimension tables get big really quickly
great video, i'm doing almost the same thing but i'm using views for marts instead of tables because views are updated automatically i would rather use tables instead of views but creating a table from select statement doesn't update automatically, is there a way to do it without using triggers?
Thanks for the video. Quick Question: This might be subjective and vary from business problem to problem but is it a good practice to create Fact Tables before Dimension? I mean isn't it necessary to understand what dimensions exist and what are its keys before creating a Fact Table.
How did you insert json data into snowflake table? Is it a direct load for m local using file format? I am using apache nifi to insert json data into table from azure blob.(I was unsuccessful to insert json object so i was convert json data to string then parse json into another column in dbt SQL script) later work on? Let me know is there way to insert json object data directly into SNOWFLAKE table have column with data type VARIANT.
Great! videos as always ... is there a way though to edit or re-create some of your videos including this where you would zoom in/out parts of it specially when highlighting stuff 🙂
@Kahan Data Solutions, I wish you covered the concepts of Surrogate Keys with SCD Type 2,, while in this video you have conveniently skipped that and made it look like it is a simple task,, by joining multiple entities which Ralph Kimball strongly advocates to Avoid. I really want to see your approach for some of the most difficult questions, when there are many to many relationships in the real world.
Agreed, I was shocked to see he was apparently (?) using natural keys from the raw source as the dimensional keys, instead of of surrogates. This is a pretty basic no-no, and this model will break once it encounters common complications like "late arriving" dimensional data, need to combine data from multiple sources (either parallel systems, or migrations across source systems over time) as well as SCD you bring up. As I type I see he does mention surrogate keys at the very end, but only as an *alternative* to the natural keys of the source system, not as standard practice. So I guess he would advocate using natural keys as dimension/fact foreign keys in some situations, then switch to surrogate keys only when there are dimensions without natural keys he likes, or (as you are pointing out) as soon as he needs to move beyond a Type 1 SCD for something like a customer or employee? Yuck. Just use surrogate keys consistently everywhere, as Kimball strongly advocates.
Great tutorial. I have trouble understanding the difference between data warehouse and database. The thing you made in this video is DWH, but what would be DB regarding this example?
Great question! This is something that took me a while to understand as well. The way I think about it is that a data warehouse is just a way to describe HOW you are using a database (or multiple databases). For example, in this scenario, the two databases I'm using are "RAW" and "ANALYTICS_100". And I'm creating new tables within the "ANALYTICS_100" database in a way that resembles what we would call a data warehouse design. But if you strip away the term data warehouse, it's still just a database with tables (or views). Using a star schema (facts & dimensions) is just one way to create your tables in an intentional way that's capable of operating in a way that we all agree to call a "data warehouse". I also just thought of this example - it's almost like saying you can build a physical "house" or an "office building" or an "apartment". They have different purposes and terminology but underneath it all they have the same components (windows, floors, roof, etc.). Just designed in different ways. Maybe not the best example but hopefully that helps!
@@KahanDataSolutions After numerous confusing articles on topic of "DB vs DWH", I think I finnaly got it. DWH is just bunch of tables within DB that are designed in specific way which supports analytics. Your example in this video and your comment made it crystal clear. Thank you!
Hi! I am enjoying your content as a new subscriber. I am a business analyst trying to become a data engineer, and would love to begin my own projects building warehouses and pipelines, etc. I have been researching about using Snowflake as an individual and how much it would realistically cost me to use it for a project (to showcase on my resume), and cannot get a clear answer. Hoping you can assist me with this. Thanks!
Great content bro! However, you haven’t done justice to the subject: a Data model is a representation of business objects, their corresponding attribute, the relationships amongst them and other business semantics - usually represented through an Entity-relationship diagram (ERD) and, sometimes, class diagrams. What you’ve done here isn’t Data Modeling. It is, however, an explanation of how a Data Model is used/leveraged. Great resource, nonetheless.
ERDs are used as a modeling tool in both OLTP and OLAP systems, but their specific application and usage differ between the two. In OLTP, ERDs are used for database design and representation of operational data structures, while in OLAP, ERDs provide a conceptual foundation for building the data model used for analytical processing and data warehousing.
Looking for help with your team's data strategy? → www.kahandatasolutions.com
Looking to improve your data engineering skillset?→ bit.ly/more-kds
Thank you for posting this video. It is a great video.
I've read the whole "The data warehouse toolkit" book by Kimball twice and this video explains the most important part of the 500 page book as clear as possible.
Well done!
Love to hear that! Thanks for your feedback
So if someone watched the vidio its not nesessary to read that 500 book thats quite optimistic statement.
This is a very clear explanation with a precise example of the star schema. Thanks a lot for the good video.
Please make a video like this on Conformed Dimensions please!
Also how do you handle the lack of primary key and foreign key constraints in Snowflake?
I would agree with joshi. It would be cool for a future video to see how take data from conformed layer then create a star schema using dbt.
Noted! Thanks for the feedback
What’s conformed layer?
Thanks for this video. I have a few questions:
1) in the star schema, we won't do any joins between the dimension tables right, but why did you create a table by joining all the dimension tables to flatten all the dimensions in a single table?
2) since we are creating the mart tables by joining with other tables, How these tables get refreshed because those are not views or materialized views?
Brilliant video
So helpful in basic understanding of star schema design
Keep up the great work bro
Glad it helped!
@@KahanDataSolutions Hi, it would be even more helpful if u could make a video about tutorial on how to add time and date dimension in star or snowflake schema
Thanks! Great and simple overview for someone who is pretty new to this. What I would recommend is to explain pitfalls like - how to load dimensions for values there to be unique, why you have to care about having unique values in dimensions, what will happen if you left join to the dimension which has duplicate records.
This can help beginners to avoid mistakes on early stages
Thank you for the great and to the point presentation, especially this is great that you showed a simple live code to execute it since most tutorials just repeat the books and present the theory and don't show how to code it. It would be great though to cover SCD implementation in the dimension tables and what would happen to the natural keys when we will have to use surrogate keys for the dimension and fact tables, because in almost all real-world DWH examples there is always need for keeping the history of dimensions inside the dim tables, which of course the OLTP primary keys will not be applicable.
Good in-depth data engineering video for professionals!
Such a good explanation !
Very succinct, and practical.
Great video!
Thank you!
Such a cool data concept man! Thanks for introducing me to it. Cheers!
Hi, love your channel, I'm learning a lot. What are your thoughts on Star-schema vs One Big Table (OBT)? Would you make a video comparing pros and cons of each other?
Really great video!!!! Thank you!!! Hope to see new video concerning SCD, indexing!!!!
Thanks for watching!
Not related to the topic.. But, @Kahan, what do you think about Meltano? Will you add it to your Modern Data Stack?
Wow..very beautifully explained. Loved it ❤
Makes me explore more of your content.
Thank you.
Hey, thanks for the video! It gives a good overview how to model a star schema. But how could new staging data be ingested in the tables of a star schema? For example, an easy but inefficient approach would be to create the tables on a daily basis. But to be more optimal, you would need a process to ingest new data into the tables. Do you have an idea how that could be done in modern warehouses like Snowflake? Or some resources on that? I think it would be helpful to add some technical columns to the raw data layer, like a load timestamp, to track the records that need to be ingested. Furthermore, a valid_to and valid_from timestamp in dimension tables could be added where changes can occur (changed address of a customer).
Great tutorial, Thanks for putting this together.
You're very welcome! Hope it was helpful.
Typo on the chapter header #7. Great video tho on a hard to teach and under-covered topic. I strongly agree about the usability case for Star schemas even in a modern stack. It creates a highly ordered, easy to reason about, junction point between the chaos of sources, ingestion, data lakes, etc and the complexity of ~infinite data consumer use cases. The payoff in downstream development efficiency is huge.
Ah, dang. Good catch on the typo. Unfortunately I can't edit that part after posting.
Really appreciate your feedback on the video too. As you can probably tell, I'm on the same page as you and think it's still a great strategy.
@Kahan What tool are you using to show JSON along with other sources while querying? Thanks much.
Very crisp !!
Great video. One recommendation would be to change the font color of the text that is commented out, it is nearly impossible to read.
Ah good catch. Sorry about that but will take note for next time!
Is it necessary to join in orders and customers in the transaction fact table when the transaction table already has customer_id and order_id assuming the source tables have referential integrity constraints?
question is would you even need this in DWH such as gcp bq - since its already uses dremel architecture.
This is fantastic - keep it up.
Thanks, will do!
Hi, could you kindly make a video on how to load the fact table incrementally? What if we have an OLTP system and the dimension tables get big really quickly
great video, i'm doing almost the same thing but i'm using views for marts instead of tables because views are updated automatically
i would rather use tables instead of views but creating a table from select statement doesn't update automatically, is there a way to do it without using triggers?
Million thanks
Excellent
Thanks for the video. Quick Question: This might be subjective and vary from business problem to problem but is it a good practice to create Fact Tables before Dimension? I mean isn't it necessary to understand what dimensions exist and what are its keys before creating a Fact Table.
First Create dimension tables then go for fact table
Was planning to commenting same, you should create first dimension tables as fact table is pointing to them using fk
Hi,
I loved your video.
Is there anyway to get any sample databse with raw tables just to try and practice put it together on our own?
Take one dataset and create own dimension and fact tables
How did you insert json data into snowflake table? Is it a direct load for m local using file format? I am using apache nifi to insert json data into table from azure blob.(I was unsuccessful to insert json object so i was convert json data to string then parse json into another column in dbt SQL script) later work on? Let me know is there way to insert json object data directly into SNOWFLAKE table have column with data type VARIANT.
Great! videos as always ... is there a way though to edit or re-create some of your videos including this where you would zoom in/out parts of it specially when highlighting stuff 🙂
Thanks Jeff! Regarding the editing - unfortunately there's not much that can be done after it's initially uploaded (other than minor trimming).
@Kahan Data Solutions, I wish you covered the concepts of Surrogate Keys with SCD Type 2,, while in this video you have conveniently skipped that and made it look like it is a simple task,, by joining multiple entities which Ralph Kimball strongly advocates to Avoid. I really want to see your approach for some of the most difficult questions, when there are many to many relationships in the real world.
Agreed, I was shocked to see he was apparently (?) using natural keys from the raw source as the dimensional keys, instead of of surrogates. This is a pretty basic no-no, and this model will break once it encounters common complications like "late arriving" dimensional data, need to combine data from multiple sources (either parallel systems, or migrations across source systems over time) as well as SCD you bring up. As I type I see he does mention surrogate keys at the very end, but only as an *alternative* to the natural keys of the source system, not as standard practice. So I guess he would advocate using natural keys as dimension/fact foreign keys in some situations, then switch to surrogate keys only when there are dimensions without natural keys he likes, or (as you are pointing out) as soon as he needs to move beyond a Type 1 SCD for something like a customer or employee? Yuck. Just use surrogate keys consistently everywhere, as Kimball strongly advocates.
Great tutorial. I have trouble understanding the difference between data warehouse and database. The thing you made in this video is DWH, but what would be DB regarding this example?
Great question! This is something that took me a while to understand as well.
The way I think about it is that a data warehouse is just a way to describe HOW you are using a database (or multiple databases).
For example, in this scenario, the two databases I'm using are "RAW" and "ANALYTICS_100". And I'm creating new tables within the "ANALYTICS_100" database in a way that resembles what we would call a data warehouse design. But if you strip away the term data warehouse, it's still just a database with tables (or views).
Using a star schema (facts & dimensions) is just one way to create your tables in an intentional way that's capable of operating in a way that we all agree to call a "data warehouse".
I also just thought of this example - it's almost like saying you can build a physical "house" or an "office building" or an "apartment". They have different purposes and terminology but underneath it all they have the same components (windows, floors, roof, etc.). Just designed in different ways.
Maybe not the best example but hopefully that helps!
@@KahanDataSolutions After numerous confusing articles on topic of "DB vs DWH", I think I finnaly got it. DWH is just bunch of tables within DB that are designed in specific way which supports analytics.
Your example in this video and your comment made it crystal clear. Thank you!
@@mitchconnor7066 You got it!
Hi! I am enjoying your content as a new subscriber. I am a business analyst trying to become a data engineer, and would love to begin my own projects building warehouses and pipelines, etc. I have been researching about using Snowflake as an individual and how much it would realistically cost me to use it for a project (to showcase on my resume), and cannot get a clear answer. Hoping you can assist me with this. Thanks!
You are great
Great content bro! However, you haven’t done justice to the subject: a Data model is a representation of business objects, their corresponding attribute, the relationships amongst them and other business semantics - usually represented through an Entity-relationship diagram (ERD) and, sometimes, class diagrams.
What you’ve done here isn’t Data Modeling. It is, however, an explanation of how a Data Model is used/leveraged.
Great resource, nonetheless.
ERDs are used as a modeling tool in both OLTP and OLAP systems, but their specific application and usage differ between the two. In OLTP, ERDs are used for database design and representation of operational data structures, while in OLAP, ERDs provide a conceptual foundation for building the data model used for analytical processing and data warehousing.
Crap, I thought it was how to date a model...
Classic
Too much bass in youre voice in the recordings
Bro what does this have to do with the information 😂
@@ktg8742 I guess I should ask. Can you turn the bass down a bit in your voice. It's hard to focus on the listening experience is all.