Literally this is the ONLY video on UA-cam on how to data model with a hands-on, real-world example. All other videos talk about concepts, which sounds very abstract. BTW, this guy seems high but how come he can be this sharp at the same time? amazing
This is one of the best dimensional data modelling primers out there. Using it for my Meta DE onsite, and it has helped me to clearify some concepts, that I thought I "kind of" understood but with this video now I clearly see the reasonign and real applications. Thank you!
😮 amazed by this OBT concept. I am definitely going to use this. Thanks Zach for this. You are an amazing engineer who teaches us real world scenarios 😍
Bro, when prod data isn't modeled correctly, asking a data engineer to address all data quality issues is like asking your plumber to get you clean water. That all said, the right integration of ETL Process, Data Quality, and Business metadata (and competent boss that can support you in doing so) can help solve many DQ problems. Populating that metadata repo requires lots of end user interaction and, if your company is mature, data governance.
Wow, thanks for sharing your knowledge. I feel one day I will be investing in one of your courses when I get better with data analysis. Thanks Zach! Keep up the value and content. You're the best DE I've seen!
@@EcZachly_ Sorry Zach, I admire your content, but not eveybody lives in the US and has a US dollar salary. The first thing I see in the course is a one time payment of $2000 which is impossible for people that have a non US salary and has kids and house to pay. Then there is the $125/month (billed annually) which is still $1500 which is pretty impossible too. Then it comes the $200 per month which is more affordable but there's just no way to consume all that content within a month when you have a full time job, and kids and a life, so it's $200 per maybe 3 months = $600.. which is.. well.. hard to pay.
Thank you very much Zach for explaining these concepts with real life examples. This content is gold mine and extremely helpful. Please keep making more and more such informative and insightful videos 😃
Very insightful content! Especially the examples from your work like Facebook really help drill in the use of the various functions. Really appreciate the content :)
Don't know if it matters much but even as a person in data world for over a decade, I still learn something new every time I read your blog or listen to your videos. Great work, as always
Wow thank you this was super digestible and really helped me get a deeper understanding of modeling. This is really going to help me on the data science side of things.
One Big Table (OBT) Pros: Simple querying, no joins needed. Cons: High storage cost, data redundancy, slower on large datasets. Kimball (Star Schema) Pros: Fast querying, optimized for analytics, minimal joins. Cons: Requires initial design, moderate complexity when scaling. Relational Modeling (3NF) Pros: Efficient storage, reduces redundancy, good for write-heavy tasks. Cons: Complex joins, slower for complex read queries.
Hi Zach... Thank you for sharing your real time experience... I have a question - What is production data and what is events data... What's the difference... Couldn't get clear understanding... Please give me some examples...
hi Zach, I have been working on SCD sfrom long, but I didnt understand partition_date column in your example. Is it the date of data ingestion in SCD table or sth else? For finding deleted number of posts in your example why would someone use partition_date where clause. Once a post is deleted it cant be undeleted. So why not simply Select count(*) from table where text = 'Sql_sucks' and is_deleted= True. And if i want to know how many posts where deleted on a specific day why would I again use partition_date, I will leverage start_date and end_date columns. Select count(*) from table where text = 'Sql_sucks' and is_deleted= True and start_date = current_date. That is the purpose of keeping these startdate and enddate columns in first place.
Zach- one of the best tutorials I have seen giving a real world example. Quick question: 1) What does the partition_date for the posts_obt signify? Is it post creation date? 2) At 33:48 why do we have a filter in the query for partition_date. Here we are simply trying to find number of likes on a given post(using post_id). I am not able to think the role of a date column (same column in my first question)
1. No. Partition date is the date of the run of the pipeline 2. You need to pick a version of your OBT. Is it todays or yesterdays? You don’t overwrite here since that’s risky. Keeping 5-6 days of OBT data is advised.
Case Against OBT is that you’d have to deal with complex data types while Kimball model uses simple data types, though it requires full table scan. At Meta, designing in OBT is a very common approach. You can also highlight the Datelist fields, which powers recency metrics..
Hi Zach, clear and practical sharing! May I know about transaction consistency across the snapshot tables? I wonder if the snapshots are in an inconsistent state when they are being selected & loaded to the data lake. For example, selecting "A" table at 5:00am, and selecting "B" table at 5:05 am with new updates. A join of 'A' & 'B' table would cause inconsistent result.
Hey Zach! I really liked the video, really nice content that you are sharing with us here! In 33:49, maybe there could be a scenario that you would have to consider the action of reverted likes. In that case, would adding a Case When expression subtracting 1 on 's' be a solution for getting total likes on the deleted post?
Thanks for the informative video on such a important (more so now than ever before) topic Zach! Question: why not just only keep the posts_obt table and upsert that when there's updates? Then you don't need the join
This is fantastic content. Do you cover OBT in your data engineering boot camp? Where can I find the lab and lab demonstration you referred to in the video?
Not sure how I got lucky and your video was served up, but instant sub ❤❤ do you have any videos or presentations for folks who want to support data maturation in their org who are not DES?
Hey ! Great content as always, thank you very much ❤ So you and the community use the term "relational data model" to refer implicitly to "at least 3NF-compliant relational data model", right ? (strictly speaking, both dimensional models and OBTs are relational models too, right ?)
I liked the video very much. It gave me complete understanding of a denormalised table. I have one query, if lets say I have to create a big denormalised table(not being a fact table), which contains some 1000s of columns which gets created by joining 80 tables(30 large tables and 50 very small lookups ), What is the best way to make it real-time ?
@@EcZachly_ a video like this around big data would be very helpful as there are so many tools and approaches out there which makes it a bit difficult to enter to the big data space
Great Content Zach. Thanks. one question is to why we can't or shouldn't do CDC on source relational tables(May be by changed datetimestamp) to bring only new and updated source records rather than daily snapshots to build SCD type2 in DW layer?
Hi Zach great content! I am wondering how are you appending data in OBT. Are you using UPSERT or APPEND mode in your table? Is it on daily on hourly basis?
Great talk! However I have couple questions. Even though partition_scd will be comparebly small, but wouldn’t its partitions grow infinitely? Or there going to be some rules to retain posts for the last n periods? And also I was wondering what are drawbacks of modelling posts as dim table? Probably I’m missing something but It more reminds me factless fact table.
Hey Zack … general speaking with the snapshot_ts on daily dimensions, Do we need to have the same snapshot_ts for the facts tables as well? But keep all the partitions no more than a week ? Thanks!!
Hey zach thanks for your guidance it was super interesting and now I fully understand the advantage of obt's. do you have a tip how to translate this query into snowflake?
@@EcZachly_ yo, data engineer working with snowflake here. And it seems that snowflake does not handle array map and reduce very well. It does not have built-in array functions for array map and reduce
Hey zach! Great video, I really enjoyed it! In your experience, what's the best way to overcome the limitation of the maximum array size in some SQL engines? For example, in Amazon Athena (which uses Trino) the maximum array size is 254 (due to a Java limitation). In your example, a post could probably contain more thans these number of actions. One simple approach would be to make a new array column every time you hit the limit, but I don't think that is very scalable. What's your opinion?
Trino doesn’t have a 254 limit. I used trino at Facebook. The limit is 65k I thought? I definitely used Trino and 365 item arrays at Facebook. Maybe it’s an Athena thing?
what is the difference between kimball and relational? relational is extremely normalized tables like inmon? while kimbal is star schema with fact and dims?
The GitHub I use is www.github.com/DataEngineer-io There is alot of content on www.DataEngineer.io that you can watch that isn’t on UA-cam as well if you make a free account.
Thanks for this video! It was very informative and the examples really illustrated the different use cases. Can you post a follow up or link to help clarify why the partitions are needed on SCD and OBT tables?
Couldn't all 1 to 1 relations to scd in reality just reside in the OBT? For example here. All posts table data could really just be another column of arrays.
Can you explain more and put a youtube shorts or a video about the enrollment and what it covers it will benefit whom e.t.c Full-access Live Boot Camp starting May 6th ($2000) Analytics-Focused Live Boot Camp starting May 6th ($1650) Self-paced Data Engineering Course V4 Combined ($1750) Infrastructure-Focused Live Boot Camp starting May 6th ($1650)
I am amazed about your youtube content, couldn't stop watching your shots FACEPAD - Funnel, Aggregatioon, Clustering, Experimentation,Prediction,Accumulation, Derivative - you are an amazing teacher mentor
Sorry, I might have a different point of view. In terms of Kimball's method, it is admittedly might not be the one-fit-all solution. But in this case, it seems you are making a mistake specified in Kimball's book in his book chapter 6, header/line pattern to avoid, which is, using a header/line (in this case, the post_id) as a dimension table with such a high cardinality. Please correct me if I am wrong.
so basically in the end are you saying that the interactions of millions of users on apps such as facebook or instagram , i am repeating again millions of users are stored in one big table ? imagine the number of rows on that table 😧😧
I like Kimball a lot, but I thought it was pretty funny when the group published "facing the re-keying crisis" near the time of his retirement and NOT a long time before that. That all said, I thought that group was the most humble and accepting of other methodologies.
Please upload more of these long form videos that go in depth
Wow. What a video. Randomly popped in feed and never completed a random 40 min video in one go.
Glad you liked it!
@@EcZachly_ loved it. Planning to take your course soon. Need to brush up the prerequisites:)
Literally this is the ONLY video on UA-cam on how to data model with a hands-on, real-world example. All other videos talk about concepts, which sounds very abstract.
BTW, this guy seems high but how come he can be this sharp at the same time? amazing
When he is high, he chooses right data model. When he is not, he simply dumps anything in a DB and ask analytics to figure out
This isn't the only video like this just the only video YOU found. LOL!
He has ADHD
This is one of the best dimensional data modelling primers out there. Using it for my Meta DE onsite, and it has helped me to clearify some concepts, that I thought I "kind of" understood but with this video now I clearly see the reasonign and real applications. Thank you!
😮 amazed by this OBT concept. I am definitely going to use this. Thanks Zach for this. You are an amazing engineer who teaches us real world scenarios 😍
Glad you were able to get a version out with sound. Thanks.
Thanks! I felt so bad that the stream was silent
@@EcZachly_no worries. You’re doing a wonderful job for the DE community. Much appreciated
Hi Zach, please keep making these, your videos are incredibly insightful.
I'm in middle of the video and just had to stop it and come here to say: Wow! This is one of the bests DE videos on UA-cam.
Once again you killed it Zach! You just made my data engineering journey easy.
Amazinggg…. Very in-depth, kept rewinding for parts I didn’t understand…. Got to know about this from your Twitter page…Please more or this🚀
Bro, when prod data isn't modeled correctly, asking a data engineer to address all data quality issues is like asking your plumber to get you clean water. That all said, the right integration of ETL Process, Data Quality, and Business metadata (and competent boss that can support you in doing so) can help solve many DQ problems. Populating that metadata repo requires lots of end user interaction and, if your company is mature, data governance.
Wow, thanks for sharing your knowledge. I feel one day I will be investing in one of your courses when I get better with data analysis. Thanks Zach! Keep up the value and content. You're the best DE I've seen!
Zach, I'm learning so much. Your videos are amazing, please add more for us, the ones that can not afford to pay your course 🙏🙏
My course is
@@EcZachly_ Sorry Zach, I admire your content, but not eveybody lives in the US and has a US dollar salary. The first thing I see in the course is a one time payment of $2000 which is impossible for people that have a non US salary and has kids and house to pay. Then there is the $125/month (billed annually) which is still $1500 which is pretty impossible too. Then it comes the $200 per month which is more affordable but there's just no way to consume all that content within a month when you have a full time job, and kids and a life, so it's $200 per maybe 3 months = $600.. which is.. well.. hard to pay.
@@mayravaldes89 If you use code EARLYBIRD you can make it 30% off that $200. So it'd be $140/month.
No one can explain it better than you. Great work 🥳
Thank you very much Zach for explaining these concepts with real life examples. This content is gold mine and extremely helpful. Please keep making more and more such informative and insightful videos 😃
Very insightful content! Especially the examples from your work like Facebook really help drill in the use of the various functions. Really appreciate the content :)
Thanks, Zach! Your content has really helped me gain a better understanding of data modeling.
Don't know if it matters much but even as a person in data world for over a decade, I still learn something new every time I read your blog or listen to your videos. Great work, as always
Wow thank you this was super digestible and really helped me get a deeper understanding of modeling. This is really going to help me on the data science side of things.
Wonderful content, Zach! God bless you!
An absolute gem of a video!
This is the exact video I am looking for! kudos for sharing great content.
One Big Table (OBT)
Pros: Simple querying, no joins needed.
Cons: High storage cost, data redundancy, slower on large datasets.
Kimball (Star Schema)
Pros: Fast querying, optimized for analytics, minimal joins.
Cons: Requires initial design, moderate complexity when scaling.
Relational Modeling (3NF)
Pros: Efficient storage, reduces redundancy, good for write-heavy tasks.
Cons: Complex joins, slower for complex read queries.
Hi Zach...
Thank you for sharing your real time experience...
I have a question -
What is production data and what is events data... What's the difference...
Couldn't get clear understanding...
Please give me some examples...
41 mins of absolute data modeling concepts explained.
😮💨
Amazing class, thank you Zach
This video is priceless, thank you for sharing
Thanks Zach for amazing content
Thank you so much for sharing such excellent video!!!
hi Zach, I have been working on SCD sfrom long, but I didnt understand partition_date column in your example. Is it the date of data ingestion in SCD table or sth else?
For finding deleted number of posts in your example why would someone use partition_date where clause. Once a post is deleted it cant be undeleted. So why not simply Select count(*) from table where text = 'Sql_sucks' and is_deleted= True.
And if i want to know how many posts where deleted on a specific day why would I again use partition_date, I will leverage start_date and end_date columns.
Select count(*) from table where text = 'Sql_sucks' and is_deleted= True and start_date = current_date. That is the purpose of keeping these startdate and enddate columns in first place.
Data lakes pull the data forward though.
Partition_date here still has all the data since SCD type 2 is cumulative.
@@EcZachly_Thanks, this clarifies my concern.So, the SCD will still de-duplicate the Data but hold the last partition_date of that record.
Great content thank you Zach! Is there a reason why it cuts out at the end seemingly abruptly?
His mention of fixing normalization in the relational layer is very crucial.
Awesome explanations. Thanks!
thanks for the great content Zach!
This "sql sucks" example hits hard bruv.
Zach- one of the best tutorials I have seen giving a real world example.
Quick question: 1) What does the partition_date for the posts_obt signify? Is it post creation date?
2) At 33:48 why do we have a filter in the query for partition_date. Here we are simply trying to find number of likes on a given post(using post_id). I am not able to think the role of a date column (same column in my first question)
1. No. Partition date is the date of the run of the pipeline
2. You need to pick a version of your OBT. Is it todays or yesterdays? You don’t overwrite here since that’s risky. Keeping 5-6 days of OBT data is advised.
What is the point of saving partition date in kimpball? It is going to be equal either the start date if it is not true or the end date it it is true
Love your videos man ! Can you please give me your insights / point of view regarding Event Driven Architecture?
thanks for this, zach
Case Against OBT is that you’d have to deal with complex data types while Kimball model uses simple data types, though it requires full table scan. At Meta, designing in OBT is a very common approach. You can also highlight the Datelist fields, which powers recency metrics..
20:55 what if the post contains PII? Does Facebook anonymise post content as well?
Yep scrubs it of PII
Where does the is_deleted column come from? How is it generated? Is there a custom logic behind it or some tooling that takes care of it?
Hi Zach, clear and practical sharing! May I know about transaction consistency across the snapshot tables? I wonder if the snapshots are in an inconsistent state when they are being selected & loaded to the data lake. For example, selecting "A" table at 5:00am, and selecting "B" table at 5:05 am with new updates. A join of 'A' & 'B' table would cause inconsistent result.
Big tech generally doesn’t care about that. The way to avoid is minimize the time in between related table snapshots by dumping them in a single run
Hey Zach! I really liked the video, really nice content that you are sharing with us here! In 33:49, maybe there could be a scenario that you would have to consider the action of reverted likes. In that case, would adding a Case When expression subtracting 1 on 's' be a solution for getting total likes on the deleted post?
Why do you have field partition_date in user_scd and posts_scd?
thank you for the content, bring more Kimball please!
Thanks for the informative video on such a important (more so now than ever before) topic Zach! Question: why not just only keep the posts_obt table and upsert that when there's updates? Then you don't need the join
Love the video. Hits some important concepts. I wonder how you consider your dimensional models when using a semantic layer like LookML?
Really all it comes down to is that all LookML views should be select * from dim/fact tables.
Inspiring content! What do you think about Data Vault?
I just love your videos. Are there more in funnel?
This is fantastic content. Do you cover OBT in your data engineering boot camp? Where can I find the lab and lab demonstration you referred to in the video?
We spend a week on SCDs and OBT yep!
Not sure how I got lucky and your video was served up, but instant sub ❤❤ do you have any videos or presentations for folks who want to support data maturation in their org who are not DES?
great content, thank you!
Hey ! Great content as always, thank you very much ❤
So you and the community use the term "relational data model" to refer implicitly to "at least 3NF-compliant relational data model", right ? (strictly speaking, both dimensional models and OBTs are relational models too, right ?)
23:30 instead of snapshots why not do change data capture?
CDC isn't as reliable
@@EcZachly_ and it sounds like most companies dont need the added granularity…
@@EcZachly_ ? care to explain
I liked the video very much. It gave me complete understanding of a denormalised table. I have one query, if lets say I have to create a big denormalised table(not being a fact table), which contains some 1000s of columns which gets created by joining 80 tables(30 large tables and 50 very small lookups ), What is the best way to make it real-time ?
If these aren’t facts, you should use CDC to make the mutations real time. The lookup tables are easy side inputs
this is was super helpful, learned a lot
Glad you liked it! Any other topics you think I could cover?
@@EcZachly_ a video like this around big data would be very helpful as there are so many tools and approaches out there which makes it a bit difficult to enter to the big data space
Amazing content.
Really appreciate your content man
Great Content Zach. Thanks. one question is to why we can't or shouldn't do CDC on source relational tables(May be by changed datetimestamp) to bring only new and updated source records rather than daily snapshots to build SCD type2 in DW layer?
Every big tech company does daily snapshots. It’s duplicative but much simpler than CDC
@36:28 is it a typo to have s.is_current = TRUE and s.is_deleted = TRUE?
Nah. Is_current means the most recent state. Is_deleted being the most recent state.
Hi Zach great content! I am wondering how are you appending data in OBT. Are you using UPSERT or APPEND mode in your table? Is it on daily on hourly basis?
Depends on the use case
Great talk! However I have couple questions. Even though partition_scd will be comparebly small, but wouldn’t its partitions grow infinitely? Or there going to be some rules to retain posts for the last n periods? And also I was wondering what are drawbacks of modelling posts as dim table? Probably I’m missing something but It more reminds me factless fact table.
Yeah you only keep the last week of partitions around. Correct!
Hey Zack … general speaking with the snapshot_ts on daily dimensions, Do we need to have the same snapshot_ts for the facts tables as well? But keep all the partitions no more than a week ? Thanks!!
Zach, what is your opinion on Bitemporal Modelling?
Hey zach thanks for your guidance it was super interesting and now I fully understand the advantage of obt's. do you have a tip how to translate this query into snowflake?
They have similar array functions in snowflake. ChatGPT is your friend
@@EcZachly_ haha :D
chatgpt hates me and always produces grouped by queries but I will proceed :)
@@EcZachly_ yo, data engineer working with snowflake here. And it seems that snowflake does not handle array map and reduce very well. It does not have built-in array functions for array map and reduce
Hey zach! Great video, I really enjoyed it! In your experience, what's the best way to overcome the limitation of the maximum array size in some SQL engines? For example, in Amazon Athena (which uses Trino) the maximum array size is 254 (due to a Java limitation). In your example, a post could probably contain more thans these number of actions. One simple approach would be to make a new array column every time you hit the limit, but I don't think that is very scalable. What's your opinion?
Trino doesn’t have a 254 limit. I used trino at Facebook. The limit is 65k I thought? I definitely used Trino and 365 item arrays at Facebook.
Maybe it’s an Athena thing?
3:55: Kimball
what is the difference between kimball and relational? relational is extremely normalized tables like inmon? while kimbal is star schema with fact and dims?
Nailed it
Hi Zach i couldn't find your git hub things can you please provide that i will learn the things from that as you mentioned in the video
The GitHub I use is www.github.com/DataEngineer-io
There is alot of content on www.DataEngineer.io that you can watch that isn’t on UA-cam as well if you make a free account.
Sure Zach Thanks for your support
You're gold!
Thanks for this video! It was very informative and the examples really illustrated the different use cases. Can you post a follow up or link to help clarify why the partitions are needed on SCD and OBT tables?
I had same question. Because with partition_date in SCD, the number of records doesn't really reduce
Couldn't all 1 to 1 relations to scd in reality just reside in the OBT? For example here. All posts table data could really just be another column of arrays.
Glad this got you thinking
Can you explain more and put a youtube shorts or a video about the enrollment and what it covers it will benefit whom e.t.c
Full-access Live Boot Camp starting May 6th ($2000)
Analytics-Focused Live Boot Camp starting May 6th ($1650)
Self-paced Data Engineering Course V4 Combined ($1750)
Infrastructure-Focused Live Boot Camp starting May 6th ($1650)
I am amazed about your youtube content, couldn't stop watching your shots
FACEPAD - Funnel, Aggregatioon, Clustering, Experimentation,Prediction,Accumulation, Derivative - you are an amazing teacher mentor
Sorry, I might have a different point of view. In terms of Kimball's method, it is admittedly might not be the one-fit-all solution. But in this case, it seems you are making a mistake specified in Kimball's book in his book chapter 6, header/line pattern to avoid, which is, using a header/line (in this case, the post_id) as a dimension table with such a high cardinality. Please correct me if I am wrong.
superrrr thanks!
thank you sir
Oo great topic
This is good shit
appreciate the content Zach but the feed just spontaneously cuts out.
This was a live that I learned had no audio.
so basically in the end are you saying that the interactions of millions of users on apps such as facebook or instagram , i am repeating again millions of users are stored in one big table ? imagine the number of rows on that table 😧😧
Export it to excel and watch your computer melt before your eyes
I like Kimball a lot, but I thought it was pretty funny when the group published "facing the re-keying crisis" near the time of his retirement and NOT a long time before that. That all said, I thought that group was the most humble and accepting of other methodologies.
Is One Big Excel File and option? (joke)
So many business people think so!
Facebook privacy policies are “fairly lax” nice 😮
Yes, i agree data scientists do ask sometimes "how many stoned people liked SQL Suck posts" at 4 am
Freakin
very confusing example
Can you ever slow down and chill?
"Give me all the likes that happened between 4:20PM & 4:25PM each day to catch the stoned people..."
One big table followed by one big joint
@@EcZachly_ lol
it's like this guy really likes to say like... he's definitely high on something, why else would his voice be so obnoxious and annoying
Like, I’m just built different okay
there are people who hate SQL?👀
With a passion!