Data Modeling: One Big Table vs Kimball vs Relational for data engineers

Поділитися
Вставка
  • Опубліковано 23 січ 2025

КОМЕНТАРІ • 138

  • @scottbrwnng
    @scottbrwnng 10 місяців тому +33

    Please upload more of these long form videos that go in depth

  • @shubhamsharma-ne2ke
    @shubhamsharma-ne2ke 9 місяців тому +26

    Wow. What a video. Randomly popped in feed and never completed a random 40 min video in one go.

    • @EcZachly_
      @EcZachly_  9 місяців тому +3

      Glad you liked it!

    • @shubhamsharma-ne2ke
      @shubhamsharma-ne2ke 9 місяців тому

      @@EcZachly_ loved it. Planning to take your course soon. Need to brush up the prerequisites:)

  • @sf-spark129
    @sf-spark129 8 місяців тому +55

    Literally this is the ONLY video on UA-cam on how to data model with a hands-on, real-world example. All other videos talk about concepts, which sounds very abstract.
    BTW, this guy seems high but how come he can be this sharp at the same time? amazing

    • @chandanjha3205
      @chandanjha3205 8 місяців тому +4

      When he is high, he chooses right data model. When he is not, he simply dumps anything in a DB and ask analytics to figure out

    • @jay_wright_thats_right
      @jay_wright_thats_right 2 місяці тому +1

      This isn't the only video like this just the only video YOU found. LOL!

    • @syedkamran4121
      @syedkamran4121 2 місяці тому

      He has ADHD

  • @pcl1923
    @pcl1923 9 місяців тому +7

    This is one of the best dimensional data modelling primers out there. Using it for my Meta DE onsite, and it has helped me to clearify some concepts, that I thought I "kind of" understood but with this video now I clearly see the reasonign and real applications. Thank you!

  • @anandahs6078
    @anandahs6078 10 місяців тому +5

    😮 amazed by this OBT concept. I am definitely going to use this. Thanks Zach for this. You are an amazing engineer who teaches us real world scenarios 😍

  • @Milhouse77BS
    @Milhouse77BS 10 місяців тому +6

    Glad you were able to get a version out with sound. Thanks.

    • @EcZachly_
      @EcZachly_  10 місяців тому

      Thanks! I felt so bad that the stream was silent

    • @murugesanrajasekaran5032
      @murugesanrajasekaran5032 10 місяців тому +6

      @@EcZachly_no worries. You’re doing a wonderful job for the DE community. Much appreciated

  • @shraddhathapa9446
    @shraddhathapa9446 9 місяців тому +8

    Hi Zach, please keep making these, your videos are incredibly insightful.

  • @paulolimaudia
    @paulolimaudia Місяць тому +1

    I'm in middle of the video and just had to stop it and come here to say: Wow! This is one of the bests DE videos on UA-cam.

  • @TechyBuild
    @TechyBuild 10 місяців тому +4

    Once again you killed it Zach! You just made my data engineering journey easy.

  • @oyindamolavictor
    @oyindamolavictor 9 місяців тому +2

    Amazinggg…. Very in-depth, kept rewinding for parts I didn’t understand…. Got to know about this from your Twitter page…Please more or this🚀

  • @NeumsFor9
    @NeumsFor9 10 місяців тому +11

    Bro, when prod data isn't modeled correctly, asking a data engineer to address all data quality issues is like asking your plumber to get you clean water. That all said, the right integration of ETL Process, Data Quality, and Business metadata (and competent boss that can support you in doing so) can help solve many DQ problems. Populating that metadata repo requires lots of end user interaction and, if your company is mature, data governance.

  • @LIV-FREE-VET
    @LIV-FREE-VET Місяць тому +1

    Wow, thanks for sharing your knowledge. I feel one day I will be investing in one of your courses when I get better with data analysis. Thanks Zach! Keep up the value and content. You're the best DE I've seen!

  • @mayravaldes89
    @mayravaldes89 4 місяці тому +4

    Zach, I'm learning so much. Your videos are amazing, please add more for us, the ones that can not afford to pay your course 🙏🙏

    • @EcZachly_
      @EcZachly_  4 місяці тому

      My course is

    • @mayravaldes89
      @mayravaldes89 4 місяці тому +1

      @@EcZachly_ Sorry Zach, I admire your content, but not eveybody lives in the US and has a US dollar salary. The first thing I see in the course is a one time payment of $2000 which is impossible for people that have a non US salary and has kids and house to pay. Then there is the $125/month (billed annually) which is still $1500 which is pretty impossible too. Then it comes the $200 per month which is more affordable but there's just no way to consume all that content within a month when you have a full time job, and kids and a life, so it's $200 per maybe 3 months = $600.. which is.. well.. hard to pay.

    • @EcZachly_
      @EcZachly_  4 місяці тому

      ​@@mayravaldes89 If you use code EARLYBIRD you can make it 30% off that $200. So it'd be $140/month.

  • @rachelannballesteros3055
    @rachelannballesteros3055 4 місяці тому +2

    No one can explain it better than you. Great work 🥳

  • @arpittapwal4651
    @arpittapwal4651 8 місяців тому +3

    Thank you very much Zach for explaining these concepts with real life examples. This content is gold mine and extremely helpful. Please keep making more and more such informative and insightful videos 😃

  • @muhammadraza3290
    @muhammadraza3290 10 місяців тому +2

    Very insightful content! Especially the examples from your work like Facebook really help drill in the use of the various functions. Really appreciate the content :)

  • @mosaliu8495
    @mosaliu8495 3 місяці тому +1

    Thanks, Zach! Your content has really helped me gain a better understanding of data modeling.

  • @dalailama5757
    @dalailama5757 3 місяці тому

    Don't know if it matters much but even as a person in data world for over a decade, I still learn something new every time I read your blog or listen to your videos. Great work, as always

  • @Tyf13
    @Tyf13 8 місяців тому +1

    Wow thank you this was super digestible and really helped me get a deeper understanding of modeling. This is really going to help me on the data science side of things.

  • @manuelalejandropereztrujil2391
    @manuelalejandropereztrujil2391 10 місяців тому +3

    Wonderful content, Zach! God bless you!

  • @rishadm1771
    @rishadm1771 8 місяців тому +2

    An absolute gem of a video!

  • @medon910
    @medon910 9 місяців тому +2

    This is the exact video I am looking for! kudos for sharing great content.

  • @PreetMehtaUSA
    @PreetMehtaUSA 2 місяці тому +2

    One Big Table (OBT)
    Pros: Simple querying, no joins needed.
    Cons: High storage cost, data redundancy, slower on large datasets.
    Kimball (Star Schema)
    Pros: Fast querying, optimized for analytics, minimal joins.
    Cons: Requires initial design, moderate complexity when scaling.
    Relational Modeling (3NF)
    Pros: Efficient storage, reduces redundancy, good for write-heavy tasks.
    Cons: Complex joins, slower for complex read queries.

  • @gurumoorthysivakolunthu9878
    @gurumoorthysivakolunthu9878 День тому

    Hi Zach...
    Thank you for sharing your real time experience...
    I have a question -
    What is production data and what is events data... What's the difference...
    Couldn't get clear understanding...
    Please give me some examples...

  • @nileshk611
    @nileshk611 9 місяців тому +26

    41 mins of absolute data modeling concepts explained.

  • @fernandostahelin2972
    @fernandostahelin2972 6 місяців тому +2

    Amazing class, thank you Zach

  • @raphaeldayan
    @raphaeldayan 7 місяців тому +2

    This video is priceless, thank you for sharing

  • @srinubathina7191
    @srinubathina7191 10 місяців тому +2

    Thanks Zach for amazing content

  • @Victor-yn6lv
    @Victor-yn6lv 5 місяців тому +1

    Thank you so much for sharing such excellent video!!!

  • @ManishJindalmanisism
    @ManishJindalmanisism 10 місяців тому +7

    hi Zach, I have been working on SCD sfrom long, but I didnt understand partition_date column in your example. Is it the date of data ingestion in SCD table or sth else?
    For finding deleted number of posts in your example why would someone use partition_date where clause. Once a post is deleted it cant be undeleted. So why not simply Select count(*) from table where text = 'Sql_sucks' and is_deleted= True.
    And if i want to know how many posts where deleted on a specific day why would I again use partition_date, I will leverage start_date and end_date columns.
    Select count(*) from table where text = 'Sql_sucks' and is_deleted= True and start_date = current_date. That is the purpose of keeping these startdate and enddate columns in first place.

    • @EcZachly_
      @EcZachly_  10 місяців тому +2

      Data lakes pull the data forward though.
      Partition_date here still has all the data since SCD type 2 is cumulative.

    • @mikekenneth4773
      @mikekenneth4773 10 місяців тому +2

      ​@@EcZachly_Thanks, this clarifies my concern.So, the SCD will still de-duplicate the Data but hold the last partition_date of that record.

  • @strawhatSEO
    @strawhatSEO 10 місяців тому +1

    Great content thank you Zach! Is there a reason why it cuts out at the end seemingly abruptly?

  • @meryplays8952
    @meryplays8952 3 місяці тому

    His mention of fixing normalization in the relational layer is very crucial.

  • @bradj229
    @bradj229 4 місяці тому +1

    Awesome explanations. Thanks!

  • @fzrbigman
    @fzrbigman 10 місяців тому +2

    thanks for the great content Zach!

  • @BasicoftheBasilisk
    @BasicoftheBasilisk 10 місяців тому +5

    This "sql sucks" example hits hard bruv.

  • @chandanjha3205
    @chandanjha3205 8 місяців тому

    Zach- one of the best tutorials I have seen giving a real world example.
    Quick question: 1) What does the partition_date for the posts_obt signify? Is it post creation date?
    2) At 33:48 why do we have a filter in the query for partition_date. Here we are simply trying to find number of likes on a given post(using post_id). I am not able to think the role of a date column (same column in my first question)

    • @EcZachly_
      @EcZachly_  8 місяців тому +2

      1. No. Partition date is the date of the run of the pipeline
      2. You need to pick a version of your OBT. Is it todays or yesterdays? You don’t overwrite here since that’s risky. Keeping 5-6 days of OBT data is advised.

    • @Escrotazo66
      @Escrotazo66 8 місяців тому

      What is the point of saving partition date in kimpball? It is going to be equal either the start date if it is not true or the end date it it is true

  • @michelabifadel1474
    @michelabifadel1474 6 місяців тому +2

    Love your videos man ! Can you please give me your insights / point of view regarding Event Driven Architecture?

  • @farahiyahsyarafina2183
    @farahiyahsyarafina2183 10 місяців тому +2

    thanks for this, zach

  • @satz611
    @satz611 3 місяці тому

    Case Against OBT is that you’d have to deal with complex data types while Kimball model uses simple data types, though it requires full table scan. At Meta, designing in OBT is a very common approach. You can also highlight the Datelist fields, which powers recency metrics..

  • @keremen
    @keremen 10 місяців тому +2

    20:55 what if the post contains PII? Does Facebook anonymise post content as well?

    • @EcZachly_
      @EcZachly_  10 місяців тому +2

      Yep scrubs it of PII

  • @muhammadraza3290
    @muhammadraza3290 10 місяців тому +1

    Where does the is_deleted column come from? How is it generated? Is there a custom logic behind it or some tooling that takes care of it?

  • @tonyyeung9481
    @tonyyeung9481 10 місяців тому +1

    Hi Zach, clear and practical sharing! May I know about transaction consistency across the snapshot tables? I wonder if the snapshots are in an inconsistent state when they are being selected & loaded to the data lake. For example, selecting "A" table at 5:00am, and selecting "B" table at 5:05 am with new updates. A join of 'A' & 'B' table would cause inconsistent result.

    • @EcZachly_
      @EcZachly_  10 місяців тому +1

      Big tech generally doesn’t care about that. The way to avoid is minimize the time in between related table snapshots by dumping them in a single run

  • @marcelonobrega355
    @marcelonobrega355 3 місяці тому +1

    Hey Zach! I really liked the video, really nice content that you are sharing with us here! In 33:49, maybe there could be a scenario that you would have to consider the action of reverted likes. In that case, would adding a Case When expression subtracting 1 on 's' be a solution for getting total likes on the deleted post?

  • @AsifKhan-jq8zg
    @AsifKhan-jq8zg 7 місяців тому +1

    Why do you have field partition_date in user_scd and posts_scd?

  • @Levy957
    @Levy957 10 місяців тому +1

    thank you for the content, bring more Kimball please!

  • @Joey-vy1iu
    @Joey-vy1iu 10 місяців тому +1

    Thanks for the informative video on such a important (more so now than ever before) topic Zach! Question: why not just only keep the posts_obt table and upsert that when there's updates? Then you don't need the join

  • @WiktorJurek
    @WiktorJurek 4 місяці тому +1

    Love the video. Hits some important concepts. I wonder how you consider your dimensional models when using a semantic layer like LookML?

    • @BenDoremus
      @BenDoremus 2 місяці тому

      Really all it comes down to is that all LookML views should be select * from dim/fact tables.

  • @OrcaChess
    @OrcaChess Місяць тому +1

    Inspiring content! What do you think about Data Vault?

  • @DeepanshGoyal-s9l
    @DeepanshGoyal-s9l 2 місяці тому +1

    I just love your videos. Are there more in funnel?

  • @kentmaxwell1976
    @kentmaxwell1976 10 місяців тому +1

    This is fantastic content. Do you cover OBT in your data engineering boot camp? Where can I find the lab and lab demonstration you referred to in the video?

    • @EcZachly_
      @EcZachly_  10 місяців тому +1

      We spend a week on SCDs and OBT yep!

  • @twentytwentyeight
    @twentytwentyeight 3 місяці тому +1

    Not sure how I got lucky and your video was served up, but instant sub ❤❤ do you have any videos or presentations for folks who want to support data maturation in their org who are not DES?

  • @MrMe77-u4y
    @MrMe77-u4y Місяць тому +1

    great content, thank you!

  • @Enzo-jj5te
    @Enzo-jj5te 10 місяців тому +2

    Hey ! Great content as always, thank you very much ❤
    So you and the community use the term "relational data model" to refer implicitly to "at least 3NF-compliant relational data model", right ? (strictly speaking, both dimensional models and OBTs are relational models too, right ?)

  • @berellevy2
    @berellevy2 9 місяців тому +1

    23:30 instead of snapshots why not do change data capture?

    • @EcZachly_
      @EcZachly_  9 місяців тому +1

      CDC isn't as reliable

    • @berellevy2
      @berellevy2 9 місяців тому

      @@EcZachly_ and it sounds like most companies dont need the added granularity…

    • @TheEjaay
      @TheEjaay 8 місяців тому

      @@EcZachly_ ? care to explain

  • @Vaibhav19921
    @Vaibhav19921 7 місяців тому +1

    I liked the video very much. It gave me complete understanding of a denormalised table. I have one query, if lets say I have to create a big denormalised table(not being a fact table), which contains some 1000s of columns which gets created by joining 80 tables(30 large tables and 50 very small lookups ), What is the best way to make it real-time ?

    • @EcZachly_
      @EcZachly_  6 місяців тому

      If these aren’t facts, you should use CDC to make the mutations real time. The lookup tables are easy side inputs

  • @arashriazi9597
    @arashriazi9597 10 місяців тому +1

    this is was super helpful, learned a lot

    • @EcZachly_
      @EcZachly_  10 місяців тому +1

      Glad you liked it! Any other topics you think I could cover?

    • @arashriazi9597
      @arashriazi9597 9 місяців тому

      @@EcZachly_ a video like this around big data would be very helpful as there are so many tools and approaches out there which makes it a bit difficult to enter to the big data space

  • @ronaldmwenda
    @ronaldmwenda 2 місяці тому +1

    Amazing content.

  • @mridulbhardwaj3116
    @mridulbhardwaj3116 10 місяців тому +1

    Really appreciate your content man

  • @princebansal7500
    @princebansal7500 10 місяців тому +1

    Great Content Zach. Thanks. one question is to why we can't or shouldn't do CDC on source relational tables(May be by changed datetimestamp) to bring only new and updated source records rather than daily snapshots to build SCD type2 in DW layer?

    • @EcZachly_
      @EcZachly_  10 місяців тому

      Every big tech company does daily snapshots. It’s duplicative but much simpler than CDC

  • @davidoh0905
    @davidoh0905 6 місяців тому

    @36:28 is it a typo to have s.is_current = TRUE and s.is_deleted = TRUE?

    • @EcZachly_
      @EcZachly_  6 місяців тому

      Nah. Is_current means the most recent state. Is_deleted being the most recent state.

  • @krzysztoflinke3870
    @krzysztoflinke3870 10 місяців тому +1

    Hi Zach great content! I am wondering how are you appending data in OBT. Are you using UPSERT or APPEND mode in your table? Is it on daily on hourly basis?

    • @EcZachly_
      @EcZachly_  10 місяців тому

      Depends on the use case

  • @KPYCHIK
    @KPYCHIK 10 місяців тому +1

    Great talk! However I have couple questions. Even though partition_scd will be comparebly small, but wouldn’t its partitions grow infinitely? Or there going to be some rules to retain posts for the last n periods? And also I was wondering what are drawbacks of modelling posts as dim table? Probably I’m missing something but It more reminds me factless fact table.

    • @EcZachly_
      @EcZachly_  10 місяців тому

      Yeah you only keep the last week of partitions around. Correct!

    • @matiasmicenmacher1564
      @matiasmicenmacher1564 6 місяців тому

      Hey Zack … general speaking with the snapshot_ts on daily dimensions, Do we need to have the same snapshot_ts for the facts tables as well? But keep all the partitions no more than a week ? Thanks!!

  • @AnnChu-tb4hp
    @AnnChu-tb4hp 7 місяців тому +1

    Zach, what is your opinion on Bitemporal Modelling?

  • @YannickSacherer
    @YannickSacherer 10 місяців тому +1

    Hey zach thanks for your guidance it was super interesting and now I fully understand the advantage of obt's. do you have a tip how to translate this query into snowflake?

    • @EcZachly_
      @EcZachly_  10 місяців тому

      They have similar array functions in snowflake. ChatGPT is your friend

    • @YannickSacherer
      @YannickSacherer 10 місяців тому

      @@EcZachly_ haha :D
      chatgpt hates me and always produces grouped by queries but I will proceed :)

    • @tobelesa
      @tobelesa 7 місяців тому

      ​@@EcZachly_ yo, data engineer working with snowflake here. And it seems that snowflake does not handle array map and reduce very well. It does not have built-in array functions for array map and reduce

  • @Sfgarcia
    @Sfgarcia 8 місяців тому +1

    Hey zach! Great video, I really enjoyed it! In your experience, what's the best way to overcome the limitation of the maximum array size in some SQL engines? For example, in Amazon Athena (which uses Trino) the maximum array size is 254 (due to a Java limitation). In your example, a post could probably contain more thans these number of actions. One simple approach would be to make a new array column every time you hit the limit, but I don't think that is very scalable. What's your opinion?

    • @EcZachly_
      @EcZachly_  8 місяців тому +1

      Trino doesn’t have a 254 limit. I used trino at Facebook. The limit is 65k I thought? I definitely used Trino and 365 item arrays at Facebook.
      Maybe it’s an Athena thing?

  • @Milhouse77BS
    @Milhouse77BS 10 місяців тому +1

    3:55: Kimball

  • @vasilis7076
    @vasilis7076 2 місяці тому +1

    what is the difference between kimball and relational? relational is extremely normalized tables like inmon? while kimbal is star schema with fact and dims?

  • @srinubathina7191
    @srinubathina7191 10 місяців тому +1

    Hi Zach i couldn't find your git hub things can you please provide that i will learn the things from that as you mentioned in the video

    • @EcZachly_
      @EcZachly_  10 місяців тому +1

      The GitHub I use is www.github.com/DataEngineer-io
      There is alot of content on www.DataEngineer.io that you can watch that isn’t on UA-cam as well if you make a free account.

    • @srinubathina7191
      @srinubathina7191 10 місяців тому +1

      Sure Zach Thanks for your support

  • @bestechdeals4539
    @bestechdeals4539 Місяць тому +1

    You're gold!

  • @tadbronson
    @tadbronson 10 місяців тому +1

    Thanks for this video! It was very informative and the examples really illustrated the different use cases. Can you post a follow up or link to help clarify why the partitions are needed on SCD and OBT tables?

    • @mikekenneth4773
      @mikekenneth4773 10 місяців тому

      I had same question. Because with partition_date in SCD, the number of records doesn't really reduce

  • @TheEjaay
    @TheEjaay 8 місяців тому +1

    Couldn't all 1 to 1 relations to scd in reality just reside in the OBT? For example here. All posts table data could really just be another column of arrays.

    • @EcZachly_
      @EcZachly_  8 місяців тому

      Glad this got you thinking

  • @babuganesh2000
    @babuganesh2000 10 місяців тому +1

    Can you explain more and put a youtube shorts or a video about the enrollment and what it covers it will benefit whom e.t.c
    Full-access Live Boot Camp starting May 6th ($2000)
    Analytics-Focused Live Boot Camp starting May 6th ($1650)
    Self-paced Data Engineering Course V4 Combined ($1750)
    Infrastructure-Focused Live Boot Camp starting May 6th ($1650)

    • @babuganesh2000
      @babuganesh2000 10 місяців тому

      I am amazed about your youtube content, couldn't stop watching your shots
      FACEPAD - Funnel, Aggregatioon, Clustering, Experimentation,Prediction,Accumulation, Derivative - you are an amazing teacher mentor

  • @jacobli2676
    @jacobli2676 3 місяці тому

    Sorry, I might have a different point of view. In terms of Kimball's method, it is admittedly might not be the one-fit-all solution. But in this case, it seems you are making a mistake specified in Kimball's book in his book chapter 6, header/line pattern to avoid, which is, using a header/line (in this case, the post_id) as a dimension table with such a high cardinality. Please correct me if I am wrong.

  • @thiffanyp
    @thiffanyp 9 місяців тому +1

    superrrr thanks!

  • @sanooosai
    @sanooosai 5 місяців тому

    thank you sir

  • @jameshizon4861
    @jameshizon4861 10 місяців тому +1

    Oo great topic

  • @ondacharts
    @ondacharts 8 місяців тому +1

    This is good shit

  • @mitchtherighteous
    @mitchtherighteous 10 місяців тому +1

    appreciate the content Zach but the feed just spontaneously cuts out.

    • @EcZachly_
      @EcZachly_  10 місяців тому

      This was a live that I learned had no audio.

  • @mihirit7137
    @mihirit7137 7 місяців тому +1

    so basically in the end are you saying that the interactions of millions of users on apps such as facebook or instagram , i am repeating again millions of users are stored in one big table ? imagine the number of rows on that table 😧😧

    • @EcZachly_
      @EcZachly_  7 місяців тому

      Export it to excel and watch your computer melt before your eyes

  • @NeumsFor9
    @NeumsFor9 10 місяців тому +5

    I like Kimball a lot, but I thought it was pretty funny when the group published "facing the re-keying crisis" near the time of his retirement and NOT a long time before that. That all said, I thought that group was the most humble and accepting of other methodologies.

  • @richardduncan3403
    @richardduncan3403 2 місяці тому +1

    Is One Big Excel File and option? (joke)

    • @EcZachly_
      @EcZachly_  2 місяці тому

      So many business people think so!

  • @alecryan8220
    @alecryan8220 3 місяці тому +1

    Facebook privacy policies are “fairly lax” nice 😮

  • @rachelryan5231
    @rachelryan5231 6 місяців тому +1

    Yes, i agree data scientists do ask sometimes "how many stoned people liked SQL Suck posts" at 4 am

  • @alecryan8220
    @alecryan8220 3 місяці тому +1

    Freakin

  • @everythingalevels6645
    @everythingalevels6645 2 місяці тому +1

    very confusing example

  • @LifeIsAwesome-d5h
    @LifeIsAwesome-d5h 3 місяці тому +2

    Can you ever slow down and chill?

  • @jameshizon4861
    @jameshizon4861 10 місяців тому +2

    "Give me all the likes that happened between 4:20PM & 4:25PM each day to catch the stoned people..."

    • @EcZachly_
      @EcZachly_  10 місяців тому +3

      One big table followed by one big joint

    • @jameshizon4861
      @jameshizon4861 10 місяців тому +1

      @@EcZachly_ lol

  • @2account134
    @2account134 4 місяці тому +1

    it's like this guy really likes to say like... he's definitely high on something, why else would his voice be so obnoxious and annoying

    • @EcZachly_
      @EcZachly_  4 місяці тому

      Like, I’m just built different okay

  • @mx953
    @mx953 8 місяців тому +1

    there are people who hate SQL?👀

    • @abacusyou
      @abacusyou 8 місяців тому +1

      With a passion!