Data Lake Fundamentals, Apache Iceberg and Parquet in 60 minutes on DataExpert.io

Поділитися
Вставка
  • Опубліковано 5 лют 2025
  • We'll be covering data lakes, parquet file format, data compression and shuffle!
    Make sure to have a www.DataExpert.io account here so you can get the most of this lab!

КОМЕНТАРІ • 58

  • @alonzo_go
    @alonzo_go 10 місяців тому +33

    This channel is gold for any young data engineer. I wish I could pay you but you're probably already swimming in enough data :D

    • @jay_wright_thats_right
      @jay_wright_thats_right 7 місяців тому

      How do you know that? Did you get a job from what you learned on this channel? Are you actually a data engineer?

    • @alonzo_go
      @alonzo_go 7 місяців тому +8

      @@jay_wright_thats_right yes, I'm actually a data engineer. I've been a data engineer for a many years now so no, I didn't get a job because of the channel. But I can confirm that he teaches important concepts that are very useful and sometimes not readily available for a beginner engineer.

    • @SheeceGardazi
      @SheeceGardazi 3 місяці тому

      @@jay_wright_thats_right 100% I'm going to pitch some of these practises ... this is gold

  • @nobodyinparticula100
    @nobodyinparticula100 Рік тому +6

    Zach! We just started our project where we will be transferring our data to Data Lake in parquet! This is a very timely video. Awesome job, as always!

  • @justinwilkinson6300
    @justinwilkinson6300 Рік тому +4

    Great lesson Zach! I have always wondered what the hell a Data Lake is. Great explanations and super easy to understand!

  • @vivekjha9952
    @vivekjha9952 11 місяців тому +2

    Zach, I watched this while going office, and I loved the way,learnt hell about lot of things.Thanks for it

  • @rohitdeshmukh197
    @rohitdeshmukh197 Рік тому +6

    great video Zach, awesome content I learnt a Lot. Can you please make a video or share some content about why we should avoid shuffling, shuffling issues and ways to fix it?

  • @stiffer_do
    @stiffer_do 24 дні тому +1

    amazing video! Nice that you didn't cut the video when the unsorted and sorted size wasn't the expected!

  • @theloniusmonkey5138
    @theloniusmonkey5138 Рік тому +2

    Great and insightful lessons Zach, just high quality content! Your community of loyal DEs is growing :) Keep up!

  • @onzurkthewarrior2822
    @onzurkthewarrior2822 Місяць тому +1

    Best data engineer in the world 🚀

  • @andydataguy
    @andydataguy Рік тому +2

    Awesome video man! Just discovered your channel and excited to see more like this

  • @qculryq43
    @qculryq43 10 місяців тому +1

    Wow - I learned so much from this video - Amazing! Thank you for sharing.

  • @princegaurav
    @princegaurav 2 місяці тому +1

    Nice video Zach ... learnt something new. Thanks 👍

  • @murilloandradef
    @murilloandradef Рік тому +2

    amazing class Zach! keep going, thxxx

  • @muhammadzakiahmad8069
    @muhammadzakiahmad8069 Рік тому +2

    Need more of these videos, beginer friendly💡

  • @FabianJEvans
    @FabianJEvans Місяць тому +1

    If you want to sort from lowest cardinality to highest cardinality then to get an estimate of cardinality for each of the table colums we can look at the following values:
    - The most steals in a game is 11
    - There number of teams in the NBA is 30
    - The most assists ever in a game is 30
    - The most rebounds in a game is 55
    - The most points in a game is 100
    - The number of players in the NBA is 500-600
    This would imply that sorting by player name is actually one of the worst options. Is better to sort by player name rather than points because compressing the strings reduces more bytes than compressing the numeric values?
    Also, considering that it’s unlikely to have two players with the exact same name on different teams, wouldn’t you get even better compression by sorting first by team and then by player? This way, you’d maintain the same level of compression for the player column while also improving compression for the team column.

  • @SheeceGardazi
    @SheeceGardazi 3 місяці тому +1

    Thanks for the hands on lab!

  • @rocioradu8636
    @rocioradu8636 2 місяці тому +1

    I love your videos! so useful for the day to day job

  • @fbokovikov
    @fbokovikov 5 місяців тому +1

    Excellent lesson! Thank you Zach!

  • @vivekjha9952
    @vivekjha9952 11 місяців тому +1

    Its great Video Zach, thoroughly Enjoyed It

  • @srinubathina7191
    @srinubathina7191 10 місяців тому +1

    Wow Amazing content Zach
    Thank you so much

  • @ManishJindalmanisism
    @ManishJindalmanisism 11 місяців тому +2

    Thanks Zach, the practical you showed helped me learn a lot. Can you please tell if I do daily sorted inserts into my iceberg table from my OLTP system using an ETL pipeline, will Iceberg consider that instance 'exclusive' and compress store it or will it look for common columns in existing data files as well and then compress?

  • @papalaplace
    @papalaplace Рік тому +1

    Great as always 🎉

  • @adolfo1981
    @adolfo1981 4 місяці тому +1

    This guy knows how to explain

  • @zwartepeat3552
    @zwartepeat3552 10 місяців тому +2

    Casually ending the gender debate 😂 good video sir! Very informative

  • @atifiu
    @atifiu 11 місяців тому +2

    @zach Thanks for this informative video. I have one question. You mentioned about sorting the data on low cardinality columns and then moving towards high cardinality for better RLE which makes sense to get more compressed data. But on the read side taking an example of ICEBERG we generally try to filter data on high cardinality columns and hence use those columns in sorting the data so that we read fewer data and predicate pushdown will really help in reading very small subset of data. Now both these settings contradict each other, on one side we get smaller data but on the other side we are more concerned about using sorting on high cardinality columns.

    • @EcZachly_
      @EcZachly_  11 місяців тому

      Yep it’s an art! It all depends on what columns are the most likely to be filtered on!

  • @alecryan8220
    @alecryan8220 3 місяці тому

    Hey! Super late to this video but glad I found it. One thing you didn’t touch on is the tradeoff in compute for sorting the data on write. I’m wondering if you found this to be negligible in your past experience. Thanks again for the great content!

  • @anthonyanalytics
    @anthonyanalytics 7 місяців тому +1

    Wow this is amazing!

  • @JP-zz6ql
    @JP-zz6ql Рік тому +1

    Wow the way people push vc is creative now good video.

  • @thoughtfulsd
    @thoughtfulsd 8 місяців тому

    This is amazing . You are a fabulous teacher . Had a question on replication. Is the replication factor not a requirement any more in modern cloud data lakes ?

    • @EcZachly_
      @EcZachly_  8 місяців тому

      Nope. Hadoop is dead fam

  • @pauladataanalyst
    @pauladataanalyst 11 місяців тому +1

    Hello Zach, thanks for the content, after May, when is the next bootcamp?

    • @EcZachly_
      @EcZachly_  11 місяців тому

      There’s a 100% chance May is the last one that I’m teaching a majority (~75%) of the content.
      September/october would be the next one. I’ll be teaching like… 30-40%

    • @MichaelCheung-z2v
      @MichaelCheung-z2v 6 місяців тому

      Wanna join!

  • @YEM_
    @YEM_ Рік тому +1

    What SQL syntax is that? (So i can Google it to research more about what options are available to create table).

    • @EcZachly_
      @EcZachly_  Рік тому +1

      Trino which has nearly identical syntax to Postgres

  • @ondacharts
    @ondacharts 8 місяців тому +1

    Another heat vid

  • @YEM_
    @YEM_ Рік тому +1

    The tables you are using for your sources... Are those iceberg tables which are really just files and folders in s3 under the hood, placed there before the training? I'm just confused where the raw is coming from and what it looks like.

  • @LMGaming0
    @LMGaming0 8 місяців тому +1

    I have a question, during the whole video you've been dealing with historical data and moving it, what about new data received, how do you deal with it ? do you insert it into some random table then update your iceberg table using some crone jobs or do you insert it directly into iceberg and how?

    • @EcZachly_
      @EcZachly_  8 місяців тому +2

      Collect a daily batch in Kafka then dump it to iceberg

  • @sreesanjeev84
    @sreesanjeev84 Рік тому +1

    Is it necessary to sort the dataset ? Say what if the compute and time for sorting >>>> Storage consumed? even if the storge is very large it is cheaper right ?? What is the good tipping point here ?

    • @EcZachly_
      @EcZachly_  Рік тому

      Depends on down stream consumption and volume.
      If the data set isn’t used a ton, sorting probably isn’t worth it

  • @amankapoor3563
    @amankapoor3563 Рік тому +1

    Apart from reducing the size, does sorted_by helps in read in any other way? Are order by queries efficient with sorted_by?

    • @EcZachly_
      @EcZachly_  Рік тому

      You get skipping and stuff like that too when you filter on columns in the sorted_by column so it’s more efficient there too

  • @LMGaming0
    @LMGaming0 10 місяців тому +1

    Amazing video! + 1 follower :D

  • @eduardogaitanescalante9698
    @eduardogaitanescalante9698 4 місяці тому +1

    Love you you beautiful engineer

  • @-es2bf
    @-es2bf 5 місяців тому +1

    I dont understand the reason of parquet?? so you are saying 'parquet is amazing it partitions the data you don't have to select all columns'. Well why not just select (needed columns) instead of select *? This is sql 101...

    • @TheBigWazowski
      @TheBigWazowski 5 місяців тому +1

      It has to do with how the data is laid out on disk, and the fact that reading from disk happens in contiguous chunks. As an example, take a table that has 3 fields, A, B, and C, that are all 1 byte in size, and say you want to query only field A. Also, suppose your disk transfers happen in 4 byte chunks.
      With a row oriented format, 4 rows of data are laid out like:
      A B C A B C A B C A B C
      To perform the query, you would need to read all 12 bytes from disk. The 3x4 byte chunks being:
      A B C A
      B C A B
      C A B C
      That took 3 transfers, and since you’re only interested in field A, you end up only using 4 of the 12 bytes (33.3% efficiency).
      With a column oriented format the data is laid out like:
      A A A A B B B B C C C C
      This time you only need a single 4 byte transfer of the first chunk to get the data you want. That’s a third of the transfers and 100% efficiency

    • @TheBigWazowski
      @TheBigWazowski 5 місяців тому +1

      Often times, there will be additional benefit for having all the field A data stored contiguously in memory because your CPU also transfers contiguous chunks from RAM to the CPU cache. If you were reading from a row oriented format, you could rearrange the data to use a column oriented in-memory layout, like Arrow, but that rearrangement would be costly

  •  3 місяці тому +1

    Your face getting full screen, we are not interested with your face >>

  • @FabianJEvans
    @FabianJEvans Місяць тому +1

    If you want to sort from lowest cardinality to highest cardinality then to get an estimate of cardinality for each of the table colums we can look at the following values:
    - The most steals in a game is 11
    - The number of teams in the NBA is 30
    - The most assists ever in a game is 30
    - The most rebounds in a game is 55
    - The most points in a game is 100
    - The number of players in the NBA is 500-600
    This would imply that sorting by player name is actually one of the worst options. Is better to sort by player name rather than points since compressing the strings reduces more bytes than compressing the numeric values?
    Also, since having two players on different teams with the same name is very unlikely, wouldn’t you get even better compression by sorting first by team and then by player? This way, you’d maintain the same level of compression for the player column while also improving compression for the team column.

    • @EcZachly_
      @EcZachly_  Місяць тому

      You’re totally right