Modern Data Lake Storage Layers

Поділитися
Вставка
  • Опубліковано 22 лип 2024
  • An overview of Apache Hudi, Apache Iceberg, and Delta Lake.
    In this video, we talk about the basics of how Hudi, Iceberg, and Delta Lake work. You'll see how to insert, update, and delete data in your data lake and how each of these frameworks work behind the scenes.
    Blog post: dacort.dev/posts/modern-data-...
    GitHub Repo with CloudFormation and Notebooks: github.com/dacort/modern-data...
    Table of Contents:
    00:00 - Intro
    03:21 - Environment Setup
    05:23 - Apache Hudi - Writes
    07:38 - Apache Hudi - Updates
    12:23 - Apache Hudi - Deletes
    14:43 - Apache Hudi - Time Travel
    17:19 - Apache Iceberg - Writes
    23:48 - Apache Iceberg - Updates
    26:49 - Apache Iceberg - Deletes
    28:46 - Apache Iceberg - Time Travel
    30:14 - Delta Lake - Writes
    31:57 - Delta Lake - Updates
    34:23 - Delta Lake - Deletes
    35:35 - Delta Lake - Time Travel
    36:47 - Wrapup
  • Навчання та стиль

КОМЕНТАРІ • 18

  • @anjim7877
    @anjim7877 2 роки тому +1

    Amazing job, Thank you Dacort

  • @VikasGK
    @VikasGK Рік тому

    Thats an excellent demonstration

  • @joemo2782
    @joemo2782 2 роки тому +1

    Brilliant work made this so easy to understand. Great overview!

  • @sshks10
    @sshks10 2 роки тому

    Nice speech and manner! Clear mind!

  • @NM-jq3sv
    @NM-jq3sv Рік тому +1

    great tutorial

  • @marcosluis2186
    @marcosluis2186 2 роки тому

    Amazing job, Damon.

    • @dacort
      @dacort  2 роки тому +1

      Thank you!

  • @woliveiras
    @woliveiras 25 днів тому

    Amazing job. Thank you!! What is the best way to read this delta tables now? Data Catalog and then Athena? I would like to see this data in our QuickSight.

  • @davidciudad3149
    @davidciudad3149 2 роки тому +3

    Great job! Your channel reminds me of your colleague Julien Simon's.
    1- Could you show how to use AWS Lake Formation's governed tables in Amazon EMR? What is the difference between governed tables and Apache Iceberg/Apache Hudi/Delta Lake?
    2- Could you have a demo of Pandas-on-Spark when Apache Spark 3.2 would be available in Amazon EMR? I'm interested to know whether it is possible to run Pandas code on Amazon EMR without big changes.
    3- Could you talk about the book about Amazon EMR that your colleague Sakti Mishra will publish soon? I would like to know if it can help me to prepare for the AWS Data Analytics Certification.

    • @marcosluis2186
      @marcosluis2186 2 роки тому +1

      1- With AWS Lake Formation's Governed Tables, you have some limitations you need to know about it docs.aws.amazon.com/lake-formation/latest/dg/governed-table-restrictions.html, so from my perspective here are the main differences between those and Apache Iceberg/Apache Hudi/Delta Lake
      2- The last version of EMR is based on Spark 3.1.2 docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/emr-eks-6.5.0.html
      3- You can find Sakti Mishra's book here amzn.to/3sRare1

  • @Major.Tom.1973
    @Major.Tom.1973 Рік тому

    Does any of these have a "vacuum" equivalent, or how do you do housekeeping / maintenance on these incremental data lakes?

    • @dacort
      @dacort  7 місяців тому

      Both Hudi and Iceberg have "maintenance' operations you can run, including compaction. For Iceberg ( iceberg.apache.org/docs/1.2.0/maintenance/#compact-data-files ) and Hudi ( hudi.apache.org/docs/compaction/ ).

  • @arjunshah8763
    @arjunshah8763 2 роки тому

    Hi Damon. Thank you but we are trying to implement the Iceberg format using Glue. Do you have any idea if Glue Spark will support Iceberg?

    • @dacort
      @dacort  2 роки тому +1

      I haven't used it personally, but looks like there is an Iceberg connector you can subscribe to from Glue Studio. Dremio has a pretty good tutorial about it here: www.dremio.com/resources/tutorials/getting-started-with-apache-iceberg-using-aws-glue-and-dremio/

    • @arjunshah8763
      @arjunshah8763 2 роки тому

      @@dacort Yes saw that one but it does do the CDC part. The one you have on your channel is what we are looking for. Hope we can replicate the same using Glue connector for Iceberg. So far no luck but will work with the support if the connector does not work.

  • @arjunshah8763
    @arjunshah8763 2 роки тому

    Does this same approach work with Spark on Glue job? Trying it but with no such luck.

    • @dacort
      @dacort  2 роки тому

      Hi Arjun - Glue does have the ability to connect to Hudi tables, but there are some different steps to set it up. You can find more details here: aws.amazon.com/blogs/big-data/writing-to-apache-hudi-tables-using-aws-glue-connector/

    • @arun.ayilliath
      @arun.ayilliath 2 роки тому

      It should work provided you supply the necessary dependancies and spark configs. I have done basic CRUD on all these files type in Glue without using connectors.