How to MERGE your Database into a Data Lake on AWS | Change Data Capture | Apache Iceberg

Поділитися
Вставка
  • Опубліковано 19 лис 2024

КОМЕНТАРІ • 9

  • @raghuerumal
    @raghuerumal 7 місяців тому +1

    Good job Thomas ... Liked your demo and explanation. Please share blog with code snippets for lambda and glue job. Thank you

    • @DataMyselfAI
      @DataMyselfAI  7 місяців тому +3

      Thank you for the positive feedback :) You can find the blog post with all code shown here: bit.ly/4aONz1M

  • @gatorpika
    @gatorpika 2 місяці тому

    Watched your first two videos, liked and subscribed. Great stuff! I have never tried CDC as I am old skool batch, but the thing that always freaked me out was if I had to go back and reload from bronze because something happened to the related target in silver, seems I would always have to reload from the beginning with the first full load. With batch I could identify the time period that was effed up and just reload that. Is that a correct assumption and if so how is that normally handled in practice to avoid huge multiyear reloads? I am assuming the source data is gone due to shorter retention.

    • @DataMyselfAI
      @DataMyselfAI  2 місяці тому

      Thanks 🙏 Yeah you're right, production-ready, robust implementations of CDC can be a headache. That's why there are reliable, ready-to-use solutions like Delta Live Tables in Databricks that can handle it efficiently.

  • @PhaniBhushan-f5w
    @PhaniBhushan-f5w 2 місяці тому

    Can you please make a video on "Use a reusable ETL framework in your AWS lake house architecture" ?

    • @DataMyselfAI
      @DataMyselfAI  2 місяці тому

      I will put it on my list, you could use dbt for that or are you interested in an AWS native solution? :)

    • @PhaniBhushan-f5w
      @PhaniBhushan-f5w 2 місяці тому

      @@DataMyselfAI , here is the reference link : aws.amazon.com/blogs/architecture/use-a-reusable-etl-framework-in-your-aws-lake-house-architecture/

  • @ManishJindalmanisism
    @ManishJindalmanisism 3 місяці тому

    HI Thomas, I have one question on this. When you are creating hostsIncrementalInputDF in glue, every time you will read the full bronze table and then do clean/transformation over it. Will that not be waste of resources as table grows over time?
    Should not this data frame only pick and process only those records from the bronze table which has changed or are new, since last run ?

    • @DataMyselfAI
      @DataMyselfAI  3 місяці тому

      Hi Manish, you are absolutely correct that this would be a waste of resources and incur unnecessary transformations.
      That's why I activated Glue job bookmarks for the job, so that only new files are picked up compared to the last run. Also, this is more of a proof of concept. In a real scenario, we would need a more robust setup to ensure that everything works correctly, even if the job fails.