Watched your first two videos, liked and subscribed. Great stuff! I have never tried CDC as I am old skool batch, but the thing that always freaked me out was if I had to go back and reload from bronze because something happened to the related target in silver, seems I would always have to reload from the beginning with the first full load. With batch I could identify the time period that was effed up and just reload that. Is that a correct assumption and if so how is that normally handled in practice to avoid huge multiyear reloads? I am assuming the source data is gone due to shorter retention.
Thanks 🙏 Yeah you're right, production-ready, robust implementations of CDC can be a headache. That's why there are reliable, ready-to-use solutions like Delta Live Tables in Databricks that can handle it efficiently.
HI Thomas, I have one question on this. When you are creating hostsIncrementalInputDF in glue, every time you will read the full bronze table and then do clean/transformation over it. Will that not be waste of resources as table grows over time? Should not this data frame only pick and process only those records from the bronze table which has changed or are new, since last run ?
Hi Manish, you are absolutely correct that this would be a waste of resources and incur unnecessary transformations. That's why I activated Glue job bookmarks for the job, so that only new files are picked up compared to the last run. Also, this is more of a proof of concept. In a real scenario, we would need a more robust setup to ensure that everything works correctly, even if the job fails.
Good job Thomas ... Liked your demo and explanation. Please share blog with code snippets for lambda and glue job. Thank you
Thank you for the positive feedback :) You can find the blog post with all code shown here: bit.ly/4aONz1M
Watched your first two videos, liked and subscribed. Great stuff! I have never tried CDC as I am old skool batch, but the thing that always freaked me out was if I had to go back and reload from bronze because something happened to the related target in silver, seems I would always have to reload from the beginning with the first full load. With batch I could identify the time period that was effed up and just reload that. Is that a correct assumption and if so how is that normally handled in practice to avoid huge multiyear reloads? I am assuming the source data is gone due to shorter retention.
Thanks 🙏 Yeah you're right, production-ready, robust implementations of CDC can be a headache. That's why there are reliable, ready-to-use solutions like Delta Live Tables in Databricks that can handle it efficiently.
Can you please make a video on "Use a reusable ETL framework in your AWS lake house architecture" ?
I will put it on my list, you could use dbt for that or are you interested in an AWS native solution? :)
@@DataMyselfAI , here is the reference link : aws.amazon.com/blogs/architecture/use-a-reusable-etl-framework-in-your-aws-lake-house-architecture/
HI Thomas, I have one question on this. When you are creating hostsIncrementalInputDF in glue, every time you will read the full bronze table and then do clean/transformation over it. Will that not be waste of resources as table grows over time?
Should not this data frame only pick and process only those records from the bronze table which has changed or are new, since last run ?
Hi Manish, you are absolutely correct that this would be a waste of resources and incur unnecessary transformations.
That's why I activated Glue job bookmarks for the job, so that only new files are picked up compared to the last run. Also, this is more of a proof of concept. In a real scenario, we would need a more robust setup to ensure that everything works correctly, even if the job fails.