Data Engineering Made Easy: Build Datalake on S3 with Apache Hudi & Glue Hands-on Labs for Beginners

Поділитися
Вставка
  • Опубліковано 20 гру 2024

КОМЕНТАРІ • 41

  • @SoumilShah
    @SoumilShah  Рік тому

    FYI people who see error
    Make sure you remove a frok s3 path
    Instead of s3a:// use s3://

  • @finedinerest
    @finedinerest Рік тому

    great video covering comprehensively all major features which a beginner should know on Hudi. Thanks Soumil for taking time to make such videos

  • @KushwanthK
    @KushwanthK 10 місяців тому

    omg I enjoyed watching it... great job 👏all my concepts brush ups for interviews. It was so much detailed and you made it easy :)

  • @sandeepmodaliar6980
    @sandeepmodaliar6980 2 роки тому +3

    Great Video!!!! Lot of learning through your Channel always !!!!

  • @ayushnauty4489
    @ayushnauty4489 Рік тому

    you're doing great work. Really appreciate the knowledge you shared .🤝🏻

  • @ashokjangam7329
    @ashokjangam7329 5 місяців тому

    @soumilshah thanks for your informative video, but the link you have given in description for pdf files is not working. could you please update that with right url.

  • @JiyuKim-sr1mi
    @JiyuKim-sr1mi Рік тому

    Thank you for the video. I have a question on precombine field.Why is precombine field required for upsert and delete operaton even if there is primary key?

  • @josemanuelmartinezsegura2815

    El código no me esta funcionando me dice que se tiene que crear una base de datos, me ayudas

  • @anshmishra9079
    @anshmishra9079 Рік тому

    Soumil can you please share the slides link which you were showing in the video.

  • @sandeepmodaliar6980
    @sandeepmodaliar6980 2 роки тому +1

    One doubt though why do we have 2 tables for MoR and only one table for CoR. What are the use cases for each of these modes? Do they both support all CRUD operations?

    • @SoumilShah
      @SoumilShah  2 роки тому +1

      MOR gives you two table read optimized and real time. Remember in MOR the RO table gets updated after set number of commits remember as I mentioned the merge happens while reading
      COW you only have one table and merge happens while you write the data
      Again they are both meant for specific use cases depending upon what your application need
      Read or write latency based on that you may decide to go with either

  • @sureshkumavat8300
    @sureshkumavat8300 11 місяців тому

    I just started i was tryjng to download the pdf for steps to be perform but it us not accessible can you please share the new link

  • @bhavanisudhas
    @bhavanisudhas 2 роки тому

    Great video. Love the hands on lab material!

  • @janocamachovicente4645
    @janocamachovicente4645 Рік тому

    Hi Soumil! Congrats for an excellent video. I have a question, do you know what permissions are required in IAM policy instead of using the admin permissions?

  • @wuerikehenriquedasilvacava928

    You're doing and amanzing job! I have just one doubt, why at this video the connector was not necessary as I saw it was used in the other videos?

  • @dulcemendes2939
    @dulcemendes2939 Рік тому

    Hi , great videos, but do you have any video for a lambda python code to bulk insert and upsert data into data lake in a hudi table in parquet format? it is crazy I cannot find anything, pyspark is giving me a lot of java errors :( thanks in advance :)

    • @sanjaybedwal2385
      @sanjaybedwal2385 Рік тому

      but why do you want to use lambda for ETL kind of work . Lambda is supposed to be used for tiny jobs and not bulk inserts and updates.

  • @swetapandey14
    @swetapandey14 9 місяців тому

    Pdf is not available 😢

  • @RandomG677
    @RandomG677 Рік тому

    Thanks, Soumil. Great starter! I have some questions:
    How big is the dataset? select after update took about 8 secs and the delete took 8 mins. Is it normal? It sounds quite slow.
    How does Hudi compare to iceberg and deltalake in terms performance, in your experience?

    • @SoumilShah
      @SoumilShah  Рік тому

      It depends there are lot of parameters that’s you can tune
      Ru using indexes ?
      Ru using partition
      Did you use right table type for application
      Are you using cleaner utility
      Are you using clustering ?

  • @shravyaR-z3p
    @shravyaR-z3p Рік тому

    Hi Soumil - Great video. I have a question, you have mentioned that RO will perform merge depending on the condition - So what is the default condition it will have when we don't specify any conditions. Will it never update the changes?

    • @SoumilShah
      @SoumilShah  Рік тому

      It happen after set commit
      I don’t know exact number need to refer docs

  • @bigDataBala
    @bigDataBala 7 місяців тому

    gdrive link is not working please check

  • @pragattiwari5530
    @pragattiwari5530 Рік тому

    I am getting error that cannot sync using meta sync class HiveSyncTool.. Can you please tell me why does this error occur?

    • @SoumilShah
      @SoumilShah  Рік тому +1

      Did you use the code I provided ??

    • @pragattiwari5530
      @pragattiwari5530 Рік тому

      @@SoumilShah Thanks, The issue got fixed.

    • @abdulhalimaziz5220
      @abdulhalimaziz5220 Рік тому

      Hi@@pragattiwari5530 How did you solve this? I'm still getting this error

    • @bharathballamudi136
      @bharathballamudi136 6 місяців тому

      I'm getting the same error even though I have literally copied the code by @SoumilShah. @pragattiwari5530 how did you fix it?

    • @bharathballamudi136
      @bharathballamudi136 6 місяців тому

      I face this issue now. Care you help me overcome it?

  • @nandkarthik
    @nandkarthik 4 місяці тому

    Man, You are awesome

  • @selmarhn
    @selmarhn Рік тому

    thank you so much! such a great material !

  • @arifurrahman0014
    @arifurrahman0014 Рік тому

    I need your halp

  • @ramanmama
    @ramanmama Рік тому

    Thanks so much or doing this!

  • @CHiRaStar1
    @CHiRaStar1 Рік тому

    Hi Soumil , I have sent an connect invite from your blog . Hope I will get a response

    • @SoumilShah
      @SoumilShah  Рік тому

      I didn’t received
      Can you send maybe an email
      shahsoumil519@gmail.com

  • @labcodePython
    @labcodePython Рік тому

    Any of you face the issue with delete command? After running I cannot longer querty the RT table, and the asset is not being removed.
    Athena complains about GENERIC_INTERNAL_ERROR: org/objenesis/strategy/InstantiatorStrategy

    • @SoumilShah
      @SoumilShah  Рік тому +1

      Thanks yes there is open ticket
      This is known issue on aws side I will send you ticket where you can watch status
      github.com/apache/hudi/issues/7430#issuecomment-1373626282
      Please note this only occurs for MOR tables