How Salting Can Reduce Data Skew By 99%

Поділитися
Вставка
  • Опубліковано 24 гру 2024

КОМЕНТАРІ • 50

  • @gabriells9074
    @gabriells9074 Рік тому +4

    Hi Afaque, thank you for another great explanation, I have a question, since AQE splits skewed partitions into smaller ones, is salting still useful when AQE is enabled ?

  • @afzalandthedreams
    @afzalandthedreams 6 днів тому

    Hi Ahmad, I see that df_skew has 999,990 rows of value = 0 in partition 0 and a df_uniform has 1,000,000 unique values evenly distributed across 3 partitions (ie approx 333,333 in each partition). Can you help me understand how the count 1000005 appeared after the join in @14:42?

  • @Wonderscope1
    @Wonderscope1 11 місяців тому +6

    Thanks for great content, You should of used Salt bae gesture when you said salting :)
    Is Slating still good approached if join is happening between two large datasets with hundreds of million rows? Explode will increase the number of rows for one dataset. Let's say 100,000,000 * 200 Salet_Number = 20,000,000,000 Rows

  • @BabaiChakraborty-ss8pt
    @BabaiChakraborty-ss8pt 29 днів тому

    man, you spark tutorials are awesome. Can you also do some Warehouse scalability stuff?

    • @afaqueahmad7117
      @afaqueahmad7117  28 днів тому +1

      Thanks man :) I've some stuff coming up on Data Modeling

  • @dhavaldalasaniya
    @dhavaldalasaniya 5 місяців тому +2

    This is excellent Spark content videos. It is prefect explanation on Spark performance concept.

    • @afaqueahmad7117
      @afaqueahmad7117  5 місяців тому

      Many thanks @dhavaldalasaniya, this means a lot, appreciate it :)

  • @sasadsasadsad
    @sasadsasadsad 6 місяців тому +2

    Precious 30 minutes, quality content

    • @afaqueahmad7117
      @afaqueahmad7117  6 місяців тому

      Thank you @sasadsasadsad, appreciate it :)

  • @HamsAnsari
    @HamsAnsari Рік тому

    I have read and watched many things related to salting but this visual explanation just makes it really easy to comprehend it, plus really well articulated. Waiting for more videos to learn from :)
    Also could you recommend some books or other resources that have enabled you to attain this level of knowledge, Thanks!

    • @afaqueahmad7117
      @afaqueahmad7117  Рік тому +1

      Hey @user-nz7uh1qo5o, many thanks for the kind words, it means a lot to me, and, glad to know that the video was helpful. Most of the content is based on my work experiences + good ad-hoc content on Medium to which I could relate. My only humble suggestion is to be ruthless, get your hands dirty, question everything that's happening and search the internet if anything doesn't makes sense :)

  • @mertboschbosch
    @mertboschbosch 14 днів тому

    If we have default spark shuffle partition as 200, we will have 200 times bigger df.
    e.g. for value 0 we will have 200 different salt_value, do you think having salt_number = sql.shuffle.partition would be efficient?

  • @kreativeaman7688
    @kreativeaman7688 Місяць тому

    what notebook / sketch app are you using? Btw great playlist, clearly explained concepts👍

    • @afaqueahmad7117
      @afaqueahmad7117  Місяць тому

      Appreciate it @kreativeaman7688 :)
      Notion for the notes, Nebo (on iPad) for sketch

  • @janb4637
    @janb4637 3 місяці тому

    I never see such a detailed explanation. Thank you very much @afaque Ahmad. Is there any way we can get the document.

    • @afaqueahmad7117
      @afaqueahmad7117  3 місяці тому

      Appreciate it @janb4637, let me try and put it on GitHub :)

  • @dib4027
    @dib4027 2 місяці тому

    Hi, I saw in Spark UI that AQE sometimes handles the data skew. Does enabling AQE can be a solution to the data skew problem.

    • @afaqueahmad7117
      @afaqueahmad7117  Місяць тому

      Hey @dib4027, it indeed can be a great solution to solving Data Skew but, however, it's important to keep in mind that AQE may not resolve extreme data skews: while it's still good splitting skewed partitions and doing broadcast joins, extreme data skews may cause executor OOM errors, spill-to-disk operation that may be difficult to manage and resolve. In such cases you will need to resolve to Salting and other approaches and do the tuning on your own

  • @anubhavrastogi7463
    @anubhavrastogi7463 8 місяців тому

    Hi, can you please help me why are we considering salt number 3 or4. Is this should be equal to number of shuffle partitions that we have in our data or the distinct values that we have in our dataset.Please explain.

  • @SHUBHAM_707
    @SHUBHAM_707 6 місяців тому

    what if the values are unique in join 1 to 1 join? will it create skew

  • @hanhtran167
    @hanhtran167 Місяць тому

    OMG! can I call this by the best explanation?

    • @afaqueahmad7117
      @afaqueahmad7117  Місяць тому

      Appreciate it @hanhtran167, glad to hear that :)

  • @Sandeep-bl9ji
    @Sandeep-bl9ji 10 місяців тому

    Nice explaination

  • @arghyakundu8558
    @arghyakundu8558 4 місяці тому

    Excellent Content..!! Loved It. Such detailed explanation on Salting Technique with Graphical Representation.

  • @meditation_in_nature
    @meditation_in_nature 25 днів тому

    Isn't Each partition should be having a unique values? After salting 1's,values with 1+salt and 2 are landing on same partition.

  • @akshaybaura
    @akshaybaura Рік тому

    can you show us if salting in aggregations was really worth it ? I'm skeptical that too many shuffles in salting will deteriorate the performance with salting.

    • @afaqueahmad7117
      @afaqueahmad7117  Рік тому

      Hey @akshaybaura, there will indeed be a performance dip due to shuffles when using Salting, but, without Salting you're at the risk of either:
      a. Getting OOM (out of memory) errors.
      b. Your jobs running 5-10x slower because fewer resources (cores and memory) are being used while the others remain underutilised.
      However, even when using Salting, the performance largely depends on factors like the size of dataset and the correct use of Salt Number.

  • @RaviSingh-dp6xc
    @RaviSingh-dp6xc Місяць тому

    again!! Perfectly explained 🤟

  • @rgv5966
    @rgv5966 5 місяців тому

    Hey @Afaque, great content as usual, but I thought this video could be a little concise, great work anyways!

    • @afaqueahmad7117
      @afaqueahmad7117  4 місяці тому

      Thank you @rgv5966 for the appreciation. Tried my best to keep it concise, but will take your feedback :)

  • @alokranjan7323
    @alokranjan7323 Рік тому

    hash(1,0)%3 how to calculate?

    • @vinothvk2711
      @vinothvk2711 11 місяців тому

      0%3

    • @afaqueahmad7117
      @afaqueahmad7117  11 місяців тому +1

      @vinothvk2711 is right. As outlined in the video, we're assuming h(1, 0) = 0, so it's equal to 0 % 3 = 0

  • @sonlh81
    @sonlh81 4 місяці тому +1

    Not easy to understand, but it great

  • @ATHARVA89
    @ATHARVA89 12 днів тому

    awesome!

  • @MuhammadAhmad-do1sk
    @MuhammadAhmad-do1sk 7 місяців тому +1

    Thanks for this. Love from 🇵🇰

    • @afaqueahmad7117
      @afaqueahmad7117  7 місяців тому

      Appreciate it @MuhammadAhmad-do1sk, Love from India :)

  • @9figurelifestyle790
    @9figurelifestyle790 Рік тому +1

    @afaqueahmad7117 - Great topic and amazing explanation - Looking forward to learning more from you. One suggestion is to create more videos related to designing idempotent data pipelines, backfilling missed window data, simulating different production failures and how to approach them, coz I see more people are doing interview focused videos. These topics will mentor both entry level and mid level Data engineers to gain confidence in Data Engineering field

    • @afaqueahmad7117
      @afaqueahmad7117  Рік тому

      Glad you liked the video and the explanation!
      Really appreciate your feedback. Yes, all of that is in the roadmap, but for the upcoming year. The initial plan is to cover all aspects related to Performance Tuning + Foundations.

  • @shivoham5939
    @shivoham5939 2 місяці тому

    BRO PAAN CHABAKE BOL RAHA HAIN KYA EXPLANATION IS SUPER

    • @afaqueahmad7117
      @afaqueahmad7117  2 місяці тому +1

      Haan bhai @shivoham5939 paan chaba raha tha :)

    • @shivoham5939
      @shivoham5939 2 місяці тому

      @@afaqueahmad7117 awesome 👌

    • @shivoham5939
      @shivoham5939 2 місяці тому

      @@afaqueahmad7117 bro i need one on one call with you i will pay as much you want plzzz

    • @shivoham5939
      @shivoham5939 2 місяці тому

      @@afaqueahmad7117 I have interview I need guidance

  • @gudiatoka
    @gudiatoka 7 місяців тому

    After 3.0 salting is not useful

    • @afaqueahmad7117
      @afaqueahmad7117  7 місяців тому +1

      Hey @gudiatoka, I wish it was so, but just in case you're referring to AQE as the solution, it isn't always very helpful, so you still need to resort to salting.

    • @gudiatoka
      @gudiatoka 7 місяців тому

      @@afaqueahmad7117 yes AQE and partition is useful and in case of larger dataframe when salting key applied to lower df it duplicated records making it more skewed then the concept of salting not valid at least for me...may be it servers different