Data Skew Drama? Not Anymore With Broadcast Joins & AQE

Afaque Ahmad

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 29 сер 2024

КОМЕНТАРІ • 31

@cantcatchme8368 7 днів тому
Excellent.. Keep going..
@Fullon2 11 місяців тому ⁺²
Incredible series, thank you Afaque Ahmad. Looking forward to the next videos.😃
@afaqueahmad7117 11 місяців тому ⁺¹
Many thanks @Fullon2 for the kind words, really appreciate it! :)
@miguelruiz9772 11 місяців тому ⁺³
Hi Afaque, great video, and content :). Maybe it may be worth noting in the video the limitations of broadcast joins: the broadcasted dataset needs to fit in the driver & executor memory, and if you have many executors, it may take longer than shuffle merge, it could in fact timeout.
@afaqueahmad7117 10 місяців тому ⁺¹
Thanks @miguelruiz9772, for the kind words, and for the feedback, makes sense! :)
@roksig3823 9 місяців тому
Great explanation ! Can understand SMJ and BCJ in a better way. Thanks heaps !
@CoolGuy 10 місяців тому
Learnt about AQE today. Thanks for the video.
@RohanKumar-mh3pt 11 місяців тому
amazing you explained in very depth in each video
@OmairaParveen-uy7qt 11 місяців тому
amazing content!! explained so well !
@snehitvaddi 10 днів тому
This is helpful, but I still have a few doubts.
1. If Broadcast join is immune to skewness, why there is Salting technique?
2. In the Broadcast join example, the customer dataset appeared to be outside of any executor. Where is it actually stored? How can we specify its storage location?
@shaifalipal9415 10 днів тому
Broadcast is only possible if the other table is really small to be replicated
@ManaviVideos 11 місяців тому
Thank you! Afaque Ahmad👍
@anandchandrashekhar2933 2 місяці тому
Really great content, all of your videos. Thannk you!! Just had a question out of curiousity - Does AQE only coalesce shuffle partitions or depending on the need, increase the shuffle partitions beyond 200?
@afaqueahmad7117 2 місяці тому ⁺¹
Hey @anandchandrashekhar2933, appreciate the kind words. Yes, AQE can do both - increase (split) and decrease (coalesce) the number of shuffle partitions. A clear example is this one is in the Spark DAGs video where 1 skewed partition was split into 12 because that 1 partition was skewed. Refer here: ua-cam.com/video/O_45zAz1OGk/v-deo.html
@anandchandrashekhar2933 2 місяці тому
@@afaqueahmad7117 Ah thank you for that. That really made it very clear. For some reason, i couldnt replicate the same when i ran your notebook on Databricks, even though i disabled broadcast hash join, it still ended up using broadcast instead of the AQE coalesce followed by sort merge. Maybe seems like something specific about the spark version i am currently on. But thats all right. Thank you again :)
@anandchandrashekhar2933 2 місяці тому
@@afaqueahmad7117 Thank you! That makes sense. For some reason, I couldnt replicate it when running your notebook on Databricks, even if i disable broadcash hash join, it still ended up using it, instead of how you showed it, that is a AQE coalesce followed by a sort merge join. Maybe something specific with the spark version that i was on. But that's all right. Thank you again!!
@vamsikrishnabhadragiri402 5 місяців тому ⁺¹
Hello Afaque, Thanks for the informative video, What does partition by join key means at 18:38?
@afaqueahmad7117 5 місяців тому
Hey @vamsikrishnabhadragiri402, I'm referring to partitioning by `Customers.id`. Basically doing a `.partitionBy("id")`. If you were to partition by `Customers.id`, there could be data skews because some customers can have more transactions than others. So, some `Customers.id` partition files will have several rows, while others will have negligible.
@cantcatchme8368 7 днів тому
Am.not able to see spill details in spark 3.5.2 UI?
@retenim28 2 місяці тому
hi sir, great content as always. just a question on the last part of the video: if i correctly understood you said to repartition(3) the big table so that rows are evenly repartitioned across the 3 executors and then apply the broadcast join. But in the code part you only performed a broadcast join without repartition(3). Why that? I am a little bit confused about that part. thanks a lot
@afaqueahmad7117 2 місяці тому
Hey @retenim28, thank you, appreciate it.
On the question - you're correct that I mentioned doing a `repartition(3)` when the table is big so that the rows get evenly partitioned. Reason why I don't do a `repartition(3)` in the code is because sample transactions table I'm using (supposedly the bigger table) isn't very big - hence a repartitioning to even out data is not needed. Hope that clarifies :)
@retenim28 2 місяці тому
@@afaqueahmad7117 this clatifies a lot, thank you. Another question: `repartition(3)` function involves a shuffle, so theoretically it would be better avoiding that and only use broadcast join, as you did in the video. So, it seems to me there are two possible situation:
1. make `repartition(3)` and then broadcast join: this involves a shuffle (bad) of big table, but finally skew data problem is solved so each core will process the same amount of data;
2. avoid `repartition(3)` and then broadcst join: there is no shuffle (good) of big table, but a specific core is forces to work with a huge amount of data compared to the remaining two.
Which is the best path?
In your code I tried both options and it looks like it's better avoiding `repartition(3)`. Am I missing something on this point? Sorry about the long answer.
@adityeshchaturvedi6553 10 місяців тому
Hi Afaque, Loving your videos. Great content. Just one doubt, Isn't AQE automatically enabled from spark3.X , if yes, Why do we explicitly need to set two mentioned property to true ?
@afaqueahmad7117 10 місяців тому ⁺²
Hey @adityeshchaturvedi6553, thanks for the kind words. To answer your question, AQE (spark.sql.adaptive.enabled) defaults to false in Spark 3.0.0 (reference here: spark.apache.org/docs/3.0.0/sql-performance-tuning.html#adaptive-query-execution) while it was enabled (defaulting to true) starting 3.2.0 onwards.
@adityeshchaturvedi6553 10 місяців тому
@@afaqueahmad7117 thanks a lot.
Really appreciate your efforts !
@suruchijha3914 10 місяців тому
Hi @afaque could you please let me know what do you mean by 15 distinct keys in a join
@afaqueahmad7117 10 місяців тому
Hey @suruchijha3914, by 15 distinct keys in a join, I'm referring to 15 unique values in the join column. For example: let's say you're joining sales data with product promotions data on the 'product_id' having only 15 unique products / product_id's => means only 15 distinct keys in the join.
@miguelruiz9772 11 місяців тому
Also, have you actually seen real-life performance improvements from AQE in a pipeline? - I always end up setting it to false to avoid unpredictable behavior
@afaqueahmad7117 10 місяців тому
Well, some of it's functionalities work pretty well, like converting sort merge join to broadcast join, coalescing the number of partitions, but I do agree, others like optimizing skewed joins are hard to predict and understand.
@gohyuchen7465 11 місяців тому
just one feedback. at 11:10, your eyes keep darting from your script to your camera. actually since this is a recorded video, it is perfectly fine to keep looking at your script throughout. having your eyes keep changing focus is slightly distracting
@afaqueahmad7117 11 місяців тому ⁺¹
I love your attention to detail, however, there's no script, its just beginner me, trying to adjust myself to the camera. Appreciate your feedback :)

Наступне

Автоматичне відтворення

Dynamic Partition Pruning: How It Works (And When It Doesn’t)