Step 1 is shuffle , but you mention , but at 11:49 you mention "There will be no shuffling if the data is colocated in the same partition" , How can data from tow tables to be merged be co-location in the same partition without any shuffling ?
hello, i find the content very interesting especially on when the hash join is better than the sort merge join. could you please tell me where you found the documentation on that?
As per documentation for rdbms hash join is faster than sort merge. I am assuming for spark as well first step for both is shuffle where same value key ends up in Same partition. After that same process happens. Why in spark sort merge is mostly preferred.?
HI Viresh, the video has a great explanation. Thanks!! I am not sure about how to determine the limit associated with smaller table to fit in memory(Shuffle Hash Join case). Please help me with it.
what is the difference between broadcast join and mapside join. What was the need of broadcast join although mapside join was available earlier.Could you please explain if you have any idea on this.?
So Shuffle Hash Join and Sort Merge Join have the same shuffle phase? Why don't call it Shuffle Sort Merge Join? Because it sounds like there is no shuffle.
Nice video ,also include some pictorial representation to visulize better
Step 1 is shuffle , but you mention , but at 11:49 you mention "There will be no shuffling if the data is colocated in the same partition" ,
How can data from tow tables to be merged be co-location in the same partition without any shuffling ?
It it is already on the same node then no need to shuffle.
I think he want to say if partition of both table having similar key ( join key ) resides in same executor ,then there will be no shuffling .
Nice content, only thing is voice was very low. You can boost the volume after recording.
Thanks:)
great explanation, Thanks for valuable video :)
hello,
i find the content very interesting especially on when the hash join is better than the sort merge join. could you please tell me where you found the documentation on that?
Hi,
I like your Spark videos. Please create a dedicated video for top 100 most frequently used Spark Commands.
- Pankaj C
As per documentation for rdbms hash join is faster than sort merge. I am assuming for spark as well first step for both is shuffle where same value key ends up in Same partition. After that same process happens. Why in spark sort merge is mostly preferred.?
HI Viresh, the video has a great explanation. Thanks!! I am not sure about how to determine the limit associated with smaller table to fit in memory(Shuffle Hash Join case). Please help me with it.
+1
Try 1% of your executor memory.
what is the difference between broadcast join and mapside join. What was the need of broadcast join although mapside join was available earlier.Could you please explain if you have any idea on this.?
So Shuffle Hash Join and Sort Merge Join have the same shuffle phase? Why don't call it Shuffle Sort Merge Join? Because it sounds like there is no shuffle.
I felt like you are talking to yourself
Confusing :(
Feel free to post your questions/doubts
Voice and explanation not clear!
Please improve your speech clarity and accent . You skip some syllables.