Spark Join Without Shuffle | Spark Interview Question
Вставка
- Опубліковано 16 січ 2025
- #Spark #Join #Internals #Performance #optimization #DeepDive #Join #Shuffle: In this video , We have discussed how to perform the join without the shuffle.
Please join as a member in my channel to get additional benefits like materials in BigData , Data Science, live streaming for Members and many more
Click here to subscribe : / @techwithviresh
About us:
We are a technology consulting and training providers, specializes in the technology areas like : Machine Learning,AI,Spark,Big Data,Nosql, graph DB,Cassandra and Hadoop ecosystem.
Mastering Spark : • Spark Scenario Based I...
Mastering Hive : • Mastering Hive Tutoria...
Spark Interview Questions : • Cache vs Persist | Spa...
Mastering Hadoop : • Hadoop Tutorial | Map ...
Visit us :
Email: techwithviresh@gmail.com
Facebook : / tech-greens
Twitter :
Thanks for watching
Please Subscribe!!! Like, share and comment!!!!
That's what i was looking for. It's a great help Viresh
i m really happy with your in deep dive spark. Thank you.
small2 is not defined. Also why is the shuffle cost of partitioning the 2 RDDs separately lower than the shuffle cost of joining them directly? They are basically doing the same thing, moving data of the same join key to a same executor.
I feel the title is misleading, repartitioning the 2 RDDs involves shuffle.
Let's say if you are having two large dataframe , then How will you optimize the join ? And why are you using the rdd as it's very slow as compared to dataframe ?
Any suggestions for dataframes?
Why are you converting dataframe to rdd ?? It is very bad practice in terms of performance
What's the benefit of persisting the 2 RDDs?
thanks Veresh , here "rdd"s been used , how to do same using Dataset/Dataframe ?? where you got "small2" from??
really nice , thanks bro , in line 14 , is it "small.partition.get" instead "small2.partition.get" right ? why shuffle.partitions set to 2 only ?
Otherwise remaining 198 partitions would be empty
@@TechWithViresh is it otherwise or other words ? want to keep 198 partitions empty ?
I have been asked like, if I have 2 tables with same volume of data but say one has 10 column and other has 3 columns, how to optimise this joining.
What's the book/picture in the video?
have been following the series, its pretty good but this video is not at all clear, you should make another with same question
Concept is really worth testing.
Code is incomplete at places .
I took time to fill gaps.
Last line display()..will it work in scala spark ?🙄
This code will run fine on Azure Databricks.
where did small2 came from .....there is typo mistakes...can you please update it.??
shuffle during join OR doing repartition before join .... u r saying that the second one is better.... right? Whats the difference? u have not mentioned why is it better... some one has to take care of repartitioning -> either join will shuffle or we have to repartition -> its fine... pls let us know why this approach is better.
Even with repartitioning we have to move data to different partitions causing a shuffle, isnt it ?
partition.get is returning None in largeRDD line no. 14
getting error for line number 14 --->
error: value partitioner is not a member of org.apache.spark.sql.DataFrame
Kindly suggest
Very Nice content provided on this channel thanks for that, Q:- Can range partition work here ?
How to connect with you?
TechWithViresh@gmail.com
Video recommendatin at the end are blocking the content...
your example is not up to the mark, whatever you describe in your lecture it is not understandable, only the shake of creating a video you do this, I did not get your point whatever you told us regarding the join how it happens and what happens please describe in a much better understandable manner.