how i join small table with big table but i want to fetch all the data in small table like the small table is 100k record and large table is 1 milion record df = smalldf.join(largedf, smalldf.id==largedf.id , how = 'left_outerjoin') it makes out of memory and i cant do broadcast the small df idont know why what is best case here pls help
ideally the broadcat join has default configuration of broadcating the samller size df (which should be less or equal to 10MB) so if you are getting error change you sparksubmit config - make some adjustment in the broadcast size and it might work. also you haven't mentioned in your code that you are broadcating the smaller df - it should be life df.join(broadcast(smallerdf), smallerdf.id=df.id, "left_outer") You can increase the spark.sql.autoBroadcastJoinThreshold to your big table size by default its 10MB then broadcasthashjoin will be performed
Nice
Thank you so much! Subscribe for more content 😊
Where is the code? You can atleast show a demo?
how i join small table with big table but i want to fetch all the data in small table like
the small table is 100k record and large table is 1 milion record
df = smalldf.join(largedf, smalldf.id==largedf.id , how = 'left_outerjoin')
it makes out of memory and i cant do broadcast the small df idont know why what is best case here pls help
ideally the broadcat join has default configuration of broadcating the samller size df (which should be less or equal to 10MB) so if you are getting error change you sparksubmit config - make some adjustment in the broadcast size and it might work. also you haven't mentioned in your code that you are broadcating the smaller df - it should be life df.join(broadcast(smallerdf), smallerdf.id=df.id, "left_outer")
You can increase the spark.sql.autoBroadcastJoinThreshold to your big table size by default its 10MB then broadcasthashjoin will be performed