Spark Join | Sort vs Shuffle | Spark Interview Question | Lec-13

MANISH KUMAR

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 31 гру 2024

КОМЕНТАРІ • 99

@jatinchugh6752 Рік тому ⁺¹⁸
Bhai itni information to aj tak kisi paid course me bhi nahi mili, thank you so much.
@DivyaSharma-ux4mo 7 місяців тому ⁺¹
This is so true, admire your hard work.!!!!!!
@amanraheja2905 2 дні тому
Kya padhaya hai bhai!!. Ekdum mast
@hankeepankee5361 6 місяців тому ⁺³
Good work bro... In hash join creating the hash would take O(N) -> N being number of unique values in the column. So hash join would take O(N) vs sort join which is O (NlogN)
@Nooberinlife 2 місяці тому ⁺¹
I have not got a detailed video more than this in my entire career
@anuragdwivedi1804 21 день тому
bro i have never seen a detailed vedio like this
@mprtech315 9 місяців тому
I follow your both spark series. Really its valuable for me 🎉 thanks
@udittiwari8420 10 місяців тому
Thank you sir for the detailed series! Your clear explanations have been incredibly helpful in my learning journey.
@mrinalraj4801 8 місяців тому
Great in depth concepts. Maja aa gya. You are genius. Thanks a lot. Keep up the great work you're doing for the community.
@jagannathsahoo8297 Місяць тому
excellent explanation. and that too free of cost ☺
@hankeepankee5361 5 місяців тому ⁺¹
18:55 Tradeoff between cpu usage (shuffle sort join) and in-memory usage (shuffle hash join)
@prabhatsingh7391 Рік тому ⁺⁵
Hi Manish Bhaiya, here we perform the join based on key= id that is an integer so we can see that id%200 is the partition number where data will go ,but if key= string then ,how it will happens or in that scene internally spark create a key for each column.
@manish_kumar_1 Рік тому ⁺⁵
Murmur3 hashing is applied for strings. If you want to know more then check how murmur3 works
@voice6905 5 місяців тому ⁺¹
Apko KOTI KOTI PRANAM GURU JI!
Please bring playlists on Apache AIRFLOW and Apache KAFKA.
I'm sure they would be the best resource in the UA-cam
@divyanshusingh3966 2 місяці тому
Thank you bro for providing quality content for free
@ShivrajSingh-x5j 7 днів тому
Awesome explanation
@sambitmohanty1758 Рік тому ⁺²
Hi Manish, your content are amazing, keep it up.
@venkatmunna8918 2 місяці тому ⁺¹
Thank you so much for the detailed explanation, However, I am confused about one point. Could you please clarify my question?
Let's say we don't have the color coding as blue and red. Now, executor-1 has 200 partitions and executor-2 also has 200 partitions.
If we consider id=102, then 102/200 = 102. How does spark determine whether the record 102 should go toexecutor-1/executor-2 ?
This is discussed at the 10:56 timestamp.
Thanks!
@rohitsharma-mg7hd Місяць тому
bhai pehle to ye batao 102/200=102 kaise ho gaya ? maths ati hai
@shivaog007 Місяць тому ⁺¹
@@rohitsharma-mg7hd we are taking the remainder
@rohitsharma-mg7hd Місяць тому
@@shivaog007 ha bhai bataya unhone , mereko galatfehmi ho gai thi hui hui
@Daily_Code_Challenge Місяць тому
executor 1 is taking 1 to 100 and executor 2 101 to 200
colour is showing 2 different table (df1 ,df 2)is created per executor
@diksha.chaudhary Рік тому
hey Manish, your videos are amazing!! 👏 love the way you explain each and every detail. thankyou for sharing your knowledge and keep it up. ✨️
@anish_bhateja Рік тому
Hi Manish,
Excellent explanation. Thanks for the informative video.
@adityakvs3529 2 місяці тому ⁺¹
bhai hash table is created at individual partiton or entire data frame in shuffle hash join
@younevano Місяць тому
Partition level
@abhigyanprakash5603 22 дні тому
One doubt: You explained joining on the basis of id column where you showed 1/200 gives remainder as 1 --> So, You placed the record in executor 1 with P1... Similarly 109/200 gives remained 109 --> So, You placed the record in P109. But Now assume instead of joining the records based on integer column, we are joining records based on String (Char or VARCHAR datatype). Then, how will this thing work ?
@rajnandinipadhy2533 Рік тому ⁺³
so if in interview recuirter will ask what kind of join you are performing then should we say as per the data we need to analyze first what kind join should be appropriate for this or we should as spark will do the optimization internally?
@manish_kumar_1 Рік тому ⁺⁴
You can talk about types of join strategy and then give a comparison between 2 by taking some dataframe size. If interviewer further asks anything then only explain in detail.
@maurifkhan3029 Рік тому ⁺²
is it like every dataframe is split into 200 partitions before shuffling (based on number of shuffle partitions set) ? or is it like if we have 2 Dataframe to join each will get only 100 shuffle partitions
@manish_kumar_1 Рік тому ⁺²
No it's not like ki every dataframe will get 100. So based on joining condition 200 partitions will get created. And then you can consider 200 bucket is there and every bucket has the same joining key records. Let say df1 had id 5 is in box no 5 then from df2 also id 5 will come to box5 and then box5 is self sufficient to join.
@vishaljoshi1752 Рік тому
hi manish as you said sorting is nlogn and what if we combine the data suppose p1 of table has id 1 and p2 has id 1,1 then if we combine two for loops are required for this then complexity n2 .. is it perform in the same way?
@anuragdwivedi1804 20 днів тому
bro can you please tell what book do you follow for spark?
@nityabajpai2022 Рік тому ⁺¹
Hi Manish, I have few questions :
1. We are applying join on partitions right and not DF? Because DF are already divided into 4, 4 partitions each.
2. Now each join will make 200 new partitions, so if we join RP1 and BP3 so it will create total 200 more partitions? And this way if we'll join each partition in Red with every partition in Blue, then total we'll have 3200 partitions?
3. In the video you said - not 200 partitions per executor but executor does have 200 partitons - 100 for Red and 100 Blue.
@akhiladevangamath1277 7 місяців тому ⁺²
Hey, This is my understanding, my answers might help you to understand
1. we r applying join DF, yes we have 4 partitions for each DF. when we apply join, those 4 partitions will made into 200 partitions.
3. 200 partitions for each DF, so each executor has 100 partitions of DF1 and 100 partitions of DF2.
@quiet8691 7 місяців тому ⁺²
Tera intro mujhe namaskar mai ravish Kumar jaisa lagta h 👍👌🔥
@dataplumberswithajay Рік тому
Will you please make video on O(n^2) ?
what actually it is
@khurshidhasankhan4700 Рік тому
Could you please ek video class and case class pr video Bana dijiye maximum interview me puch raha hai
@vikashroy5882 6 місяців тому ⁺¹
Hi Manish
If we follow the approach mentioned at this timestamp 9:28 , then in which partition data will go if we have 0 remainder.
Ex- if we have Id as 200 or multiple of 200
@Daily_Code_Challenge Місяць тому
2nd
@sachindubey4315 Рік тому ⁺²
how these 200 partitions spilited into 2 executor ? what if there is 3 or 4 executor are there how split of 200 partiton will be heppen ? ?
@manish_kumar_1 Рік тому ⁺²
Then partition will be distributed over 4 executors
@prathapganesh7021 Рік тому ⁺¹
Hi you said 100 partitions in each executor but in one executor you demonstrate blue and red in one executor counts 200 could you please elaborate that. Thank you
@vishaljoshi1752 Рік тому ⁺¹
hi manish one more question you are saying in-memory for hash-table but as we know first data is loaded in executor memory and logical operation are performed so in shuffle-sort join all the things are performing in memory so why we are not saying shuffle-sort join in-memory as both the partitions for the same key should be loaded in-memory then after join operation will be performed ?
@Daily_Code_Challenge Місяць тому
We can't say because shuffle sort-merge uses disk also while hash-table relies heavily on the hash table being entirely in memory,
@prathapganesh7021 Рік тому
Thank you great explanation 🙏
@adityakvs3529 Місяць тому
Bhai which join is better shuffle hash or sort merge and how spark decides which join it needs to use
@rishavsharma5732 2 місяці тому
Baal kharab hogaya..xD, nice work btw..these videos are really helpful.
@lakshya1375 Рік тому
Bhai Optimization technique bataya h kya aapne kisi video me?
@amitchaurasia592 День тому
bhai wo jo 4 partition jo bna tha wo 200 partition me kaise convert ho gaya? like join ke baad 200 me convert hoga ya join se pehle convert hoga ?
and join se pehle hoga to unn 4 partition ka kya hua jo pehle bna tha ?
@manish_kumar_1 День тому
Aap shuffling pahle samjhiye. Fir ye samjh me aa jayega
@rameshbayanavenkata1305 Рік тому ⁺¹
Hi Manish..i am following all your videos. Thanks for your great contribution in explaining each and every thing in detail. As you said records will be segregated in each partition as per the reminder which we get from dividing id value with 200 partitions. What if the joining is done on name column instead of id. how division takes place here to segregate name column in each partition. pls clarify..
@amritranjannayak2705 9 місяців тому
I also have same question, Please answer this.
@younevano Місяць тому
@@amritranjannayak2705 he replied on same other comment murmur3 hashing is done for joining on strings!
@rohanchoudhary672 10 місяців тому ⁺¹
Nice video sir, but use modulus operation, divide is little confusing.
@manish_kumar_1 10 місяців тому
Modulus operator dekhiye kaise kaam karta hai
@rohanchoudhary672 10 місяців тому
@@manish_kumar_1 aap remainder hi to lerhe ho 200 ka
@ManishSharma-fi2vr 7 місяців тому
Thanks Manish Bhai!!
@HanuamnthReddy 11 місяців тому
Really exemplary 🎉
@KaranSingh-hx8dh Рік тому
Thank you for explaining.
@mhdakram 4 місяці тому
An executor can have only one partition at a time...is this not correct?
@Nomanqureshi2204 9 місяців тому
sir spark streaming par video banaiye
@sanooosai 9 місяців тому
great sir thank you
@aashishraja-k7u 4 місяці тому
well explained
@sreelakshmang7275 7 місяців тому
how to know dataframe size?
@RajeshKumar-re8tj 7 місяців тому
Which memory pool utilizing to create hash table during shuffle hash join?
@younevano Місяць тому
Executor's those partitions are on after shuffling?
@manojkaransingh5848 Рік тому
amazing...!!!!!!!!! ...video bhaiii....@@@
@Amarjeet-fb3lk 7 місяців тому ⁺¹
200 partition banega,means 200 cores bhi chahiye hoga,
Tabhi to 200 partition banega.
Agar
200 cores nahi hua to?
@manish_kumar_1 7 місяців тому ⁺¹
Tab bhi chalega. Distributed computing ka kaam hi hai aapke Kam resource me v job chalane ka. Aapko Pura spark samjhne ke liye to Pura playlist dekhna parega
@younevano Місяць тому
It will run 200/n times where n= number of cores!
@mdasif2411 7 місяців тому
Jb salary table 10MB se km h r phla table itna zyada, toh dono m same no. of partitions kaise bnega?
@homeactfun Рік тому
Amazing video
@raajnghani Рік тому ⁺¹
I am working as Operation Executive in a warehouse, but I started learning sqoop, hive, MySQL, MongoDB, Hbase, Nifi, Kafka, spark, AWS Services. It is completely Non-IT, I cleared two interviews. How do I get an experience certificate for working on above technologies.
@manish_kumar_1 Рік тому
Tell them that you don't have experience. You have done all the project by your own. If you cleared interview means you are good fit for the role.
@raajnghani Рік тому
@@manish_kumar_1 Recuiter need experience after clearing l2 discussion also.
@RohitKumar-kd5fj 3 місяці тому
DIvision hoga kya ? Mereko lagra hai modulus hoga
@Daily_Code_Challenge Місяць тому
yes wo modulus hai
@akumar2575. 8 місяців тому
day 4 done👍
@kartikgupta2299 4 місяці тому
Per executor 200 partition bante dikhre hai as in your vedio but aap bolre ho per executor 200 nhi banege total 200 partition banenge please ye part explain kro aur 200 by default kyu bante h
@mayanksinghsoni Місяць тому
what if the id is not numeric?
@chiragbajaj-w2v 14 днів тому
actually the formula which gets used in real is hash(id)%num_of_partitions , for example , here internally hash(7)%200 will be done and as per that partition will be assigned and this hash which we are talking about is murmur3 hashing , so it doesn't matter whether your id is numeric or string or any other datatype , you have to eventually take hash() value
@mohammadfurquan241 Рік тому
Sir I have done Python, ,basic SQL, Linux commands All DBMS concepts. CAN I LEARN SPARK NOW OR IS THERE ANY PREREQUISITE FOR SPARK???????
@manish_kumar_1 Рік тому ⁺¹
No prerequisite. Thora bahut sql aayega tab concept jaldi grasp karoge
@mohammadfurquan241 Рік тому
Thank you sir I will follow your series
@harshi993 9 місяців тому
What is partition ?
@prashanttakate7856 8 місяців тому ⁺¹
whenever you are working with a spark, data is divided in some parts, that parts of data is called partition
@neeraj_dama Рік тому
how is 7/200 =7 ?
@manish_kumar_1 Рік тому
Remainder 7 aayega. Pmod function lagta hai waha par
@rohitsharma-mg7hd Місяць тому
bhai 102/200=102 kab se hone laga ?
@manish_kumar_1 Місяць тому ⁺²
102%200 bol rhe honge divide galti se bol diya hoga
@rohitsharma-mg7hd Місяць тому
@@manish_kumar_1 ok thanks you , ap bahut hi badiya samjhaee ho. thanks a lot
@mayankkandpal1565 11 місяців тому
@manishsingh-cb3pp 9 місяців тому ⁺¹
Can you explain this topic more clearly
@pradipraj5954 10 місяців тому
improve your video quality
@pradipraj5954 10 місяців тому
Bahut jyada bakwas karte ho .... Strait point pe raho .....
@radheshyama448 Рік тому
thanks
@adityakvs3529 Місяць тому
Bhai can I have one to one meeting I have some doubts
@manish_kumar_1 Місяць тому
Sure you can book session on topmate

Наступне

Автоматичне відтворення

Broadcast Join in spark | Spark Interview Question | Lec-14