Spark Job, Stages, Tasks | Lec-11

MANISH KUMAR

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 8 січ 2025

КОМЕНТАРІ • 255

@manish_kumar_1 Рік тому ⁺¹
Directly connect with me on:- topmate.io/manish_kumar25
@Adity-t4d 3 місяці тому ⁺¹
last wala action hit hua then collect ke lia job bna th waha v toh ek stage bnayega woh include q ni hua
@younevano 2 місяці тому
@@Adity-t4d So does it create a total of 205 tasks and 5 stages then?
@sahilsood2028 19 днів тому
Hi Manish, in your eg, i also see a count action which means it will trigger a job, but while explaining you didn' tconsider it as a job ? Can you explain this if possible
@fury00713 Рік тому ⁺¹³
In Apache Spark, the spark.read.csv() method is neither a transformation nor an action; spark.read.csv() is a method used for initiating the reading of CSV data into a Spark DataFrame, and it's part of the data loading phase in Spark's processing model. The actual reading and processing of the data occur later, driven by Spark's lazy evaluation model.
@ChetanSharma-oy4ge 10 місяців тому ⁺²
Fir action jobs kaise ban rahe hai? Mtlb ager action is equal to jobs , to better way kya hai find out kerne ka?
@roshankumargupta46 7 місяців тому ⁺³
If you don't explicitly provide a schema, Spark will read a portion of the data to infer it. This triggers a job to read data for schema inference.
@younevano 2 місяці тому
@@roshankumargupta46 Wow this makes much sense to issues encountered while following his previous videos as well! Where do you study from?
@roshankumargupta46 7 місяців тому ⁺¹²
If you don't explicitly provide a schema, Spark will read a portion of the data to infer it. This triggers a job to read data for schema inference. If you disable schema inference and provide your own schema, you can avoid the job triggered by schema inference.
@arpanscreations6954 6 місяців тому
Thanks for your clarification
@user93-i2k 2 місяці тому
thanks bro
@younevano 2 місяці тому
This is so helpful! where do you study from?
@shorakhutte1887 Рік тому ⁺⁵
bro next level ka explanation tha... thanks for sharing your great knowledge. keep up the good work. Thanks
@kulyashdahiya2529 Місяць тому
How many times I will have to say this, he is a GOAT for explaining Concepts.
Liquid Gold. :)
@manish_kumar_1 Місяць тому
😃😃
@amitkamble8493 3 місяці тому
I used to watch these Job and task information after running a notebook - but after watching your video actual work flow got to know - thanks for the best knowledge sharing
@dayanandab.n3814 27 днів тому
One of the BEST videos in this series. Thank you Manish Bhai
@mrinalraj4801 8 місяців тому ⁺¹
Great Manish. I am grateful to you for making such rare content with so much depth. You are doing a tremendous job by contributing towards the Community. Please keep up the good work and stay motivated. We are always here to support you.
@satyammeena-bu7kp 7 місяців тому
Really Awsome Explanation ! Esa Explanation kabhi or ni mil sakta hai Thank you so much
@naturehealingandpeace2658 Рік тому
Wow Kya clear explanation tha,first time understood in.one.go
@manishamapari5224 3 місяці тому
I have no words for your efforts. Great Great, You are really great. Thanks a lot
@nishasreedharan6175 8 місяців тому
One of the best videos ever . Thank you for this . Really helpful.
@rohitbhawle8658 Рік тому ⁺¹
nice explain ,each and every concept you clear keep it up
@Food_panda-hu6sj Рік тому ⁺³
One question:
after groupby by default 200 partitions will be created where each partition will hold data for individual key.
What happens if there are less keys like 100 , will it lead to formation of only 100 partition insted of 200?
AND
What happens if the individual keys are more than 200 in number, will it create more than 200 partitions?
@rahuljain8001 Рік тому
Didn't find such a detailed explanation, Kudos
@Useracwqbrazy Рік тому
I really liked this video....nobody explained at this level
@samg3434 3 місяці тому
Bahot shandaar explanation
@ShrinchaRani Рік тому ⁺¹
Explained so well that too bit- by- bit 👏🏻
@amazhobner Рік тому ⁺²
Around 6:35 it is wrong, read can avoid being a action, if inside schema you pass a manual created schema containing a list of all columns. Refer for practical: ua-cam.com/video/VLi9WS8SJFY/v-deo.html
@stuti5700 Рік тому ⁺²
very good content. Please make detail videoes on spark job optimization
@kamalprajapati9955 3 місяці тому
Very detailed lecture.worth it
@mantukumar-qn9pv Рік тому
Thanks Manish Bhai...Please keep continue your video
@PRIYANAYAN-o2x Місяць тому
very nice explanation sir.
@KhaderAliAfghan Рік тому ⁺⁵
1 job for read,
1 job for print 1
1 job for print 2
1 job for count
1 job for collect
total 5 jobs according to me but i have not run the code not sure
@piyushzope10 3 місяці тому
Yes I also got 5 jobs. Not sure how Manish got only 2
@younevano 2 місяці тому
@@piyushzope10 In databricks? 5 jobs, 5 stages and 6 tasks in databricks!
@piyushzope10 2 місяці тому
@@younevano yes in data bricks
@younevano 2 місяці тому
@@piyushzope10 Did you figure out why it differs so much with Manish's theory?
@piyushzope10 2 місяці тому
@@younevano No not able to figure it out but maybe I think maybe due to some updates it might have happened... I will explore it more this week... Will update here if I find anything...
@SuresgjmJ 10 місяців тому ⁺⁵
I have one doubt there are 3 actions are there such as read,collect and count, but why it is creating 2 job only ?
@tejathunder 8 місяців тому ⁺¹
In Apache Spark, the read operation is not considered an action; it is a transformation.
@vipulbornare34 2 місяці тому
@@tejathunder it is neither action nor transformation
@younevano 2 місяці тому
Here read( ) does trigger a job because we are not explicitly providing schema! But here count() in the code is with an aggregation function groupBy within a transformation chain, so it does not create an extra job.
So just 2 jobs read and collect
@ChetanSharma-oy4ge 10 місяців тому ⁺¹
Ek question tha ki, order kya hona chahaye likhne ka, Mtlb ki ager hum filter/select/partition/group by/distinct/count ya or kuch bhi ker rahe hai to, sabsay pehla kya likhna chahaye…
@yashkumarjha5733 4 місяці тому ⁺⁴
Bro dekho agar optimized way me me likhna chah re ho then first apply filter and then apply transformations.
For example : agar Maan lo k mere pass data hai 100 employees ka or mujhe sirf 90000 se greater salary vaale employees chahiye or mujhe unn sabhi employees ko promotion Krna hai matlab k sabhi ka salary or badhaana hai. Toh suppose iss case me tu pehle transformation lagaoge k salary + vaala then filter kroge toh poora 100 employees ka data scan hoga but agar pehle hi filter laga loge or suppose 90000 se jyada salary vaale sirf 2 log hai toh agar pehle filter laga lenge then we just need to scan only 2 employees data.
Sorry example thoda kharab tha but shyd concept samajh aa gaya hoga.
By the way agar aap pehle transformation lagane k baad bhi filter laga re ho toh bhi koi dikkat nahi hoga kyuki Spark internally is designed in a way k vo optimized tareeke se hi run krega toh jab tumhaara poora operation perform ho gaya uske baad tum jab job hit kroge then it'll first do filter and then apply other transformations on top of it
Spark is very intelligent and will do in optimized way.
I hope this answers your question
@utkarshkumar6703 10 місяців тому
Bhai bahut sahi explain karte ho aap
@saumyasingh9620 Рік тому
This was so beautifully explained.
@AnandVerma 9 місяців тому ⁺¹
Num Jobs = 2
Num Stages = 4 (job1 = 1, job2 = 3)
Num Tasks = 204 (job1 = 1, job2 = 203)
@Rajeshkumbhkar-x6v 6 місяців тому
Amazing explanation sir jii
@Rafian1924 Рік тому ⁺¹
Your channel will grow immensely bro. Keep it up❤
@coolashishful 2 місяці тому
Given the actions identified in code, let's summarize how many jobs will be created:
load():
This triggers 1 job to read the CSV file into a DataFrame.
First getNumPartitions():
This triggers 1 additional job to determine the number of partitions in the DataFrame after loading the data.
Second getNumPartitions():
This triggers 1 more job to check the number of partitions after the DataFrame has been repartitioned.
collect():
This triggers 1 job to execute the entire DataFrame pipeline, which includes the filter, select, groupBy, and count transformations.
Total Jobs:
1 (load) + 1 (first getNumPartitions) + 1 (second getNumPartitions) + 1 (collect) = 4 jobs
So, a total of 4 Spark jobs will be created by your code.
@younevano 2 місяці тому
Ideally yes, this should be it! Can you elaborate on how many stages and tasks you believe will be created?
@RahulAgarwal-uh3pf 9 днів тому
@manish_kumar, why a new stage is created when a filter operation is applied.?
@sudipmukherjee6878 Рік тому
Bhai was eagerly waiting for your videos
@kyransingh8209 Рік тому ⁺¹
@manish_kumar_1 : correction - job2 - stage2 is till group by, and job2 - stage 3 is till collect
@younevano 2 місяці тому
Yeah this is what chatGPT explains too!
@codingwithanonymous890 Рік тому ⁺¹
Sir make playlist for other data engineer tools also
@arshadmohammed1090 8 місяців тому
great job bro, you are doing well.
@someshmishra6636 17 днів тому
Hello Manish, isn't count() method an action() ? also, once it returns the count, the employee_df has an integer value and later you are performing a collect() operation, wouldn't that throw an exception?
@kumarankit4479 7 місяців тому
Hi bhaiya. Why havent we considered collect() as a job creator here in the program you discussed?
@nikitashrivastava1754 2 місяці тому ⁺¹
Hi Manish, I have ran same code in databricks notebook but it created 5 jobs instead of 2.
@younevano 2 місяці тому
Same here 5 jobs, 5 stages and 6 tasks! Figured out why?
@maruthil5179 Рік тому
Very well explained bhai.
@KapilKumar-hk9xk 5 місяців тому ⁺¹
Excellent explaination. One question, with groupBy 200 tasks are created but most of these tasks are useless right. How to avoid such scenarios. Coz it will take extra effort for spark for scheduling such empty partition task right...
@rushikeshsalunkhe8892 4 місяці тому ⁺²
You can repartition it to less number of partition or you can tweak the spark.sql.shuffle.partition config by setting it to desirable number.
@saurabhgulati2505 Місяць тому
count is also an action. Why we didn't consider as a new job ?
@venugopal-nc3nz Рік тому ⁺²
Hi Manish, In databricks also when groupby() invoke it create 200 task by default ? How to reduce 200 task when using group by() for optimizing spark job ?
@manish_kumar_1 Рік тому ⁺¹
There is a configuration which can be set. Just google how can I set fix number of partition after join
@sauravagrawal5013 3 місяці тому ⁺¹
Count used after a groupBy is treated as a transformation not an Action.
@abhijitz8146 2 місяці тому
I had read that 'Task = no. of cores'. If there are 2 cores ,how much task it will create 2 or 200?
@vinitsunita Рік тому
Very good explaination bro
@shekharraghuvanshi2267 Рік тому ⁺¹
Hi Manish,
in the second job there were 203 tasks and 1st job there was 1, so in total 204 tasks are there in complete application? i am bit confused between 203 and 204.
Kindly clarify..
@akhiladevangamath1277 7 місяців тому
Thank you so much Manish
@sujitdixit2957 Місяць тому
is read operation considered as action ??
@sharma-vasundhara 11 місяців тому ⁺²
Sir, in line 14 - we have .groupby and .count
.count is an action, right? Not sure if you missed it by mistake or if it doesn't count as an action? 🙁
@tanmayagarwal3481 10 місяців тому ⁺¹
I had the same doubt,Did you get the answer to this question? As per the UI also it has mentioned only 2 jobs whereas count should be an action :(
@younevano 2 місяці тому ⁺¹
Functions like sum(), count() etc in spark behave differently in different contexts!
Here count() used along with other aggregration functions in a transformation chain, where it acts like a transformation instead of an action!
@shreyaspurankar9736 5 місяців тому
great explanation :)
@sugandharaghav2609 Рік тому
I remember wide dependency you explained in shuffling.
@user93-i2k 2 місяці тому
smart lag rhe bhaiya aaj
@asif50786 Рік тому ⁺¹
Start a playlist with guided projects ,,so that we can apply these things in real life..
@younevano 2 місяці тому
Found any sorces/channels for DE follow along projects?
@jilsonjoe6259 Рік тому
Great Explanation ❤
@parthagrawal5516 9 місяців тому
After repartition(3)
Still 200 default partition will show there on dag Sir
@Adity-t4d 3 місяці тому
aapne video me 4 stage bnaya but spark ui me 3 hi stage kaise bnane, read me hi toh ek stage bna th
@deeksha6514 9 місяців тому ⁺¹
An Executor can have 2 partitions or is it like that partitions means it will be there on two different machines.
@younevano 2 місяці тому
An executor can run multiple tasks (meaning can have multiple partitions) depending on the number of cores it has!
@tanushreenagar3116 11 місяців тому
VERY VERY HELPFUL
@saumyasingh9620 Рік тому ⁺¹
When I ran in notebook, it gave 5 jobs like below, and not only 2 for this snippet of code. Can you explain.:
Job 80 View(Stages: 1/1)
Stage 95: 1/1
Job 81 View(Stages: 1/1)
Stage 96: 1/1
Job 82 View(Stages: 1/1)
Stage 97: 1/1
Job 83 View(Stages: 1/1, 1 skipped)
Stage 98: 0/1 skipped
Stage 99: 2/2
Job 84 View(Stages: 1/1, 2 skipped)
Stage 100: 0/1 skipped
Stage 101: 0/2 skipped
Stage 102: 1/1
@younevano 2 місяці тому
Same here, did you figure out why?
@rohanchoudhary672 8 місяців тому ⁺¹
df.rdd.getNumPartitions(): Output = 1
df.repartition(5)
df.rdd.getNumPartitions(): Output = 1
Using community databricks sir
@manish_kumar_1 8 місяців тому ⁺²
Yes I don't see any issue. If you won't assign your repartition df to some variable then you will get same result
@GauravKumar-im5zx Рік тому
glad to see you brother
@user93-i2k 2 місяці тому
at 6:40, .count() is also an action right?
@younevano 2 місяці тому
Functions like sum(), count() etc in spark behave differently in different contexts!
Here count() used along with other aggregration functions in a transformation chain, where it acts like a transformation instead of an action!
@whatisaw... Місяць тому
vo data kahpe bheja hey sir please tell
konsi link hey please provide
@ankitachauhan6084 7 місяців тому ⁺¹
why was count() not counted as action while countuing in jobs ?
@manish_kumar_1 7 місяців тому ⁺¹
Aage ke lecture me samjh me aayega
@younevano 2 місяці тому
Functions like sum(), count() etc in spark behave differently in different contexts!
Here count() used along with other aggregration functions in a transformation chain, where it acts like a transformation instead of an action!
@gujjuesports3798 Рік тому ⁺¹
.count on group dataset is transformation not action, correct ? if it was like employee_df.count() then it would be action
@younevano 2 місяці тому
yep!
@tahakhalid324 27 днів тому
Since filter is narrow transformation, then wht a new stage is being created after the repartition
@RahulAgarwal-uh3pf 9 днів тому
same doubt
@CctnsHelpdesk 9 місяців тому
7:40
print is an action ,so it should be 4 job in given code. ryt????correct me if i am wrong
@younevano 2 місяці тому
No, print() is not an action in Spark. But I believe getNumPartitions( ) is triggering a job though!
@moyeenshaikh4378 Рік тому
bhai job 2 to collect se chalu hoga na? to read ke bad se collect tak job 1 hi chalega na?
@garvitgarg6749 Рік тому ⁺²
Bhai, I'm using Spark 3.4.1 and in that when I group data using groupby (I have 15 records in dummy dataset) it create 4 jobs to process 200 partitions why ? Is this the latest enhancement ? and not only in latest version but also in spark 3.2.1 I observed same thing. Could you please explain this ?
@villagers_01 7 місяців тому ⁺¹
bhai job action pe create hota h but no. of action != no of jobs. kyonki jobs create hota h jab new rdd ka jarurat hota h. in groupby we need shuffling of data aur rdd immutable hota h to naya rdd banana hi padta h after shuffling. isliye jab bhi naya rdd banane ka jarurat hota h to ek job create hota.aapke case 1 job read ke liye, 1 job schema ke liye,1 job shuffling ke liye aur ek job display ke liye.
@younevano 2 місяці тому
@@villagers_01 Aisa h toh, it should create 2 jobs each after repartition, filter, select phases in Manish's code as it's creating 2 new RDDs/data partitions each time na?
@prashantsrivastava9026 Рік тому
Nicely explained
@pranay-q1e Рік тому
Can i know about executors.
How many executors will be there in worker node?
And
Is the no.of executors depend on no.of cores in worker node?
@mayankkandpal1565 Рік тому
nice explanation.
@divyabhansali2182 Рік тому ⁺¹
Is 200 default task in group by even if only 3 distinct ages are there?If so what will be there in rest of the 197 task (which age group will be there)
@younevano 2 місяці тому
Empty partitions/tasks??
@sairamguptha9988 Рік тому
Glad to see you manish... Bhai any update on project details?
@kumarabhijeet8968 Рік тому
Manish bhai total kitne videos rhenge theory or practical wale series mein?
@samirdeshmukh9886 Рік тому
thank you sir..
@nikhilhimanshu9758 11 місяців тому
kya hota agar filter ke baad ek aur narrow transformation hota like filter --> flatmap--> select iska kitna task banta ?
@younevano 2 місяці тому
no additional tasks i believe!
@RahulAgarwal-uh3pf 7 місяців тому ⁺¹
jo code snipet suru m dikhaya h usme count bhi ek job h right?
@manish_kumar_1 7 місяців тому ⁺¹
Nhi aage ke lecture me aapko pata chalega why
@younevano 2 місяці тому
Functions like sum(), count() etc in spark behave differently in different contexts!
Here count() used along with other aggregration functions in a transformation chain, where it acts like a transformation instead of an action!
@AnkitNub 7 місяців тому
I have a question, if one job has 2 consecutive wide dependency transformation then 1 narrow dependency and again 1 wide dependency how many stages will be created. Suppose repartition, after that groupby, then filter and then join, how many stages will this create?
@jaisahota4062 6 місяців тому
same question
@younevano 2 місяці тому
As he said in the video, each wide dependency transformation creates a new stage, I am guessing in your case, it will create 3 stages!
@MiliChoksi-gc8if 8 місяців тому ⁺¹
So if any group by is there we have to consider 200 task?
@manish_kumar_1 8 місяців тому ⁺¹
Yes if aqe is disabled. If it is enabled then count depends on data volume and default parallelism
@rajkumardubey5486 6 місяців тому ⁺¹
Count bhi ek action hoga na means 3 job create hua
@younevano 2 місяці тому
Functions like sum(), count() etc in spark behave differently in different contexts!
Here count() used along with other aggregration functions in a transformation chain, where it acts like a transformation instead of an action!
@deepaksharma-xr1ih 5 місяців тому
good job man
@ADESHKUMAR-yz2el 8 місяців тому
bhaiya you are grate
@jay_rana 10 місяців тому
what if the spark.sql.shuffle.partitions is set to some value, in this case what will be the no of tasks after/in groupby stage ?
@younevano 2 місяці тому
your set value i believe!
@mohitgupta7341 Рік тому
Bro amazing explanation
@salmansayyad4522 9 місяців тому
excellent!!
@rajukundu1308 Рік тому ⁺¹
Number of actions is equal to number of jobs. In mentioned code snipped there was thrre actions (read.count,collect) . As per theory three job id should create. But in spark ui only two job is created. Can you help me on this.
@rajukundu1308 Рік тому
why three job id not credited?.
@manish_kumar_1 Рік тому ⁺³
Count is a transformation and action both. In the given example it is working as transformation not as an action.
I will be uploading aggregation video soon. There you will get to know more about count behavior
@rajukundu1308 Рік тому
@@manish_kumar_1 thanks for prompt response.sure, eagerly waiting for your new video
@younevano 2 місяці тому
Functions like sum(), count() etc in spark behave differently in different contexts!
Here count() used along with other aggregration functions in a transformation chain, where it acts like a transformation instead of an action!
@gauravkothe9558 Рік тому ⁺¹
Hii sir, i have one doubt like collect will create the task and stage or not because you mentioned like 203 task
@younevano 2 місяці тому
200 tasks read the shuffled data, the same 200 tasks perform the count() aggregation. No Additional Tasks as collect() uses the results from the aggregation tasks. Basically 200 tasks for all (GroupBy aggregation and collect)
@narag9802 Рік тому
do you have English version of videos
@serenitytime1959 7 місяців тому
could you please upload the required files, I just want to run and see by myself.
@ramyabhogaraju2416 Рік тому
I have been waiting for your video how many more days will it take spark series to complete
@jatinyadav6158 11 місяців тому ⁺¹
Hi @manish_kumar_1,
I have one question in wide transformation you said that in groupBy stage3 there will be 200 tasks according to the 200 partitions. But can you tell me why these 200 partitions happened in the first place.
@younevano 2 місяці тому
Any wide transformation (involves shuffling) by default in spark causes 200 partitions he said in the video!
@Watson22j Рік тому
koi baat nhi bhaia, bs ye series poora khatam kr dena kyoki etne detail me yt pe kisi nhi nhi btaya hai.
@venumyneni6696 Рік тому ⁺¹
Hi Manish, Why doesn't the collect() method create a new stage (stage 4) in Job2 as it needs to send the data from 200 partitions into the driver node ?
@manish_kumar_1 Рік тому
Collect is an action not a transformation
@venumyneni6696 Рік тому
@@manish_kumar_1 Thanks for the reply Manish. What happens after the groupBy in this case ? Spark transfers the data in 200 partitions to driver right ? Don't we need any tasks for that process ? Thanks in advance.
@manish_kumar_1 Рік тому
@@venumyneni6696 I think you are missing some key point of driver and executor. Please clear your basics, read multiple blogs or watch my videos in sequence
@AAMIRKHAN-qy2cl Рік тому ⁺¹
correct but you said that for every action 1 stage will be created so total stages should be 5 ,@@manish_kumar_1
@younevano 2 місяці тому
@@AAMIRKHAN-qy2cl correct! Have you confirmed this?
@pankajrathod5906 Рік тому
File ka data kaha se lu.. aapne data kaha diya hai
@lucky_raiser Рік тому
bro, how many more days will spark series take and will you make any complete DE project with spark at last.
BTW watched and implemented all your theory and practical videos. Great sharing❤
@aditya9c 10 місяців тому
waah
@prabhatsingh7391 Рік тому
Hi Manish , count is also an action and you have written count just after group by in code snippet,why count is not considered as job here.
@manish_kumar_1 Рік тому
Aage me videos me iska explanation mil jayega. Count action v hai and transformation v. Kab kon sa hoga uske liye aage videos me detailed me explain kara hai
@younevano 2 місяці тому
Functions like sum(), count() etc in spark behave differently in different contexts!
Here count() used along with other aggregration functions in a transformation chain, where it acts like a transformation instead of an action!

Наступне

Автоматичне відтворення