Hi Manish, in your eg, i also see a count action which means it will trigger a job, but while explaining you didn' tconsider it as a job ? Can you explain this if possible
In Apache Spark, the spark.read.csv() method is neither a transformation nor an action; spark.read.csv() is a method used for initiating the reading of CSV data into a Spark DataFrame, and it's part of the data loading phase in Spark's processing model. The actual reading and processing of the data occur later, driven by Spark's lazy evaluation model.
If you don't explicitly provide a schema, Spark will read a portion of the data to infer it. This triggers a job to read data for schema inference. If you disable schema inference and provide your own schema, you can avoid the job triggered by schema inference.
I used to watch these Job and task information after running a notebook - but after watching your video actual work flow got to know - thanks for the best knowledge sharing
Great Manish. I am grateful to you for making such rare content with so much depth. You are doing a tremendous job by contributing towards the Community. Please keep up the good work and stay motivated. We are always here to support you.
One question: after groupby by default 200 partitions will be created where each partition will hold data for individual key. What happens if there are less keys like 100 , will it lead to formation of only 100 partition insted of 200? AND What happens if the individual keys are more than 200 in number, will it create more than 200 partitions?
Around 6:35 it is wrong, read can avoid being a action, if inside schema you pass a manual created schema containing a list of all columns. Refer for practical: ua-cam.com/video/VLi9WS8SJFY/v-deo.html
1 job for read, 1 job for print 1 1 job for print 2 1 job for count 1 job for collect total 5 jobs according to me but i have not run the code not sure
@@younevano No not able to figure it out but maybe I think maybe due to some updates it might have happened... I will explore it more this week... Will update here if I find anything...
Here read( ) does trigger a job because we are not explicitly providing schema! But here count() in the code is with an aggregation function groupBy within a transformation chain, so it does not create an extra job. So just 2 jobs read and collect
Ek question tha ki, order kya hona chahaye likhne ka, Mtlb ki ager hum filter/select/partition/group by/distinct/count ya or kuch bhi ker rahe hai to, sabsay pehla kya likhna chahaye…
Bro dekho agar optimized way me me likhna chah re ho then first apply filter and then apply transformations. For example : agar Maan lo k mere pass data hai 100 employees ka or mujhe sirf 90000 se greater salary vaale employees chahiye or mujhe unn sabhi employees ko promotion Krna hai matlab k sabhi ka salary or badhaana hai. Toh suppose iss case me tu pehle transformation lagaoge k salary + vaala then filter kroge toh poora 100 employees ka data scan hoga but agar pehle hi filter laga loge or suppose 90000 se jyada salary vaale sirf 2 log hai toh agar pehle filter laga lenge then we just need to scan only 2 employees data. Sorry example thoda kharab tha but shyd concept samajh aa gaya hoga. By the way agar aap pehle transformation lagane k baad bhi filter laga re ho toh bhi koi dikkat nahi hoga kyuki Spark internally is designed in a way k vo optimized tareeke se hi run krega toh jab tumhaara poora operation perform ho gaya uske baad tum jab job hit kroge then it'll first do filter and then apply other transformations on top of it Spark is very intelligent and will do in optimized way. I hope this answers your question
Given the actions identified in code, let's summarize how many jobs will be created: load(): This triggers 1 job to read the CSV file into a DataFrame. First getNumPartitions(): This triggers 1 additional job to determine the number of partitions in the DataFrame after loading the data. Second getNumPartitions(): This triggers 1 more job to check the number of partitions after the DataFrame has been repartitioned. collect(): This triggers 1 job to execute the entire DataFrame pipeline, which includes the filter, select, groupBy, and count transformations. Total Jobs: 1 (load) + 1 (first getNumPartitions) + 1 (second getNumPartitions) + 1 (collect) = 4 jobs So, a total of 4 Spark jobs will be created by your code.
Hello Manish, isn't count() method an action() ? also, once it returns the count, the employee_df has an integer value and later you are performing a collect() operation, wouldn't that throw an exception?
Excellent explaination. One question, with groupBy 200 tasks are created but most of these tasks are useless right. How to avoid such scenarios. Coz it will take extra effort for spark for scheduling such empty partition task right...
Hi Manish, In databricks also when groupby() invoke it create 200 task by default ? How to reduce 200 task when using group by() for optimizing spark job ?
Hi Manish, in the second job there were 203 tasks and 1st job there was 1, so in total 204 tasks are there in complete application? i am bit confused between 203 and 204. Kindly clarify..
Functions like sum(), count() etc in spark behave differently in different contexts! Here count() used along with other aggregration functions in a transformation chain, where it acts like a transformation instead of an action!
When I ran in notebook, it gave 5 jobs like below, and not only 2 for this snippet of code. Can you explain.: Job 80 View(Stages: 1/1) Stage 95: 1/1 Job 81 View(Stages: 1/1) Stage 96: 1/1 Job 82 View(Stages: 1/1) Stage 97: 1/1 Job 83 View(Stages: 1/1, 1 skipped) Stage 98: 0/1 skipped Stage 99: 2/2 Job 84 View(Stages: 1/1, 2 skipped) Stage 100: 0/1 skipped Stage 101: 0/2 skipped Stage 102: 1/1
Functions like sum(), count() etc in spark behave differently in different contexts! Here count() used along with other aggregration functions in a transformation chain, where it acts like a transformation instead of an action!
Functions like sum(), count() etc in spark behave differently in different contexts! Here count() used along with other aggregration functions in a transformation chain, where it acts like a transformation instead of an action!
Bhai, I'm using Spark 3.4.1 and in that when I group data using groupby (I have 15 records in dummy dataset) it create 4 jobs to process 200 partitions why ? Is this the latest enhancement ? and not only in latest version but also in spark 3.2.1 I observed same thing. Could you please explain this ?
bhai job action pe create hota h but no. of action != no of jobs. kyonki jobs create hota h jab new rdd ka jarurat hota h. in groupby we need shuffling of data aur rdd immutable hota h to naya rdd banana hi padta h after shuffling. isliye jab bhi naya rdd banane ka jarurat hota h to ek job create hota.aapke case 1 job read ke liye, 1 job schema ke liye,1 job shuffling ke liye aur ek job display ke liye.
@@villagers_01 Aisa h toh, it should create 2 jobs each after repartition, filter, select phases in Manish's code as it's creating 2 new RDDs/data partitions each time na?
Is 200 default task in group by even if only 3 distinct ages are there?If so what will be there in rest of the 197 task (which age group will be there)
Functions like sum(), count() etc in spark behave differently in different contexts! Here count() used along with other aggregration functions in a transformation chain, where it acts like a transformation instead of an action!
I have a question, if one job has 2 consecutive wide dependency transformation then 1 narrow dependency and again 1 wide dependency how many stages will be created. Suppose repartition, after that groupby, then filter and then join, how many stages will this create?
Functions like sum(), count() etc in spark behave differently in different contexts! Here count() used along with other aggregration functions in a transformation chain, where it acts like a transformation instead of an action!
Number of actions is equal to number of jobs. In mentioned code snipped there was thrre actions (read.count,collect) . As per theory three job id should create. But in spark ui only two job is created. Can you help me on this.
Count is a transformation and action both. In the given example it is working as transformation not as an action. I will be uploading aggregation video soon. There you will get to know more about count behavior
Functions like sum(), count() etc in spark behave differently in different contexts! Here count() used along with other aggregration functions in a transformation chain, where it acts like a transformation instead of an action!
200 tasks read the shuffled data, the same 200 tasks perform the count() aggregation. No Additional Tasks as collect() uses the results from the aggregation tasks. Basically 200 tasks for all (GroupBy aggregation and collect)
Hi @manish_kumar_1, I have one question in wide transformation you said that in groupBy stage3 there will be 200 tasks according to the 200 partitions. But can you tell me why these 200 partitions happened in the first place.
Hi Manish, Why doesn't the collect() method create a new stage (stage 4) in Job2 as it needs to send the data from 200 partitions into the driver node ?
@@manish_kumar_1 Thanks for the reply Manish. What happens after the groupBy in this case ? Spark transfers the data in 200 partitions to driver right ? Don't we need any tasks for that process ? Thanks in advance.
@@venumyneni6696 I think you are missing some key point of driver and executor. Please clear your basics, read multiple blogs or watch my videos in sequence
bro, how many more days will spark series take and will you make any complete DE project with spark at last. BTW watched and implemented all your theory and practical videos. Great sharing❤
Aage me videos me iska explanation mil jayega. Count action v hai and transformation v. Kab kon sa hoga uske liye aage videos me detailed me explain kara hai
Functions like sum(), count() etc in spark behave differently in different contexts! Here count() used along with other aggregration functions in a transformation chain, where it acts like a transformation instead of an action!
Directly connect with me on:- topmate.io/manish_kumar25
last wala action hit hua then collect ke lia job bna th waha v toh ek stage bnayega woh include q ni hua
@@Adity-t4d So does it create a total of 205 tasks and 5 stages then?
Hi Manish, in your eg, i also see a count action which means it will trigger a job, but while explaining you didn' tconsider it as a job ? Can you explain this if possible
In Apache Spark, the spark.read.csv() method is neither a transformation nor an action; spark.read.csv() is a method used for initiating the reading of CSV data into a Spark DataFrame, and it's part of the data loading phase in Spark's processing model. The actual reading and processing of the data occur later, driven by Spark's lazy evaluation model.
Fir action jobs kaise ban rahe hai? Mtlb ager action is equal to jobs , to better way kya hai find out kerne ka?
If you don't explicitly provide a schema, Spark will read a portion of the data to infer it. This triggers a job to read data for schema inference.
@@roshankumargupta46 Wow this makes much sense to issues encountered while following his previous videos as well! Where do you study from?
If you don't explicitly provide a schema, Spark will read a portion of the data to infer it. This triggers a job to read data for schema inference. If you disable schema inference and provide your own schema, you can avoid the job triggered by schema inference.
Thanks for your clarification
thanks bro
This is so helpful! where do you study from?
bro next level ka explanation tha... thanks for sharing your great knowledge. keep up the good work. Thanks
How many times I will have to say this, he is a GOAT for explaining Concepts.
Liquid Gold. :)
😃😃
I used to watch these Job and task information after running a notebook - but after watching your video actual work flow got to know - thanks for the best knowledge sharing
One of the BEST videos in this series. Thank you Manish Bhai
Great Manish. I am grateful to you for making such rare content with so much depth. You are doing a tremendous job by contributing towards the Community. Please keep up the good work and stay motivated. We are always here to support you.
Really Awsome Explanation ! Esa Explanation kabhi or ni mil sakta hai Thank you so much
Wow Kya clear explanation tha,first time understood in.one.go
I have no words for your efforts. Great Great, You are really great. Thanks a lot
One of the best videos ever . Thank you for this . Really helpful.
nice explain ,each and every concept you clear keep it up
One question:
after groupby by default 200 partitions will be created where each partition will hold data for individual key.
What happens if there are less keys like 100 , will it lead to formation of only 100 partition insted of 200?
AND
What happens if the individual keys are more than 200 in number, will it create more than 200 partitions?
Didn't find such a detailed explanation, Kudos
I really liked this video....nobody explained at this level
Bahot shandaar explanation
Explained so well that too bit- by- bit 👏🏻
Around 6:35 it is wrong, read can avoid being a action, if inside schema you pass a manual created schema containing a list of all columns. Refer for practical: ua-cam.com/video/VLi9WS8SJFY/v-deo.html
very good content. Please make detail videoes on spark job optimization
Very detailed lecture.worth it
Thanks Manish Bhai...Please keep continue your video
very nice explanation sir.
1 job for read,
1 job for print 1
1 job for print 2
1 job for count
1 job for collect
total 5 jobs according to me but i have not run the code not sure
Yes I also got 5 jobs. Not sure how Manish got only 2
@@piyushzope10 In databricks? 5 jobs, 5 stages and 6 tasks in databricks!
@@younevano yes in data bricks
@@piyushzope10 Did you figure out why it differs so much with Manish's theory?
@@younevano No not able to figure it out but maybe I think maybe due to some updates it might have happened... I will explore it more this week... Will update here if I find anything...
I have one doubt there are 3 actions are there such as read,collect and count, but why it is creating 2 job only ?
In Apache Spark, the read operation is not considered an action; it is a transformation.
@@tejathunder it is neither action nor transformation
Here read( ) does trigger a job because we are not explicitly providing schema! But here count() in the code is with an aggregation function groupBy within a transformation chain, so it does not create an extra job.
So just 2 jobs read and collect
Ek question tha ki, order kya hona chahaye likhne ka, Mtlb ki ager hum filter/select/partition/group by/distinct/count ya or kuch bhi ker rahe hai to, sabsay pehla kya likhna chahaye…
Bro dekho agar optimized way me me likhna chah re ho then first apply filter and then apply transformations.
For example : agar Maan lo k mere pass data hai 100 employees ka or mujhe sirf 90000 se greater salary vaale employees chahiye or mujhe unn sabhi employees ko promotion Krna hai matlab k sabhi ka salary or badhaana hai. Toh suppose iss case me tu pehle transformation lagaoge k salary + vaala then filter kroge toh poora 100 employees ka data scan hoga but agar pehle hi filter laga loge or suppose 90000 se jyada salary vaale sirf 2 log hai toh agar pehle filter laga lenge then we just need to scan only 2 employees data.
Sorry example thoda kharab tha but shyd concept samajh aa gaya hoga.
By the way agar aap pehle transformation lagane k baad bhi filter laga re ho toh bhi koi dikkat nahi hoga kyuki Spark internally is designed in a way k vo optimized tareeke se hi run krega toh jab tumhaara poora operation perform ho gaya uske baad tum jab job hit kroge then it'll first do filter and then apply other transformations on top of it
Spark is very intelligent and will do in optimized way.
I hope this answers your question
Bhai bahut sahi explain karte ho aap
This was so beautifully explained.
Num Jobs = 2
Num Stages = 4 (job1 = 1, job2 = 3)
Num Tasks = 204 (job1 = 1, job2 = 203)
Amazing explanation sir jii
Your channel will grow immensely bro. Keep it up❤
Given the actions identified in code, let's summarize how many jobs will be created:
load():
This triggers 1 job to read the CSV file into a DataFrame.
First getNumPartitions():
This triggers 1 additional job to determine the number of partitions in the DataFrame after loading the data.
Second getNumPartitions():
This triggers 1 more job to check the number of partitions after the DataFrame has been repartitioned.
collect():
This triggers 1 job to execute the entire DataFrame pipeline, which includes the filter, select, groupBy, and count transformations.
Total Jobs:
1 (load) + 1 (first getNumPartitions) + 1 (second getNumPartitions) + 1 (collect) = 4 jobs
So, a total of 4 Spark jobs will be created by your code.
Ideally yes, this should be it! Can you elaborate on how many stages and tasks you believe will be created?
@manish_kumar, why a new stage is created when a filter operation is applied.?
Bhai was eagerly waiting for your videos
@manish_kumar_1 : correction - job2 - stage2 is till group by, and job2 - stage 3 is till collect
Yeah this is what chatGPT explains too!
Sir make playlist for other data engineer tools also
great job bro, you are doing well.
Hello Manish, isn't count() method an action() ? also, once it returns the count, the employee_df has an integer value and later you are performing a collect() operation, wouldn't that throw an exception?
Hi bhaiya. Why havent we considered collect() as a job creator here in the program you discussed?
Hi Manish, I have ran same code in databricks notebook but it created 5 jobs instead of 2.
Same here 5 jobs, 5 stages and 6 tasks! Figured out why?
Very well explained bhai.
Excellent explaination. One question, with groupBy 200 tasks are created but most of these tasks are useless right. How to avoid such scenarios. Coz it will take extra effort for spark for scheduling such empty partition task right...
You can repartition it to less number of partition or you can tweak the spark.sql.shuffle.partition config by setting it to desirable number.
count is also an action. Why we didn't consider as a new job ?
Hi Manish, In databricks also when groupby() invoke it create 200 task by default ? How to reduce 200 task when using group by() for optimizing spark job ?
There is a configuration which can be set. Just google how can I set fix number of partition after join
Count used after a groupBy is treated as a transformation not an Action.
I had read that 'Task = no. of cores'. If there are 2 cores ,how much task it will create 2 or 200?
Very good explaination bro
Hi Manish,
in the second job there were 203 tasks and 1st job there was 1, so in total 204 tasks are there in complete application? i am bit confused between 203 and 204.
Kindly clarify..
Thank you so much Manish
is read operation considered as action ??
Sir, in line 14 - we have .groupby and .count
.count is an action, right? Not sure if you missed it by mistake or if it doesn't count as an action? 🙁
I had the same doubt,Did you get the answer to this question? As per the UI also it has mentioned only 2 jobs whereas count should be an action :(
Functions like sum(), count() etc in spark behave differently in different contexts!
Here count() used along with other aggregration functions in a transformation chain, where it acts like a transformation instead of an action!
great explanation :)
I remember wide dependency you explained in shuffling.
smart lag rhe bhaiya aaj
Start a playlist with guided projects ,,so that we can apply these things in real life..
Found any sorces/channels for DE follow along projects?
Great Explanation ❤
After repartition(3)
Still 200 default partition will show there on dag Sir
aapne video me 4 stage bnaya but spark ui me 3 hi stage kaise bnane, read me hi toh ek stage bna th
An Executor can have 2 partitions or is it like that partitions means it will be there on two different machines.
An executor can run multiple tasks (meaning can have multiple partitions) depending on the number of cores it has!
VERY VERY HELPFUL
When I ran in notebook, it gave 5 jobs like below, and not only 2 for this snippet of code. Can you explain.:
Job 80 View(Stages: 1/1)
Stage 95: 1/1
Job 81 View(Stages: 1/1)
Stage 96: 1/1
Job 82 View(Stages: 1/1)
Stage 97: 1/1
Job 83 View(Stages: 1/1, 1 skipped)
Stage 98: 0/1 skipped
Stage 99: 2/2
Job 84 View(Stages: 1/1, 2 skipped)
Stage 100: 0/1 skipped
Stage 101: 0/2 skipped
Stage 102: 1/1
Same here, did you figure out why?
df.rdd.getNumPartitions(): Output = 1
df.repartition(5)
df.rdd.getNumPartitions(): Output = 1
Using community databricks sir
Yes I don't see any issue. If you won't assign your repartition df to some variable then you will get same result
glad to see you brother
at 6:40, .count() is also an action right?
Functions like sum(), count() etc in spark behave differently in different contexts!
Here count() used along with other aggregration functions in a transformation chain, where it acts like a transformation instead of an action!
vo data kahpe bheja hey sir please tell
konsi link hey please provide
why was count() not counted as action while countuing in jobs ?
Aage ke lecture me samjh me aayega
Functions like sum(), count() etc in spark behave differently in different contexts!
Here count() used along with other aggregration functions in a transformation chain, where it acts like a transformation instead of an action!
.count on group dataset is transformation not action, correct ? if it was like employee_df.count() then it would be action
yep!
Since filter is narrow transformation, then wht a new stage is being created after the repartition
same doubt
7:40
print is an action ,so it should be 4 job in given code. ryt????correct me if i am wrong
No, print() is not an action in Spark. But I believe getNumPartitions( ) is triggering a job though!
bhai job 2 to collect se chalu hoga na? to read ke bad se collect tak job 1 hi chalega na?
Bhai, I'm using Spark 3.4.1 and in that when I group data using groupby (I have 15 records in dummy dataset) it create 4 jobs to process 200 partitions why ? Is this the latest enhancement ? and not only in latest version but also in spark 3.2.1 I observed same thing. Could you please explain this ?
bhai job action pe create hota h but no. of action != no of jobs. kyonki jobs create hota h jab new rdd ka jarurat hota h. in groupby we need shuffling of data aur rdd immutable hota h to naya rdd banana hi padta h after shuffling. isliye jab bhi naya rdd banane ka jarurat hota h to ek job create hota.aapke case 1 job read ke liye, 1 job schema ke liye,1 job shuffling ke liye aur ek job display ke liye.
@@villagers_01 Aisa h toh, it should create 2 jobs each after repartition, filter, select phases in Manish's code as it's creating 2 new RDDs/data partitions each time na?
Nicely explained
Can i know about executors.
How many executors will be there in worker node?
And
Is the no.of executors depend on no.of cores in worker node?
nice explanation.
Is 200 default task in group by even if only 3 distinct ages are there?If so what will be there in rest of the 197 task (which age group will be there)
Empty partitions/tasks??
Glad to see you manish... Bhai any update on project details?
Manish bhai total kitne videos rhenge theory or practical wale series mein?
thank you sir..
kya hota agar filter ke baad ek aur narrow transformation hota like filter --> flatmap--> select iska kitna task banta ?
no additional tasks i believe!
jo code snipet suru m dikhaya h usme count bhi ek job h right?
Nhi aage ke lecture me aapko pata chalega why
Functions like sum(), count() etc in spark behave differently in different contexts!
Here count() used along with other aggregration functions in a transformation chain, where it acts like a transformation instead of an action!
I have a question, if one job has 2 consecutive wide dependency transformation then 1 narrow dependency and again 1 wide dependency how many stages will be created. Suppose repartition, after that groupby, then filter and then join, how many stages will this create?
same question
As he said in the video, each wide dependency transformation creates a new stage, I am guessing in your case, it will create 3 stages!
So if any group by is there we have to consider 200 task?
Yes if aqe is disabled. If it is enabled then count depends on data volume and default parallelism
Count bhi ek action hoga na means 3 job create hua
Functions like sum(), count() etc in spark behave differently in different contexts!
Here count() used along with other aggregration functions in a transformation chain, where it acts like a transformation instead of an action!
good job man
bhaiya you are grate
what if the spark.sql.shuffle.partitions is set to some value, in this case what will be the no of tasks after/in groupby stage ?
your set value i believe!
Bro amazing explanation
excellent!!
Number of actions is equal to number of jobs. In mentioned code snipped there was thrre actions (read.count,collect) . As per theory three job id should create. But in spark ui only two job is created. Can you help me on this.
why three job id not credited?.
Count is a transformation and action both. In the given example it is working as transformation not as an action.
I will be uploading aggregation video soon. There you will get to know more about count behavior
@@manish_kumar_1 thanks for prompt response.sure, eagerly waiting for your new video
Functions like sum(), count() etc in spark behave differently in different contexts!
Here count() used along with other aggregration functions in a transformation chain, where it acts like a transformation instead of an action!
Hii sir, i have one doubt like collect will create the task and stage or not because you mentioned like 203 task
200 tasks read the shuffled data, the same 200 tasks perform the count() aggregation. No Additional Tasks as collect() uses the results from the aggregation tasks. Basically 200 tasks for all (GroupBy aggregation and collect)
do you have English version of videos
could you please upload the required files, I just want to run and see by myself.
I have been waiting for your video how many more days will it take spark series to complete
Hi @manish_kumar_1,
I have one question in wide transformation you said that in groupBy stage3 there will be 200 tasks according to the 200 partitions. But can you tell me why these 200 partitions happened in the first place.
Any wide transformation (involves shuffling) by default in spark causes 200 partitions he said in the video!
koi baat nhi bhaia, bs ye series poora khatam kr dena kyoki etne detail me yt pe kisi nhi nhi btaya hai.
Hi Manish, Why doesn't the collect() method create a new stage (stage 4) in Job2 as it needs to send the data from 200 partitions into the driver node ?
Collect is an action not a transformation
@@manish_kumar_1 Thanks for the reply Manish. What happens after the groupBy in this case ? Spark transfers the data in 200 partitions to driver right ? Don't we need any tasks for that process ? Thanks in advance.
@@venumyneni6696 I think you are missing some key point of driver and executor. Please clear your basics, read multiple blogs or watch my videos in sequence
correct but you said that for every action 1 stage will be created so total stages should be 5 ,@@manish_kumar_1
@@AAMIRKHAN-qy2cl correct! Have you confirmed this?
File ka data kaha se lu.. aapne data kaha diya hai
bro, how many more days will spark series take and will you make any complete DE project with spark at last.
BTW watched and implemented all your theory and practical videos. Great sharing❤
waah
Hi Manish , count is also an action and you have written count just after group by in code snippet,why count is not considered as job here.
Aage me videos me iska explanation mil jayega. Count action v hai and transformation v. Kab kon sa hoga uske liye aage videos me detailed me explain kara hai
Functions like sum(), count() etc in spark behave differently in different contexts!
Here count() used along with other aggregration functions in a transformation chain, where it acts like a transformation instead of an action!