Master Reading Spark Query Plans

Afaque Ahmad

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 3 січ 2025

КОМЕНТАРІ • 157

@afaqueahmad7117 Рік тому ⁺¹³
🔔🔔 Please remember to subscribe to the channel folks. It really motivates me to make more such videos :)
@shafi143786 4 місяці тому
Your content is really awesome with out any further thought, and its very advance level pyspark understanding
@roksig3823 Рік тому ⁺⁵
Thanks a bunch. To my knowledge, no one has explained Spark explain function this detailed level. Very in-depth information.
@vikasverma2580 Рік тому ⁺²
Bhai mera bhai 😍 Abto hazaro students aayenge bhai ke pass par Apne sabse pehle student ko mat bhulna bawa😜
Very proud of you bhai... And i can guarantee every1 here that he is the best teacher that there is❤️
@mohitshrivastava4967 2 місяці тому ⁺²
My work laptop doesn't allow the gmail login so I usually do not comment or like. But for this, I searched for the same video on mobile just to like and comment. Brilliantly explained. Your channel is definitely underrated.
Thank you for all the videos.
@afaqueahmad7117 2 місяці тому
Hey @mohitshrivastava4967, thank you for this note and gesture, it means a lot to me. Really appreciate it, brother :)
@khanthesalah Рік тому ⁺⁴
Proud of you brother, looking forward to more of such videos. Great job!
@dawidgrzeskow987 9 місяців тому
After looking for some time for best material which truly explains this topic, and try to dig deep enough you clearly delivered, thanks Afaque.
@afaqueahmad7117 9 місяців тому
Glad it was helpful, appreciate it :)
@shivagarg9458 13 днів тому
Simple and most effective explanation.
@afaqueahmad7117 11 днів тому
Appreciate it man :)
@saptorshidana7903 Рік тому ⁺²
Amazing content.. I am a newbie into Spark but I am hooked.. Sir plz post the continued series.. awaiting for your video posts.. Amazing teacher
@adityasingh8553 Рік тому ⁺¹
This takes me back to me YaarPadhade times. Great work Bhai much love!
@psicktrick7667 Рік тому ⁺¹
rare content! please don't stop making these
@thecodingmind9319 Рік тому
Bro, I am beginner but i was able to understand everything. Really great content and ur explanations was also amazing. Please continue doing such great videos. Thanks a lot for sharing .
@afaqueahmad7117 Рік тому
@thecodingmind9319 Thanks for the kind words, means a lot :)
@СергейРоманов-у9и Рік тому
Thanks for such an in-depth overview!! helps a lot to grow!!
@YoSoyWerlix 10 місяців тому
Afaque, THANK YOU SO MUCH FOR THESE VIDEOS!!
They are so amazing for a fast paced learning experience.
Hope you soon upload much more!!
@Sampaio1303 3 місяці тому
Thanks! From Brazil
@afaqueahmad7117 3 місяці тому
Appreciate it, thanks man! :)
@shubhamwaingade4144 11 місяців тому
One of the best videos I have seen on Spark, waiting for your Spark Architecture Video
@iamexplorer6052 Рік тому
no one teaches detailed way complex things like you no matter what please spread you're knowledge to world i am sure there must be people learn from you , remember you as master life long who settled in it job like me
@SidharthanPV Рік тому
This is one of the best video about Spark I have seen recently!
@GuruBala Рік тому
It's great to see such useful contents in spark... an its helpful to understand clearer with your notes! you rock.... Thankless thanks !!
@OmairaParveen-uy7qt Рік тому ⁺¹
Explained the concept really well!
@snehitvaddi 5 місяців тому ⁺¹
Buddy! You got a new sub here.
Loved your detailed explanation. I see no one explaining the query plain this detail and I believe this is the right way of learning. But I would love to see an entire Spark series.
@afaqueahmad7117 4 місяці тому ⁺²
Thank you @snehitvaddi for the kind appreciation. A full-fledged, in-depth course on Spark coming soon :)
@snehitvaddi 4 місяці тому
@@afaqueahmad7117 Most awaited. Keep up the 🚀
@anirbansom6682 Рік тому
My today's well spent 40 mins. Thanks for the knowledge sharing.
@ManishKumar-qw3ft 10 місяців тому ⁺¹
Bhai bhot bhadia content banaate ho. Love your vdos. Please keep it up. You have great teaching skills.
@afaqueahmad7117 10 місяців тому
Bohot shukriya bhai sahab!
@sandeepchoudhary3355 11 місяців тому
Great content with practical knowledge. Hats off to you !!!
@ridewithsuraj-zz9cc 5 місяців тому
This is the most detailed explanation I have ever seen.
@afaqueahmad7117 4 місяці тому
Appreciate it man @ridewithsuraj-zz9cc :)
@ujvadeeppatil8135 Рік тому
By far best content i have seen on explain query thing!!! Keep it brother. Good luck!
@afaqueahmad7117 Рік тому
Glad, you liked it, thank you! :)
@abhishekmohanty9971 Рік тому
Beautifully explained. Many concepts got cleared. thanks a lot.Keep going.
@yashwantdhole7645 6 місяців тому
You are a gem bro. The content that you bring here is terrific. ❤❤❤
@afaqueahmad7117 6 місяців тому
Thanks man, @yashwantdhole7645. This means a lot!
@tahiliani22 Рік тому
This is really informative, such details are not even present in the O'Reilly Learning Spark Book. Please continue to make such content. Needless to say but I have already subscribed.
@garydiaz8886 Рік тому
This is pure gold, congrats bro , keep the good work
@afaqueahmad7117 Рік тому
Thank you @garydiaz8886, really appreciate it! :)
@h-o-hness 4 місяці тому
Great explanation of such a complex topic, thanks and keep up the good work.
@afaqueahmad7117 4 місяці тому
Thanks man @Dhawal-ld2mc :)
@myl1566 7 місяців тому
one of the best videos i came across on spark query plan explanation. Thank you! :)
@afaqueahmad7117 7 місяців тому
Appreciate it @myl1566, thank you!
@joseduarte5663 4 місяці тому
Awesome video as always. Would really appreciate more videos explaining how DAG's can be read
@RaviSingh-dp6xc 3 місяці тому
Bro, This is really nice, I just love the way you teach and very very good content. Bahut bahut Sukriya and ton of love
@afaqueahmad7117 2 місяці тому
Thank you @RaviSingh-dp6xc, appreciate it man, means a lot, thank you! :)
@maheshbongani Рік тому
It's a great video with a great explanation. Awesome. Thank you for such a detailed explanation. Please keep doing such content.
@Ilovefriendswebseries Місяць тому
Excellent Video
Thanks for the beautiful explanation
@afaqueahmad7117 Місяць тому
Appreciate it @Ilovefriendswebseries, glad to hear that :)
@neelbanerjee7875 7 місяців тому ⁺¹
Absolute gem ❤❤ would like to have video on handling real time scenarios (handle slow running job, oom etc)..
@CoolGuy Рік тому
I am sure that down the line, in a few years, you will cross 100k subscribers. Great content BTW.
@afaqueahmad7117 Рік тому ⁺¹
Hey @CoolGuy , thanks man! Means a lot to me :)
@MuhammadAhmad-do1sk 8 місяців тому
Excellend content, please make more videos like this with deep understanding of "how stuff works"... Highly Appreciate it. Love from 🇵🇰
@afaqueahmad7117 8 місяців тому
Thank you @MuhammadAhmad-do1sk for the appreciation, love from India :)
@PavanKalyan-vw2cp 9 місяців тому
Bro, you dropped this👑
@Ali-q4d4c 27 днів тому
amazing explanation!
@afaqueahmad7117 20 днів тому
Appreciate it :)
@VenuuMaadhav 6 місяців тому
By watching your first 15mins of youtube video and I am awed beyond my words.
What a great explanation @afaqueahmad. Kudos to you!
Please make more videos of solving real time scenarios using PySpark & Cluster configuration. Again BIG THANKS!
@afaqueahmad7117 6 місяців тому
Hey @VenuuMaadhav, thank you for the kind words, means a lot. More coming soon :)
@piyushjain5852 Рік тому
Very useful, video man, thanks for explaining things in so much details, keep doing the good work.
@vijaykumar-b6i7t 4 місяці тому
a lot of knowledge in just one video
@afaqueahmad7117 4 місяці тому ⁺¹
Appreciate it @user-pq9tx6ui2t :)
@sudeepbehera5921 11 місяців тому
Thank you so much for making this video. this is really very helpful.
@sarfarazmemon2429 9 місяців тому
Underrated pro max!
@saravananvel2365 Рік тому
Very useful and explaining complex things in easy manner . Thanks and expect more videos from you
@nahilahmed755 Місяць тому
Love the video! great content and presentation. You definitely earned a subscriber in me and probably my fellow DE friends. :)
@afaqueahmad7117 Місяць тому
Glad to know you liked the video and for being a subscriber, thanks man :)
@remedyiq8034 11 місяців тому
"God bless you! Great video! Learned a lot"
@satyajitmohanty5039 5 місяців тому
Explanation is so good
@afaqueahmad7117 4 місяці тому
Thank you @satyajitmohanty5039 :)
@shaheelsahoo8535 7 місяців тому
Great Content. Nice and Detailed!!
@afaqueahmad7117 7 місяців тому
Thank you @shaheelsahoo8535, appreciate it :)
@prasadrajupericharla5545 7 місяців тому
Excellent job 🙌
@afaqueahmad7117 7 місяців тому
Thanks @prasadrajupericharla5545, appreciate it :)
@sanjayplays5010 Рік тому
This is really good, thanks so much for this explanation!
@jjayeshpawar 6 місяців тому
Great Video!
@afaqueahmad7117 6 місяців тому
Appreciate it @jjayeshpawar, thank you!
@divyadivya48 2 місяці тому
Very Informative.... Thanks for sharing 🙂
@debasishparthasarathi6685 2 місяці тому
Great tutorials
@afaqueahmad7117 2 місяці тому
Appreciate it! :)
@mohitupadhayay1439 6 місяців тому
Just 10 minutes into this notebook and I am awed beyond my words.
What a great explanation Afaque. Kudos to you!
Please make more videos of solving real time scenarios using Spark UI and one on Cluster configuration too. Again BIG THANKS!
@afaqueahmad7117 6 місяців тому
Hi @mohitupadhayay1439, really appreciate the kind words, it means a lot. A lot coming soon :)
@venkatyelava8043 5 місяців тому
One of the cleanest explanation I ever come across on the internals of Spark. Really appreciate all the effort you are putting into making these videos.
If you don't mind, May I know which text editor are you are using when pasting the Physical plan?
@afaqueahmad7117 4 місяці тому
Many thanks for the kind words @venkatyelava8043, means a lot. On the text editor - I'm using Notion :)
@chidellasrinivas Рік тому
I loved your explanation and understood it very well. Could you help me to understand at 23 mins, if we have join key as cid and group by region. how the hash partitioning works. will that consider both?
@srinivasjagalla7864 3 місяці тому
Nice explanation
@varunparuchuri9544 7 місяців тому ⁺¹
please do more vedios bro. love this one
@afaqueahmad7117 7 місяців тому
Thank you @varunparuchuri9544, really appreciate it :)
@jnana1985 Рік тому
Great explanation!!Keep uploading such quality content bro
@suman3316 Рік тому
Very Good explanation...Keep Going
@afaqueahmad7117 Рік тому
Thank you!
@crazypri11 8 місяців тому
Amazing content! Thank you for sharing!
@afaqueahmad7117 8 місяців тому
Thank you @crazypri8, appreciate it :)
@user-meowmeow1 9 місяців тому
this is gold. Thank you very much!
@afaqueahmad7117 8 місяців тому
@user-meowmeow1 Glad you found it helpful :)
@crystalllake3158 Рік тому
Thank you for taking the time to create such an in depth video for Spark Plans. This is very helpful !
Would you also be able to explain Spark Memory Tuning ?
How do we decide how much resources to allocate (driver mem, executors mem , num executors , etc for a spark submit ?
Also Data Structures Tuning, Garbage Collection Tuning !
Thanks again !
@afaqueahmad7117 Рік тому ⁺¹
Thanks for the kind words @crystalllake3158 and the suggestion; currently the focus of the series is to cover all possible code level optimization. Resource level optimisations will come in much later, but no plans for the upcoming few months :)
@crystalllake3158 Рік тому
Thanks ! Please do keep uploading, love your videos !
@dishant_22 Рік тому
Great explanation.
@AmitBhadra Рік тому
Great content brother. Please post more 😁
@ajaykiranchundi9979 Місяць тому
Thanks for nice explanation. My question regarding COALESCE - what happens when the partition increases the default size. Let us say 128 MB is default partition size, and we are coalescing to 3 partitions. But the volume of data is 512 MB ?
How would spark handle such scenarios?
@afaqueahmad7117 29 днів тому
Hey @ajaykiranchundi9979, the default partition size `spark.sql.files.maxPartitionBytes` applies at read time. In this case, 4 partitions will initially be read (512/128). After coalescing, each will have and approximate size of ~170MB
@RahulGhosh-yl7hl 11 місяців тому
This was awesome!
@satheshkumar4892 2 місяці тому
Please do post videos very often
@nikhilc8611 Рік тому
You are awesome man❤
@venkateshkannan7398 7 місяців тому
Great explanation man! Thank you! What's the editor that you use in the video to read query plans?
@afaqueahmad7117 7 місяців тому
Thanks @venkateshkannan7398, appreciate it. Using Notion :)
@savitajade8425 2 місяці тому
Good knowledge
@afaqueahmad7117 2 місяці тому
Appreciate it, thank you @savitajade8425 :)
@kvin007 Рік тому
Great explanation! I love the simplicity of it! I wonder what is the app you use for having your Mac as a screenshot that you can edit with your iPad?
@afaqueahmad7117 Рік тому ⁺¹
Thanks @kvin007! So, basically I join a zoom meeting with my own self and annotate, haha!
@Wonderscope1 Рік тому
Great video thanks for sharing. I definitely subscribe
@udaymmmmmmmmmm Рік тому
Can you please prepare a video showing storage anatomy of data during job execution cycle? I am sure there are many aspiring spark students who may be confused about the idea of RDD or dataframe and how it access data through apis (since spark is in memory computation) during job execution. It will help many upcoming spark developers.
@afaqueahmad7117 9 місяців тому
Hey @udaymmmmmmmmmm, I added this video recently on Spark Memory Management. It talks about storage and responsibilities or each of memory components during job execution. You may want to have a look at it :)
Link here: ua-cam.com/video/sXL1qgrPysg/v-deo.html
@tahiliani22 9 місяців тому
At the very end of the video 38:36, we see that the cast("int") filter is present in the parsed logical plan and Analyzed logical plan. I am a little confused as to when we refer those plans. Can you please explain?
@sangu2227 9 місяців тому
I have doubt when the data will be distributed to executor is it before scheduling the task or after scheduling the task and who assign the data to executor
@afaqueahmad7117 9 місяців тому
Hey @sangu2227, this requires an understanding of transformations/actions and lazy evaluation in Spark. Spark doesn't do anything (either scheduling a task or distributing data) until an action is called.
The moment an action is invoked, Spark creates a logical -> physical plan and Spark's scheduler divides the work into tasks. Spark's driver and Cluster manager then distributes the data to the executors for processing :)
@niladridey9666 Рік тому
quality content
@afaqueahmad7117 Рік тому
Thank you!
@Shrawani18 Рік тому
You were too good!
@afaqueahmad7117 Рік тому
Thank you!
@mission_possible Рік тому
Thanks for the content and when can we expect new video?
@afaqueahmad7117 Рік тому ⁺¹
Coming soon, in the next few days! :)
@nijanthanvijayakumar Рік тому
Hello @afaqueahmad7117, thanks for the great video. While explaining repartition, you mentioned you’ve a video on the AQE. Please can you link that as well?
@afaqueahmad7117 Рік тому ⁺¹
Thanks @nijanthanvijayakumar, yes that video is upcoming in the next few days :)
@nijanthanvijayakumar Рік тому
Can't wait for that@@afaqueahmad7117
These UA-cam videos are so much more helpful. Hats down one of the best ones that explain the Spark performance tuning and internals in a very simplest of forms possible. Cheers!
@TechnoSparkBigData Рік тому
You mentioned that for coalesce(2) shuffle will happen, but later you mentioned that shuffle will not happen in case of coalesce hence no partitioning scheme. Could you please explain it in detail?
@afaqueahmad7117 Рік тому ⁺¹
So, coalesce will only incur a shuffle if its a very aggressive situation. If the objective can be achieved by merging (reducing) the partitions on the same executor, it will go ahead with it. In case of coalesce(2), its an aggressive reduction in the number of partitions, meaning that Spark has no other option but to move the partitions. As there were 3 executors (in the example I referenced in the video), even if it reduced the partitions on each executor to a single partition, it would end up with 3 partitions in total, therefore it incurs a shuffle to have 2 final partitions :)
@TechnoSparkBigData Рік тому
@@afaqueahmad7117 Thanks for clarification.
@TJ-hs1qm 6 місяців тому
What drawing board are you using for those notes?
@afaqueahmad7117 6 місяців тому ⁺¹
Using "Notion" for text, "Nebo" on iPad for the diagrams
@TJ-hs1qm 5 місяців тому
@@afaqueahmad7117cool thx!
@rajubyakod8462 11 місяців тому
if it is doing local aggregation before shuffling the data then why it will throw out of memory error while taking count of each key when the column has huge distinct values
@TechnoSparkBigData Рік тому
Hi Sir, you mentioned that you referred AQE before. Can I get that link ? I want to know about AQE
@afaqueahmad7117 Рік тому ⁺¹
Yes, I will be releasing the video in the next few days. :)
@TechnoSparkBigData Рік тому
@@afaqueahmad7117 Thank you sir.
@TechnoSparkBigData Рік тому
In exchange hashpartitioning what is the significance of number 200? what does that mean?
@afaqueahmad7117 Рік тому ⁺¹
200 is the default number of shuffle partitions. You can find the number here in this table by the property name "spark.sql.shuffle.partitions" spark.apache.org/docs/latest/sql-performance-tuning.html#other-configuration-options
@VikasChavan-v1c 8 місяців тому
I am doing coalesce(1) and getting error as : Unable to acquire 65536 bytes of memory, got 0.
But when i am doing repartition(1), it worked. Can you please explain what happens internally in this case?
@shivammishra5546 Місяць тому
The error you’re seeing when using coalesce(1) versus repartition(1) in Spark is related to the way these two operations handle memory and shuffling.
Here’s what’s happening internally with each:
coalesce(1)
> Purpose: coalesce is used to reduce the number of partitions in a DataFrame, ideally with minimal data movement across the cluster.
> Memory Handling: When you call coalesce(1), Spark attempts to reduce the number of partitions to one. However, unlike repartition, it does so by consolidating data without shuffling across all nodes. This can cause an out-of-memory error if the data volume is too high to fit into one partition on a single node.
> Error Explanation: The error message, "Unable to acquire 65536 bytes of memory, got 0," indicates that Spark couldn’t allocate the requested memory for the partition consolidation due to the memory limitations on a node.
repartition(1)
> Purpose: repartition increases or decreases the partitions by evenly redistributing the data through a full shuffle, which can spread the load more evenly.
> Memory Handling: When you use repartition(1), Spark performs a full shuffle across all nodes, moving data as needed. This distribution helps avoid memory overload on a single node by balancing memory across the cluster, making it more reliable for large data volumes.
> Why it Worked: Because repartition involves a shuffle, it doesn’t place as much of a memory burden on any one executor, unlike coalesce. This allowed your code to avoid the memory allocation issue and complete successfully.
Key Differences
> Shuffle: repartition triggers a shuffle, coalesce does not (or minimizes shuffling).
> Memory Demand: coalesce is more efficient for reducing partitions only if the data size per partition is small enough to fit within the available memory. For large datasets or tight memory, repartition is often more reliable.
In summary, repartition(1) worked because it redistributed the data across the cluster rather than trying to load it all into a single node's memory.
@bhargaviakkineni 7 місяців тому
Hi sir i came across a doubt
Consider the executor size 1gb/executor. We have 3 executors and intially 3 gb data gets distributed across 3 executors each executor is having 1gb partition after various transformations we came across a requirment to decrease the number of partitions to 1 partition for that we will use repartition(1) or coalesce(1). In this scenario all the 3 partitions will merges to 1 partition each partition is having size of 1 gb approximately. Collectively all the partitions size is 3 gb approximately. When repartition (1) or coalesce(1) all the 3 gb data should sit in 1 executor having capicity of 1gb only. So here the data is execeeding the executor size what happens in this scenario. Could you please make video on this requesting sir.
@afaqueahmad7117 7 місяців тому
Hi @bhargaviakkineni, In the scenario you described above where the resulting partition size (3 GB) exceeds the memory available on a single executor (1 GB), Spark will attempt to spill data to disk. The spill to disk is going to help the application from crashing due to out-of-memory errors however, there is going to be a performance impact associated, because disk IO is slower.
On a side note, as a best practice, It’s best to also think/re-evaluate the need to write to a single partition. Avoid writing to a single partition, because it generally creates a bottleneck if the sizes are large. Try to balance out the partitions with the resources of the cluster (executors/cores).
Hope that clarifies :)
@mdkhan449 11 годин тому
I love to work
@ZafarDorna 11 місяців тому
Hi Afaque, how can I download the data files you are using? I want to try it hands on :)
@afaqueahmad7117 11 місяців тому
Should be available here: github.com/afaqueahmad7117/spark-experiments :)
@sahilmahale7657 8 місяців тому
Bro please make more videos !!!
@NiranjanAnandam 6 місяців тому ⁺¹
Local distinct on cust id doens't make sense and couldn't understand. How globally it does distinct count if the count is already computed. The reasoning behind why cast doens't push down predicate is not clearly explained and just as it's mentioned in the doc
@sydjernkira9042 6 днів тому
After distinct on cust id and city locally and globally, the data size would be reduced massively.
An easy way to understand, spark adds a df.dropDuplicate(A,B) before running groupBy(A).countDistinct(B)
@mohitupadhayay1439 4 місяці тому
Hi Afaque.
A suggestion.
You could start from the beginning to connect the DOTS!
Like if in your scenario we have X Node machine with Y workers and Z exectors and if you do REPARTITION and fit the data like this then this could happen.
Otherwise the Machine would sit idle and so on.
@Pratik0917 Рік тому
Fab Cotenet
@Precocious_Pervez Рік тому
Great Work buddy keep it up .... love your content, very simple to understand @Afaque Ahmed
@afaqueahmad7117 Рік тому
Thanks a ton!
@the_gamer2416 4 місяці тому
Hi Sir Please Make a Detailed course on apache spark which include every aspect of spark for Data Engineer role Please make sure there are a lot of begineer course here in market keep the course from intermediate level to advance level. Please tr to make video in Hindi it will be very helpful.

Наступне

Автоматичне відтворення