Sir, I have watched many videos related to this topic, but very few guys were able to explain these concepts the way you did. and this video tempted me to watch full playlist, and I definitely will. thanks for sharing your knowledge and understandings with us. 🙏🙏🙌
Could you please explain am I getting it right. As I understand partition is a logical division of data in chunks of data (unit of operation that Spark applies). So basically when for example we create RDD with 4 partitions it means that Driver Node will read data, create partitions, and serialize it, ship those partitions to Worker Nodes (deserialize here) so that it may make compuations parallelly?
Per my understanding driver sends the logic or program to executor to read only given partition of data. My doubt is how driver node creates those instruction as it does not know exactly what data is present in file specifically if it's big text file, there are no columns or keys or indexes. How it make sure that all data is read by different executorand there are no overlaps.
Are you asking about Reducing/Increasing number of partitions then u can try repartition() Or coalesce(). Remember that repartition will work for increasing and Decreasing the partitions but coalesce will only reduce the number of partitions
We can use df.coalesce(1) instead of reparation(1) as coalesce involves lesser or no shuffle while reparation involves full shuffle of data. It is preferred to have minimal shuffle of data.
Informative and well explained. Keep posting 👍
Sure 👍
Excellent 👍
Sir,
I have watched many videos related to this topic, but very few guys were able to explain these concepts the way you did. and this video tempted me to watch full playlist, and I definitely will.
thanks for sharing your knowledge and understandings with us.
🙏🙏🙌
Really informative.... neat explanation. Thank u
nice catch points explained
another good efforts for the aspirent of Data engineering job candidates. sound ground for preparing for interview...
Thank You
Could you please explain am I getting it right. As I understand partition is a logical division of data in chunks of data (unit of operation that Spark applies).
So basically when for example we create RDD with 4 partitions it means that Driver Node will read data, create partitions, and serialize it, ship those partitions to Worker Nodes (deserialize here) so that it may make compuations parallelly?
All your Vedio’s on Spark are good..Can you assign the numbers in the order to watch from first to last?
We will try to do that. Thanks for watching the videos.
Can you please provide the download link for the CDH you are using.???
Per my understanding driver sends the logic or program to executor to read only given partition of data. My doubt is how driver node creates those instruction as it does not know exactly what data is present in file specifically if it's big text file, there are no columns or keys or indexes. How it make sure that all data is read by different executorand there are no overlaps.
Thank you
Can you explain this question: how to move all partitions in a single node?
Are you asking about Reducing/Increasing number of partitions then u can try repartition() Or coalesce(). Remember that repartition will work for increasing and Decreasing the partitions but coalesce will only reduce the number of partitions
We can use df.coalesce(1) instead of reparation(1) as coalesce involves lesser or no shuffle while reparation involves full shuffle of data. It is preferred to have minimal shuffle of data.