Coalesce outperform most of the cases . In one of my project i was dealing with skewed data and required the data to compact it into one single partition for down stream application and from there further to redshift now problem arises when I used coalesce instead of repartition I see 1 hr job took 1.45 hrs due to uneven distribution . Job was stuck for straight 45 mins as i checked from the DAG . I went to the documentation found out coalesce assign same number of compute nodes with the partition number what i meant by that is it will basically assign same number of compute node at work with same number of partition which you require and eventually will drastically reduce parallelism . Repartition Does things in evenly manner just because it follows round robin fashion of sending data in sequentially across the partitions. So using repartition it reduced to 8 mins from 45 mins now this is massive .
There are many use cases where repartition is a better choice. When you have a large data set and complex operation other than count, calling coalesce will not be able to take advantage of parallelism, etc only a single task is launched and thus can take far longer to finish. whereas repartition will be able to run in parallel per number of partitions, and be much faster. As a matter of fact, if coalesce is the last step of the pipeline, the whole pipeline is running in a single task. Be aware!
Thanks a lot Sir, your videos pulls me off from Java and python to scala👍.. could you please spend some time to create a video on how to add dependencies in build. Sbt
Thank you! How do we determine number of repartitions or coalesce? numbers.repartition(n) or numbers.coalesce(n) - is there any calculation that can be done to come up with the certain number suitable for the operation?
Great Explanation. Could you please create more videos on spark.
Will do!
Coalesce outperform most of the cases . In one of my project i was dealing with skewed data and required the data to compact it into one single partition for down stream application and from there further to redshift now problem arises when I used coalesce instead of repartition I see 1 hr job took 1.45 hrs due to uneven distribution . Job was stuck for straight 45 mins as i checked from the DAG . I went to the documentation found out coalesce assign same number of compute nodes with the partition number what i meant by that is it will basically assign same number of compute node at work with same number of partition which you require and eventually will drastically reduce parallelism . Repartition Does things in evenly manner just because it follows round robin fashion of sending data in sequentially across the partitions. So using repartition it reduced to 8 mins from 45 mins now this is massive .
There are many use cases where repartition is a better choice. When you have a large data set and complex operation other than count, calling coalesce will not be able to take advantage of parallelism, etc only a single task is launched and thus can take far longer to finish.
whereas repartition will be able to run in parallel per number of partitions, and be much faster. As a matter of fact, if coalesce is the last step of the pipeline, the whole pipeline is running in a single task. Be aware!
Indeed, that's not to say that coalesce is always better. We'll do a deeper dive into the tradeoffs in a future video.
Thanks a lot Sir, your videos pulls me off from Java and python to scala👍.. could you please spend some time to create a video on how to add dependencies in build. Sbt
Will do - there's a lot of content coming soon!
Thank you!
How do we determine number of repartitions or coalesce?
numbers.repartition(n) or numbers.coalesce(n) - is there any calculation that can be done to come up with the certain number suitable for the operation?
There is no one perfect number - this depends on the shape of your data and what you want to do with it.
Thank you. But what are all the scenarios we go for either repartition or coalesce? Plz explain
We can not increase the number of partitions using Coalesce. @Rock the JVM
Yes you can, and in that case it's the same as a repartition.
Fun fact: repartition is implemented in terms of coalesce.