Repartition vs Coalesce in Apache Spark | Rock the JVM

Поділитися
Вставка
  • Опубліковано 21 січ 2025

КОМЕНТАРІ • 13

  • @subimalkhatua2886
    @subimalkhatua2886 2 роки тому +2

    Coalesce outperform most of the cases . In one of my project i was dealing with skewed data and required the data to compact it into one single partition for down stream application and from there further to redshift now problem arises when I used coalesce instead of repartition I see 1 hr job took 1.45 hrs due to uneven distribution . Job was stuck for straight 45 mins as i checked from the DAG . I went to the documentation found out coalesce assign same number of compute nodes with the partition number what i meant by that is it will basically assign same number of compute node at work with same number of partition which you require and eventually will drastically reduce parallelism . Repartition Does things in evenly manner just because it follows round robin fashion of sending data in sequentially across the partitions. So using repartition it reduced to 8 mins from 45 mins now this is massive .

  • @heenagirdher6443
    @heenagirdher6443 2 роки тому +1

    Great Explanation. Could you please create more videos on spark.

  • @satyadevanwubhayavedantapu4860
    @satyadevanwubhayavedantapu4860 3 роки тому +2

    Thank you!
    How do we determine number of repartitions or coalesce?
    numbers.repartition(n) or numbers.coalesce(n) - is there any calculation that can be done to come up with the certain number suitable for the operation?

    • @rockthejvm
      @rockthejvm  3 роки тому

      There is no one perfect number - this depends on the shape of your data and what you want to do with it.

  • @prasadvenkataramasatyanand5559
    @prasadvenkataramasatyanand5559 4 роки тому

    Thank you. But what are all the scenarios we go for either repartition or coalesce? Plz explain

  • @seanxhuo
    @seanxhuo 4 роки тому +3

    There are many use cases where repartition is a better choice. When you have a large data set and complex operation other than count, calling coalesce will not be able to take advantage of parallelism, etc only a single task is launched and thus can take far longer to finish.
    whereas repartition will be able to run in parallel per number of partitions, and be much faster. As a matter of fact, if coalesce is the last step of the pipeline, the whole pipeline is running in a single task. Be aware!

    • @rockthejvm
      @rockthejvm  4 роки тому

      Indeed, that's not to say that coalesce is always better. We'll do a deeper dive into the tradeoffs in a future video.

  • @SriniVasan-ml6we
    @SriniVasan-ml6we 4 роки тому +2

    Thanks a lot Sir, your videos pulls me off from Java and python to scala👍.. could you please spend some time to create a video on how to add dependencies in build. Sbt

    • @rockthejvm
      @rockthejvm  4 роки тому +1

      Will do - there's a lot of content coming soon!

  • @clasomblog8881
    @clasomblog8881 3 роки тому

    We can not increase the number of partitions using Coalesce. @Rock the JVM

    • @rockthejvm
      @rockthejvm  3 роки тому

      Yes you can, and in that case it's the same as a repartition.
      Fun fact: repartition is implemented in terms of coalesce.