19 Understand and Optimize Shuffle in Spark

Поділитися
Вставка
  • Опубліковано 11 вер 2024

КОМЕНТАРІ • 10

  • @anveshkonda8334
    @anveshkonda8334 Місяць тому

    Thanks a lot for sharing. It will be very helpful if you add data directory in git hub repo

    • @easewithdata
      @easewithdata  Місяць тому

      Some data files are too big to be uploaded in github. Most of the data is uploaded at - github.com/subhamkharwal/pyspark-zero-to-hero/tree/master/datasets

  • @at-cv9ky
    @at-cv9ky 7 місяців тому

    great, explanation ! and the article in the comments section is too good.

  • @adulterrier
    @adulterrier 27 днів тому

    Hi @easewithdata, I am using your cluster docker assembly. There Is a folder called ease-with-apache-spark. Where can I find this series? It's going far more in depth

    • @easewithdata
      @easewithdata  27 днів тому

      Yes, that series contains more advanced article on Spark. It is a part of Medium Articles:
      subhamkharwal.medium.com/learnbigdata101-spark-series-940160ff4d30

  • @mahendranarayana1744
    @mahendranarayana1744 Місяць тому

    Great explanation, Thank you,
    But how would we know how to configure exact (at least best) "spark.sql.shuffle.partitions" at run time? Because each run/day the volume of the data is going to changed.

    • @easewithdata
      @easewithdata  Місяць тому

      Yes, this where AQE helps. Even if you have a partition setting of 200. AQE would coalesce un-necessary partitions with no data. So you dont have to manually tune the partition setting.
      This video was designed to explain you how shuffle effects your job performance. And if required how you can tune it manually. And always try to set the shuffle partitions in the multiples of parallel cores/task you have in your cluster.

  • @sarthaks
    @sarthaks 8 місяців тому

    To your statement "to avoid un-necessary shuffle wherever necessary", can you give some example or scenarios..

    • @easewithdata
      @easewithdata  8 місяців тому

      Checkout this article - blog.devgenius.io/pyspark-worst-use-of-window-functions-f646754255d2
      An example of un-necessary use of shuffle

    • @sarthaks
      @sarthaks 8 місяців тому

      @@easewithdata very very useful.. thanks for sharing the details