From Query Plan to Performance: Supercharging your Apache Spark Queries using the Spark UI SQL Tab

Поділитися
Вставка
  • Опубліковано 29 сер 2024
  • The SQL tab in the Spark UI provides a lot of information for analysing your spark queries, ranging from the query plan, to all associated statistics. However, many new Spark practitioners get overwhelmed by the information presented, and have trouble using it to their benefit. In this talk we want to give a gentle introduction to how to read this SQL tab. We will first go over all the common spark operations, such as scans, projects, filter, aggregations and joins; and how they relate to the Spark code written. In the second part of the talk we will show how to read the associated statistics to pinpoint performance bottlenecks.
    After attending this session you will have a better grasp on query plans and the SQL tab, and will be able to use this knowledge to increase the performance of your spark queries.
    About:
    Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.
    Read more here: databricks.com...
    See all the previous Summit sessions:
    Connect with us:
    Website: databricks.com
    Facebook: / databricksinc
    Twitter: / databricks
    LinkedIn: / databricks
    Instagram: / databricksinc Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. databricks.com...

КОМЕНТАРІ • 7

  • @viswanathana3759
    @viswanathana3759 7 місяців тому

    Awesome presentation. Really useful

  • @Sathishkumar-rl7gj
    @Sathishkumar-rl7gj 2 роки тому +1

    Thanks much !!! Very useful

  • @anirvansen2941
    @anirvansen2941 3 роки тому +1

    Awesome presentation :)

  • @Learn2Share786
    @Learn2Share786 10 місяців тому

    is there a repository to go over the real time bad vs good written spark sql ?

  • @aviyehuda
    @aviyehuda 3 роки тому

    Why does HashMergeJoin not mentioned in the presentation?

  • @aviyehuda
    @aviyehuda 3 роки тому

    Why does a spark query is translated to multiple spark jobs?

    • @user-mx7mc7sv2q
      @user-mx7mc7sv2q 2 роки тому

      Every job is a piece of work to be executed by an executor on a cluster. A query is analyzed and then split into stages according to the transformations in the query itself. Every stage is then split into multiple jobs which can be parallelized and pipelined for best efficiency.