Advancing Spark - Crazy Performance with Spark 3 Adaptive Query Execution

Поділитися
Вставка
  • Опубліковано 10 вер 2024
  • One of the big announcements from Spark 3.0 was the Adaptive Query Execution feature... but noone seems to be celebrating it as much as Simon! In this video. Simon walks through what makes him just so excited about AQE and what it could mean to you!
    (also - yes, it's Adaptive Query Execution, not Engine - Simon's just too excited here...)
    Learn more about AQE at the Databricks blog here: databricks.com...
    Find the demo we used here: docs.databrick...
    And, as always, don't forget to like and subscribe if you want more Databricks, Spark & Azure Analytics content from Advancing Analytics
    www.advancinga...

КОМЕНТАРІ • 21

  • @carltonpatterson5539
    @carltonpatterson5539 4 роки тому +5

    This is the most enjoyable video on Spark Execution plans - I never thought in my wildest dreams I would enjoy listening to someone talk about data skew, execution plans and query engines.
    I definitely need to follow this guy

  • @JustLikeBuildingThings
    @JustLikeBuildingThings Місяць тому

    One thing to note at least what I was told on the dbx performance course is that when AQE is on, shuffle.partitions setting still acts as your ceiling.

  • @carltonpatterson5539
    @carltonpatterson5539 4 роки тому

    Oops!, I should not have said engine ... I meant execution :-)

  • @mohitsanghai5455
    @mohitsanghai5455 3 роки тому

    Wow...Just Perfect...Subscribed.

  • @akhilannan
    @akhilannan 4 роки тому

    Exactly what I needed! Thanks for this video! I was struggling with building a big datamart with lots of joins to huge tables (600m +). Even with a 8 node 56GB cluster, it didn't finish after 1.5hrs and I had to cancel the query. As per explain plan it was doing lots of Sort Mearge joins, and execution hangs with 1 executor doing all the work. I even tried computing table /column stats still didn't help. With Adaptive Query Execution the SQL ran in 28min!!!

    • @AdvancingAnalytics
      @AdvancingAnalytics  4 роки тому +1

      Fantastic news - glad it helped! I expect we'll see quite a few similar stories as people discover the config option, switch it on and watch their queries suddenly speed up! Definitely need to investigate to see if there are any potential downsides, but looks like a winner so far!
      Simon

    • @GhernieM
      @GhernieM 3 роки тому +1

      @@AdvancingAnalytics Hi, have you found any downsides after 9 months of use?

  • @radhsri
    @radhsri 4 роки тому

    Hi, what exactly is 26 here? I get the 200 default partitions that the clusters create with the data, but not sure of what the "26" is..thank you..

    • @AdvancingAnalytics
      @AdvancingAnalytics  4 роки тому +1

      Hey - Been a while since I ran the demo, but from what I recall the "sales" table that I'd saved down was from a dataframe with 26 RDD blocks, so had created 26 parquet files within the table. In any of the examples going back to that table, it needs 26 tasks to read each of those files, before we can do any of the actual shuffle/exchange steps that AQE is actually helping us with.
      Simon

  • @shakezi
    @shakezi 2 роки тому

    Thanks for the video!
    Any idea why Spark AQE creates multiple jobs, instead of multiple stages in the past...?
    You've mentioned the "jobs" but with no explanation... :)

  • @Dreamer-xj3ms
    @Dreamer-xj3ms 2 місяці тому

    Great video.
    Do I need to turn it on manually in 2024.

    • @AdvancingAnalytics
      @AdvancingAnalytics  2 місяці тому

      Nope - turned on by default since Spark 3.2.0 (released end of 2021)!

  • @richardcarpenter6775
    @richardcarpenter6775 4 роки тому

    Great video and thanks for sharing!! I'm currently using Delta Lake on Databricks 2.4.5 and considering migrating to 3.x, do you know if AQE will bring speed enhancements when working with Delta files?

    • @AdvancingAnalytics
      @AdvancingAnalytics  4 роки тому

      Absolutely - the stats collected by Delta gives you some speed boosts from dynamic partition pruning in Spark 3, which is a huge boost in itself. AQE should help anytime there's a shuffle/exchange - so if you're joining tables, performing wide transformations etc, you should see the benefits.
      Only gotcha I found when jumping up from spark 2.4 to 3.x was the shift in date compatibility (ie: strict case sensitivity on date format strings). Otherwise, it was a pain-free upgrade with a few nice performance gains.
      Simon

    • @richardcarpenter6775
      @richardcarpenter6775 4 роки тому

      @@AdvancingAnalytics Thanks for this, much appreciated.

  • @WKhan-fh2pp
    @WKhan-fh2pp 3 роки тому

    I have a window operation on 8M records and even AQE is enabled but it is causing data skewness issue, any idea how to resolve it.

    • @AdvancingAnalytics
      @AdvancingAnalytics  3 роки тому +1

      That's always gonna be an "it depends" :)
      If the window is partitioned by a skewed key, and there's no way you can break the partition down another way, you're going to get elements of skew during the stage that performs that transform. You could potentially use bucketing to collect some of the smaller partitions together - but if the issue is around one partition being way too big, it's not going to help too much.
      That said, 8M records isn't a crazy amount of data to just brute force through, unless it's an issue trying to serve data directly into dashboards etc that needs the faster execution time? If that's the case, you can always go for a larger worker size, but fewer nodes, so at least you're not hitting any memory boundaries?

  • @mohitsanghai5455
    @mohitsanghai5455 3 роки тому

    @Advanced Analytics Can you make a video on Dynamic Partition Pruning in Spark ? It is quite confusing

    • @AdvancingAnalytics
      @AdvancingAnalytics  3 роки тому

      Yep, there's a DPP vid from a while ago - ua-cam.com/video/-86iMCKeYxI/v-deo.html.
      Let us know if that answers your questions!

    • @mohitsanghai5455
      @mohitsanghai5455 3 роки тому

      @@AdvancingAnalytics Yes..that was a great video.

    • @mohitsanghai5455
      @mohitsanghai5455 3 роки тому

      @@AdvancingAnalytics I have added few questions in the DPP vid. Can u please have a look at them