18 Understand DAG, Explain Plans & Spark Shuffle with Tasks

Поділитися
Вставка
  • Опубліковано 11 вер 2024

КОМЕНТАРІ • 39

  • @sureshraina321
    @sureshraina321 9 місяців тому +2

    Omg this is serious stuff , I'm sure no online tutors are teaching this much depth and I'm glad that I found your channel 😍

    • @easewithdata
      @easewithdata  9 місяців тому +1

      Thank you 💗 Make sure to share with your network and tag us.

  • @sarthaks
    @sarthaks 8 місяців тому +1

    Very thorough , detailed explanation of spark internals.... never seen such content before.. Nice job!!

    • @easewithdata
      @easewithdata  8 місяців тому

      Much appreciated!

    • @sarthaks
      @sarthaks 8 місяців тому

      I have two follow-up questions:
      1- How to avoid shuffle. Does shuffle really impact the performance in all scenarios.
      2- Given a DAG or say the explain plan, what are the areas or steps that one needs to take care for performing performance optimization. @easewithdata

    • @easewithdata
      @easewithdata  8 місяців тому

      1. Yes shuffle impacts performance, but we cannot avoid shuffle in all cases (consider aggregations). We need to optimize those so that the step behave perfectly.
      2. There are many steps for optimization of a JOB in Spark. The first step starts with Explain Plan/DAG. It only provide the way spark is going to execute the code. But each step might have some different bottlenecks that you need to look out for (e.g. skewness/spillage)

  • @chandrasekharreddy3617
    @chandrasekharreddy3617 9 місяців тому

    Perfect explanation of DAG in detail with easy to understand manner. Thanks, Now I get more confidence than earlier.

  • @aashisharora3536
    @aashisharora3536 4 місяці тому

    why didn't I read your blog earlier and landed to your channel, your explanation is superb man, please keep posting videos

    • @easewithdata
      @easewithdata  4 місяці тому

      Thank you, please make sure to share with your network as well.

  • @Ravi_Teja_Padala_tAlKs
    @Ravi_Teja_Padala_tAlKs 8 місяців тому

    Seriously Super bro. Keep going and Thanks for all this 🎉

    • @easewithdata
      @easewithdata  8 місяців тому +1

      Thanks 👍 Make sure to share with your network as well ☺️

  • @ravulapallivenkatagurnadha9605
    @ravulapallivenkatagurnadha9605 8 місяців тому

    Great explanation Please continue uploading videos.

    • @easewithdata
      @easewithdata  8 місяців тому +1

      Please check the playlist for next videos.

  • @ashishsahu4025
    @ashishsahu4025 5 місяців тому

    Nice work bro

  • @ateetagrawal9928
    @ateetagrawal9928 9 місяців тому

    Very very informative video

    • @easewithdata
      @easewithdata  9 місяців тому

      Thank you, please make sure to share with your network 🛜

  • @deepjyotimitra1340
    @deepjyotimitra1340 8 днів тому +1

    Such an indepth explanation. Thank you!
    But I have one question-
    If we have input data, the partitions can be max of 128 MB by default. Here in this case, I believe each dataframe is having less than 128 MB data, which should create 1 partiton. Why 8 partitons are created for each dataframe? Am I misunderstaning or missing something here? Please do reply, I'm bit confused here.

    • @easewithdata
      @easewithdata  5 днів тому

      Partitions are created based on number of tasks working on the data. You can also alter the number of partitions using repartition or coalesce.
      Now, if you are reading a file less than 128MB and 8 tasks are utilized to read it in parallel then you will have 8 partitions with data less than 128MB. The main motto of spark is to get the work done with parallelism.
      And if you like my content Please make sure to share with your network over LinkedIn 👍

    • @deepjyotimitra1340
      @deepjyotimitra1340 5 днів тому

      @@easewithdata Thank you for the explanation. I really like your content & have already recommend this channel to many of my colleagues.

    • @easewithdata
      @easewithdata  5 днів тому

      @@deepjyotimitra1340 Thank you so much !

  • @samagraagrawal
    @samagraagrawal 8 днів тому +1

    One quick question on stage I’d 3 where shuffle write data is more than shuffle read, how come write data is more than read size because as I understand there is only reparationing happening ? How the data has grown up for writing?

    • @easewithdata
      @easewithdata  5 днів тому

      Stages read shuffle data from previous stages and can write shuffle data for next stage. The operation on the particular stage would determine the amount of data written for next stage, in some cases it can be greater than the amount of data read (based on several internal serialization factors of shuffle), which is OK.

  • @ComedyXRoad
    @ComedyXRoad 2 місяці тому +1

    thank you in real time do we use cluster node or cline mode which you are using now?

  • @rakeshpanigrahi577
    @rakeshpanigrahi577 3 місяці тому

    Thanks for the awesome explanation. I ran the exact code in Databricks, but it skipped the repartitioning step somehow. It is not showing the relevant steps for repartitioning in the DAG or in the explain plan.

    • @easewithdata
      @easewithdata  3 місяці тому

      In order to replicate the same behaviour just disable the AQE

  • @ansumansatpathy3923
    @ansumansatpathy3923 3 місяці тому

    Why is there a shuffle write for the read stage from files to a dataframe? Does that involve a shuffle? Also a shuffle write for only kbs worth data?

    • @easewithdata
      @easewithdata  3 місяці тому

      Shuffle is only involved when the next step is a wide operation. And for KBs data, it depends on the next stage. If you have count it will first make a local count before shuffling the data (which is reduced to kbs)

  • @at-cv9ky
    @at-cv9ky 7 місяців тому

    since the default parallelism is 8, only 8 tasks can run in parallel. So can you explain how 200 tasks in the join transformation ran in parallel ?

    • @easewithdata
      @easewithdata  7 місяців тому

      All 200 task didnt run in parallel. Batches of 8 task will run one after another untill all 200 tasks are completed

    • @at-cv9ky
      @at-cv9ky 7 місяців тому +1

      @@easewithdata thanks. if possible kindly make a series on Databricks as well. Just curious to understand how it integrates with Spark.

  • @dataworksstudio
    @dataworksstudio 9 місяців тому

    Bro very good explanation but I have doubt...I followed all your steps but I saw instead of 229 task 217 are being created for me and 4 stages...and in the Spark job I don't see (7+5) task for repartioning the dataframes and hence 2 stages are also missing...any idead why? Thank you.

    • @easewithdata
      @easewithdata  9 місяців тому

      Hello,
      Please disable AQE and Broadcast join, to replicate the same behaviour. I did it just to explain how thing work in background. With Spark 3 AQE is enabled by default.

  • @anupb.a983
    @anupb.a983 7 місяців тому

    Doubt : As sum is also needs shuffling and join also why for sum 200 parittions are not created

    • @easewithdata
      @easewithdata  7 місяців тому

      If AQE is enabled you will not find 200 Shuffle partitions. It will coalesce all un necessary shuffle partitions.
      Checkout - ua-cam.com/video/164OKvwW8T8/v-deo.html

  • @alishkumarmanvar7163
    @alishkumarmanvar7163 3 місяці тому

    .

  • @pawansharma-pz1hz
    @pawansharma-pz1hz 6 місяців тому

    Thanks for detailed explanation, I have one doubt, df_union = df_sum.union(df_4) after this step, why its showing 229 task again in Job DAG

    • @easewithdata
      @easewithdata  5 місяців тому

      Yes that would be skipped stages