Master Reading Spark DAGs

Поділитися
Вставка
  • Опубліковано 31 гру 2024

КОМЕНТАРІ • 66

  • @shivagarg9458
    @shivagarg9458 10 днів тому

    Practically explained the AQE runtime optimization. Good Work!

  • @afaqueahmad7117
    @afaqueahmad7117  Рік тому +6

    🔔🔔 Please remember to subscribe to the channel folks. It really motivates me to make more such videos :)

    • @lunatyck05
      @lunatyck05 Рік тому

      Done - awesome videos will watch the rest of the series. Would be great to get some databricks oriented videos also when possible

  • @gabriells9074
    @gabriells9074 Рік тому +3

    this is probably the best explanation I've seen on spark DAG's. Please keep up the amazing content! thank you

  • @niladridey9666
    @niladridey9666 Рік тому +1

    again in depth content. Thanks a lot. Please discuss a scenario based question on todays topics.

  • @BuvanAlmighty
    @BuvanAlmighty Рік тому +1

    Beautiful content. Very clear and crystal explanation. Thank you for doing this. ❤❤

  • @joseduarte5663
    @joseduarte5663 4 місяці тому

    Awesome video! I've been searching for something like this and all the other videos I found don't get to the point and neither explain things as good as you do. I'm definitely subscribing and sharing this with other DE's from my team, please keep posting content like this!

    • @afaqueahmad7117
      @afaqueahmad7117  3 місяці тому

      Appreciate the kind words @joseduarte5663 :)

  • @yuvrajyuvas4730
    @yuvrajyuvas4730 10 місяців тому

    Bro..Can't thank you enough... This is what exactly I was looking... Thanks a ton bro... 🎉

  • @Fullon2
    @Fullon2 Рік тому

    Nice serie about performance, waiting for more videos, tranks.

  • @VijaySingh-x3f
    @VijaySingh-x3f 10 місяців тому

    Doing fantastic work bro.... Keep this up 💪❤

  • @CharanSaiAnnam
    @CharanSaiAnnam 10 місяців тому

    very good explanation, thanks. you earned a new subscriber

  • @varunparuchuri9544
    @varunparuchuri9544 7 місяців тому +2

    @Afaque asually amazing vedio bro. It's been more than 1 month we are dying of waiting for vedios from you

  • @saravananvel2365
    @saravananvel2365 Рік тому

    amazing explanination ..Waiting for more videos from you

  • @Learner1234-hv4be
    @Learner1234-hv4be 8 місяців тому

    Great explanation bro,thanks for the great work you are doing

    • @afaqueahmad7117
      @afaqueahmad7117  8 місяців тому +1

      Thank you @Learner1234-hv4be, really appreciate it :)

  • @ManaviVideos
    @ManaviVideos Рік тому +1

    It's really informative session, thank you!!

  • @HarbeerKadian-m3u
    @HarbeerKadian-m3u 5 місяців тому

    Amazing. This is just too good. Will share with my team also.

    • @afaqueahmad7117
      @afaqueahmad7117  4 місяці тому

      Really appreciate it @HarbeerKadian-m3u :)

  • @OmairaParveen-uy7qt
    @OmairaParveen-uy7qt Рік тому

    Explained so well!! Crystal clear!

  • @CoolGuy
    @CoolGuy Рік тому +1

    Done with the second video on this channel. See you tomorrow again.

  • @ankursinhaa2466
    @ankursinhaa2466 Рік тому

    Thank you Bro!! your videos are very informative and helpful. Can you please one video explaining setting up spark in local machine. That will be very helpful

    • @afaqueahmad7117
      @afaqueahmad7117  Рік тому

      Thanks @ankursinhaa2466, videos on deployment (local and cluster) coming soon :)

  • @RaviSingh-dp6xc
    @RaviSingh-dp6xc 2 місяці тому

    please make a full pyspark tutorial.. This is very interesting topic and explained very nicely 👍

    • @afaqueahmad7117
      @afaqueahmad7117  2 місяці тому

      Thanks @RaviSingh-dp6xc, more PySpark content coming soon! :)

  • @balakrishna61
    @balakrishna61 8 місяців тому

    Nice explanation.Great work.Thank you .Liked and Subscribed.

    • @afaqueahmad7117
      @afaqueahmad7117  8 місяців тому

      Thank you @balakrishna61, appreciate it :)

  • @tahiliani22
    @tahiliani22 9 місяців тому +1

    At 16:49, as part of the AQE plan for the larger dataset, the way that I understood is 1 skewed partition was split in 12 and finally we had 24+12 = 36 partitions. We see the same on Job Id 9 at 13:40 that it had 36 tasks. But I heard you say that 36 partitions have been reduced to 24. Can you please help clear the confusion ? thank you.

    • @rambabuposa5082
      @rambabuposa5082 8 місяців тому

      I think in that AQE Step, AQEShuffleRead reads 200 partitions (as per previous node) from customers dataset, then coalesced to 24 then something happened and make them to 36 thats why that right side node is showing "number of of partitions 36".
      At left side for transactions dataset, this "number of of partitions 36" is appearing as last value where at right side for customers dataset its appearing as first value.
      But Im not sure what is that " something"???

  • @rambabuposa5082
    @rambabuposa5082 8 місяців тому

    Hi Afaque Ahmad
    At 13:37 you were saying that separate job for shuffle operation that one job for transactions dataset shuffle operation and one for customers dataset.
    Im bit confused why they need a separate job? As per my understanding, when spark encounters a shuffle operation, it just creates a new stage within that job right?
    When I execute the same code snippet, it create 5 jobs totally: two for metadata (expected), two for shuffle operation (not expected) and final one is for join operation.
    Many thanks

  • @i_am_out_of_office_
    @i_am_out_of_office_ 9 місяців тому

    very well explained!!

  • @jdisunil
    @jdisunil 11 місяців тому

    your expertise and explanations is like "filtered gold in one can " Can you make quick video on AQE in depth please. 1000 thanks

    • @afaqueahmad7117
      @afaqueahmad7117  11 місяців тому

      Thanks @jdisunil for the kind words. There's already an in-depth video on AQE.
      You can refer here: ua-cam.com/video/bRjVa7MgsBM/v-deo.html

  • @rambabuposa5082
    @rambabuposa5082 8 місяців тому

    Hi Afaque Ahmad
    At 7:24, you were saying that a batch is a group of rows and its not same as a partition.
    Shall we assume something like
    a group of rows read from one or more partitions available in one or more executors (not from all executors) to match that df.show() count?

  • @subaruhassufferredenough7892

    Could you also do a video on Spark SQL and how to read DAGs/Execution Plans for that? Amazing video btw, subscribed!!

    • @afaqueahmad7117
      @afaqueahmad7117  Рік тому +1

      Hey @subaruhassufferredenough7892, Thanks you for the kind words, really appreciate it :)
      On Spark SQL, DAGs/Execution plans for both Spark SQL and non-SQL (python) are the same as they are compiled/optimized by the same underlying engine/catalyst optimizer.

  • @SHUBHAM_707
    @SHUBHAM_707 7 місяців тому

    Please make a dedicated video on shuffle partition... how it behaves when it's increased or decrease from 200

    • @afaqueahmad7117
      @afaqueahmad7117  7 місяців тому

      Hey @SHUBHAM_707, have you watched this - ua-cam.com/video/q1LtBU_ca20/v-deo.html

  • @tandaibhanukiran
    @tandaibhanukiran 10 місяців тому

    Hello Bro,
    I have a doubt. at "23:30 min" playtime, it was mentioned that AQEShuffleRead: coalesced partitions into 1, then will the other worker nodes will sit ideal ?
    In the Video it is mentioned that even after shuffle, all A's will be in 1 partition and B's in another partition.
    can you please explain me, what do you actually mean by Number of Coalesced Partitions=1

  • @viswanathana3759
    @viswanathana3759 9 місяців тому

    Amazing content. Keep it up

  • @tejasnareshsuvarna7948
    @tejasnareshsuvarna7948 5 місяців тому

    Thank you very much for the explanation. But I want to know what is your source of knowledge. Where do you learn these things?

  • @nayanroy13
    @nayanroy13 Рік тому

    awesome explanation.👍

  • @zulqarnainali6560
    @zulqarnainali6560 Рік тому

    beautifully explained!

  • @RishabhTakkar-o6l
    @RishabhTakkar-o6l 5 місяців тому

    How do you access this spark UI?

  • @muhammadhassan1640
    @muhammadhassan1640 Рік тому

    Excellent bro

  • @TechnoSparkBigData
    @TechnoSparkBigData Рік тому

    Thanks for this. When is the next video coming sir?

  • @rohitsharma-mg7hd
    @rohitsharma-mg7hd 15 днів тому

    29:10 pe a1 b1 (in green) ayega instead of a2 b1 (green)

    • @afaqueahmad7117
      @afaqueahmad7117  15 днів тому

      Hey @rohitsharma-mg7hd, appreciate your attention to the video. Yes, it should be A1 B1 instead of A2 B1. If you've watched the complete video I corrected it at 29:57 :)

    • @rohitsharma-mg7hd
      @rohitsharma-mg7hd 14 днів тому

      @@afaqueahmad7117 thanks i did only half video since i was also doing practical on my system along with you. Surely i will watch it today. very good luck to you . very good content

  • @pratiksatpati3096
    @pratiksatpati3096 Рік тому

    Superb ❤

  • @ComedyXRoad
    @ComedyXRoad 7 місяців тому

    thank you

  • @VenkatakrishnaGangavarapu
    @VenkatakrishnaGangavarapu Рік тому

    why shuffle partitions made 200 hundread. when we have only 13 partitions max. at 14:55

    • @afaqueahmad7117
      @afaqueahmad7117  Рік тому

      By default, shuffle partitions are 200, hence you see that in the 'Exchange' step. The reduction (optimization) to fewer partitions takes place in the 'AQEShuffleRead' step below.

  • @abdulraheem2874
    @abdulraheem2874 Рік тому

    Bhai , can help to make a video on spark architecture as well for beginners

  • @satheeshkumar2149
    @satheeshkumar2149 10 місяців тому

    While stages are created whenever a shuffle occurs, how are jobs created?

    • @afaqueahmad7117
      @afaqueahmad7117  10 місяців тому

      Hey @satheeshkumar2149, jobs are created whenever an actions is invoked. Examples of action in Apache Spark can be - collect(), count()

    • @satheeshkumar2149
      @satheeshkumar2149 10 місяців тому

      @@afaqueahmad7117 , but in some cases we have more than one job being created. This is where I find difficulty in understanding

  • @NiranjanAnandam
    @NiranjanAnandam 6 місяців тому +1

    No clarity is provided on when job is created. The stages are result of shuffle. The task is just a unit of execution