Apache Spark Memory Management

Поділитися
Вставка
  • Опубліковано 16 січ 2025

КОМЕНТАРІ • 92

  • @hritikapal683
    @hritikapal683 9 місяців тому +10

    Please don't stop making videos they're highly insightful!

  • @nayanroy13
    @nayanroy13 9 місяців тому +4

    The best 23mins 8secs I have ever spent :). This is easily one of the most useful videos on UA-cam!

  • @himanshuxyz87
    @himanshuxyz87 9 місяців тому +4

    I have read so many articles before on Spark Memory Management but this is the first time I have understood the allocation and other details so clearly. Thanks a lot. Really helpful video.

    • @afaqueahmad7117
      @afaqueahmad7117  9 місяців тому +1

      @himanshuxyz87 This means a lot, thank you for the appreciation :)

  • @adityanjsg99
    @adityanjsg99 17 днів тому

    This lecture is well made!! No fancy edits needed

    • @afaqueahmad7117
      @afaqueahmad7117  17 днів тому

      You watching this on 31st Dec shows your commitment. I’m glad you’re here and appreciate your kind words

  • @skybluelearner4198
    @skybluelearner4198 5 місяців тому +7

    I spent INR 42000 on a Big Data course but could not understand this concept clearly because the trainer himself lacked clarity. Here I understood completely.

    • @afaqueahmad7117
      @afaqueahmad7117  5 місяців тому

      Appreciate the kind words @skybluelearner4198 :)

  • @Pratik0917
    @Pratik0917 8 місяців тому +1

    All Videos are of high quality. I dont think we could this level of knowledge anywhere.. THank you, Afaque

    • @afaqueahmad7117
      @afaqueahmad7117  8 місяців тому +1

      Thank you @Pratik0917, appreciate it, means a lot to me :)

  • @sukanyanarayanan5763
    @sukanyanarayanan5763 Місяць тому

    Great lecture and clean explanation. Its a very good video for those trying to understanding in-depth memory management. Thank you for the video.

  • @cloudanddatauniverse
    @cloudanddatauniverse 6 місяців тому

    Top Class brother! Simple, Amazing and impactful. You deserve great appreciation to bring these internals. May God bless you with great health, peace, mind and prosperity! Keep growing.

    • @afaqueahmad7117
      @afaqueahmad7117  6 місяців тому +1

      Many thanks @cloudanddatauniverse, this means a lot, thank you for the kind words :)

  • @iamexplorer6052
    @iamexplorer6052 10 місяців тому +2

    Thank you we are expecting you with solid content like this

    • @afaqueahmad7117
      @afaqueahmad7117  10 місяців тому

      @iamexplorer6052 Really appreciate it :)

  • @vinitrai5020
    @vinitrai5020 10 місяців тому +3

    Hey Afique, thanks for the wonderful explanation.
    Ok, so now I have got a few questions, plz clear the doubts:
    1. In the unified memory, what if the the execution memory needs the full space that is occupied by storage memory, can the blocks from the storage memory be evicted to make room for the execution memory? So, can the execution memory occupy 100% of the space of unified memory (execution + storage)
    2. If yes, so let's suppose an event where the execution memory occupies the full unified memory and it still needs more memory.
    3. So, in this case, we have two choices -a disk spill or an off heap memory, we should opt for off heap memory over disk spill as u explained in your video .
    4. The most important question now is that if we can use disk spill or off heap memory why do we get Out Of Memory Error in executors.
    I hope that you got my points and will soon get clear explanations from your end.
    Thanks again.

    • @afaqueahmad7117
      @afaqueahmad7117  10 місяців тому +5

      Hi @vinitrai5020, Good question!
      Yes, execution can request 100% of space from the unified memory manager pool, however, in cases where you want to immune the cached blocks from eviction, you can always set `spark.memory.storageFraction` to a value. If you set this value to, for e.g. 0.1, 10% of the total memory cannot be evicted by execution. However, it is important to note that this is on-demand. If `spark.memory.storageFraction` is set to 0.1 (10%) but nothing is cached, execution will just go ahead and use that storage memory and storage will wait for that 10% memory to free up before it can use it. Refer Spark documentation here: spark.apache.org/docs/latest/tuning.html#memory-management-overview
      On Spark throwing OOM errors, despite always having the option to spill to disk is because most in-memory structures used for joins, aggregations, sorting, shuffling cannot be “split”. Consider an example where you’re doing a join or an aggregation. In this operation, the same keys land in the same partition. Imagine one of the join/aggregation key being so large that it doesn’t fit in-memory. Now, spill doesn’t work here because that in-memory structure “supposedly” holding that large key cannot be “split” i.e. depending on the nature of data, half of the join cannot be done while spilling the rest and then later getting the spilled data back and doing the join for this half. This is primarily because that in-memory structure cannot be “split”.
      Enabling off-heap memory would help reduce the memory pressure and now:
      - Total execution memory = execution (on-heap) + execution (off-heap)
      - Total storage memory = storage (on-heap) + storage (off-heap)
      If the size of the large key (as discussed above) is good enough to fit in the total execution memory after enabling off-heap memory, an OOM will be avoided.
      Hope this clarifies :)

    • @tahiliani22
      @tahiliani22 9 місяців тому

      @@afaqueahmad7117 Thanks for explaining this. I had the same question and this really helps.

    • @mesurajyadav
      @mesurajyadav Місяць тому

      @@afaqueahmad7117
      ​​⁠I just liked how you answered big question; without anything in mind just clearly with big answer. Not something that most of UA-camrs do. The great minds are clear minds. 👏

  • @PratikPande-k5h
    @PratikPande-k5h 5 місяців тому

    Really appreciate your efforts. This was very easy to understand and comprehensive as well.

    • @afaqueahmad7117
      @afaqueahmad7117  5 місяців тому

      @PratikPande-k5h Glad you're finding it easy to understand :)

  • @senthilkumarpalanisamy365
    @senthilkumarpalanisamy365 6 місяців тому

    Excellent and clear cut explanation, thanks much for taking time and preaparing the content. Please do more.

    • @afaqueahmad7117
      @afaqueahmad7117  5 місяців тому

      Appreciate it @senthilkumarpalanisamy365. More coming soon, stay tuned :)

  • @pullasrikanth1495
    @pullasrikanth1495 3 місяці тому

    Perfect Explanation 👏👏 with detailed division of memory management in spark . This video solves most of the questions running in my head 😇😇. Big Thanks to you !!

    • @afaqueahmad7117
      @afaqueahmad7117  2 місяці тому

      Thank you @pullasrikanth1495, Glad you found it helpful, appreciate the kind words, means a lot :)

  • @srinirow5808
    @srinirow5808 3 місяці тому

    Awesome content Afaque, it helped me understanding the concepts. You nailed it. Thanks a lot.

    • @afaqueahmad7117
      @afaqueahmad7117  2 місяці тому

      Thanks @srinirow5808, Appreciate it, means a lot :)

  • @technicalsuranii
    @technicalsuranii 9 місяців тому

    Very in-depth description of Apache Spark Memory management 🎉🎉❤

    • @afaqueahmad7117
      @afaqueahmad7117  9 місяців тому

      Thank you @technicalsuranii, appreciate it :)

  • @coledenesik
    @coledenesik 7 місяців тому

    I have two accounts in UA-cam and subscribed in both, Reason is you are putting some serious effort into the content. Beautiful Diagrams clear explanation accurate information is beauty of your content. Thanks, Afaque Bhai

    • @afaqueahmad7117
      @afaqueahmad7117  7 місяців тому +1

      Bohot shukriya @coledenesik bhai :) This comment made my day. Thank you for appreciating my efforts, it means a lot to me brother

  • @iamkiri_
    @iamkiri_ 9 місяців тому

    Good one Bro. You are one of the elite DataEngineer youtuber -)

    • @afaqueahmad7117
      @afaqueahmad7117  9 місяців тому

      @iamkiri_ Thanks man, it means a lot to me :)

  • @SurendraKumar-qj9tv
    @SurendraKumar-qj9tv 5 місяців тому

    Awesome explanations! pls share us more relevant videos

  • @pratikparbhane8677
    @pratikparbhane8677 9 місяців тому

    You are the Real Gem❤ , Thanks Bhai for crystal clear explanation❤❤

    • @afaqueahmad7117
      @afaqueahmad7117  9 місяців тому

      @pratikparbhane8677 Means a lot, thank you :)

  • @rgv5966
    @rgv5966 6 місяців тому

    Hey @afaque, this is top class stuff, thanks for putting in all the effort and making it available for us. Keep going :)

    • @afaqueahmad7117
      @afaqueahmad7117  6 місяців тому

      Many thanks @rgv5966, this means a lot, appreciate it :)

  • @RaviSingh-dp6xc
    @RaviSingh-dp6xc 3 місяці тому

    great content , perfectly explained .✌

    • @afaqueahmad7117
      @afaqueahmad7117  2 місяці тому

      @RaviSingh-dp6xc Glad to hear it was easy to understand! :)

  • @Ravi_Teja_Padala_tAlKs
    @Ravi_Teja_Padala_tAlKs 5 місяців тому

    Excellent 🎉 👍 appreciate your effort

  • @amiyakumarnayak8286
    @amiyakumarnayak8286 8 місяців тому

    very detailed explanation. Thanks

  • @prabas5646
    @prabas5646 7 місяців тому

    Excellent.. pls keep posting on internals of spark

  • @AlluArjun-ds9hh
    @AlluArjun-ds9hh 10 місяців тому +1

    Can you please explain more about serialization and deserialization in spark?

  • @snehilverma1772
    @snehilverma1772 7 годин тому

    Hi Afaque, in this video you talked about the GC cycle, basically when on heap memory is full, then GC happens. From where can we get to know that GC cycle has happened and now its time to use some Off heap memory? I mean do we check DAG/Query Plan or somewhere else.

  • @ybalasaireddy1248
    @ybalasaireddy1248 10 місяців тому

    Thanks for the Fabulous content. More power to you

    • @afaqueahmad7117
      @afaqueahmad7117  10 місяців тому

      Thank you @ybalasaireddy1248, really appreciate it :)

  • @syam-t3f
    @syam-t3f 8 місяців тому

    Rare video thanks for making this video. Please make more videos ❤

    • @afaqueahmad7117
      @afaqueahmad7117  8 місяців тому

      Thank you @user-sk8vi1xy7q, appreciate the kind words :)

  • @adipondas
    @adipondas 18 днів тому

    Can you please explain where the storage blocks get evicted using LRU algorithm from the unified memory ? Is it being written to the disk ?
    And really appreciate that you are making spark internal working really easy for enthusiasts like us. Thanks.

  • @nikhillingam4630
    @nikhillingam4630 6 місяців тому

    It's very useful ❤

  • @TJWGoodness
    @TJWGoodness 3 місяці тому

    Thank you

  • @sushanthsai2078
    @sushanthsai2078 2 місяці тому

    can you eloborate on offheapsize and memoryoverhead when to use both and their significance

  • @avinash7003
    @avinash7003 9 місяців тому +2

    please do one full time project on Apache Spark

    • @afaqueahmad7117
      @afaqueahmad7117  8 місяців тому

      Thanks for the suggestion @avinash7003! It's in the plan.

    • @avinash7003
      @avinash7003 8 місяців тому +1

      @@afaqueahmad7117 upload most questions asked in Data engineering interview

  • @namanverma1507
    @namanverma1507 Місяць тому

    Very nice explanation. Thanks for your efforts.
    1 question i have
    Why would we get OOM if spark could spill the data to disk? I understand half ( already in between running) of data structure can’t be pushed to disc. But I didn’t get this fully.. in terms of memory.. kindly reply with terms of memory .

  • @apurvsingh5541
    @apurvsingh5541 23 дні тому

    Thank you for the video. One doubt - where does partitions reside in executor? My guess is storage memory.

    • @afaqueahmad7117
      @afaqueahmad7117  22 дні тому

      When Spark loads data (partitions), they're loaded into the "Execution" memory. If those partitions were cached, they would be stored in the "Storage" memory

    • @apurvsingh5541
      @apurvsingh5541 22 дні тому

      @@afaqueahmad7117 Thank you :)

  • @rambabuposa5082
    @rambabuposa5082 9 місяців тому

    Thanks Afaque Ahmad, very good series and loved all of them. Good work
    I have a few questions for you, may be we can discuss here if possible or if you are planning a new video, I will wait for it.
    1. Here you discussed about Executor Memory Management. What about Driver Memory Management, how it works internally?
    2. What are the similarities between Executor and Driver Memory Management?
    3. What are the differences between Executor and Driver Memory Management?
    Many thanks in advance.

    • @afaqueahmad7117
      @afaqueahmad7117  9 місяців тому +2

      Hey @rambabuposa5082, thank you for the kind words, really appreciate it :)
      Regarding `Driver Memory Management`, appreciate the ask, but I do not have plans yet for a video. Reason is, I believe Driver & Executor memory management go hand-in-hand and relatively easy to manage Driver if your concepts are clear on Executor memory management because of several similarities (as you asked in one of your questions).
      Internally their memory components look similar in the sense that they both have JVM (on-heap) and off-heap memory and the division/logic of memory in the driver is just the same as the executor.
      Key differences are in terms of "roles and usage". You would have 1 driver which is solely responsible for creating tasks, scheduling those tasks, communicating back and forth with the executors on progress and aggregating the results (if needed), therefore its memory usage patterns differ from those of executors, which perform the actual data processing and storage.
      An important difference is on the ways OOM (out of memory errors) would happen on drivers vs executors. Hopefully, I'll be creating some content on OOM & other issues specifically and how to navigate through them.
      Hope that clarifies :)

  • @joseduarte5663
    @joseduarte5663 4 місяці тому

    Hey Afaque, awesome video as always! quick question. If we have the chance to increase the memory of the spark execution container, how can we decide between assigning that extra memory to the on heap memory or assigning it to the off heap memory if at the end the total available memory is always the sum of these two? I know you mentioned that off heap memory is not affected by the garbage collection process, but is also slower that on heap memory, so wouldn't it be better if we always assign all possible memory to the on heap memory right from the beginning instead of waiting for the off heap memory to come into play?

    • @afaqueahmad7117
      @afaqueahmad7117  4 місяці тому

      Hey @joseduarte5663, good question! Generally, all memory is appropriate to be assigned to on-heap with off-heap mostly being disabled. However, it's best to monitor job performance and lookout for issues where the overall run may be affected due to, for e.g. "GC cleanup" timing; in such cases you may prefer to change your strategy and allocate 10-20% of memory to "off-heap"

  • @BabaiChakraborty-ss8pt
    @BabaiChakraborty-ss8pt 4 місяці тому

    Great Work Bro.

  • @Akshaykumar-pu4vi
    @Akshaykumar-pu4vi 5 місяців тому

    Useful information

  • @malathiashok6650
    @malathiashok6650 21 день тому

    So why do we use persist as against cache if disk access is very expensive and slow.

  • @suresh.suthar.24
    @suresh.suthar.24 2 місяці тому

    Hi Ahmed,
    i have small doubt when all the 10 GB memory is allocated to the on heap memory then how overhead memory got 1 GB.
    Thanks for your efforts

  • @bhargaviakkineni
    @bhargaviakkineni 4 місяці тому

    Sir please do a video on executor out of memory in spark and driver out of memory in spark

  • @crustysoda
    @crustysoda 3 місяці тому

    Thank you Afaque, wonderful content. Do you have information regard where python bucketed into in pyspark, specifically pandas_udf? My understanding that it should goes into overhead memory

  • @dileepn2479
    @dileepn2479 8 місяців тому

    What is the use of overhead memory ?

    • @afaqueahmad7117
      @afaqueahmad7117  8 місяців тому

      Hey @dileepn2479, as mentioned at 4.18, it's used for internal system level operations - these are not directly related to data processing but are essential for the proper functioning of the executor e.g. managing memory for JVM, networking during shuffling etc.. Hope this clarifies :)

    • @dileepn2479
      @dileepn2479 8 місяців тому

      Thank you @@afaqueahmad7117 . I wasn't expecting such swift response from your end . Thanks much again !!

  • @piyushkumawat8042
    @piyushkumawat8042 5 місяців тому

    Why to give such a large fraction (0.4) to User memory as in the end when the transformations will be performed in a particular stage , whether we give it a user defined function or any other function execution memory will be only used . So Whats exactly the role of User Memory ??

  • @bhargaviakkineni
    @bhargaviakkineni 9 місяців тому

    Excellent video sir. Could u please make a video on garbage collection in spark and jvm

  • @maheshphadale976
    @maheshphadale976 Місяць тому

    what is used of overhead memory can please add 2 min points ?

  • @marreddyp3010
    @marreddyp3010 9 місяців тому

    Thanks for the excellent content. Could we see all the mentioned memory details in spark ui.

    • @afaqueahmad7117
      @afaqueahmad7117  9 місяців тому +1

      Thanks @marreddyp3010! RE: Spark UI, on the "Executors" tab, you can see most of the memory components - storage, on-heap, off-heap memory, disk usage

    • @marreddyp3010
      @marreddyp3010 9 місяців тому +1

      @@afaqueahmad7117 I am confused with user memory . As per spark documentation by default it is 40% of total memory. How can we check usage this memory in spark ui. Could you kindly please help to sort it. Kindly please make poc (proof of concept) video on resources usage by using GB's of data.

  • @deepakgonugunta
    @deepakgonugunta 9 місяців тому

    Please don't stop making videos

  • @grim_rreaperr
    @grim_rreaperr 9 місяців тому

    Thanks a lot bhai

  • @i_am_out_of_office_
    @i_am_out_of_office_ 9 місяців тому

    keep it coming!!

  • @snehilverma1772
    @snehilverma1772 3 місяці тому +1

    Bhai bjp ka supporter apke videos dekh k unka support band kardega. Jokes apart, this was really insightful and less complicated explanation.

    • @afaqueahmad7117
      @afaqueahmad7117  2 місяці тому

      @snehilverma1772 I appreciate the joke, however, politics side rakhte hain bhai :) Thanks for the kind words, glad you found the explanation helpful

    • @suresh.suthar.24
      @suresh.suthar.24 2 місяці тому +1

      🤣🤣