Spark Out of Memory Issue | Spark Memory Tuning | Spark Memory Management | Part 1

Поділитися
Вставка
  • Опубліковано 11 вер 2024

КОМЕНТАРІ • 94

  • @nikhilmishra7572
    @nikhilmishra7572 3 роки тому +11

    recently discovered this channel. this is gold

  • @rijwanmohammed1309
    @rijwanmohammed1309 3 роки тому +3

    Great please don't stop from uploading new contents!!

  • @saivarunkolluru6792
    @saivarunkolluru6792 3 роки тому +4

    Lots of respect for ur content ❤️

  • @Nonamaee
    @Nonamaee 3 роки тому +2

    So well explained, even the images were very useful. Thank you very much!

  • @RamRam-jp2kc
    @RamRam-jp2kc 4 роки тому +2

    Your videos on Trouble shooting are pretty good.

  • @sarfarazhussain6883
    @sarfarazhussain6883 4 роки тому +2

    Waiting for Part 2 :)

  • @kaladharnaidusompalyam851
    @kaladharnaidusompalyam851 4 роки тому +2

    Thank you so much. I m facing many times this auestion recent days. 👍

  • @bhuvaneshkumarsrivastava906
    @bhuvaneshkumarsrivastava906 3 роки тому +2

    Is the 2nd Part not there yet?
    Your videos are AWSUUMMM !!! :D

  • @minalmoon4605
    @minalmoon4605 3 роки тому

    It is a great vedio. Content is very useful. Keep it up man 👍🏻👍🏻👍🏻

  • @viraajsivaraju2329
    @viraajsivaraju2329 4 роки тому +1

    Very useful.please keep making more such videos

  • @lxkakkarot3689
    @lxkakkarot3689 2 роки тому +2

    Can you please also show code to repartition and increase executor on dummy process by changing values so that you can show us the impact on the run time of the jobs ? That will be really great to understand concepts

  • @prasadadsul8703
    @prasadadsul8703 2 роки тому

    Great information...... 👏👏👏

  • @sambitkumardash9585
    @sambitkumardash9585 4 роки тому +1

    Nice video Sir. And mostly asked question in interview . Could you please make one video, related to other issues we do face in Spark .

    • @DataSavvy
      @DataSavvy  4 роки тому

      Sure Sambit... Do u have any other suggestion on questions?

    • @sambitkumardash9585
      @sambitkumardash9585 4 роки тому

      @@DataSavvy could you please explain , how to deal with the semi structured data, from ingestion to computation .

  • @PrasadNadiger456
    @PrasadNadiger456 2 роки тому

    Great video.. perfect explanation

  • @NishaKumari-op2ek
    @NishaKumari-op2ek 3 роки тому +1

    Very useful videos. Thank you :)

  • @RAKESHKUMAR-tp8zj
    @RAKESHKUMAR-tp8zj 3 роки тому

    U are one of the best mentor I have ever seen on youtube. The way you explain in awesome and all real-time questions.
    if my cluster memory is 10 GB and the date we want to process is 20 Gb will it process the data? sir can you please explain this topic

    • @medotop330
      @medotop330 3 роки тому

      No you can not process it

    • @medotop330
      @medotop330 3 роки тому

      You can do it using MapReduce if it is in batch layer or non used iterative algorithms like machine learning algos

  • @vijeandran
    @vijeandran 3 роки тому

    Neatly explained thank you....

  • @suresh.suthar.24
    @suresh.suthar.24 4 місяці тому

    i have one doubt:
    reserved memory and yarn overhead memory are same ? because reserved memory also stored spark internals.
    Thank you for your time.

  • @praptijoshi9102
    @praptijoshi9102 5 місяців тому

    amazing

  • @nikhilgupta110
    @nikhilgupta110 3 роки тому

    Pure content, great topic, informative, interactive and simple.Thanks you!!

  • @riyasmohammad9234
    @riyasmohammad9234 2 роки тому +1

    Great video. Can you share the source of information for further reading?

  • @bhatiaparesh89
    @bhatiaparesh89 4 роки тому

    Waiting for part 2! :🙈

    • @DataSavvy
      @DataSavvy  4 роки тому

      Working on it... Will post in few weeks. I need to explain one related concept first before that video

  • @ravikumarkumashi7065
    @ravikumarkumashi7065 3 роки тому

    very well expained, thank you

  • @ANUKARTHIM
    @ANUKARTHIM 4 роки тому +3

    Dear Data Savvy,
    Could you please clarify, if we go for broadcast join mean, it copies the small file into all available executor memory right? how come it causes the driver out of memory exception.

    • @DataSavvy
      @DataSavvy  4 роки тому

      That file is first brought on driver and merged(if it has multiple partitions) then it is sent to executors

    • @ANUKARTHIM
      @ANUKARTHIM 4 роки тому +1

      @@DataSavvy Thanks for the answer

    • @DataSavvy
      @DataSavvy  4 роки тому

      Thanks

    • @svsvikky
      @svsvikky 3 роки тому

      @@DataSavvy Isn't brodcast done executor-executor similar to bittorrent? Please correct me if i am wrong

  • @rohithsaivemula3200
    @rohithsaivemula3200 2 роки тому

    Very helpful

  • @ajaykiranchundi9979
    @ajaykiranchundi9979 3 роки тому

    Very nice video!! thank you

  • @PrasadChallagondla
    @PrasadChallagondla 3 роки тому

    Is there any real-time spark project. Please upload video on it. It would be helpful.

  • @nakkaeswaraoeswar2140
    @nakkaeswaraoeswar2140 Рік тому

    Thank you . Can you make video about what is Azure Sql?

  • @naveena2226
    @naveena2226 4 місяці тому

    Hi @all
    I just got to know about the wonderful videos in datasavvy channel.
    In that executor OOM - big partitions slide, in spark every partition is of block size only ryt(128MB) , then how come big partition will cause an issue?
    Can Simeon please explain this?
    Little confused here
    Even if there is 10gb file , when spark reads the file it creates around 80 partition of 128mb. Even if one of the partition is high it cannot increase 128mb ryt.. then how come OOM occurs??

  • @ravikirantuduru1061
    @ravikirantuduru1061 4 роки тому +1

    Good videos

  • @kiranmudradi26
    @kiranmudradi26 4 роки тому +2

    Nice Video. Question: In case when we call coalesce(1), does it causes any OOM issues either in driver or executor? if calling this operation does not through any OOM what could be the reason? Please clarify.

    • @DataSavvy
      @DataSavvy  4 роки тому +1

      U are right... Coalesce can also cause memory breach in few situations...

    • @kiranmudradi26
      @kiranmudradi26 4 роки тому +1

      @@DataSavvy Thanks. In that case OOM will happen at executor side not at driver side. is my understanding correct?

    • @DataSavvy
      @DataSavvy  4 роки тому +1

      Yes...

    • @DataSavvy
      @DataSavvy  4 роки тому +1

      Wait... A correction here... Repartition (1) can cause issue , not coalesce (1) as coalesce will not cause shuffle and data will stay on same machines...

    • @kiranmudradi26
      @kiranmudradi26 4 роки тому +2

      @@DataSavvyThanks. i was about to ask the same question. u replied in time. Kudos

  • @aashishraina2831
    @aashishraina2831 3 роки тому

    Recruiters say that you dont have production experience and POC spark working will not help. How can we convince despite having a good understanding of PYspark. Plz sugget

  • @sundarkris1320
    @sundarkris1320 3 роки тому +1

    Can you explain me difference between yarn memory over head vs Spark reserved and user memory?

  • @RAVIC3200
    @RAVIC3200 4 роки тому

    Nice Video again Harjeet :) , Hey Can you make videos on Test cases on spark/scala as well, i have scene no one talk about it.

    • @DataSavvy
      @DataSavvy  4 роки тому

      Hi Ravi, test cases are generally about functional and use case specific...

    • @rajlakshmipatil4415
      @rajlakshmipatil4415 4 роки тому +1

      Ravishankar Maybe you can try using holdenkarau

    • @DataSavvy
      @DataSavvy  4 роки тому +1

      Thanks for suggesting... Looks like a good resource... I will go through this github.com/holdenk/spark-testing-base

  • @Fresh-sh2gc
    @Fresh-sh2gc 2 роки тому

    Spark on kubernetes works completely different. This works only for spark on hadoop.

  • @krupab3388
    @krupab3388 2 роки тому

    can you please give example of each OOM what you have explained here, lots of blogs are given with same explanations. what extra is here. please provide with examples. it would be great.

  • @touristplaces7837
    @touristplaces7837 Рік тому

    Hello. I have 16 crore records on which i want to use window function. But order by is taking huge time and giving memory issue. is there any alternative approach

  • @arvind30
    @arvind30 3 роки тому +1

    Great video! I had a question regarding the yarn memory overhead. When a pyspark job runs, my understanding is that python worker processes are started within the memory allocated to the executor. JVM then sends data back and forth to these python processes. Won't the allocated python objects use the memory of these python processes instead of the yarn memory overhead?

    • @Fresh-sh2gc
      @Fresh-sh2gc 2 роки тому +1

      the worker nodes run based on resources of the yarn memory. Yarn is normally run on a shared cluster thus there always a tug of war between the tenants of the cluster for memory. as a result, one cannot always use too much memory. However, when there is ample yarn memory there is a process called preemption which gets more memory for the executor memory,

  • @anuragamit727
    @anuragamit727 2 роки тому

    Hi Sir, Could you please make a video on the factors that decide the number of tasks, stages, and jobs created after submitting our application.

  • @carlosllerena3922
    @carlosllerena3922 3 роки тому +1

    question if i use pyspark do i still get does errors ?? another question in instead of collect what other command ca we use

  • @subimalkhatua2886
    @subimalkhatua2886 3 роки тому

    Issue : container killed by yarn . Spark application Exited 1. This is the most common in aws glue or any spark jobs . increasing spark.yarn.executor.memoryOverhead and spark.yarn.executor.memory willl help but make sure it shouldn't increase than the total yarn.nodemanger memory or else there'll be a issue of configuration.

  • @saisravankumar6020
    @saisravankumar6020 2 роки тому

    When loading a file to data frame you get oom error, how will u rectify it? Can we get a demo?

  • @user-dl3ck6ym4r
    @user-dl3ck6ym4r 8 місяців тому +1

    how would we know that which file is small and which file is larger . one interview asked this question to me.

    • @DataSavvy
      @DataSavvy  7 місяців тому +1

      You can list the files in folder and see the size of file... Hdfs fs -ls... This is command

    • @user-dl3ck6ym4r
      @user-dl3ck6ym4r 7 місяців тому

      thank you@@DataSavvy

    • @user-dl3ck6ym4r
      @user-dl3ck6ym4r 7 місяців тому

      but i am using s3 bucket so@@DataSavvy

  • @krunalgoswami4654
    @krunalgoswami4654 2 роки тому

    Why use rdd in all question?? Why not dataframe?

  • @k.saibhargav8072
    @k.saibhargav8072 8 місяців тому +1

    How to avoid collect operation

    • @DataSavvy
      @DataSavvy  8 місяців тому

      You usually don't need collect.. Can you give an example where you are using it.. I can suggest, how to avoid it and rightly code

  • @divit00
    @divit00 Рік тому

    Part 2??

  • @vishalmishra863
    @vishalmishra863 3 роки тому

    Where is the second part ?

  • @amitpadhi2717
    @amitpadhi2717 3 роки тому

    i cant able to join your whatsapp group i am facing some issue in my local machine while setting up spark; please let me know where to post my query

    • @DataSavvy
      @DataSavvy  3 роки тому +1

      Please join telegram group and send query there... We have moved to telegram... Http://t.me/bigdata_hkr

    • @amitpadhi2717
      @amitpadhi2717 3 роки тому

      @@DataSavvy aforalgo@gmail.com dropped a mail already could you please check the issue which i faced

  • @sreenivasmekala6198
    @sreenivasmekala6198 3 роки тому

    Is groupbykey also cause of Out of Memory Right

    • @DataSavvy
      @DataSavvy  3 роки тому

      U are right... If there is skewness in data...vin case of group by key, we can end up facing Memory issue

  • @midhileshmomidi2434
    @midhileshmomidi2434 3 роки тому

    I am learning concepts but without real time experience I am not able to get practice on Data Collection from various sources
    I am able to clean the data well using Pyspark and can do ML using Spark ML by MLlib library
    But please suggest some sources to practice for Data Collection from various sources
    Thank you

    • @DataSavvy
      @DataSavvy  3 роки тому

      Sure, let me look into this and I will share some link... You can join our document library and data Savvy group... U will get lot of relevent information there

  • @rahulpandit9082
    @rahulpandit9082 3 роки тому +1

    Who is the person who dislikes this video... I think.. frustrating with life or wife... 😀😀😀

  • @subajecintha173
    @subajecintha173 3 роки тому

    The whatsapp group is full

    • @DataSavvy
      @DataSavvy  3 роки тому

      Yes... Please join telegram group

    • @RajuSharma-qd2uv
      @RajuSharma-qd2uv Рік тому

      Can you pls share your telegram group name?