Tutorial 8- Exploding Gradient Problem in Neural Network

Поділитися
Вставка
  • Опубліковано 3 гру 2024

КОМЕНТАРІ • 178

  • @khalidal-reemi3361
    @khalidal-reemi3361 4 роки тому +33

    I never got such clear explanation for deep learning concepts.
    I had Coursera deep learning. They make it more difficult to what it is.
    Thank you Krish.

  • @midhileshmomidi2434
    @midhileshmomidi2434 5 років тому +40

    From now If anyone asked me about Vanishing Gradient Descent OR Exploring Gradient Descent I will not just answer and I even take a class to them
    The best video I've ever seen

    • @manishsharma2211
      @manishsharma2211 4 роки тому

      Exactly

    • @kiruthigakumar8557
      @kiruthigakumar8557 4 роки тому +3

      i have a small doubt...in vanishing- the values where very small but here its high but both have the same eqn right? or is it due to the weights in the vanishing was normal and in exploding its high?....ur help is really appreciated

    • @sargun_narula
      @sargun_narula 4 роки тому

      @@kiruthigakumar8557 even I have the same doubt if anyone can help it would be really appreciated

    • @chiragchauhan8429
      @chiragchauhan8429 4 роки тому +4

      @@sargun_narula As he said when using sigmoid the values would be between 0-1 so if their weights are smaller when we initialise them but for a smaller network that is 1 or 2 hidden layer network vanishing won't be a problem but if it uses more like 10 layers then after some layers considering last 3rd layer when backpropagating the derivative will be decreasing with every layer and due to that optimizer will be so slow to reach the minima value and that's what vanishing gradient is. Talking about exploding gradient if weights are bigger and derivative increases after backpropagating than that may put our optimizer into diverging rather than reaching minima i.e exploding problem. Simply saying weights shouldn't be initialized so high or so low.

    • @babupatil2416
      @babupatil2416 3 роки тому

      ​@@kiruthigakumar8557​Irrespective of your activation function your weights causes the Exploding/Vanishing gradient descent problem. Weights shouldn't be initialized so high or so low. Here is the Andrew Ng video for the same ua-cam.com/video/qhXZsFVxGKo/v-deo.html

  • @winviki123
    @winviki123 5 років тому +39

    Loving this playlist
    Most of these abstract concepts are explained very elegantly
    Thank you so much

  • @tarun4705
    @tarun4705 Рік тому +2

    This playlist is like a treasure.

  • @skviknesh
    @skviknesh 3 роки тому +5

    9:32 peak of interest! Happiness in explaining why it will not converge... I love that reaction!!!😍😍😍

  • @rukeshshrestha5938
    @rukeshshrestha5938 4 роки тому +6

    I really love your videos. Today only i started watching your tutorial. It was really helpful. Thank you so much for sharing your knowledge.

  • @raidblade2307
    @raidblade2307 5 років тому +2

    Deep Concepts are getting clear.
    Thank you sir. Such a beautiful explanation

  • @whitemamba7128
    @whitemamba7128 4 роки тому +3

    Sir, your videos are very educational and, you put a lot of energy into making them. They make the learning process easy, and it also lets me develop an interest in deep learning. That's the best I could have asked for and, you delivered it. Thank you, Sir.

  • @bigbull266
    @bigbull266 3 роки тому +6

    Exploding Gradient Problem is because of Higher Weights Initialization. If the weights are higher, then during BackProp gradients value will be higher which in turn affects the new weights to be vv small when updating weights [ Wnew = Wold - lr * Grad] Due to which the weight difference will be Varying a lot at every epoch and this is why Gradient Descent will never converge.

  • @pushkarajpalnitkar1695
    @pushkarajpalnitkar1695 3 роки тому

    Best explanation for EXPLODING gradient problem on the internet I have encountered so far. Awesome!

  • @ArthurCor-ts2bg
    @ArthurCor-ts2bg 4 роки тому +1

    Very passionate and articulate lecture well done

  • @annalyticsannalizaramos5890
    @annalyticsannalizaramos5890 3 роки тому +1

    Congrats for a well explained topic. Now I know the effect of exploding gradients

  • @aravindpiratla2443
    @aravindpiratla2443 2 роки тому

    Love the explanation bro... I used to initialize weights randomly but after watching this, I came to know the impact of such initializations...

  • @kishanpandey4798
    @kishanpandey4798 5 років тому +8

    Please see, the chain rule has missed something at 2:55. @krish naik

    • @omkarrane1347
      @omkarrane1347 5 років тому +9

      yes there is mistake is missing del L /del o31 onwards

    • @amrousimen684
      @amrousimen684 4 роки тому

      @@omkarrane1347 yes this is a miss

  • @basharfocke
    @basharfocke Рік тому

    Best explanation so far. No doubt !!!

  • @somanathking4694
    @somanathking4694 7 місяців тому

    how i missed the class all these years
    how come you are able to simplify the topics.
    👏

  • @farzanehparvar_
    @farzanehparvar_ 3 роки тому

    That was one of the best explanations for Exploding gradient problem. But please mention the next video in the description box. I could find it hard.

  • @tinumathews
    @tinumathews 5 років тому +3

    This is super krish, its like a story that you explain... at 9:35 minutes the whole picture jumps into your mind. neat explanation. Nice work krish... awaiting for more videos. meet you on satruday..till then cheers

  • @slaozturk47
    @slaozturk47 2 роки тому

    Your classes are quite clear, thank you so much !!!!

  • @vincenzo3908
    @vincenzo3908 4 роки тому +1

    Very well explained, and the writings and drawings are very clear too by the way

  • @adityashewale7983
    @adityashewale7983 Рік тому

    hats off to you sir,Your explanation is top level, THnak you so much for guiding us...

  • @4abdoulaye
    @4abdoulaye 5 років тому +1

    YOU ARE JUST KIND DUDE. THANKS

  • @sindhuorigins
    @sindhuorigins 4 роки тому +2

    the activation function is denoted by phi, not to be confused with the symbol of cyclicc integral

  • @ronishsharma8825
    @ronishsharma8825 5 років тому +18

    the chain rule is a mistake please correct it.

  • @anshulzade6355
    @anshulzade6355 2 роки тому

    keep up the good work, disrupting the education system. Lots of love

  • @yogenderkushwaha5523
    @yogenderkushwaha5523 4 роки тому

    Amazing explanation sir. I am going to learn whole deep learning from your videos only

  • @kueen3032
    @kueen3032 3 роки тому +44

    One correction: dL/dW'11 should be (dL/dO31. dO31/dO21. dO21/dO11. dO11/dW'11)

    • @vikrambharadwaj7072
      @vikrambharadwaj7072 3 роки тому +3

      In tutorial 6 also there was a correction...!
      is there an explanation

    • @adarshyadav340
      @adarshyadav340 3 роки тому

      You are right @kueen, krish has missed out the first term in the chain rule.

    • @vvek27
      @vvek27 3 роки тому

      yes you are right

    • @manojsamal7248
      @manojsamal7248 3 роки тому

      but what will come in "dL" is that (y-Y) ^2 or log loss funtion will come in "dL"

    • @indrashispowali
      @indrashispowali 2 роки тому

      just wanted to know... does the chain rule refer to partial derivative ??

  • @-birigamingcallofduty2219
    @-birigamingcallofduty2219 3 роки тому

    Very very effective video sir 👍👍👍👍👍👍....my love and gratitude to you 🙏...

  • @PeyiOyelo
    @PeyiOyelo 4 роки тому +1

    Another Great Video. Namaste

  • @ne2514
    @ne2514 3 роки тому

    love your video of machine learning algorithms, kudos

  • @143balug
    @143balug 4 роки тому +1

    Excellent Videos bro, I am getting clear picture on those concepts Thank you very much for making the video's with clear understandable manner.
    I am follwing your every video.

  • @DanielSzalko
    @DanielSzalko 5 років тому +2

    Please keep making videos like this!

  • @emirozgun3368
    @emirozgun3368 4 роки тому +1

    Pure passion,appriciate it.

  • @nareshbabu9517
    @nareshbabu9517 5 років тому +4

    Do tutorial based on machine learning like regression ,classification and clustering sir

  • @indrashispowali
    @indrashispowali 2 роки тому

    thanks Krish... nice explanations

  • @rajaramk1993
    @rajaramk1993 5 років тому +1

    excellent and to the point explanation sir. Waiting for your future videos in Deep Learning.

  • @janekou2482
    @janekou2482 4 роки тому

    Awesome explanation! Best video I have seen for this problem.

  • @nitayg1326
    @nitayg1326 5 років тому

    Exploding GD explained nicely!

  • @pranjalgupta9427
    @pranjalgupta9427 3 роки тому +1

    Awesome 😊👏👍

  • @sahilsaini3783
    @sahilsaini3783 3 роки тому +2

    At 08:30, the derivative of O21 wrt O11 is 125, but O21 is a sigmoid function. How can its derivative be 125 because derivative of sigmoid function ranges from 0 to 0.25.

  • @pranavgandhiprojects
    @pranavgandhiprojects 4 місяці тому

    so well explained!

  • @ganeshkharad
    @ganeshkharad 4 роки тому

    best explaination... thanks for making this video

  • @sandipansarkar9211
    @sandipansarkar9211 4 роки тому

    Superb video once again.But need to study a little bit of theory.But still no idea how questions are framed in an interview in regards to deep learning.

  • @harshsharma-jp9uk
    @harshsharma-jp9uk 3 роки тому

    great work.. Kudos to u!!!!!!!!!!

  • @pdteach
    @pdteach 5 років тому

    Very nice explanation.thanks

  • @bangarrajumuppidu8354
    @bangarrajumuppidu8354 3 роки тому

    super explanation sir !!

  • @nitishkumar-bk8kd
    @nitishkumar-bk8kd 4 роки тому

    beautiful explanation

  • @brindapatel1750
    @brindapatel1750 4 роки тому

    excellent krish
    love to watch your videos

  • @shamussim137
    @shamussim137 3 роки тому +4

    Question:
    Hi Krish. dO21/do11 is large because we mutliple the derivate of the sigmoid (btwn 0 to 0.25) with a large weight. However, in Tutorial 7 we didn't use this formula(chain rule derivation), we directly said dO21/do11 is between 0 to 0.25. Please can you clarify on this?

    • @hritiknandanwar5095
      @hritiknandanwar5095 2 роки тому

      Even I have the same question, sir can you please explain this section?

    • @shrikotha3899
      @shrikotha3899 2 роки тому

      even I have the same doubt.. can u explain this?

    • @aadityabhardwaj4036
      @aadityabhardwaj4036 Рік тому +1

      That is because O21 = sigmoid(ff21) and when we take the derivate of O21 with respect to any variable (be it O11), We know it will range between 0 and 0.25. Because the derivative of sigmoid(x) ranges from 0 to.25, and x can be any value.

  • @tarunbhatia8652
    @tarunbhatia8652 3 роки тому

    Best video. Hands down

  • @YoutubePremium-ny2ys
    @YoutubePremium-ny2ys 3 роки тому

    Request for a video on side by side comparison of vanishing gradient and exploding gradient...

  • @kalpeshnaik8826
    @kalpeshnaik8826 4 роки тому +1

    Exploding Gradient Problem is only for sigmoid activation function or for all activation functions

  • @sarrae100
    @sarrae100 3 роки тому

    Excellent.

  • @jasbirsingh8849
    @jasbirsingh8849 4 роки тому +4

    In the vanishing gradient you directly put values b/w 0 and 0.25 as derivative ranges in that range but why not put direct values here ?
    I mean the same we could we have done in vanishing gradient as well i.e. expanding the equation and multiple by its weight ?

    • @anshul8258
      @anshul8258 4 роки тому

      Even i am having the same doubt. After watching this video, I cannot understand why (d O21 / d 011) was directly put between 0 to 0.25 in Vanishing Gradient Problem video.

    • @souravsaha1973
      @souravsaha1973 3 роки тому

      @krish naik sir, can you please help clarify this doubt

    • @elileman6599
      @elileman6599 2 роки тому

      yes it made me confused too

  • @saikiran-mi3jc
    @saikiran-mi3jc 5 років тому +1

    Waiting for future videos on DL

  • @ankurmodi4588
    @ankurmodi4588 3 роки тому +1

    This likes turn into 1M likes after mid 2021. People do not understand the effort and hard work as they are also not doing anything right now. wait and watch

  • @praneethcj6544
    @praneethcj6544 4 роки тому +1

    Excellent ..!!!

  • @samyakjain8079
    @samyakjain8079 3 роки тому +1

    @7:47 d(w_21 * O_11) = O_11 dw_21 + w_21 dO_11 (why are you assuming w_21 is constant)

  • @omkarrane1347
    @omkarrane1347 5 років тому +3

    sir, please note that in the last two videos there was the wrong application of chain rule. even our teacher who referred to the video has written the same mistake in her notes. ref del L /del o31 onwards

    • @krishnaik06
      @krishnaik06  5 років тому

      I probably made a mistake in the last part

    • @shubhammaurya2658
      @shubhammaurya2658 5 років тому +1

      can you explain what is wrong briefly. so I can understand

    • @chinmaybhat9636
      @chinmaybhat9636 4 роки тому

      Which one is correct then one used in this video or the one used in the previous video ??

  • @y.mamathareddy8699
    @y.mamathareddy8699 5 років тому +1

    Sir please make a video on bayes theorem and its concepts learning....

  • @louerleseigneur4532
    @louerleseigneur4532 3 роки тому

    Thanks krish

  • @karunasagargundiga5821
    @karunasagargundiga5821 4 роки тому +3

    hello sir,
    In vanishing gradient problem you have mentioned that derivative of sigmoid is always between 0-0.25. When you did the derivative of sigmoid function result i.e derivative of o12 w.r.t o11 it must be in the range of 0-0.25 but when you expanded we got the answer as 125. I did not understand how did the derivative of sigmoid exceed the range of 0-0.25. It seems contradictory. Hope you can clear my doubt, sir.

    • @priyanath2754
      @priyanath2754 4 роки тому +1

      I am having the same doubt. Can anyone please explain it?

    • @reachDeepNeuron
      @reachDeepNeuron 4 роки тому

      Even I had this question

    • @praneetkuber7210
      @praneetkuber7210 4 роки тому

      He multiplied 0.25 with initial value weight w21 which was 500. W21 is derivative of z wrt O11 in his case.

  • @Mustafa-jy8el
    @Mustafa-jy8el 4 роки тому +1

    I love the energy

  • @SuryaDasSD
    @SuryaDasSD 4 роки тому +1

    7:56 there's a mistake in derivative.. please correct it

  • @emilyme9478
    @emilyme9478 3 роки тому

    great video !

  • @sushantshukla6673
    @sushantshukla6673 5 років тому

    u doing great job man

  • @makemoney7506
    @makemoney7506 4 місяці тому

    Thank you very much i learn a lot, i think in gradient you forgot one term, the first one, dL /dO3

  • @sushmitapoudel8500
    @sushmitapoudel8500 3 роки тому

    You're great!

  • @16876
    @16876 4 роки тому

    awesome video, much respect

  • @jagadeeswarareddy9726
    @jagadeeswarareddy9726 3 роки тому

    Really very good videos, One doubt - High value weights causing this exploding problem. But W-old also might be large vale right, if we do W-old - derivative L / dW not cause for big variance right. please help me.

  • @shambhuthakur5562
    @shambhuthakur5562 4 роки тому +5

    Thanks Krish for the video, however I didn't understood how you replaced loss function with output of output layer, it should actually be real output minus predicted.pls suggest.

    • @shashwatsinha4170
      @shashwatsinha4170 3 роки тому

      He has just shown that the predicted output will be made input to the loss function (not that predicted output is loss function as you have comprehended)

  • @rmn7086
    @rmn7086 4 роки тому

    Krish Naik bester Mann!

  • @vd.se.17
    @vd.se.17 4 роки тому

    Thank you.

  • @sumaiyachoudhury7091
    @sumaiyachoudhury7091 Рік тому

    at 2:47 you are missing the dL/dO31 term

  • @dhruvajpatil8359
    @dhruvajpatil8359 4 роки тому

    Too good man !!! #BohotHard

  • @thunder440v3
    @thunder440v3 4 роки тому

    Awesome video!

  • @quranicscience9631
    @quranicscience9631 5 років тому

    very good content

  • @anirbandas6122
    @anirbandas6122 2 роки тому

    @2.37 u have missed a derivate dL/d031 on the RHS.

  • @alphonseinbaraj7602
    @alphonseinbaraj7602 5 років тому

    in this video ,5:30 u mentioned that w21' is this correct? i hope that is w11''? am i right or wrong ?So z=O11.w11''+b2will come .instead O11.w21+b2. am i right ?pls

  • @jt007rai
    @jt007rai 4 роки тому

    Thanks for this amazing video sir!
    Just to summarize, can I say that only if my weight initialization would be very high and activation function is sigmoid and learning rate is also very high, I can experience this problem and no other such cases?

    • @32deepan
      @32deepan 4 роки тому

      Activation function doesn't matter for exploding gradient decent to occur. High magnitude weights initialization alone can cause this problem.

    • @songs-jn1cf
      @songs-jn1cf 4 роки тому

      deepan chakravarthi
      Activation function is proportional to weights being applied.so exploding gradient indirectly depends on activation function and directly on weights.

    • @manishsharma2211
      @manishsharma2211 4 роки тому

      The derivate should also be high

  • @benvelloor
    @benvelloor 4 роки тому

    Thanks a lot sir

  • @invisible2836
    @invisible2836 5 місяців тому

    So overall you're saying that if you choose high values of weights, it'll cause problem to reach or maybe will never reach global minima

  • @sumeetseth22
    @sumeetseth22 4 роки тому

    love your videos and can't thankyou enough. Thankyou so much for theawesomest lessons

  • @pratikkhadse732
    @pratikkhadse732 4 роки тому

    Doubt: the BIAS that is added, what constitutes this bias.
    For instance Learning rate was found by optimization models, what methodology is used to introduce bias?

  • @SimoneIovane
    @SimoneIovane 5 років тому +2

    Very well explained thanks! I have a doubt tho: Are vanishing and exploding gradient coexistent phenomena? As they both happen in the BP does their happening depend exclusively on the value of the loss at a particular epoch? Hope my question is clear

    • @reachDeepNeuron
      @reachDeepNeuron 4 роки тому

      Even I hv the same question. Appreciate if you can clear

  • @subrataghosh735
    @subrataghosh735 3 роки тому

    Thanks for the great explanation. One small doubt/clarification would be helpful. Here since we have sigmoid if the Weight value is somewhat around 2 ( ex:2) dO11/dW11. then the dO21/dO11 value will be .25*2 = .5 then chain rule ((dO21/dO11) * (dO11/dW11)) will be again .5*.5 = 25 considering dO11/dW11 weight is also 2. Then instead of exploding it will be shirking. Can you please suggest what is the thought for this scenario

    • @subrataghosh735
      @subrataghosh735 3 роки тому

      Thanks for the great explanation. One small doubt/clarification would be helpful. Here since we have sigmoid if the Weight value is somewhat around 2 ( ex:2) dO11/dW11. then the dO21/dO11 value will be .25*2 = .5 then chain rule ((dO21/dO11) * (dO11/dW11)) will be again .5*.5 = .25 considering dO11/dW11 weight is also 2. Then instead of exploding it will be shirking. Can you please suggest what is the thought for this scenario

  • @shahariarsarkar3433
    @shahariarsarkar3433 3 роки тому

    sir may be there is a problem in the chain rule that you explain. Here something is missing that is derivative of L with respect to O31

  • @komandoorideekshith85
    @komandoorideekshith85 8 місяців тому

    a small doubt is that in another video you told that derivative of loss w.r.t. weight equals to derivative of loss w.r.t. output and etc... but in this video you considered directly from out on r.h.s could you please conform it

  • @MoosaMemon.
    @MoosaMemon. 6 місяців тому

    At 5:56, shouldn't it be "derivate of z w.r.t derivative of w_11" instead of being "derivate of z w.r.t derivative of O_11"

  • @revanthshalon5626
    @revanthshalon5626 4 роки тому +1

    Sir, the only time when the exploding gradient problem occurs is when the weights is high and the time when vanishing gradient occurs is when the weights are too low, is my assumption correct?

  • @KamalkaGermany
    @KamalkaGermany 2 роки тому +2

    Shouldn't the derivative be dl/ dw'11 = dl/dO31 and then the rest? Could someone please clarify? Thanks

  • @Kabir_Narayan_Jha
    @Kabir_Narayan_Jha 5 років тому

    Great videoo

  • @Adinasa2
    @Adinasa2 4 роки тому

    On what basis are the weights initialises

  • @SambitBasu22MCA024
    @SambitBasu22MCA024 8 місяців тому

    So basically Exploding and vanishing are dependent on how the weights are initialised?

  • @jsverma143
    @jsverma143 5 років тому

    just excellent :-)

  • @soodipaj6477
    @soodipaj6477 4 роки тому

    How do you define O_11? in the first hidden layer?

  • @subhamsekharpradhan297
    @subhamsekharpradhan297 3 роки тому

    SIr in the chain rule formula, I guess you have left the del(L)/del(O^31) at first

  • @jibinsebastian187
    @jibinsebastian187 3 роки тому

    How we will assign the weight value as 500. The normalized value is (-1,1).

  • @avsheshkumar8352
    @avsheshkumar8352 4 роки тому

    Can you please tell how Weight is apply ?

  • @smarttaurian30
    @smarttaurian30 Рік тому

    I don't understand the chain rule equationt that how we get activation function while it should begun from dO21