I never got such clear explanation for deep learning concepts. I had Coursera deep learning. They make it more difficult to what it is. Thank you Krish.
From now If anyone asked me about Vanishing Gradient Descent OR Exploring Gradient Descent I will not just answer and I even take a class to them The best video I've ever seen
i have a small doubt...in vanishing- the values where very small but here its high but both have the same eqn right? or is it due to the weights in the vanishing was normal and in exploding its high?....ur help is really appreciated
@@sargun_narula As he said when using sigmoid the values would be between 0-1 so if their weights are smaller when we initialise them but for a smaller network that is 1 or 2 hidden layer network vanishing won't be a problem but if it uses more like 10 layers then after some layers considering last 3rd layer when backpropagating the derivative will be decreasing with every layer and due to that optimizer will be so slow to reach the minima value and that's what vanishing gradient is. Talking about exploding gradient if weights are bigger and derivative increases after backpropagating than that may put our optimizer into diverging rather than reaching minima i.e exploding problem. Simply saying weights shouldn't be initialized so high or so low.
@@kiruthigakumar8557Irrespective of your activation function your weights causes the Exploding/Vanishing gradient descent problem. Weights shouldn't be initialized so high or so low. Here is the Andrew Ng video for the same ua-cam.com/video/qhXZsFVxGKo/v-deo.html
Sir, your videos are very educational and, you put a lot of energy into making them. They make the learning process easy, and it also lets me develop an interest in deep learning. That's the best I could have asked for and, you delivered it. Thank you, Sir.
Exploding Gradient Problem is because of Higher Weights Initialization. If the weights are higher, then during BackProp gradients value will be higher which in turn affects the new weights to be vv small when updating weights [ Wnew = Wold - lr * Grad] Due to which the weight difference will be Varying a lot at every epoch and this is why Gradient Descent will never converge.
This is super krish, its like a story that you explain... at 9:35 minutes the whole picture jumps into your mind. neat explanation. Nice work krish... awaiting for more videos. meet you on satruday..till then cheers
Excellent Videos bro, I am getting clear picture on those concepts Thank you very much for making the video's with clear understandable manner. I am follwing your every video.
At 08:30, the derivative of O21 wrt O11 is 125, but O21 is a sigmoid function. How can its derivative be 125 because derivative of sigmoid function ranges from 0 to 0.25.
Superb video once again.But need to study a little bit of theory.But still no idea how questions are framed in an interview in regards to deep learning.
Question: Hi Krish. dO21/do11 is large because we mutliple the derivate of the sigmoid (btwn 0 to 0.25) with a large weight. However, in Tutorial 7 we didn't use this formula(chain rule derivation), we directly said dO21/do11 is between 0 to 0.25. Please can you clarify on this?
That is because O21 = sigmoid(ff21) and when we take the derivate of O21 with respect to any variable (be it O11), We know it will range between 0 and 0.25. Because the derivative of sigmoid(x) ranges from 0 to.25, and x can be any value.
In the vanishing gradient you directly put values b/w 0 and 0.25 as derivative ranges in that range but why not put direct values here ? I mean the same we could we have done in vanishing gradient as well i.e. expanding the equation and multiple by its weight ?
Even i am having the same doubt. After watching this video, I cannot understand why (d O21 / d 011) was directly put between 0 to 0.25 in Vanishing Gradient Problem video.
This likes turn into 1M likes after mid 2021. People do not understand the effort and hard work as they are also not doing anything right now. wait and watch
sir, please note that in the last two videos there was the wrong application of chain rule. even our teacher who referred to the video has written the same mistake in her notes. ref del L /del o31 onwards
hello sir, In vanishing gradient problem you have mentioned that derivative of sigmoid is always between 0-0.25. When you did the derivative of sigmoid function result i.e derivative of o12 w.r.t o11 it must be in the range of 0-0.25 but when you expanded we got the answer as 125. I did not understand how did the derivative of sigmoid exceed the range of 0-0.25. It seems contradictory. Hope you can clear my doubt, sir.
Really very good videos, One doubt - High value weights causing this exploding problem. But W-old also might be large vale right, if we do W-old - derivative L / dW not cause for big variance right. please help me.
Thanks Krish for the video, however I didn't understood how you replaced loss function with output of output layer, it should actually be real output minus predicted.pls suggest.
He has just shown that the predicted output will be made input to the loss function (not that predicted output is loss function as you have comprehended)
in this video ,5:30 u mentioned that w21' is this correct? i hope that is w11''? am i right or wrong ?So z=O11.w11''+b2will come .instead O11.w21+b2. am i right ?pls
Thanks for this amazing video sir! Just to summarize, can I say that only if my weight initialization would be very high and activation function is sigmoid and learning rate is also very high, I can experience this problem and no other such cases?
deepan chakravarthi Activation function is proportional to weights being applied.so exploding gradient indirectly depends on activation function and directly on weights.
Doubt: the BIAS that is added, what constitutes this bias. For instance Learning rate was found by optimization models, what methodology is used to introduce bias?
Very well explained thanks! I have a doubt tho: Are vanishing and exploding gradient coexistent phenomena? As they both happen in the BP does their happening depend exclusively on the value of the loss at a particular epoch? Hope my question is clear
Thanks for the great explanation. One small doubt/clarification would be helpful. Here since we have sigmoid if the Weight value is somewhat around 2 ( ex:2) dO11/dW11. then the dO21/dO11 value will be .25*2 = .5 then chain rule ((dO21/dO11) * (dO11/dW11)) will be again .5*.5 = 25 considering dO11/dW11 weight is also 2. Then instead of exploding it will be shirking. Can you please suggest what is the thought for this scenario
Thanks for the great explanation. One small doubt/clarification would be helpful. Here since we have sigmoid if the Weight value is somewhat around 2 ( ex:2) dO11/dW11. then the dO21/dO11 value will be .25*2 = .5 then chain rule ((dO21/dO11) * (dO11/dW11)) will be again .5*.5 = .25 considering dO11/dW11 weight is also 2. Then instead of exploding it will be shirking. Can you please suggest what is the thought for this scenario
a small doubt is that in another video you told that derivative of loss w.r.t. weight equals to derivative of loss w.r.t. output and etc... but in this video you considered directly from out on r.h.s could you please conform it
Sir, the only time when the exploding gradient problem occurs is when the weights is high and the time when vanishing gradient occurs is when the weights are too low, is my assumption correct?
I never got such clear explanation for deep learning concepts.
I had Coursera deep learning. They make it more difficult to what it is.
Thank you Krish.
From now If anyone asked me about Vanishing Gradient Descent OR Exploring Gradient Descent I will not just answer and I even take a class to them
The best video I've ever seen
Exactly
i have a small doubt...in vanishing- the values where very small but here its high but both have the same eqn right? or is it due to the weights in the vanishing was normal and in exploding its high?....ur help is really appreciated
@@kiruthigakumar8557 even I have the same doubt if anyone can help it would be really appreciated
@@sargun_narula As he said when using sigmoid the values would be between 0-1 so if their weights are smaller when we initialise them but for a smaller network that is 1 or 2 hidden layer network vanishing won't be a problem but if it uses more like 10 layers then after some layers considering last 3rd layer when backpropagating the derivative will be decreasing with every layer and due to that optimizer will be so slow to reach the minima value and that's what vanishing gradient is. Talking about exploding gradient if weights are bigger and derivative increases after backpropagating than that may put our optimizer into diverging rather than reaching minima i.e exploding problem. Simply saying weights shouldn't be initialized so high or so low.
@@kiruthigakumar8557Irrespective of your activation function your weights causes the Exploding/Vanishing gradient descent problem. Weights shouldn't be initialized so high or so low. Here is the Andrew Ng video for the same ua-cam.com/video/qhXZsFVxGKo/v-deo.html
Loving this playlist
Most of these abstract concepts are explained very elegantly
Thank you so much
This playlist is like a treasure.
9:32 peak of interest! Happiness in explaining why it will not converge... I love that reaction!!!😍😍😍
yeah, the time where I smiled with respect :)
I really love your videos. Today only i started watching your tutorial. It was really helpful. Thank you so much for sharing your knowledge.
Deep Concepts are getting clear.
Thank you sir. Such a beautiful explanation
Sir, your videos are very educational and, you put a lot of energy into making them. They make the learning process easy, and it also lets me develop an interest in deep learning. That's the best I could have asked for and, you delivered it. Thank you, Sir.
Exploding Gradient Problem is because of Higher Weights Initialization. If the weights are higher, then during BackProp gradients value will be higher which in turn affects the new weights to be vv small when updating weights [ Wnew = Wold - lr * Grad] Due to which the weight difference will be Varying a lot at every epoch and this is why Gradient Descent will never converge.
Best explanation for EXPLODING gradient problem on the internet I have encountered so far. Awesome!
Very passionate and articulate lecture well done
Congrats for a well explained topic. Now I know the effect of exploding gradients
Love the explanation bro... I used to initialize weights randomly but after watching this, I came to know the impact of such initializations...
Please see, the chain rule has missed something at 2:55. @krish naik
yes there is mistake is missing del L /del o31 onwards
@@omkarrane1347 yes this is a miss
Best explanation so far. No doubt !!!
how i missed the class all these years
how come you are able to simplify the topics.
👏
That was one of the best explanations for Exploding gradient problem. But please mention the next video in the description box. I could find it hard.
This is super krish, its like a story that you explain... at 9:35 minutes the whole picture jumps into your mind. neat explanation. Nice work krish... awaiting for more videos. meet you on satruday..till then cheers
Your classes are quite clear, thank you so much !!!!
Very well explained, and the writings and drawings are very clear too by the way
hats off to you sir,Your explanation is top level, THnak you so much for guiding us...
YOU ARE JUST KIND DUDE. THANKS
the activation function is denoted by phi, not to be confused with the symbol of cyclicc integral
the chain rule is a mistake please correct it.
Yeah I commented on it too
No. It is correct
No it is incorrect
yes derv L /derv o31 is missed
keep up the good work, disrupting the education system. Lots of love
Amazing explanation sir. I am going to learn whole deep learning from your videos only
One correction: dL/dW'11 should be (dL/dO31. dO31/dO21. dO21/dO11. dO11/dW'11)
In tutorial 6 also there was a correction...!
is there an explanation
You are right @kueen, krish has missed out the first term in the chain rule.
yes you are right
but what will come in "dL" is that (y-Y) ^2 or log loss funtion will come in "dL"
just wanted to know... does the chain rule refer to partial derivative ??
Very very effective video sir 👍👍👍👍👍👍....my love and gratitude to you 🙏...
Another Great Video. Namaste
love your video of machine learning algorithms, kudos
Excellent Videos bro, I am getting clear picture on those concepts Thank you very much for making the video's with clear understandable manner.
I am follwing your every video.
Please keep making videos like this!
Pure passion,appriciate it.
Do tutorial based on machine learning like regression ,classification and clustering sir
thanks Krish... nice explanations
excellent and to the point explanation sir. Waiting for your future videos in Deep Learning.
Amazing In-Depth Explanation!
Awesome explanation! Best video I have seen for this problem.
Exploding GD explained nicely!
Awesome 😊👏👍
At 08:30, the derivative of O21 wrt O11 is 125, but O21 is a sigmoid function. How can its derivative be 125 because derivative of sigmoid function ranges from 0 to 0.25.
so well explained!
best explaination... thanks for making this video
Superb video once again.But need to study a little bit of theory.But still no idea how questions are framed in an interview in regards to deep learning.
great work.. Kudos to u!!!!!!!!!!
Very nice explanation.thanks
super explanation sir !!
beautiful explanation
excellent krish
love to watch your videos
Question:
Hi Krish. dO21/do11 is large because we mutliple the derivate of the sigmoid (btwn 0 to 0.25) with a large weight. However, in Tutorial 7 we didn't use this formula(chain rule derivation), we directly said dO21/do11 is between 0 to 0.25. Please can you clarify on this?
Even I have the same question, sir can you please explain this section?
even I have the same doubt.. can u explain this?
That is because O21 = sigmoid(ff21) and when we take the derivate of O21 with respect to any variable (be it O11), We know it will range between 0 and 0.25. Because the derivative of sigmoid(x) ranges from 0 to.25, and x can be any value.
Best video. Hands down
Request for a video on side by side comparison of vanishing gradient and exploding gradient...
Exploding Gradient Problem is only for sigmoid activation function or for all activation functions
Excellent.
In the vanishing gradient you directly put values b/w 0 and 0.25 as derivative ranges in that range but why not put direct values here ?
I mean the same we could we have done in vanishing gradient as well i.e. expanding the equation and multiple by its weight ?
Even i am having the same doubt. After watching this video, I cannot understand why (d O21 / d 011) was directly put between 0 to 0.25 in Vanishing Gradient Problem video.
@krish naik sir, can you please help clarify this doubt
yes it made me confused too
Waiting for future videos on DL
This likes turn into 1M likes after mid 2021. People do not understand the effort and hard work as they are also not doing anything right now. wait and watch
Excellent ..!!!
@7:47 d(w_21 * O_11) = O_11 dw_21 + w_21 dO_11 (why are you assuming w_21 is constant)
sir, please note that in the last two videos there was the wrong application of chain rule. even our teacher who referred to the video has written the same mistake in her notes. ref del L /del o31 onwards
I probably made a mistake in the last part
can you explain what is wrong briefly. so I can understand
Which one is correct then one used in this video or the one used in the previous video ??
Sir please make a video on bayes theorem and its concepts learning....
Thanks krish
hello sir,
In vanishing gradient problem you have mentioned that derivative of sigmoid is always between 0-0.25. When you did the derivative of sigmoid function result i.e derivative of o12 w.r.t o11 it must be in the range of 0-0.25 but when you expanded we got the answer as 125. I did not understand how did the derivative of sigmoid exceed the range of 0-0.25. It seems contradictory. Hope you can clear my doubt, sir.
I am having the same doubt. Can anyone please explain it?
Even I had this question
He multiplied 0.25 with initial value weight w21 which was 500. W21 is derivative of z wrt O11 in his case.
I love the energy
7:56 there's a mistake in derivative.. please correct it
great video !
u doing great job man
Thank you very much i learn a lot, i think in gradient you forgot one term, the first one, dL /dO3
You're great!
awesome video, much respect
Really very good videos, One doubt - High value weights causing this exploding problem. But W-old also might be large vale right, if we do W-old - derivative L / dW not cause for big variance right. please help me.
Thanks Krish for the video, however I didn't understood how you replaced loss function with output of output layer, it should actually be real output minus predicted.pls suggest.
He has just shown that the predicted output will be made input to the loss function (not that predicted output is loss function as you have comprehended)
Krish Naik bester Mann!
Thank you.
at 2:47 you are missing the dL/dO31 term
Too good man !!! #BohotHard
Awesome video!
very good content
@2.37 u have missed a derivate dL/d031 on the RHS.
in this video ,5:30 u mentioned that w21' is this correct? i hope that is w11''? am i right or wrong ?So z=O11.w11''+b2will come .instead O11.w21+b2. am i right ?pls
Thanks for this amazing video sir!
Just to summarize, can I say that only if my weight initialization would be very high and activation function is sigmoid and learning rate is also very high, I can experience this problem and no other such cases?
Activation function doesn't matter for exploding gradient decent to occur. High magnitude weights initialization alone can cause this problem.
deepan chakravarthi
Activation function is proportional to weights being applied.so exploding gradient indirectly depends on activation function and directly on weights.
The derivate should also be high
Thanks a lot sir
So overall you're saying that if you choose high values of weights, it'll cause problem to reach or maybe will never reach global minima
love your videos and can't thankyou enough. Thankyou so much for theawesomest lessons
Doubt: the BIAS that is added, what constitutes this bias.
For instance Learning rate was found by optimization models, what methodology is used to introduce bias?
Very well explained thanks! I have a doubt tho: Are vanishing and exploding gradient coexistent phenomena? As they both happen in the BP does their happening depend exclusively on the value of the loss at a particular epoch? Hope my question is clear
Even I hv the same question. Appreciate if you can clear
Thanks for the great explanation. One small doubt/clarification would be helpful. Here since we have sigmoid if the Weight value is somewhat around 2 ( ex:2) dO11/dW11. then the dO21/dO11 value will be .25*2 = .5 then chain rule ((dO21/dO11) * (dO11/dW11)) will be again .5*.5 = 25 considering dO11/dW11 weight is also 2. Then instead of exploding it will be shirking. Can you please suggest what is the thought for this scenario
Thanks for the great explanation. One small doubt/clarification would be helpful. Here since we have sigmoid if the Weight value is somewhat around 2 ( ex:2) dO11/dW11. then the dO21/dO11 value will be .25*2 = .5 then chain rule ((dO21/dO11) * (dO11/dW11)) will be again .5*.5 = .25 considering dO11/dW11 weight is also 2. Then instead of exploding it will be shirking. Can you please suggest what is the thought for this scenario
sir may be there is a problem in the chain rule that you explain. Here something is missing that is derivative of L with respect to O31
a small doubt is that in another video you told that derivative of loss w.r.t. weight equals to derivative of loss w.r.t. output and etc... but in this video you considered directly from out on r.h.s could you please conform it
At 5:56, shouldn't it be "derivate of z w.r.t derivative of w_11" instead of being "derivate of z w.r.t derivative of O_11"
Sir, the only time when the exploding gradient problem occurs is when the weights is high and the time when vanishing gradient occurs is when the weights are too low, is my assumption correct?
Yes
Shouldn't the derivative be dl/ dw'11 = dl/dO31 and then the rest? Could someone please clarify? Thanks
You're right
Great videoo
On what basis are the weights initialises
So basically Exploding and vanishing are dependent on how the weights are initialised?
just excellent :-)
How do you define O_11? in the first hidden layer?
SIr in the chain rule formula, I guess you have left the del(L)/del(O^31) at first
How we will assign the weight value as 500. The normalized value is (-1,1).
Can you please tell how Weight is apply ?
I don't understand the chain rule equationt that how we get activation function while it should begun from dO21