I am a data scientist working in startup company in Bangladesh. Thanks you so much for preparing such wonderful videos. You are really great teacher, I really learned a lot from you.
You are a very good teacher. I have been following your video since 2020. It has helped me to understand so many concepts in Machine and deep learning. I like your simplicity with respect to your teachings. Thanks a lot. Writing from Nigeria.
I am becoming a fan of you .... I am very lazy person ... Never want to study but after Watching your videos . It feel good to learn .... Thank You Krish Sir....
The main problem with Batch Gradient Descent is the fact that it uses the whole training set to compute the gradients at every step, which makes it very slow when the training set is large. At the opposite extreme, Stochastic Gradient Descent just picks a random instance in the training set at every step and computes the gradients based only on that single instance. Obviously this makes the algorithm much faster since it has very little data to manipulate at every iteration. It also makes it possible to train on huge training sets, since only one instance needs to be in memory at each iteration (SGD can be implemented as an out-of-core algorithm.) On the other hand, due to its stochastic (i.e., random) nature, this algorithm is much less regular than Batch Gradient Descent: instead of gently decreasing until it reaches the minimum, the cost function will bounce up and down, decreasing only on aver‐ age. Over time it will end up very close to the minimum, but once it gets there it will continue to bounce around, never settling down . So once the algo‐ rithm stops, the final parameter values are good, but not optimal.
Randomness is good to escape from local optima but bad because it can never settle at minima, to overcome this you can use Learning Scheduler which reduces the steps as we approach global minima.
Whole data is feed in gd ...in stochastic we took 1-1 record in batch we batches the records??we call it stochastic as there is more randomness in it as 1-1 data is feed so randomness and noise...
I am amazed by the improvement in the quality, clarity and depth of intuition in your recent videos. keep up the great work. I have watched most of your Deep learning videos and I must say you make learning very easy.
Hands down the best explanation of optimizers on planet earth till to this day. whenever I want a refresh I simply go back to this video, nothing better has ever came out since this video was launched. thank you Krish deep from my heart.
I have always been confused with Optimizers in NN however this was the best resource available on internet that gave me an end to end clarity. Hatts off to Krish Sir.
earlier i was literally struggling with concept of optimizer but after i watched your video it became very easy to understand. a very simple way of explaining even complex topic.
This is called amazing. I went through paper .I was not able to grap the concept but your teaching skill is amazing. Thanks for this video. I request for yogi algorithm video.
Amazing explanation Krish. You teach in a very simple manner. I respect your skills. You made deep learning concepts so easy. Keep doing the good work. Thank you so much and All the Best.
Thank you so much Sir for this Educative Video. VERY VERY VERY EDUCATIVE. THANKS A LOT Sir 🙏 Not even my Course faculty had taught like this And the words u spoke @56:00 increased my Respect towards You Sir That's 💯% true.We should be respectful to the Researchers and Every one behind what we Learning
About Bias correction in Adam, Just wanted to write about the need for it, So when we have B1 and B2 (Beta) , for the first iteration Both Momentum parameter and learning rate parameter will be zero so Sdw(1) will end up being very small and since Sdw will be in the denominator while updating the new weight w1, this will give a really huge change for the initial iterations and in the paper it has been mentioned that due to this bias of having zero (For the first iteration), the loss might not reduce over time so, we the authors proposed a bias correction, where we do a weighted average instead of a simple moving average Sdw(t)corrected = Sdw(t)/1-(B2^t) where t is the number of iterations so if you notice, for the first few iterations the value of bias corrected Sdw(t) is different from Sdw(t) but as t increases Sdw(t)corrected is equal to Sdw(t) as the denominator becomes 1, thus this correction removes the bias created by using Vdw(0)=0 and Sdw(0)=0
Also, Wonderful explanation @Krish Naik sir, I learnt all about optimizers from your video and wanted to find out the reason behind bias correction and ended up finding this, Awesome explanations!!
Hello Krish I'm a Research Scholar and I was looking for some good explanations and luckily found your video and you made it clear in the first shot Forget the formulas why one after the other algorithms came into picture was really much clear with math intuitions. Will save your video for future reference. Thank you Kish . Appreciate your works.
Mini-batch Gradient Descent. at each step, instead of computing the gradients based on the full train‐ ing set (as in Batch GD) or based on just one instance (as in Stochastic GD), Mini- batch GD computes the gradients on small random sets of instances called mini- batches. The main advantage of Mini-batch GD over Stochastic GD is that you can get a performance boost from hardware optimization of matrix operations, especially when using GPUs
hi sir, you made a great video. But in the Sgd with momentum equation, some error is there. in the video, you explained this equation w_t = w_t-1 - (learning_rate)*dw_t dw_t = B* dw_t-1 + (1-B)*dl/dw_t-1 but in this eqaution it's sholud not be dl/dw_t-1. it will be dl/dw_t So correct equation is dw_t = B* dw_t-1 + (1-B)*dl/dw_t
Bias correction is done to reduce the velocity and squared velocity component in lower iteration t but for higher iteration, it won't make much of a difference.
Hey Krish, 1:32:26 it should be Vdb instead of Vdw right since we are calculating with respect to Bias ? Thank you for this interesting story of Optimizers, really helped a lot : )
@1:39:40 Why is this equation being changed? is it not for weight? If its bias correction, then should we not update only the Vdb? Why Vdw? Should it rather not be just called correction? Why is the term Bias used?
Hi Krish, your live data science project demos are very useful. Interested to know when can we see new projects further? Requesting for projects with stock market applications.
Time taken to update the weights is the same in all the three cases (GD, SGD and Mini Batch SGD). ONly forward propagation will take different times in these three cases
In the loss formula for stochastic gradient for n records you are dividing by 2 after calculation of sum of squares of loss. I did not understand the reason by dividing by 2. Should it not be division by n? I am referring to the loss formula written on the extreme right hand side at time interval 8:35 of video clip
I think you are wrong about SGD. Stochastic stands for random so it means, it will choose random inputs and perform GD on them. so it will converge faster. it does not mean it is iterate one by one.
Thank you for the video, it's beyond helpful! One question, when talking about the implementation on 01:23:00 would we have 2 different learning rates, one for updating weights and the other for updating the bias, since the calculation of exp weight average would be different since one depends on the derivative of Loss w.r.t. weights, and the other of derivative of Loss w.r.t. bias. ? :) Thank you! :)
Sir, thank you so much for this story. It has cleared all my doubts. Maths behind this all is so interesting. Sir, but you have not explained RMS Prop anywhere. Not on this mega video of optimizers and nor on this link Tutorial 16- AdaDelta and RMSprop optimizer. Can you please walk us through RMS once or a short video on it? Even in this video you have directly stated about RMSProp but we dunno why RMS was introduced like we know that for other optimizers. Looking forward to this. Also, sir in this Tutorial 16- AdaDelta and RMSprop optimizer:"gamma is taken and terminology is Weighted average (Wavg), wherein this current mega optimizer video you are saying ""beta and Sdw (replaced by Wavg) ". We are learning sir this will confuse us all the more. Please use same signs/terminology all over the videos.
@1:08:44 why would it be a bigger number? 0.001 > (0.001)^2 so squaring it rather makes it smaller, no? @1:09:38 - αt will be a very big number only if the values of t1 t2 are greater than 1, no? If they are less than 1, the it keeps becoming smaller, no?
Thanks so much. Recently, I have been asked about the comparison between (SGD and ADAM), And I didn't know the intuition behind adam. But now everything is clear. Thanks so much again. And I have a question now. I want to build a custom optimizer in Keras. So, is there a good resource in this? And I know we can manually take the derivatives w.r.t the vector of matrix multiplication vector. But is it possible to manually take the derivative w.r.t matrix? For example, taking the gradient w.r.t weights. I know it is done numerically with auto diff, but could I solve it manually?
I am a data scientist working in startup company in Bangladesh.
Thanks you so much for preparing such wonderful videos.
You are really great teacher, I really learned a lot from you.
This video cleared many doubts. I would suggest everyone to watch even if you have watched previous videos.
You are a very good teacher. I have been following your video since 2020. It has helped me to understand so many concepts in Machine and deep learning. I like your simplicity with respect to your teachings. Thanks a lot. Writing from Nigeria.
I am becoming a fan of you ....
I am very lazy person ...
Never want to study but after Watching your videos . It feel good to learn ....
Thank You
Krish Sir....
The main problem with Batch Gradient Descent is the fact that it uses the whole
training set to compute the gradients at every step, which makes it very slow when
the training set is large. At the opposite extreme, Stochastic Gradient Descent just
picks a random instance in the training set at every step and computes the gradients
based only on that single instance. Obviously this makes the algorithm much faster
since it has very little data to manipulate at every iteration. It also makes it possible to
train on huge training sets, since only one instance needs to be in memory at each
iteration (SGD can be implemented as an out-of-core algorithm.)
On the other hand, due to its stochastic (i.e., random) nature, this algorithm is much
less regular than Batch Gradient Descent: instead of gently decreasing until it reaches
the minimum, the cost function will bounce up and down, decreasing only on aver‐
age. Over time it will end up very close to the minimum, but once it gets there it will
continue to bounce around, never settling down . So once the algo‐
rithm stops, the final parameter values are good, but not optimal.
Randomness is good to escape from local optima but bad because it can never settle at minima, to overcome this you can use Learning Scheduler which reduces the steps as we approach global minima.
Whole data is feed in gd ...in stochastic we took 1-1 record in batch we batches the records??we call it stochastic as there is more randomness in it as 1-1 data is feed so randomness and noise...
I am amazed by the improvement in the quality, clarity and depth of intuition in your recent videos. keep up the great work. I have watched most of your Deep learning videos and I must say you make learning very easy.
because tony learns from his mistakes
Hands down the best explanation of optimizers on planet earth till to this day. whenever I want a refresh I simply go back to this video, nothing better has ever came out since this video was launched. thank you Krish deep from my heart.
you are the best teacher . i have seen many videos' but no one explain concepts so deep and clearly.
I have always been confused with Optimizers in NN however this was the best resource available on internet that gave me an end to end clarity. Hatts off to Krish Sir.
earlier i was literally struggling with concept of optimizer but after i watched your video it became very easy to understand. a very simple way of explaining even complex topic.
अत्यंत महत्वाचा विडिओ सर..❤
This guys great, repeats himself so you can’t forget
Many thanks Krish for your great efforts, its an excellent material for optimizers.
Very informative and best video on youtube to understand details of all optimization techniques. Thanks @Krish Naik..I became your admirer
You are a fantastic teacher Krish.
Simple.
This is called amazing. I went through paper .I was not able to grap the concept but your teaching skill is amazing. Thanks for this video. I request for yogi algorithm video.
Krish, you have taught it very nicely, it became simple to learn, it is like story, thanks a lot for making NN and optimizer very easy to learn.
Beautifully explained and taught. Hats off!
Amazing explanation Krish. You teach in a very simple manner. I respect your skills. You made deep learning concepts so easy. Keep doing the good work. Thank you so much and All the Best.
This is a well explained video to understand Optimizers. Thanks a lot Krish!
nice explanation Krish sir... wonderfully explained all the optimizers.
Krish is fastly becoming my favorite teacher
good stuff mr Krish, simple and step by step, helped me a lot.
Thank you so much Sir for this Educative Video.
VERY VERY VERY EDUCATIVE.
THANKS A LOT Sir 🙏
Not even my Course faculty had taught like this
And the words u spoke @56:00 increased my Respect towards You Sir
That's 💯% true.We should be respectful to the Researchers and Every one behind what we Learning
Wonderful way of teaching ! Krish Rockzzzz
Thanks a lot for this video! It was very lucid and comprehensive!
Thank you Krish. You make it seem easy to grasp.
Try to watch full video, that would be better for understanding every optimizer... thank you so much krish Naik Ji 👍👍👍
Thank you Mr Krish, your work inspired me,now i understand optimizers
Really nice way of teaching Krish. Thank you so mcuh
One of best method to teach thanks alot so simple so concise and every point is important
About Bias correction in Adam, Just wanted to write about the need for it, So when we have B1 and B2 (Beta) , for the first iteration Both Momentum parameter and learning rate parameter will be zero so Sdw(1) will end up being very small and since Sdw will be in the denominator while updating the new weight w1, this will give a really huge change for the initial iterations and in the paper it has been mentioned that due to this bias of having zero (For the first iteration), the loss might not reduce over time so, we the authors proposed a bias correction, where we do a weighted average instead of a simple moving average
Sdw(t)corrected = Sdw(t)/1-(B2^t) where t is the number of iterations so if you notice, for the first few iterations the value of bias corrected Sdw(t) is different from Sdw(t) but as t increases Sdw(t)corrected is equal to Sdw(t) as the denominator becomes 1, thus this correction removes the bias created by using Vdw(0)=0 and Sdw(0)=0
Also, Wonderful explanation @Krish Naik sir, I learnt all about optimizers from your video and wanted to find out the reason behind bias correction and ended up finding this, Awesome explanations!!
Thanks for the nice explanation
Thank you for your efforts krish.... Your videos are incredible!!!!
You're too awesome to exist ! Thanks a lot man !!
Shocked, so clear explaination!
nice explanation.thanks a lot
very good lecture thanku you . I am watching from Nepal
This is really awesome video. The maths is explained really in details. Thanks
what an effort! stunning
We are understanding beacuse you teach brilliantly..
Excellent video! Understood things I had problems with.
Amazing video krish sir..!
tq so much in one viedeo my total unit finished
I mean to change learning rate, we should use a scalar value to decay it after each iteration.
Thank you so much for this live session !!
Very good one by one from basic to advance and one related to other gradually increasing to the best optimizer Adam❤️
Hello Krish
I'm a Research Scholar and I was looking for some good explanations and luckily found your video and you made it clear in the first shot
Forget the formulas why one after the other algorithms came into picture was really much clear with math intuitions.
Will save your video for future reference.
Thank you Kish . Appreciate your works.
Awesome explanation
Really very great effort ..
Thank you sir
Mini-batch Gradient Descent. at each step, instead of computing the gradients based on the full train‐
ing set (as in Batch GD) or based on just one instance (as in Stochastic GD), Mini-
batch GD computes the gradients on small random sets of instances called mini-
batches. The main advantage of Mini-batch GD over Stochastic GD is that you can
get a performance boost from hardware optimization of matrix operations, especially
when using GPUs
In earlier comment u told that stochastic took hole data
you are the best at explaining
I don't know what to say, so I subscribed!!
nice video :)
Amazing video Sir.
hi sir, you made a great video.
But in the Sgd with momentum equation, some error is there.
in the video, you explained this equation
w_t = w_t-1 - (learning_rate)*dw_t
dw_t = B* dw_t-1 + (1-B)*dl/dw_t-1
but in this eqaution it's sholud not be dl/dw_t-1. it will be dl/dw_t
So correct equation is
dw_t = B* dw_t-1 + (1-B)*dl/dw_t
I guess u are absolutely correct
Thank you sir...you have explained everything very nicely... excellent work..🙏❤️
Respect Sir, such an amazing explanation with super crystal clear. Super Thank you
Find Nachiketa Hebbar's video on SGD with momentum to get a feel for the topic. Krish has.....written the formulas.
Great brother, doing excellent work 👍👍
Thank you for the wonderful intuition and for clearing the concepts.
Bias correction is done to reduce the velocity and squared velocity component in lower iteration t but for higher iteration, it won't make much of a difference.
Sir please do video on svm kernels
very well explained 👍
Hey Krish, 1:32:26 it should be Vdb instead of Vdw right since we are calculating with respect to Bias ?
Thank you for this interesting story of Optimizers, really helped a lot : )
Yes, you are right.
Thanks for explaining it simply and easily.
@1:39:40 Why is this equation being changed? is it not for weight? If its bias correction, then should we not update only the Vdb? Why Vdw? Should it rather not be just called correction? Why is the term Bias used?
Hi Krish, your live data science project demos are very useful. Interested to know when can we see new projects further? Requesting for projects with stock market applications.
Time taken to update the weights is the same in all the three cases (GD, SGD and Mini Batch SGD). ONly forward propagation will take different times in these three cases
thanks, Krish for this great explanation. I have understood now .. lastly I was not able to follow so thanks again Krish
Wonderful session sir 🔥🔥
Understood 100% , I got it in a single time. Excellent explanation
In the loss formula for stochastic gradient for n records you are dividing by 2 after calculation of sum of squares of loss. I did not understand the reason by dividing by 2. Should it not be division by n? I am referring to the loss formula written on the extreme right hand side at time interval 8:35 of video clip
I think it's wrong. As you rightly pointed out, it should be "1/n" and not 1/2. Also, if you look at his session on loss function, he did use 1/n.
Alpha t is wrong u giving wt but it shold wt-1 for adgrad
Great Video Sir !!!
Sir in live session you specify Sdw in rmsprop and in recorded videos you specify Wavg both are same or different?
@krish Naik ok then the previous video on sgd with momentum was wrong and this explanation is correct?
Thanks Krish
I think you are wrong about SGD. Stochastic stands for random so it means, it will choose random inputs and perform GD on them. so it will converge faster. it does not mean it is iterate one by one.
After seeing this video, m getting dizzy. u taught very well but my mind is dancing with fear after seeing so much.
Thank you so much Sir for this
Thank you sir....you are amazing
@37:49 what is this a1 a2 a3 data and why are you replacing it with dL/dw or dL/db @45:06?
Thank you for the video, it's beyond helpful!
One question, when talking about the implementation on 01:23:00 would we have 2 different learning rates, one for updating weights and the other for updating the bias, since the calculation of exp weight average would be different since one depends on the derivative of Loss w.r.t. weights, and the other of derivative of Loss w.r.t. bias. ? :)
Thank you! :)
Very nice video, one doubt here, I want know in each epoch, are we using weights of last completed epoch or just randomly generating it in each epoch?
hi sir first of all thankyou for providing such a valuable education , sir where we can get this notes.
One of the best video about optimization algorithm.❤️
Great Video.
respect!!!! subbed
Hi Krish, can do same thing for ML techniques
Sir, thank you so much for this story. It has cleared all my doubts. Maths behind this all is so interesting. Sir, but you have not explained RMS Prop anywhere. Not on this mega video of optimizers and nor on this link Tutorial 16- AdaDelta and RMSprop optimizer. Can you please walk us through RMS once or a short video on it? Even in this video you have directly stated about RMSProp but we dunno why RMS was introduced like we know that for other optimizers. Looking forward to this.
Also, sir in this Tutorial 16- AdaDelta and RMSprop optimizer:"gamma is taken and terminology is Weighted average (Wavg), wherein this current mega optimizer video you are saying ""beta and Sdw (replaced by Wavg) ". We are learning sir this will confuse us all the more. Please use same signs/terminology all over the videos.
adadelta is RMS prop
Thanks sir, helped me.
Thank u bro!
Question : should i retain that : GD=>One epoch leads to underfitting. SGD => require more resources RAM etc. (comp. explos.) ?
at 1:32:27 , it should be V_db (not w)
@1:08:44 why would it be a bigger number? 0.001 > (0.001)^2 so squaring it rather makes it smaller, no?
@1:09:38 - αt will be a very big number only if the values of t1 t2 are greater than 1, no? If they are less than 1, the it keeps becoming smaller, no?
Awesome !
Thanks so much. Recently, I have been asked about the comparison between (SGD and ADAM), And I didn't know the intuition behind adam. But now everything is clear. Thanks so much again.
And I have a question now. I want to build a custom optimizer in Keras. So, is there a good resource in this?
And I know we can manually take the derivatives w.r.t the vector of matrix multiplication vector. But is it possible to manually take the derivative w.r.t matrix? For example, taking the gradient w.r.t weights. I know it is done numerically with auto diff, but could I solve it manually?
Plz explain Nadam, ftrl etc too
Is weights gets updated for every iteration or for every epoch.
SGD with momentum formula changes(old video) to (new video) confused but, I got we do exponential moving average in sgd with momentum.
@1:32:30 I think it should be (1-β1), and
@1:33:30 I think it should be (1-β2), no?
thanks. well taught
: )