Deep Learning-All Optimizers In One Video-SGD with Momentum,Adagrad,Adadelta,RMSprop,Adam Optimizers

Поділитися
Вставка
  • Опубліковано 19 січ 2025

КОМЕНТАРІ • 194

  • @md.nazrulislam6739
    @md.nazrulislam6739 2 роки тому +5

    I am a data scientist working in startup company in Bangladesh.
    Thanks you so much for preparing such wonderful videos.
    You are really great teacher, I really learned a lot from you.

  • @tesla1772
    @tesla1772 3 роки тому +31

    This video cleared many doubts. I would suggest everyone to watch even if you have watched previous videos.

  • @techwithsolo
    @techwithsolo 3 роки тому +5

    You are a very good teacher. I have been following your video since 2020. It has helped me to understand so many concepts in Machine and deep learning. I like your simplicity with respect to your teachings. Thanks a lot. Writing from Nigeria.

  • @shraddhaagrahari7519
    @shraddhaagrahari7519 2 роки тому +4

    I am becoming a fan of you ....
    I am very lazy person ...
    Never want to study but after Watching your videos . It feel good to learn ....
    Thank You
    Krish Sir....

  • @ColdZera14
    @ColdZera14 4 роки тому +61

    The main problem with Batch Gradient Descent is the fact that it uses the whole
    training set to compute the gradients at every step, which makes it very slow when
    the training set is large. At the opposite extreme, Stochastic Gradient Descent just
    picks a random instance in the training set at every step and computes the gradients
    based only on that single instance. Obviously this makes the algorithm much faster
    since it has very little data to manipulate at every iteration. It also makes it possible to
    train on huge training sets, since only one instance needs to be in memory at each
    iteration (SGD can be implemented as an out-of-core algorithm.)
    On the other hand, due to its stochastic (i.e., random) nature, this algorithm is much
    less regular than Batch Gradient Descent: instead of gently decreasing until it reaches
    the minimum, the cost function will bounce up and down, decreasing only on aver‐
    age. Over time it will end up very close to the minimum, but once it gets there it will
    continue to bounce around, never settling down . So once the algo‐
    rithm stops, the final parameter values are good, but not optimal.

    • @adityabobde2882
      @adityabobde2882 3 роки тому +1

      Randomness is good to escape from local optima but bad because it can never settle at minima, to overcome this you can use Learning Scheduler which reduces the steps as we approach global minima.

    • @nishah4058
      @nishah4058 2 роки тому +1

      Whole data is feed in gd ...in stochastic we took 1-1 record in batch we batches the records??we call it stochastic as there is more randomness in it as 1-1 data is feed so randomness and noise...

  • @nehabalani7290
    @nehabalani7290 3 роки тому +17

    I am amazed by the improvement in the quality, clarity and depth of intuition in your recent videos. keep up the great work. I have watched most of your Deep learning videos and I must say you make learning very easy.

    • @moindalvs
      @moindalvs 2 роки тому +1

      because tony learns from his mistakes

  • @alibastami
    @alibastami 9 місяців тому

    Hands down the best explanation of optimizers on planet earth till to this day. whenever I want a refresh I simply go back to this video, nothing better has ever came out since this video was launched. thank you Krish deep from my heart.

  • @RashiJakhar-w6r
    @RashiJakhar-w6r Рік тому +1

    you are the best teacher . i have seen many videos' but no one explain concepts so deep and clearly.

  • @vikashdas1852
    @vikashdas1852 3 роки тому +1

    I have always been confused with Optimizers in NN however this was the best resource available on internet that gave me an end to end clarity. Hatts off to Krish Sir.

  • @sumitagarwal2335
    @sumitagarwal2335 3 роки тому

    earlier i was literally struggling with concept of optimizer but after i watched your video it became very easy to understand. a very simple way of explaining even complex topic.

  • @yogeshkadam8160
    @yogeshkadam8160 4 роки тому +2

    अत्यंत महत्वाचा विडिओ सर..❤

  • @merv893
    @merv893 2 роки тому +1

    This guys great, repeats himself so you can’t forget

  • @subashp7925
    @subashp7925 Місяць тому

    Many thanks Krish for your great efforts, its an excellent material for optimizers.

  • @MrAyandebnath
    @MrAyandebnath 3 роки тому

    Very informative and best video on youtube to understand details of all optimization techniques. Thanks @Krish Naik..I became your admirer

  • @pravalikadas5496
    @pravalikadas5496 2 роки тому

    You are a fantastic teacher Krish.
    Simple.

  • @sumitkumarsharma4004
    @sumitkumarsharma4004 2 роки тому

    This is called amazing. I went through paper .I was not able to grap the concept but your teaching skill is amazing. Thanks for this video. I request for yogi algorithm video.

  • @monalisameena103
    @monalisameena103 4 роки тому +1

    Krish, you have taught it very nicely, it became simple to learn, it is like story, thanks a lot for making NN and optimizer very easy to learn.

  • @gourab469
    @gourab469 2 роки тому +1

    Beautifully explained and taught. Hats off!

  • @sharanpreetsandhu3215
    @sharanpreetsandhu3215 3 роки тому

    Amazing explanation Krish. You teach in a very simple manner. I respect your skills. You made deep learning concepts so easy. Keep doing the good work. Thank you so much and All the Best.

  • @BalaguruGupta
    @BalaguruGupta 3 роки тому

    This is a well explained video to understand Optimizers. Thanks a lot Krish!

  • @hemamaliniveeranarayanan9901
    @hemamaliniveeranarayanan9901 7 місяців тому

    nice explanation Krish sir... wonderfully explained all the optimizers.

  • @francisegah6115
    @francisegah6115 2 роки тому

    Krish is fastly becoming my favorite teacher

  • @MikiSiguriči1389
    @MikiSiguriči1389 2 роки тому

    good stuff mr Krish, simple and step by step, helped me a lot.

  • @tagoreji2143
    @tagoreji2143 2 роки тому

    Thank you so much Sir for this Educative Video.
    VERY VERY VERY EDUCATIVE.
    THANKS A LOT Sir 🙏
    Not even my Course faculty had taught like this
    And the words u spoke @56:00 increased my Respect towards You Sir
    That's 💯% true.We should be respectful to the Researchers and Every one behind what we Learning

  • @DHGokul
    @DHGokul 2 роки тому

    Wonderful way of teaching ! Krish Rockzzzz

  • @rijulsharma8148
    @rijulsharma8148 9 днів тому

    Thanks a lot for this video! It was very lucid and comprehensive!

  • @nazaninayareh5008
    @nazaninayareh5008 Рік тому

    Thank you Krish. You make it seem easy to grasp.

  • @sivabalaram4962
    @sivabalaram4962 3 роки тому

    Try to watch full video, that would be better for understanding every optimizer... thank you so much krish Naik Ji 👍👍👍

  • @mohamedbadi8875
    @mohamedbadi8875 9 місяців тому

    Thank you Mr Krish, your work inspired me,now i understand optimizers

  • @kcihtrakd
    @kcihtrakd 2 роки тому

    Really nice way of teaching Krish. Thank you so mcuh

  • @sur_yt805
    @sur_yt805 3 роки тому

    One of best method to teach thanks alot so simple so concise and every point is important

  • @arjunsubramaniyan1675
    @arjunsubramaniyan1675 4 роки тому +2

    About Bias correction in Adam, Just wanted to write about the need for it, So when we have B1 and B2 (Beta) , for the first iteration Both Momentum parameter and learning rate parameter will be zero so Sdw(1) will end up being very small and since Sdw will be in the denominator while updating the new weight w1, this will give a really huge change for the initial iterations and in the paper it has been mentioned that due to this bias of having zero (For the first iteration), the loss might not reduce over time so, we the authors proposed a bias correction, where we do a weighted average instead of a simple moving average
    Sdw(t)corrected = Sdw(t)/1-(B2^t) where t is the number of iterations so if you notice, for the first few iterations the value of bias corrected Sdw(t) is different from Sdw(t) but as t increases Sdw(t)corrected is equal to Sdw(t) as the denominator becomes 1, thus this correction removes the bias created by using Vdw(0)=0 and Sdw(0)=0

    • @arjunsubramaniyan1675
      @arjunsubramaniyan1675 4 роки тому +1

      Also, Wonderful explanation @Krish Naik sir, I learnt all about optimizers from your video and wanted to find out the reason behind bias correction and ended up finding this, Awesome explanations!!

    • @aditisrivastava7079
      @aditisrivastava7079 3 роки тому +1

      Thanks for the nice explanation

  • @SwethaNandyala-sf9lt
    @SwethaNandyala-sf9lt Рік тому

    Thank you for your efforts krish.... Your videos are incredible!!!!

  • @ishantsingh3366
    @ishantsingh3366 3 роки тому

    You're too awesome to exist ! Thanks a lot man !!

  • @liudreamer8403
    @liudreamer8403 3 роки тому

    Shocked, so clear explaination!

  • @lekkalanaveenkumarreddy1539
    @lekkalanaveenkumarreddy1539 2 роки тому +1

    nice explanation.thanks a lot

  • @bhuwanacharya6275
    @bhuwanacharya6275 3 роки тому

    very good lecture thanku you . I am watching from Nepal

  • @anupsahoo8561
    @anupsahoo8561 4 роки тому +1

    This is really awesome video. The maths is explained really in details. Thanks

  • @akashkumar-ni9ec
    @akashkumar-ni9ec 2 роки тому

    what an effort! stunning

  • @hashimhafeez21
    @hashimhafeez21 3 роки тому

    We are understanding beacuse you teach brilliantly..

  • @alexmash1353
    @alexmash1353 3 роки тому

    Excellent video! Understood things I had problems with.

  • @ppsheth91
    @ppsheth91 4 роки тому

    Amazing video krish sir..!

  • @bipulsingh6232
    @bipulsingh6232 2 роки тому

    tq so much in one viedeo my total unit finished

  • @nhactrutinh6201
    @nhactrutinh6201 3 роки тому +1

    I mean to change learning rate, we should use a scalar value to decay it after each iteration.

  • @arjyabasu1311
    @arjyabasu1311 4 роки тому

    Thank you so much for this live session !!

  • @voidknown2338
    @voidknown2338 3 роки тому

    Very good one by one from basic to advance and one related to other gradually increasing to the best optimizer Adam❤️

  • @LakshmiDevilifentravels
    @LakshmiDevilifentravels Рік тому

    Hello Krish
    I'm a Research Scholar and I was looking for some good explanations and luckily found your video and you made it clear in the first shot
    Forget the formulas why one after the other algorithms came into picture was really much clear with math intuitions.
    Will save your video for future reference.
    Thank you Kish . Appreciate your works.

  • @dr.ratnapatil9272
    @dr.ratnapatil9272 3 роки тому +1

    Awesome explanation

  • @meenalpande
    @meenalpande Рік тому

    Really very great effort ..
    Thank you sir

  • @ColdZera14
    @ColdZera14 4 роки тому +7

    Mini-batch Gradient Descent. at each step, instead of computing the gradients based on the full train‐
    ing set (as in Batch GD) or based on just one instance (as in Stochastic GD), Mini-
    batch GD computes the gradients on small random sets of instances called mini-
    batches. The main advantage of Mini-batch GD over Stochastic GD is that you can
    get a performance boost from hardware optimization of matrix operations, especially
    when using GPUs

    • @nishah4058
      @nishah4058 2 роки тому

      In earlier comment u told that stochastic took hole data

  • @jameswestbrook5709
    @jameswestbrook5709 3 роки тому

    you are the best at explaining

  • @harshkhandelwal2974
    @harshkhandelwal2974 3 роки тому

    I don't know what to say, so I subscribed!!
    nice video :)

  • @kanhataak1269
    @kanhataak1269 4 роки тому +2

    Amazing video Sir.

  • @shahrukhsharif9382
    @shahrukhsharif9382 4 роки тому +9

    hi sir, you made a great video.
    But in the Sgd with momentum equation, some error is there.
    in the video, you explained this equation
    w_t = w_t-1 - (learning_rate)*dw_t
    dw_t = B* dw_t-1 + (1-B)*dl/dw_t-1
    but in this eqaution it's sholud not be dl/dw_t-1. it will be dl/dw_t
    So correct equation is
    dw_t = B* dw_t-1 + (1-B)*dl/dw_t

  • @richadhiman585
    @richadhiman585 3 роки тому

    Thank you sir...you have explained everything very nicely... excellent work..🙏❤️

  • @pruattea0302
    @pruattea0302 3 роки тому

    Respect Sir, such an amazing explanation with super crystal clear. Super Thank you

  • @yashmishra12
    @yashmishra12 3 роки тому

    Find Nachiketa Hebbar's video on SGD with momentum to get a feel for the topic. Krish has.....written the formulas.

  • @oss1996
    @oss1996 4 роки тому

    Great brother, doing excellent work 👍👍

  • @nishitgala2861
    @nishitgala2861 3 роки тому

    Thank you for the wonderful intuition and for clearing the concepts.

  • @ashishanand1466
    @ashishanand1466 3 роки тому

    Bias correction is done to reduce the velocity and squared velocity component in lower iteration t but for higher iteration, it won't make much of a difference.

  • @anandhiselvi3174
    @anandhiselvi3174 4 роки тому +5

    Sir please do video on svm kernels

  • @deelipvenkat5161
    @deelipvenkat5161 2 роки тому

    very well explained 👍

  • @pavankumarpk9013
    @pavankumarpk9013 3 роки тому

    Hey Krish, 1:32:26 it should be Vdb instead of Vdw right since we are calculating with respect to Bias ?
    Thank you for this interesting story of Optimizers, really helped a lot : )

  • @joelbraganza3819
    @joelbraganza3819 4 роки тому

    Thanks for explaining it simply and easily.

  • @harshvardhanagrawal
    @harshvardhanagrawal 5 місяців тому

    @1:39:40 Why is this equation being changed? is it not for weight? If its bias correction, then should we not update only the Vdb? Why Vdw? Should it rather not be just called correction? Why is the term Bias used?

  • @senthilkumara3653
    @senthilkumara3653 4 роки тому +3

    Hi Krish, your live data science project demos are very useful. Interested to know when can we see new projects further? Requesting for projects with stock market applications.

  • @Hari-xr7ob
    @Hari-xr7ob 3 роки тому

    Time taken to update the weights is the same in all the three cases (GD, SGD and Mini Batch SGD). ONly forward propagation will take different times in these three cases

  • @priyabratasahoo8535
    @priyabratasahoo8535 3 роки тому +3

    thanks, Krish for this great explanation. I have understood now .. lastly I was not able to follow so thanks again Krish

  • @BMESparshJain
    @BMESparshJain 3 роки тому

    Wonderful session sir 🔥🔥

  • @rajak7410
    @rajak7410 4 роки тому

    Understood 100% , I got it in a single time. Excellent explanation

  • @RajeshGupta-wx5qd
    @RajeshGupta-wx5qd 4 роки тому +4

    In the loss formula for stochastic gradient for n records you are dividing by 2 after calculation of sum of squares of loss. I did not understand the reason by dividing by 2. Should it not be division by n? I am referring to the loss formula written on the extreme right hand side at time interval 8:35 of video clip

    • @yashmishra12
      @yashmishra12 3 роки тому

      I think it's wrong. As you rightly pointed out, it should be "1/n" and not 1/2. Also, if you look at his session on loss function, he did use 1/n.

  • @ArunKumar-sg6jf
    @ArunKumar-sg6jf 11 місяців тому +1

    Alpha t is wrong u giving wt but it shold wt-1 for adgrad

  • @GauravSharma-kb9np
    @GauravSharma-kb9np 4 роки тому +1

    Great Video Sir !!!

  • @wahabali828
    @wahabali828 3 роки тому +1

    Sir in live session you specify Sdw in rmsprop and in recorded videos you specify Wavg both are same or different?

  • @yathishs1895
    @yathishs1895 3 роки тому +1

    @krish Naik ok then the previous video on sgd with momentum was wrong and this explanation is correct?

  • @louerleseigneur4532
    @louerleseigneur4532 3 роки тому

    Thanks Krish

  • @mhadnanali
    @mhadnanali 2 роки тому

    I think you are wrong about SGD. Stochastic stands for random so it means, it will choose random inputs and perform GD on them. so it will converge faster. it does not mean it is iterate one by one.

  • @sweetisah735
    @sweetisah735 3 роки тому

    After seeing this video, m getting dizzy. u taught very well but my mind is dancing with fear after seeing so much.

  • @daur_e_jaun9201
    @daur_e_jaun9201 2 роки тому

    Thank you so much Sir for this

  • @shubhamchoudhary5461
    @shubhamchoudhary5461 3 роки тому

    Thank you sir....you are amazing

  • @harshvardhanagrawal
    @harshvardhanagrawal 5 місяців тому

    @37:49 what is this a1 a2 a3 data and why are you replacing it with dL/dw or dL/db @45:06?

  • @marijatosic217
    @marijatosic217 2 роки тому

    Thank you for the video, it's beyond helpful!
    One question, when talking about the implementation on 01:23:00 would we have 2 different learning rates, one for updating weights and the other for updating the bias, since the calculation of exp weight average would be different since one depends on the derivative of Loss w.r.t. weights, and the other of derivative of Loss w.r.t. bias. ? :)
    Thank you! :)

  • @raghvendrapal1762
    @raghvendrapal1762 3 роки тому

    Very nice video, one doubt here, I want know in each epoch, are we using weights of last completed epoch or just randomly generating it in each epoch?

  • @Rahul_Singh_Rajput_04
    @Rahul_Singh_Rajput_04 3 роки тому

    hi sir first of all thankyou for providing such a valuable education , sir where we can get this notes.

  • @Amankumar-by9ed
    @Amankumar-by9ed 4 роки тому

    One of the best video about optimization algorithm.❤️

  • @asiftandel8750
    @asiftandel8750 4 роки тому +1

    Great Video.

  • @stipepavic843
    @stipepavic843 2 роки тому

    respect!!!! subbed

  • @OmkarYadavDhudi
    @OmkarYadavDhudi 4 роки тому +3

    Hi Krish, can do same thing for ML techniques

  • @rasikai102
    @rasikai102 4 роки тому +1

    Sir, thank you so much for this story. It has cleared all my doubts. Maths behind this all is so interesting. Sir, but you have not explained RMS Prop anywhere. Not on this mega video of optimizers and nor on this link Tutorial 16- AdaDelta and RMSprop optimizer. Can you please walk us through RMS once or a short video on it? Even in this video you have directly stated about RMSProp but we dunno why RMS was introduced like we know that for other optimizers. Looking forward to this.
    Also, sir in this Tutorial 16- AdaDelta and RMSprop optimizer:"gamma is taken and terminology is Weighted average (Wavg), wherein this current mega optimizer video you are saying ""beta and Sdw (replaced by Wavg) ". We are learning sir this will confuse us all the more. Please use same signs/terminology all over the videos.

  • @shantanu556
    @shantanu556 Рік тому

    Thanks sir, helped me.

  • @hichamkalkha5847
    @hichamkalkha5847 2 роки тому

    Thank u bro!
    Question : should i retain that : GD=>One epoch leads to underfitting. SGD => require more resources RAM etc. (comp. explos.) ?

  • @sumaiyachoudhury7091
    @sumaiyachoudhury7091 Рік тому

    at 1:32:27 , it should be V_db (not w)

  • @harshvardhanagrawal
    @harshvardhanagrawal 5 місяців тому

    @1:08:44 why would it be a bigger number? 0.001 > (0.001)^2 so squaring it rather makes it smaller, no?
    @1:09:38 - αt will be a very big number only if the values of t1 t2 are greater than 1, no? If they are less than 1, the it keeps becoming smaller, no?

  • @adithyajob8728
    @adithyajob8728 5 місяців тому

    Awesome !

  • @kagglefire6545
    @kagglefire6545 2 роки тому

    Thanks so much. Recently, I have been asked about the comparison between (SGD and ADAM), And I didn't know the intuition behind adam. But now everything is clear. Thanks so much again.
    And I have a question now. I want to build a custom optimizer in Keras. So, is there a good resource in this?
    And I know we can manually take the derivatives w.r.t the vector of matrix multiplication vector. But is it possible to manually take the derivative w.r.t matrix? For example, taking the gradient w.r.t weights. I know it is done numerically with auto diff, but could I solve it manually?

  • @scienceandmathbyankitsir6403
    @scienceandmathbyankitsir6403 2 роки тому

    Plz explain Nadam, ftrl etc too

  • @ramakrishnayellela7455
    @ramakrishnayellela7455 9 місяців тому

    Is weights gets updated for every iteration or for every epoch.

  • @thepresistence5935
    @thepresistence5935 3 роки тому

    SGD with momentum formula changes(old video) to (new video) confused but, I got we do exponential moving average in sgd with momentum.

  • @harshvardhanagrawal
    @harshvardhanagrawal 5 місяців тому

    @1:32:30 I think it should be (1-β1), and
    @1:33:30 I think it should be (1-β2), no?

  • @aDarkDay
    @aDarkDay Рік тому

    thanks. well taught
    : )