Optimization for Deep Learning (Momentum, RMSprop, AdaGrad, Adam)

Поділитися
Вставка
  • Опубліковано 22 лис 2024

КОМЕНТАРІ •

  • @HojjatMonzavi
    @HojjatMonzavi 3 місяці тому +9

    As a junior AI developer, this was the best toturial of Adam and Other optimizers I've ever seen. Simply explained but not too simply to be a useless overview
    Thanks

  • @rhugvedchaudhari4584
    @rhugvedchaudhari4584 Рік тому +22

    The best explanation I've seen till now!

  • @zhang_han
    @zhang_han Рік тому +19

    Most mind blowing thing in this video was what Cauchy did in 1847.

  • @AkhilKrishnaatg
    @AkhilKrishnaatg 8 місяців тому +2

    Beautifully explained. Thank you!

  • @fzigunov
    @fzigunov Місяць тому

    You're the best explanation out there in my opinion. I appreciate you!!

  • @tempetedecafe7416
    @tempetedecafe7416 11 місяців тому +2

    Very good explanation!
    15:03 Arguably, I would say that it's not the responsibility of the optimization algorithm to ensure good generalization. I feel like it would be more fair to judge optimizers only on their fit of the training data, and leave the responsibility of generalization out of their benchmark. In your example, I think it would be the responsibility of model architecture design to get rid of this sharp minimum (by having dropout, fewer parameters, etc...), rather than the responsibility of Adam not to fall inside of it.

  • @dongthinh2001
    @dongthinh2001 10 місяців тому +1

    Clearly explained indeed! Great video!

  • @saqibsarwarkhan5549
    @saqibsarwarkhan5549 6 місяців тому

    That's a great video with clear explanations in such a short time. Thanks a lot.

  • @EFCK555
    @EFCK555 3 місяці тому

    Good work man its the best explanation i have ever seen. Thank you so much for your work.

  • @oinotnarasec
    @oinotnarasec 2 місяці тому

    Beautiful video. Thank you

  • @sokrozayeng7691
    @sokrozayeng7691 3 місяці тому

    Great Explaination! Thank you.

  • @idiosinkrazijske.rutine
    @idiosinkrazijske.rutine Рік тому +2

    Very nice explanation!

  • @markr9640
    @markr9640 10 місяців тому

    Fantastic video and graphics. Please find time to make more. Subscribed 👍

  • @luiskraker807
    @luiskraker807 9 місяців тому

    Many thanks, clear explanation!!!

  • @Justin-zw1hx
    @Justin-zw1hx Рік тому +2

    keep doing the awesome work, you deserve more subs

  • @rasha8541
    @rasha8541 11 місяців тому

    really well explained

  • @benwinstanleymusic
    @benwinstanleymusic 8 місяців тому

    Great video thank you!

  • @na50r24
    @na50r24 4 дні тому

    Can w be considered as vector that represents all adjustable parameters? I.e., not just weights of one linear transformation matrix from input to hidden layer but all of them + bias values.
    So when you compute gradient of L with respect to w, you compute a vector for which each entry is a the partial derivative of L with respect to w_i?

    • @deepbean
      @deepbean  3 дні тому +1

      Yup, that's correct!

  • @leohuang-sz2rf
    @leohuang-sz2rf 7 місяців тому

    I love your explaination

  • @physis6356
    @physis6356 7 місяців тому

    great video, thanks!

  • @makgaiduk
    @makgaiduk Рік тому

    Well explained!

  • @TheTimtimtimtam
    @TheTimtimtimtam Рік тому +1

    Thank you this is really well put together and presented !

  • @KwangrokRyoo
    @KwangrokRyoo 2 місяці тому

    this is amazing 🤩

  • @MikeSieko17
    @MikeSieko17 8 місяців тому

    why didnt you explain the (1-\beta_1) term?

  • @wishIKnewHowToLove
    @wishIKnewHowToLove Рік тому

    thank you so much :)

  • @donmiguel4848
    @donmiguel4848 8 місяців тому +2

    Nesterov is silly. You have the gradient g(w(t)) because the weight w is calculating in the forward the activation of the neuron and contributes to the loss. You don't have the gradient g(w(t)+pV(t)) because at this fictive position of the weight the inference was not calculated and so you don't have any information about what the loss contribution at that weight position would have been. It's PURE NONSENSE. But it only cost a few more calculations without doing much damage, so no one really seems to complain about it.

    • @Nerdimo
      @Nerdimo 7 місяців тому

      This does not make sense…at all. The intuition is that you’re making an educated guess for the gradient in the future; you’re already going to compute g(w(t) + pV(t)) anyway, so why not correct for that and move in that direction instead on the current step?

    • @donmiguel4848
      @donmiguel4848 7 місяців тому

      @@Nerdimo Let's remember that the actual correct gradient of w is computed as the average gradient over ALL samples. So for runtime complexity reason we already make a "educated guess", or better a stochastic approximation, with our present per sample or per batch gradient by using a running gradient or a batch gradient. But these approximations are based on actual inference that we have calculated. Adding to that uncertainty some guessing about what in future will happen is not a correction based on facts, it's pure fiction. Of course you will find for every training process a configuration of hyper parameter, with which this fiction is beneficial as well you will find configurations, with which it is not. But you get this knowledge only by experiment instead of having an algorithm, that is beneficial in general.

    • @Nerdimo
      @Nerdimo 7 місяців тому

      @@donmiguel4848 Starting to wonder if this is AI generated “pure fiction” 😂.

    • @Nerdimo
      @Nerdimo 7 місяців тому

      @@donmiguel4848 I understand your point, however, I think it’s unfair to discount it as something that’s “fiction”. My main argument is just that there’s intuitions in why doing this could improve taking good steps in the direction towards the local minimum of the loss function.

    • @donmiguel4848
      @donmiguel4848 7 місяців тому

      @@Nerdimo These "intuitions" are based on assumptions about the NN which don't match with reality. We humans understand a hill and a sink or a mountain or a canyon and we assume the loss function being like that, but the real power of NeuralNetworks is the non-linearity of the activation and the flexibility of a lot of interacting non-linear components. If our intuition would match what actually is going on in the NN we could write an algorithm which would be much faster than the NN. But NN are fare more complex and beyond human imagination, so I think we have to be very careful with our assumptions and "intuitions", even though it seems to be "unfair".😉

  • @wishIKnewHowToLove
    @wishIKnewHowToLove Рік тому

    Really? i didn't know SGD generalized better than ADAM

    • @deepbean
      @deepbean  Рік тому

      Thank you for your comments Sebastian! This result doesn't seem completely clear cut so may be open to refutation in some cases. For instance, one Medium article concludes that "fine-tuned Adam is always better than SGD, while there exists a performance gap between Adam and SGD when using default hyperparameters", which means the problem is one of hyperparameter optimization, which can be more difficult with Adam. Let me know what you think!
      medium.com/geekculture/a-2021-guide-to-improving-cnns-optimizers-adam-vs-sgd-495848ac6008

    • @wishIKnewHowToLove
      @wishIKnewHowToLove Рік тому

      @@deepbean it's sebastiEn with E.Learn how to read carefully :)

    • @deepbean
      @deepbean  Рік тому +3

      🤣

    • @deepbean
      @deepbean  Рік тому

      @@wishIKnewHowToLove my bad

    • @dgnu
      @dgnu Рік тому +9

      @@wishIKnewHowToLove bruh cmon the man is being nice enough to u just by replying jesus

  • @Stopinvadingmyhardware
    @Stopinvadingmyhardware Рік тому +1

    nom nom nom learn to program.

  • @fullerholiday2872
    @fullerholiday2872 2 місяці тому

    Martin Jessica Moore Carol Taylor Dorothy

  • @MrWater2
    @MrWater2 5 місяців тому

    Wonderful explanation!!