As a junior AI developer, this was the best toturial of Adam and Other optimizers I've ever seen. Simply explained but not too simply to be a useless overview Thanks
Very good explanation! 15:03 Arguably, I would say that it's not the responsibility of the optimization algorithm to ensure good generalization. I feel like it would be more fair to judge optimizers only on their fit of the training data, and leave the responsibility of generalization out of their benchmark. In your example, I think it would be the responsibility of model architecture design to get rid of this sharp minimum (by having dropout, fewer parameters, etc...), rather than the responsibility of Adam not to fall inside of it.
Can w be considered as vector that represents all adjustable parameters? I.e., not just weights of one linear transformation matrix from input to hidden layer but all of them + bias values. So when you compute gradient of L with respect to w, you compute a vector for which each entry is a the partial derivative of L with respect to w_i?
Nesterov is silly. You have the gradient g(w(t)) because the weight w is calculating in the forward the activation of the neuron and contributes to the loss. You don't have the gradient g(w(t)+pV(t)) because at this fictive position of the weight the inference was not calculated and so you don't have any information about what the loss contribution at that weight position would have been. It's PURE NONSENSE. But it only cost a few more calculations without doing much damage, so no one really seems to complain about it.
This does not make sense…at all. The intuition is that you’re making an educated guess for the gradient in the future; you’re already going to compute g(w(t) + pV(t)) anyway, so why not correct for that and move in that direction instead on the current step?
@@Nerdimo Let's remember that the actual correct gradient of w is computed as the average gradient over ALL samples. So for runtime complexity reason we already make a "educated guess", or better a stochastic approximation, with our present per sample or per batch gradient by using a running gradient or a batch gradient. But these approximations are based on actual inference that we have calculated. Adding to that uncertainty some guessing about what in future will happen is not a correction based on facts, it's pure fiction. Of course you will find for every training process a configuration of hyper parameter, with which this fiction is beneficial as well you will find configurations, with which it is not. But you get this knowledge only by experiment instead of having an algorithm, that is beneficial in general.
@@donmiguel4848 I understand your point, however, I think it’s unfair to discount it as something that’s “fiction”. My main argument is just that there’s intuitions in why doing this could improve taking good steps in the direction towards the local minimum of the loss function.
@@Nerdimo These "intuitions" are based on assumptions about the NN which don't match with reality. We humans understand a hill and a sink or a mountain or a canyon and we assume the loss function being like that, but the real power of NeuralNetworks is the non-linearity of the activation and the flexibility of a lot of interacting non-linear components. If our intuition would match what actually is going on in the NN we could write an algorithm which would be much faster than the NN. But NN are fare more complex and beyond human imagination, so I think we have to be very careful with our assumptions and "intuitions", even though it seems to be "unfair".😉
Thank you for your comments Sebastian! This result doesn't seem completely clear cut so may be open to refutation in some cases. For instance, one Medium article concludes that "fine-tuned Adam is always better than SGD, while there exists a performance gap between Adam and SGD when using default hyperparameters", which means the problem is one of hyperparameter optimization, which can be more difficult with Adam. Let me know what you think! medium.com/geekculture/a-2021-guide-to-improving-cnns-optimizers-adam-vs-sgd-495848ac6008
The best explanation I've seen till now!
confirmed
As a junior AI developer, this was the best toturial of Adam and Other optimizers I've ever seen. Simply explained but not too simply to be a useless overview
Thanks
Beautifully explained. Thank you!
Most mind blowing thing in this video was what Cauchy did in 1847.
Very good explanation!
15:03 Arguably, I would say that it's not the responsibility of the optimization algorithm to ensure good generalization. I feel like it would be more fair to judge optimizers only on their fit of the training data, and leave the responsibility of generalization out of their benchmark. In your example, I think it would be the responsibility of model architecture design to get rid of this sharp minimum (by having dropout, fewer parameters, etc...), rather than the responsibility of Adam not to fall inside of it.
Can w be considered as vector that represents all adjustable parameters? I.e., not just weights of one linear transformation matrix from input to hidden layer but all of them + bias values.
So when you compute gradient of L with respect to w, you compute a vector for which each entry is a the partial derivative of L with respect to w_i?
Yup, that's correct!
You're the best explanation out there in my opinion. I appreciate you!!
That was awesome, extremely helpful. Thank you!
Clearly explained indeed! Great video!
That's a great video with clear explanations in such a short time. Thanks a lot.
Very nice explanation!
Good work man its the best explanation i have ever seen. Thank you so much for your work.
keep doing the awesome work, you deserve more subs
Beautiful video. Thank you
Great Explaination! Thank you.
Many thanks, clear explanation!!!
really well explained
Great video thank you!
why didnt you explain the (1-\beta_1) term?
Fantastic video and graphics. Please find time to make more. Subscribed 👍
this is amazing 🤩
great video, thanks!
I love your explaination
Well explained!
thank you so much :)
Thank you this is really well put together and presented !
Nesterov is silly. You have the gradient g(w(t)) because the weight w is calculating in the forward the activation of the neuron and contributes to the loss. You don't have the gradient g(w(t)+pV(t)) because at this fictive position of the weight the inference was not calculated and so you don't have any information about what the loss contribution at that weight position would have been. It's PURE NONSENSE. But it only cost a few more calculations without doing much damage, so no one really seems to complain about it.
This does not make sense…at all. The intuition is that you’re making an educated guess for the gradient in the future; you’re already going to compute g(w(t) + pV(t)) anyway, so why not correct for that and move in that direction instead on the current step?
@@Nerdimo Let's remember that the actual correct gradient of w is computed as the average gradient over ALL samples. So for runtime complexity reason we already make a "educated guess", or better a stochastic approximation, with our present per sample or per batch gradient by using a running gradient or a batch gradient. But these approximations are based on actual inference that we have calculated. Adding to that uncertainty some guessing about what in future will happen is not a correction based on facts, it's pure fiction. Of course you will find for every training process a configuration of hyper parameter, with which this fiction is beneficial as well you will find configurations, with which it is not. But you get this knowledge only by experiment instead of having an algorithm, that is beneficial in general.
@@donmiguel4848 Starting to wonder if this is AI generated “pure fiction” 😂.
@@donmiguel4848 I understand your point, however, I think it’s unfair to discount it as something that’s “fiction”. My main argument is just that there’s intuitions in why doing this could improve taking good steps in the direction towards the local minimum of the loss function.
@@Nerdimo These "intuitions" are based on assumptions about the NN which don't match with reality. We humans understand a hill and a sink or a mountain or a canyon and we assume the loss function being like that, but the real power of NeuralNetworks is the non-linearity of the activation and the flexibility of a lot of interacting non-linear components. If our intuition would match what actually is going on in the NN we could write an algorithm which would be much faster than the NN. But NN are fare more complex and beyond human imagination, so I think we have to be very careful with our assumptions and "intuitions", even though it seems to be "unfair".😉
Really? i didn't know SGD generalized better than ADAM
Thank you for your comments Sebastian! This result doesn't seem completely clear cut so may be open to refutation in some cases. For instance, one Medium article concludes that "fine-tuned Adam is always better than SGD, while there exists a performance gap between Adam and SGD when using default hyperparameters", which means the problem is one of hyperparameter optimization, which can be more difficult with Adam. Let me know what you think!
medium.com/geekculture/a-2021-guide-to-improving-cnns-optimizers-adam-vs-sgd-495848ac6008
@@deepbean it's sebastiEn with E.Learn how to read carefully :)
🤣
@@wishIKnewHowToLove my bad
@@wishIKnewHowToLove bruh cmon the man is being nice enough to u just by replying jesus
Martin Jessica Moore Carol Taylor Dorothy
nom nom nom learn to program.
Wonderful explanation!!