Optimization for Deep Learning (Momentum, RMSprop, AdaGrad, Adam)

DeepBean

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 23 гру 2024

КОМЕНТАРІ • 47

@rhugvedchaudhari4584 Рік тому ⁺²⁶
The best explanation I've seen till now!
@ukasz9625 4 місяці тому
confirmed
@HojjatMonzavi 4 місяці тому ⁺¹¹
As a junior AI developer, this was the best toturial of Adam and Other optimizers I've ever seen. Simply explained but not too simply to be a useless overview
Thanks
@AkhilKrishnaatg 9 місяців тому ⁺²
Beautifully explained. Thank you!
@zhang_han Рік тому ⁺²²
Most mind blowing thing in this video was what Cauchy did in 1847.
@tempetedecafe7416 Рік тому ⁺⁴
Very good explanation!
15:03 Arguably, I would say that it's not the responsibility of the optimization algorithm to ensure good generalization. I feel like it would be more fair to judge optimizers only on their fit of the training data, and leave the responsibility of generalization out of their benchmark. In your example, I think it would be the responsibility of model architecture design to get rid of this sharp minimum (by having dropout, fewer parameters, etc...), rather than the responsibility of Adam not to fall inside of it.
@na50r24 Місяць тому ⁺¹
Can w be considered as vector that represents all adjustable parameters? I.e., not just weights of one linear transformation matrix from input to hidden layer but all of them + bias values.
So when you compute gradient of L with respect to w, you compute a vector for which each entry is a the partial derivative of L with respect to w_i?
@deepbean Місяць тому ⁺¹
Yup, that's correct!
@fzigunov 2 місяці тому
You're the best explanation out there in my opinion. I appreciate you!!
@collinmccarthy 17 днів тому
That was awesome, extremely helpful. Thank you!
@dongthinh2001 Рік тому ⁺¹
Clearly explained indeed! Great video!
@saqibsarwarkhan5549 7 місяців тому
That's a great video with clear explanations in such a short time. Thanks a lot.
@idiosinkrazijske.rutine Рік тому ⁺²
Very nice explanation!
@EFCK555 4 місяці тому
Good work man its the best explanation i have ever seen. Thank you so much for your work.
@Justin-zw1hx Рік тому ⁺²
keep doing the awesome work, you deserve more subs
@oinotnarasec 3 місяці тому
Beautiful video. Thank you
@sokrozayeng7691 4 місяці тому
Great Explaination! Thank you.
@luiskraker807 11 місяців тому
Many thanks, clear explanation!!!
@rasha8541 Рік тому
really well explained
@benwinstanleymusic 9 місяців тому
Great video thank you!
@MikeSieko17 9 місяців тому
why didnt you explain the (1-\beta_1) term?
@markr9640 11 місяців тому
Fantastic video and graphics. Please find time to make more. Subscribed 👍
@KwangrokRyoo 3 місяці тому
this is amazing 🤩
@physis6356 8 місяців тому
great video, thanks!
@leohuang-sz2rf 8 місяців тому
I love your explaination
@makgaiduk Рік тому
Well explained!
@wishIKnewHowToLove Рік тому ⁺¹
thank you so much :)
@TheTimtimtimtam Рік тому ⁺²
Thank you this is really well put together and presented !
@donmiguel4848 9 місяців тому ⁺²
Nesterov is silly. You have the gradient g(w(t)) because the weight w is calculating in the forward the activation of the neuron and contributes to the loss. You don't have the gradient g(w(t)+pV(t)) because at this fictive position of the weight the inference was not calculated and so you don't have any information about what the loss contribution at that weight position would have been. It's PURE NONSENSE. But it only cost a few more calculations without doing much damage, so no one really seems to complain about it.
@Nerdimo 8 місяців тому
This does not make sense…at all. The intuition is that you’re making an educated guess for the gradient in the future; you’re already going to compute g(w(t) + pV(t)) anyway, so why not correct for that and move in that direction instead on the current step?
@donmiguel4848 8 місяців тому
@@Nerdimo Let's remember that the actual correct gradient of w is computed as the average gradient over ALL samples. So for runtime complexity reason we already make a "educated guess", or better a stochastic approximation, with our present per sample or per batch gradient by using a running gradient or a batch gradient. But these approximations are based on actual inference that we have calculated. Adding to that uncertainty some guessing about what in future will happen is not a correction based on facts, it's pure fiction. Of course you will find for every training process a configuration of hyper parameter, with which this fiction is beneficial as well you will find configurations, with which it is not. But you get this knowledge only by experiment instead of having an algorithm, that is beneficial in general.
@Nerdimo 8 місяців тому
@@donmiguel4848 Starting to wonder if this is AI generated “pure fiction” 😂.
@Nerdimo 8 місяців тому
@@donmiguel4848 I understand your point, however, I think it’s unfair to discount it as something that’s “fiction”. My main argument is just that there’s intuitions in why doing this could improve taking good steps in the direction towards the local minimum of the loss function.
@donmiguel4848 8 місяців тому
@@Nerdimo These "intuitions" are based on assumptions about the NN which don't match with reality. We humans understand a hill and a sink or a mountain or a canyon and we assume the loss function being like that, but the real power of NeuralNetworks is the non-linearity of the activation and the flexibility of a lot of interacting non-linear components. If our intuition would match what actually is going on in the NN we could write an algorithm which would be much faster than the NN. But NN are fare more complex and beyond human imagination, so I think we have to be very careful with our assumptions and "intuitions", even though it seems to be "unfair".😉
@wishIKnewHowToLove Рік тому ⁺¹
Really? i didn't know SGD generalized better than ADAM
@deepbean Рік тому ⁺¹
Thank you for your comments Sebastian! This result doesn't seem completely clear cut so may be open to refutation in some cases. For instance, one Medium article concludes that "fine-tuned Adam is always better than SGD, while there exists a performance gap between Adam and SGD when using default hyperparameters", which means the problem is one of hyperparameter optimization, which can be more difficult with Adam. Let me know what you think!
medium.com/geekculture/a-2021-guide-to-improving-cnns-optimizers-adam-vs-sgd-495848ac6008
@wishIKnewHowToLove Рік тому ⁺¹
@@deepbean it's sebastiEn with E.Learn how to read carefully :)
@deepbean Рік тому ⁺⁴
🤣
@deepbean Рік тому
@@wishIKnewHowToLove my bad
@dgnu Рік тому ⁺⁹
@@wishIKnewHowToLove bruh cmon the man is being nice enough to u just by replying jesus
@fullerholiday2872 3 місяці тому ⁺¹
Martin Jessica Moore Carol Taylor Dorothy
@Stopinvadingmyhardware Рік тому ⁺¹
nom nom nom learn to program.
@MrWater2 6 місяців тому
Wonderful explanation!!

Наступне

Автоматичне відтворення