The Adam Optimizer is a very complex topic that you introduced and explained in a very well manner and in a surprisingly short video! I'm impressed Sourish! Definetly one of my favorite videos from you!
I don't disagree that it's a very good video. But calling something that can be taught in a 20 minute youtube video "very complex topic" is funny. When I was in collage studying CS (before Adam even existed, I am this old), the entire topic of AI and neural networks was covered in 1 semester with only 2 hours of class per week. In fact, that is what both surprising and amazing about the current state of AI, that the math behind it is so simple that most of the researchers were positive that we would need much more complex algorithms to get to where we are now. But then teams like OpenAI proved that it was just a matter of massively scaling up those simple concepts and feeding it insane amount of data.
@@elirane85 Hi! Thanks for commenting. I believe Akshay was just being nice haha. But you're definitely right about how the potential lies not in complex algorithms, but the scale at which these algorithms are run! I guess that's the marvel of modern hardware
I remembered when my teacher gave me assignment on optimizers I have gone through blogs, papers and videos but everywhere I see different formulas I was so confused but you explained everything at one place very easily.
I don't comment to videos a lot ... but I just wanted to let you know this is the best visualization and explanation on optimizers I've found on youtube . Great Job .
Absolutely loved the graphics and intensive paper based proof of working of different optimizers , all in the same video. You just earned a loyal viewer.
Excellent Video. Especially since you written down the equations from the viewpoint of a single weight. Many researchers like to write down more abstract notation that are hard to grasp for people that never studied this topic.
love the simplified explanation and animation! videos with this quality and educational value are worth of millions of likes and subscribers in other channels... this is so underrated..
Didn't expect to have to learn three optimizers in order to understand Adam, but here we are. It took me so much time to go through this video, and I had to have Chat GPT explain those formulas a bit more in depth before it slowly started to make sense. But I think I've got the (math) intuition behind it now. Thanks for this video, lots of others skip the math but like, you cannot understand without the math because ML IS math, right?! Btw the visualizations were pretty great!
This video is amazing! You covered most important topic in ML, with all major optimization algorithms. I literally had no idea about Momentum, NAG, RMSprop, AdaGrad, Adam. Now, I have a good overview of all, will deep dive in all of them. Thanks for the video! ❤
Thanks for this video. I am just a bit confused about what W is. Is it the parameters of the objective function or the function itself? From my view, W_t+1 are the resulting parameters after updates, but on the other hand, the gradient needs to be computed for the objective function and not the parameters directly
Thank you for such easy, simple, and great explanation. I searched quick overwiev how Adam is working and found your video. Actually I am training DRL Reinforce Policy Gradient algorithm with theta parameters as weights and viases from CNN, where exactly Adam is involved. Thanks again, very informative.
I used to have networks where the loss was fluctuating in a very periodic manner every 30 or so steps and I never knew why that happened. Now it makes sense! It just takes a number of steps for the direction of Adam weight updates to change. I really should have looked this up earlier.
Hmm while this might be Adam's fault, I would encourage you to see if you can replicate the issue with SGD w/ Momentum or see if another optimizer without momentum solves it. I believe there are a wide array of reasons as to why this periodic behavior might emerge.
Nice animations, nice explanations of the math mehind them, i was curious about how different optimizers work but didnt want to spend an hour going through documentations, this video answered most of my questions! One that remains is about the AdamW optimizer, i read that it is practically just a better version of Adam, but didnt really find any intuitive explanations of how it affects training (ideally with graphics like these hahaha). There are not many videos on youtube about it
Very nicely explained. Wish you brought up the relationship between these optimizers and numerical procedures though. Like how vanilla gradient descent is just Euler's method applied to a gradient rather than one derivative.
Thank you so much. And there were so many topics I wanted to cram into this video but couldn't in the interest of time. That is a very interesting topic to cover and I'll add it to my list! Hopefully we can visit it soon :) I appreciate the idea
Nice vid, I'd mention MAS too, to explicity say that Adam at the start is weaker and could fit local minima(until it gets enough data) and SGD peforms well with its stochasity, and then slower, so both methods (peformed nearly like I mentioned in MAS Paper)
You are incredibly intelligent to explain such a complex topic formed of tens of research papers of knowledge in a single 20 minutes video... what the heck!
Haha it took me a long time to "engineer" the cost function to look exactly how it did! It consists of three parts: the parabola shape and two holes. They're added together to yield the final result. I've inserted the python code below, but it might seem overwhelming! If you're really curious, I encourage you to change each of the constants and see how the function changes. ```python w1 = 2 w2 = 4 h1 = 0.5 h2 = 0.75 c = 0.075 bowl_constant = 3**2 center_1_x, center_1_y = -0, -0.0 center_2_x, center_2_y = 1, 1 def f(x, y): parabola = c * x**2 + c * y**2 / bowl_constant hole1 = -h1 * (np.exp(-w1*(x-center_1_x)**2) * np.exp(-w1*(y-center_1_y)**2)) hole2 = -h2 * (np.exp(-w2*(x-center_2_x)**2) * np.exp(-w2*(y-center_2_y)**2)) return parabola + hole1 + hole2 ```
@@sourishk07 This is really great! I work in molecular QM and needed an image to display a potential energy surface for a reaction and the transition between reactants and products. This is one of the cleanest analytical ones that I've seen, and I'll be using this in the future, thanks!
One point. Memory availability has increased exponentially as energy requirements for TPU’s gas decreased in a more linear fashion, ergo, over time the use of more memory to utilize better optimizers will have a net reduction of energy consumption.
That's a very valid point! It's really good that GPUs these days are shipping with boatloads of VRAM, especially because optimizers such as Adam require so much additional information to be stored per parameter.
I tried using momentum for a 3SAT optimizer token i worked on in 2010. Doesn't help with 3SAT since all variables are binary. It's cool that it works with NNs though!
I am the only one to not understand the RMS propagation math formula? What is the gradient squared is it per component or is the Hessian? How do you divide a vector by another vector? Could someane explain me please.
Hi! Sorry, this is something I should've definitely clarified in the video! I've gotten a couple other comments about this as well. Everything in the formula is component-wise. You square each element in the gradient matrix individually & you perform component-wise division, along with the component-wise square root. Again, I really apologize for the confusion! I'll make sure to make these things clearer next time.
The intuition behind why the methods help with convergence is a bit misleading imo. The problem is not in general with slow convergence close to optimum point because of a small gradient, that can easily be fixed with letting step size depend on gradient size. The problem that it solves is when the iterations zig-zag because of large components in some directions and small components in the direction you actually want to move. By averaging (or similar use of past gradients) you effectively cancel out the components causing the zig-zag.
Hello! Thanks for the comment. Optimizers like RMSProp and Adam do make step size dependent on gradient size, which I showcase in the video, so while there are other techniques to deal with slow convergence close to the optimum point due to small gradients, having these optimizers still help. Maybe I could've made this part clearer though. Also, from my understanding, learning rate decay is a pretty popular technique used so wouldn't that just slow down convergence even more as the learning rate decays & the loss approaches the area with smaller gradients? However, I definitely agree with your bigger point about these optimizers from preventing the loss from zig-zagging! In my RMSProp example, I do show how the loss is able to take a more direct route from the starting point to the minimum. Maybe I could've showcased a bigger example where SGD zig-zags more prominently to further illustrate the benefit that RMSProp & Adam bring to the table. I really appreciate you taking the time to give me feedback.
@@sourishk07 Yeah, I absolutely think the animations give good insight into the different strategies within "moment"-based optimizers. My point was more that even with "vanilla" gradient descent methods, the step sizes can be handled to not vanish as the gradient gets smaller, and that real benefit of the other methods is for altering the _direction_ of descent to deal with situations where eigenvalues of the (locally approximate) quatratic form differs in orders of magnitude. But I must also admit that (especially in the field of machine learning) the name SGD seem to be more or less _defined_ to include a fixed decay rate of step sizes, rather than just the method of finding a step direction (where finding step sizes would be a separate (sub-)problem), so your interpretation is probably more accurate than mine. Anyway, thanks for replying and I hope you continue making videos on the topic!
Thanks for sharing your insights! I'm glad you enjoyed the video. Maybe I could make a video that dives deeper into step sizes or learning rate decay and the role that they play on convergence!
So cool, just subscribed! I literally just started researching more about how optimizers work this week as part of my bachelor's thesis. Once I'm finished with my thesis I would love to see if I can create my own optimizer algorithm. Thanks for sharing! Do you happen to have the manimgl code you used to create the animations for visualizing the gradient path of the optimizers?
Thank you for subscribing! Maybe once you make your own optimizer, I can make a video on it for you! I do have the manimgl code but it's so messy haha. I do plan on publishing all of the code for my animations once I get a chance to clean up the codebase. However, if you want the equations for the loss functions in the meantime, let me know!
I guess technically I didn’t talk about how the dataset was batched when performing GD, so no stochastic elements were touched upon. However, I just used SGD as a general term to talk about vanilla gradient descent, like how PyTorch and Tensorflow’s APIs are structured.
@@sourishk07 I see! It would be interesting to see if/how the stochastic element helps with the landscape l(x, y) = x^2 + a|y| or whatever that example was :)
If you're interested, consider playing around with batch & mini batch gradient descent! There's been a lot of research on how batch size affects convergence so it might be a fun experiment to try out.
Yeah, as crazy as it sounds, there is already research being done in this area! I encourage you to take a look at some of those papers if you're interested! 1. Learning to Learn by Gradient Descent by Gradient Descent (2016, Andrychowicz et al.) 2. Learning to Optimize (2017, Li and Malik)
Hi! Using reinforcement learning in the realm of optimizers is a fascinating concept and there's already research being done on it! Here are a couple cool papers that might be worth your time: 1. Learning to Learn by Gradient Descent by Gradient Descent (2016, Andrychowicz et al.) 2. Learning to Optimize (2017, Li and Malik) It would be fascinating to see GPT-4 help write more efficient optimizers though. LLMs helping accelerate the training process for other AI models seems like the gateway into AGI
Sorry, maybe I'm misinterpreting your question, but just to clarify the RMSProp optimizer: After the gradient term is calculated during backpropagation, you take the element-wise square of it. These values help determine by how much to modulate the learning rates individually for each parameter! The reason squaring is useful is because we don't actually care about the sign, but rather just the magnitude. Same concept applies to Adam. Let me know if that answers your question.
The “problem” the Adam algorithm in this case is presented to solve (the one with local and global minima) is simply wrong - in small amounts of dimensions this is infact a problem, but the condition for the existence of a local minima grows more and more strongly with the amount of dimensions. So in practice, when you have millions of parameters and therefore dimensions, local minima that aren’t the global minima will simply not even exist, the probability for such existence is simply unfathomably small.
Hi! This is a fascinating point you bring up. I did say at the beginning that the scope of optimizers wasn't just limited to neural networks in high dimensions, but could also be applicable in lower dimensions. However, I probably should've added a section about saddle points to make this part of the video more thorough, so I really appreciate the feedback!
When you write square root of V_t in a denominator, do you mean this component-wise? V is a high dimensional vector I assume. Also what if it has negative values? Don’t you mean the norm of V?
That's a good point. I did mean component-wise, which I should've mentioned in the video. Also, V shouldn't have negative values because we're always squaring the gradients when calculating the V term. Since beta is always between 0 and 1, we're always multiplying positive numbers to calculate V.
A lot of times in academia, people are just using SGD with momentum but playing around with learning rate scheduling a lot. You don't always want to get the deepest minimum since it can actually give you poor generalizability. That's why Adam isn't that popular when researchers are trying to push to SOTA.
Hi! I can only speak to the papers that I've read, but I still seem to see Adam being used a decent amount. Your point about overfitting is valid, but wouldn't the same thing be achieved by using Adam but just training for less iterations?
Hi! I've read the Adabelief paper and it seems really promising, but I wanted to focus on the preliminary optimizers first. I think this might be a great candidate if I were to work on a part 2 to this video! Thanks for the idea!
Ps. in Denmark we have this extremely silly song called "Adam" (Viro) that went viral like 8 years ago or so, that song played inside my head every time you said Adam lmao.
Are you referring to my animated examples or the benchmarks towards the end of the video? The animations were contrived just to showcase each optimizer, but the performance of RMSProp during the benchmarks at the end vary based on the domain. It actually sometimes manages to beat Adam as we saw in the research papers! This is where experimentation might be worthwhile depending on what resources are available to you.
@@sourishk07 thanks! There was some hard work behind them, so I’m happy to hear they’re appreciated. But I don’t need to tell you that. This video is a master piece!
Gemini 1.5 Pro: This video is about optimizers in machine learning. Optimizers are algorithms that are used to adjust the weights of a machine learning model during training. The goal is to find the optimal set of weights that will minimize the loss function. The video discusses four different optimizers: Stochastic Gradient Descent (SGD), SGD with Momentum, RMSprop, and Adam. * Stochastic Gradient Descent (SGD) is the simplest optimizer. It takes a step in the direction of the negative gradient of the loss function. The size of the step is determined by the learning rate. * SGD with Momentum is a variant of SGD that takes into account the history of the gradients. This can help the optimizer to converge more quickly. * RMSprop is another variant of SGD that adapts the learning rate for each parameter of the model. This can help to prevent the optimizer from getting stuck in local minima. * Adam is an optimizer that combines the ideas of momentum and adaptive learning rates. It is often considered to be a very effective optimizer. The video also discusses the fact that different optimizers can be better suited for different tasks. For example, Adam is often a good choice for training deep neural networks. Here are some of the key points from the video: * Optimizers are algorithms that are used to adjust the weights of a machine learning model during training. * The goal of an optimizer is to find the optimal set of weights that will minimize the loss function. * There are many different optimizers available, each with its own strengths and weaknesses. * The choice of optimizer can have a significant impact on the performance of a machine learning model.
I'm currently consolidating a list of more advanced optimizers for a follow up video so I really appreciate the recommendation. I'm adding it to the list!
Unfortunately, the code for the animations are not ready for the public haha. It's wayyy too messy. However, I didn't include the code for the optimizers because the equations are straight forward to implement, but how you use the gradients to update weights depends greatly on how the rest of the code is structured.
Hi! There seems to be many interesting papers about using metaheuristic approaches with machine learning, but I haven't seen too many applications of them in industry. However, this is a topic I haven't looked too deeply into! I simply wanted to discuss the strategies that are commonly used by modern day deep learning and maybe I'll make another video about metaheuristic approaches! Thanks for the idea!
Please explain one thing to me. Why do we negate the gradient vector to get the downhill direction? What if directly opposite to the gradient vector there's a bump instead of a smooth descent? Shouldn't we instead negate the parameter field itself, transforming holes into bumps, and then calculating the gradient?
Hi, that's a great question! Remember, what the gradient gives is a vector of the "instantaneous" rate of change for each parameter at the current location of the cost function. So if there is a bump 1 epsilon (using an arbitrarily small number as a unit) or 10 epsilons away, our gradient vector has no way of knowing that. What you'll see is that if you negate the entire cost function (which is what I'm assuming you meant by 'parameter field') and perform gradient ascent rather than gradient descent, you'll end up with the exact same problem: "What happens if there is a tiny divot in the direction of steepest ascent?" At the end of the day, no cost function in the real world will be as smooth and predictable as the ones I animated. There will always be small bumps and divots along the way, which is the entire point of using more advanced optimizers like RMSProp or Adam because we're hoping that they're able to circumvent these small obstacles and still reach a global minima!
10:19 What a weird formula for NAG! It's much easier to remember a formulation where you always take antigradient. You want to *add* velocity and take gradient with *minus* . The formula just changes to V_t+1 = b V_t - a grad(Wt + b V_t) W_t+1 = W_t + V_t+1 It's more intuitive and more similar to standard GD. Why would anyone want to change these signs? How often do you subtract velocity to update the position? Do you want to *add* gradient to update V right after you explained we want to subtract gradient in general to minimize the loss function? It makes everything twice as hard and just... wtf...
Hi! Thanks for bringing this up! I've seen the equation written in both forms, but probably should've elected for the one suggested by you! This is what I was referring to for the equation: www.arxiv.org/abs/1609.04747
@@sourishk07 I loved your animations, it is well presented. Are you planning on sharing a little insight on making those ? I feel in academia the biggest challenge for us is to communicate in an engaging way
@@renanmonteirobarbosa8129 Hi, thanks for the idea! I want to get a little bit better at creating them before I share how I create them. But I used manimgl so I encourage you to check that out in the meantime!
It really annoys me when people claim using less electricity has anything to do with environmentalism. Most large ML models are trained in the Pacific North West (Seattle and Vancouver area) where most power comes from hydroelectric. Using more electricity has no meaningful environmental impact there since it's just diverting the energy that rivers are dumping in the ocean anyway. If you're worried about the environmental impact of energy, focus on the generation methods, not the consumption methods.
I appreciate you bringing up this point. But at the end of the day, the Seattle/Vancouver area isn't enough to handle the entire world's demand for electricity, especially with the additional burden that training large ML models bring. Not to mention, at the point where all of our electricity isn't being derived from green sources, it doesn't matter if training jobs for ML models get their energy solely from green sources because that demand is still competing with other sectors of consumption. While there remains a lot of work left in optimizing our hardware to run more efficiently, there is no harm in optimizing our algorithms to use less resources in the meantime.
@@rudypieplenbosch6752 if you think addressing environmental issues is environmental nonsense you need to get your head out of your --- and read a scientific paper on climate change. And if you don’t believe in that, publish some proof or kindly get out of the scientific discourse, that i.e. this video is. Thank you!
Are you real? I have a feeling you are AI video generation hooked up to a LLM that varies a script that the MIC uses to steer humanity to building it's AI god.
The Adam Optimizer is a very complex topic that you introduced and explained in a very well manner and in a surprisingly short video! I'm impressed Sourish! Definetly one of my favorite videos from you!
Thank you so much Akshay! I'm glad you enjoyed it!
I don't disagree that it's a very good video. But calling something that can be taught in a 20 minute youtube video "very complex topic" is funny.
When I was in collage studying CS (before Adam even existed, I am this old), the entire topic of AI and neural networks was covered in 1 semester with only 2 hours of class per week.
In fact, that is what both surprising and amazing about the current state of AI, that the math behind it is so simple that most of the researchers were positive that we would need much more complex algorithms to get to where we are now. But then teams like OpenAI proved that it was just a matter of massively scaling up those simple concepts and feeding it insane amount of data.
@@elirane85 Hi! Thanks for commenting. I believe Akshay was just being nice haha. But you're definitely right about how the potential lies not in complex algorithms, but the scale at which these algorithms are run! I guess that's the marvel of modern hardware
I remembered when my teacher gave me assignment on optimizers I have gone through blogs, papers and videos but everywhere I see different formulas I was so confused but you explained everything at one place very easily.
I'm really glad I was able to help!
I don't comment to videos a lot ... but I just wanted to let you know this is the best visualization and explanation on optimizers I've found on youtube . Great Job .
Absolutely loved the graphics and intensive paper based proof of working of different optimizers , all in the same video. You just earned a loyal viewer.
Thank you so much! I'm honored to hear that!
Excellent Video. Especially since you written down the equations from the viewpoint of a single weight. Many researchers like to write down more abstract notation that are hard to grasp for people that never studied this topic.
love the simplified explanation and animation! videos with this quality and educational value are worth of millions of likes and subscribers in other channels... this is so underrated..
Haha I really appreciate the kind words! More content like this is on the horizon
Didn't expect to have to learn three optimizers in order to understand Adam, but here we are. It took me so much time to go through this video, and I had to have Chat GPT explain those formulas a bit more in depth before it slowly started to make sense. But I think I've got the (math) intuition behind it now. Thanks for this video, lots of others skip the math but like, you cannot understand without the math because ML IS math, right?! Btw the visualizations were pretty great!
This video is amazing!
You covered most important topic in ML, with all major optimization algorithms. I literally had no idea about Momentum, NAG, RMSprop, AdaGrad, Adam.
Now, I have a good overview of all, will deep dive in all of them.
Thanks for the video! ❤
I'm really glad to hear that it was helpful! Good luck on your deep dive!
Very Clear Explanation! Thank you. I especially appreciate the fact that you included the equations.
Thank you! And I’m glad you enjoyed it
Sir your exposition is excellent, the presentation, the cadence , the simplicity.
I really appreciate that! Looking forward to sharing more content like this
Thanks for this video. I am just a bit confused about what W is. Is it the parameters of the objective function or the function itself? From my view, W_t+1 are the resulting parameters after updates, but on the other hand, the gradient needs to be computed for the objective function and not the parameters directly
Thank you for such easy, simple, and great explanation. I searched quick overwiev how Adam is working and found your video. Actually I am training DRL Reinforce Policy Gradient algorithm with theta parameters as weights and viases from CNN, where exactly Adam is involved. Thanks again, very informative.
Thanks for watching and I'm really glad I was able to help!
I used to have networks where the loss was fluctuating in a very periodic manner every 30 or so steps and I never knew why that happened. Now it makes sense! It just takes a number of steps for the direction of Adam weight updates to change.
I really should have looked this up earlier.
Hmm while this might be Adam's fault, I would encourage you to see if you can replicate the issue with SGD w/ Momentum or see if another optimizer without momentum solves it.
I believe there are a wide array of reasons as to why this periodic behavior might emerge.
Nice animations, nice explanations of the math mehind them, i was curious about how different optimizers work but didnt want to spend an hour going through documentations, this video answered most of my questions!
One that remains is about the AdamW optimizer, i read that it is practically just a better version of Adam, but didnt really find any intuitive explanations of how it affects training (ideally with graphics like these hahaha). There are not many videos on youtube about it
I'm glad I was able to be of help! I hope to make a part 2 where I cover more optimizers such as AdamW! Stay tuned!
I am coding backpropagation right now and this helped me so much.
Glad to hear that! That's a very exciting project and I wish you luck on it!
Very nicely explained. Wish you brought up the relationship between these optimizers and numerical procedures though. Like how vanilla gradient descent is just Euler's method applied to a gradient rather than one derivative.
Thank you so much. And there were so many topics I wanted to cram into this video but couldn't in the interest of time. That is a very interesting topic to cover and I'll add it to my list! Hopefully we can visit it soon :) I appreciate the idea
Thanks for the great explanations! The graphics and benchmark were particularly useful.
I'm really glad to hear that!
Woah what a great video! And how you are helping people on the comments kind of has me amazed. Thank you for your work!
Haha thank you, I really appreciate that!
The animations are amazing - what did you use to make them??
Thank you so much! I'm glad you liked them. I used the manimgl python library!
Wow! Great video, more of these deep dives into basic components of ML please
Thank you for watching. We have many more topics lined up!
Extremely insightful... Thanks for the video, helped a lot!!!
Glad to hear it!
I'm not even a data scientist or machine learning expert but I enjoyed this video!
I love to hear that!
Nice vid, I'd mention MAS too, to explicity say that Adam at the start is weaker and could fit local minima(until it gets enough data) and SGD peforms well with its stochasity, and then slower, so both methods (peformed nearly like I mentioned in MAS Paper)
Thank you for the feedback! These are great things to include in a part 2!
what a good video, I watched it and bookmarked so I can come back to it when I understand more about the topic
Glad it was helpful! What concepts do you feel like you don’t understand yet?
You are incredibly intelligent to explain such a complex topic formed of tens of research papers of knowledge in a single 20 minutes video... what the heck!
Wow thank you for those kind words! I'm glad you enjoyed the video!
What is the mathematical expression for the boss cost function at the end?
Haha it took me a long time to "engineer" the cost function to look exactly how it did! It consists of three parts: the parabola shape and two holes. They're added together to yield the final result.
I've inserted the python code below, but it might seem overwhelming! If you're really curious, I encourage you to change each of the constants and see how the function changes.
```python
w1 = 2
w2 = 4
h1 = 0.5
h2 = 0.75
c = 0.075
bowl_constant = 3**2
center_1_x, center_1_y = -0, -0.0
center_2_x, center_2_y = 1, 1
def f(x, y):
parabola = c * x**2 + c * y**2 / bowl_constant
hole1 = -h1 * (np.exp(-w1*(x-center_1_x)**2) * np.exp(-w1*(y-center_1_y)**2))
hole2 = -h2 * (np.exp(-w2*(x-center_2_x)**2) * np.exp(-w2*(y-center_2_y)**2))
return parabola + hole1 + hole2
```
@@sourishk07 This is really great! I work in molecular QM and needed an image to display a potential energy surface for a reaction and the transition between reactants and products. This is one of the cleanest analytical ones that I've seen, and I'll be using this in the future, thanks!
@@jevandezande I'm glad I was able to be of assistance! Let me know if you need anything else!
Adam - A dynamic adjustment mechanism
Yes, that's exactly what it is!
One point.
Memory availability has increased exponentially as energy requirements for TPU’s gas decreased in a more linear fashion, ergo, over time the use of more memory to utilize better optimizers will have a net reduction of energy consumption.
That's a very valid point! It's really good that GPUs these days are shipping with boatloads of VRAM, especially because optimizers such as Adam require so much additional information to be stored per parameter.
Nice video :) I appreciate the visual examples of the various optimizers.
Glad to hear that!
I tried using momentum for a 3SAT optimizer token i worked on in 2010. Doesn't help with 3SAT since all variables are binary. It's cool that it works with NNs though!
Oh wow that's an interesting experiment to run! Glad you decided to try it out
I am the only one to not understand the RMS propagation math formula? What is the gradient squared is it per component or is the Hessian? How do you divide a vector by another vector? Could someane explain me please.
Hi! Sorry, this is something I should've definitely clarified in the video! I've gotten a couple other comments about this as well. Everything in the formula is component-wise. You square each element in the gradient matrix individually & you perform component-wise division, along with the component-wise square root.
Again, I really apologize for the confusion! I'll make sure to make these things clearer next time.
This was a really interesting video! I feel like this helped me understand the intuitions behind optimizers, thank you!
I really appreciate the comment! Glad you could learn something new!
Great video, glad the algorithm brought me. The visualizations helped a lot
Thank you so much! I'm glad you liked the visualizations! I had a great time working on them
dude love the video title. came just to comment that. i think i searched something like "who is adam w" when i started my ai journey
Haha I'm glad you liked the title. Don't worry I did that too!
Incredible video. I especially love the math and intuition behind it that you explain. Keep it up!
Thanks, will do! Don't worry, there is more to come
The intuition behind why the methods help with convergence is a bit misleading imo. The problem is not in general with slow convergence close to optimum point because of a small gradient, that can easily be fixed with letting step size depend on gradient size. The problem that it solves is when the iterations zig-zag because of large components in some directions and small components in the direction you actually want to move. By averaging (or similar use of past gradients) you effectively cancel out the components causing the zig-zag.
Hello! Thanks for the comment. Optimizers like RMSProp and Adam do make step size dependent on gradient size, which I showcase in the video, so while there are other techniques to deal with slow convergence close to the optimum point due to small gradients, having these optimizers still help. Maybe I could've made this part clearer though.
Also, from my understanding, learning rate decay is a pretty popular technique used so wouldn't that just slow down convergence even more as the learning rate decays & the loss approaches the area with smaller gradients?
However, I definitely agree with your bigger point about these optimizers from preventing the loss from zig-zagging! In my RMSProp example, I do show how the loss is able to take a more direct route from the starting point to the minimum. Maybe I could've showcased a bigger example where SGD zig-zags more prominently to further illustrate the benefit that RMSProp & Adam bring to the table. I really appreciate you taking the time to give me feedback.
@@sourishk07 Yeah, I absolutely think the animations give good insight into the different strategies within "moment"-based optimizers. My point was more that even with "vanilla" gradient descent methods, the step sizes can be handled to not vanish as the gradient gets smaller, and that real benefit of the other methods is for altering the _direction_ of descent to deal with situations where eigenvalues of the (locally approximate) quatratic form differs in orders of magnitude. But I must also admit that (especially in the field of machine learning) the name SGD seem to be more or less _defined_ to include a fixed decay rate of step sizes, rather than just the method of finding a step direction (where finding step sizes would be a separate (sub-)problem), so your interpretation is probably more accurate than mine. Anyway, thanks for replying and I hope you continue making videos on the topic!
Thanks for sharing your insights! I'm glad you enjoyed the video. Maybe I could make a video that dives deeper into step sizes or learning rate decay and the role that they play on convergence!
So cool, just subscribed! I literally just started researching more about how optimizers work this week as part of my bachelor's thesis. Once I'm finished with my thesis I would love to see if I can create my own optimizer algorithm. Thanks for sharing! Do you happen to have the manimgl code you used to create the animations for visualizing the gradient path of the optimizers?
Thank you for subscribing! Maybe once you make your own optimizer, I can make a video on it for you!
I do have the manimgl code but it's so messy haha. I do plan on publishing all of the code for my animations once I get a chance to clean up the codebase. However, if you want the equations for the loss functions in the meantime, let me know!
Just found out your channel. Instant follow 🙏🏼 Hope we can see more Computer Science content like this. Thank you ;)
Thank you so much for watching! Don't worry, I have many more videos like this planned! Stay tuned :)
Sorry did I misunderstand something or did you say SGD when it was only GD you talked about? When was stochastic elements discussed?
I guess technically I didn’t talk about how the dataset was batched when performing GD, so no stochastic elements were touched upon. However, I just used SGD as a general term to talk about vanilla gradient descent, like how PyTorch and Tensorflow’s APIs are structured.
@@sourishk07 I see! It would be interesting to see if/how the stochastic element helps with the landscape l(x, y) = x^2 + a|y| or whatever that example was :)
If you're interested, consider playing around with batch & mini batch gradient descent! There's been a lot of research on how batch size affects convergence so it might be a fun experiment to try out.
Thank you So much sir. But I will like you to create videos on upconvolutions or transposed convolutions. Thank you for understanding
Hi! Thank you for the great video ideas. I'll definitely add those to my list!
Hey what do you think, could we use reinforcement learning to train the perfect optimizer?
Yeah, as crazy as it sounds, there is already research being done in this area! I encourage you to take a look at some of those papers if you're interested!
1. Learning to Learn by Gradient Descent by Gradient Descent (2016, Andrychowicz et al.)
2. Learning to Optimize (2017, Li and Malik)
I wonder if we could use the same training loop NVIDIA used in the DrEureka paper to find even better optimizers.
Hi! Using reinforcement learning in the realm of optimizers is a fascinating concept and there's already research being done on it! Here are a couple cool papers that might be worth your time:
1. Learning to Learn by Gradient Descent by Gradient Descent (2016, Andrychowicz et al.)
2. Learning to Optimize (2017, Li and Malik)
It would be fascinating to see GPT-4 help write more efficient optimizers though. LLMs helping accelerate the training process for other AI models seems like the gateway into AGI
@@sourishk07 Thanks for the answer!
I never comment on anything, but wanted to let you know that this video was really well done. Looking forward to more!
Thank you, I really appreciate it!
What is the square of the gradient?
Sorry, maybe I'm misinterpreting your question, but just to clarify the RMSProp optimizer: After the gradient term is calculated during backpropagation, you take the element-wise square of it. These values help determine by how much to modulate the learning rates individually for each parameter! The reason squaring is useful is because we don't actually care about the sign, but rather just the magnitude. Same concept applies to Adam. Let me know if that answers your question.
very clearly explained - thanks
Glad you liked it
The “problem” the Adam algorithm in this case is presented to solve (the one with local and global minima) is simply wrong - in small amounts of dimensions this is infact a problem, but the condition for the existence of a local minima grows more and more strongly with the amount of dimensions. So in practice, when you have millions of parameters and therefore dimensions, local minima that aren’t the global minima will simply not even exist, the probability for such existence is simply unfathomably small.
Hi! This is a fascinating point you bring up. I did say at the beginning that the scope of optimizers wasn't just limited to neural networks in high dimensions, but could also be applicable in lower dimensions.
However, I probably should've added a section about saddle points to make this part of the video more thorough, so I really appreciate the feedback!
Very well explained and awesome animations. Hope to see more content in the future!
Don't worry, I have many more videos like this lined up!
When you write square root of V_t in a denominator, do you mean this component-wise? V is a high dimensional vector I assume. Also what if it has negative values? Don’t you mean the norm of V?
That's a good point. I did mean component-wise, which I should've mentioned in the video. Also, V shouldn't have negative values because we're always squaring the gradients when calculating the V term. Since beta is always between 0 and 1, we're always multiplying positive numbers to calculate V.
Good work man. Which tool do you use for making the animations?
Thank you! I used manimgl!
Keep up with the great videos man
Thank you so much! Have many more planned for the future
A lot of times in academia, people are just using SGD with momentum but playing around with learning rate scheduling a lot. You don't always want to get the deepest minimum since it can actually give you poor generalizability. That's why Adam isn't that popular when researchers are trying to push to SOTA.
Hi! I can only speak to the papers that I've read, but I still seem to see Adam being used a decent amount. Your point about overfitting is valid, but wouldn't the same thing be achieved by using Adam but just training for less iterations?
@@sourishk07 Right! And that's the whole purpose of regularization
share about batch sgd on pre-training of llm. What were results.
Hi! I haven't performed any pre-training for LLMs yet, but that's a good idea for a future video. I'll definitely add it to my list!
What about the Adabelief optimizer? I use it most of the time and it is a bit faster and needs less tuning than the Adam optimizer.
Hi! I've read the Adabelief paper and it seems really promising, but I wanted to focus on the preliminary optimizers first. I think this might be a great candidate if I were to work on a part 2 to this video! Thanks for the idea!
Thank you for a very clear explanation. Liked and subscribed
Thanks for the sub! I'm glad you enjoyed the video
Ps. in Denmark we have this extremely silly song called "Adam" (Viro) that went viral like 8 years ago or so, that song played inside my head every time you said Adam lmao.
Excellent video, please keep it up! Subscribed and will share with my colleagues too :)
I really appreciate it! Excited to share more content
best video on optimizers thanks
Glad you think so!
Hmm while rmsprop speeds up the demonstrated example, it slows down the first example.
Are you referring to my animated examples or the benchmarks towards the end of the video? The animations were contrived just to showcase each optimizer, but the performance of RMSProp during the benchmarks at the end vary based on the domain. It actually sometimes manages to beat Adam as we saw in the research papers! This is where experimentation might be worthwhile depending on what resources are available to you.
Great video dude!
Thanks so much! I've seen your videos before! I really liked your videos about Policy Gradients methods & Importance Sampling!!!
@@sourishk07 thanks! There was some hard work behind them, so I’m happy to hear they’re appreciated. But I don’t need to tell you that. This video is a master piece!
I really appreciate that coming from you!!
The notation for the gradient is a bit weird but nice video!
Sorry haha. I wanted to keep it as simple as possible, but maybe I didn't do such a good job at that! Will keep in mind for next time
This video is super helpful my god thank you
I’m really glad you think so! Thanks
That is some quality work sir!
I really appreciate that! Don't worry, we got more to come
I love it! You are also nice to hear and see! :D
Haha thank you very much!
Gemini 1.5 Pro: This video is about optimizers in machine learning. Optimizers are algorithms that are used to adjust the weights of a machine learning model during training. The goal is to find the optimal set of weights that will minimize the loss function. The video discusses four different optimizers: Stochastic Gradient Descent (SGD), SGD with Momentum, RMSprop, and Adam.
* Stochastic Gradient Descent (SGD) is the simplest optimizer. It takes a step in the direction of the negative gradient of the loss function. The size of the step is determined by the learning rate.
* SGD with Momentum is a variant of SGD that takes into account the history of the gradients. This can help the optimizer to converge more quickly.
* RMSprop is another variant of SGD that adapts the learning rate for each parameter of the model. This can help to prevent the optimizer from getting stuck in local minima.
* Adam is an optimizer that combines the ideas of momentum and adaptive learning rates. It is often considered to be a very effective optimizer.
The video also discusses the fact that different optimizers can be better suited for different tasks. For example, Adam is often a good choice for training deep neural networks.
Here are some of the key points from the video:
* Optimizers are algorithms that are used to adjust the weights of a machine learning model during training.
* The goal of an optimizer is to find the optimal set of weights that will minimize the loss function.
* There are many different optimizers available, each with its own strengths and weaknesses.
* The choice of optimizer can have a significant impact on the performance of a machine learning model.
Thank you Gemini for watching, although I'm not sure you learned anything from this lol
Make one about the Sophia optimizer please!
I'm currently consolidating a list of more advanced optimizers for a follow up video so I really appreciate the recommendation. I'm adding it to the list!
Have you seen KAN?
I have not heard of that! I'd love to learn more though
is the code available?
Unfortunately, the code for the animations are not ready for the public haha. It's wayyy too messy.
However, I didn't include the code for the optimizers because the equations are straight forward to implement, but how you use the gradients to update weights depends greatly on how the rest of the code is structured.
1.54k subs it's crazy low for this quality remember me when you make it my boy
Thank you for those kind words! I'm glad you liked the video
why not using a metaheuristic approach?
Hi! There seems to be many interesting papers about using metaheuristic approaches with machine learning, but I haven't seen too many applications of them in industry.
However, this is a topic I haven't looked too deeply into! I simply wanted to discuss the strategies that are commonly used by modern day deep learning and maybe I'll make another video about metaheuristic approaches! Thanks for the idea!
Best ML video title I think I've ever seen haha
LOL thank you so much!
Great video
Great explanations
Glad you think so!
bhalo video TY
Thank you for watching!
AMAZING!
Thank you
You're welcome. Thanks for watching!
Yooo Sourish this is heat do you remember hs speech
Thanks Lan! Yeah I remember high school speech! It's crazy to reconnect on UA-cam lol
Please explain one thing to me. Why do we negate the gradient vector to get the downhill direction? What if directly opposite to the gradient vector there's a bump instead of a smooth descent? Shouldn't we instead negate the parameter field itself, transforming holes into bumps, and then calculating the gradient?
Hi, that's a great question! Remember, what the gradient gives is a vector of the "instantaneous" rate of change for each parameter at the current location of the cost function. So if there is a bump 1 epsilon (using an arbitrarily small number as a unit) or 10 epsilons away, our gradient vector has no way of knowing that.
What you'll see is that if you negate the entire cost function (which is what I'm assuming you meant by 'parameter field') and perform gradient ascent rather than gradient descent, you'll end up with the exact same problem: "What happens if there is a tiny divot in the direction of steepest ascent?"
At the end of the day, no cost function in the real world will be as smooth and predictable as the ones I animated. There will always be small bumps and divots along the way, which is the entire point of using more advanced optimizers like RMSProp or Adam because we're hoping that they're able to circumvent these small obstacles and still reach a global minima!
This is sick
Thank you Ben!
Great video!
Thanks!
Great, great, great!!!
Thanks!!!
"Benifiting the enviroment" Not entirely sure we can say that. Makes ML less bad for the enviroment is a better way to say it.
Fantastic
Thank you! Cheers!
Like for the title. :)
Haha I'm glad you liked it!
well done!
Thank you!
10:19 What a weird formula for NAG! It's much easier to remember a formulation where you always take antigradient. You want to *add* velocity and take gradient with *minus* . The formula just changes to
V_t+1 = b V_t - a grad(Wt + b V_t)
W_t+1 = W_t + V_t+1
It's more intuitive and more similar to standard GD. Why would anyone want to change these signs? How often do you subtract velocity to update the position? Do you want to *add* gradient to update V right after you explained we want to subtract gradient in general to minimize the loss function? It makes everything twice as hard and just... wtf...
Hi! Thanks for bringing this up! I've seen the equation written in both forms, but probably should've elected for the one suggested by you!
This is what I was referring to for the equation: www.arxiv.org/abs/1609.04747
you can actually beat all other lectures on AI if u kept on making such videos
love that title haha
Haha thank you!
Who is Adam is what sold me hahahahahaha
LOL I'm glad you liked the title. I feel like it wrote itself though haha
@@sourishk07 I loved your animations, it is well presented. Are you planning on sharing a little insight on making those ? I feel in academia the biggest challenge for us is to communicate in an engaging way
@@renanmonteirobarbosa8129 Hi, thanks for the idea! I want to get a little bit better at creating them before I share how I create them. But I used manimgl so I encourage you to check that out in the meantime!
what a title 😂
Appreciate the visit!
the title made me laugh i had to click this
Haha I appreciate it. Thanks for the visit!
Adam has parkingson
I'm not sure I understand haha
It really annoys me when people claim using less electricity has anything to do with environmentalism. Most large ML models are trained in the Pacific North West (Seattle and Vancouver area) where most power comes from hydroelectric. Using more electricity has no meaningful environmental impact there since it's just diverting the energy that rivers are dumping in the ocean anyway. If you're worried about the environmental impact of energy, focus on the generation methods, not the consumption methods.
Demand influences consumption (and consumption methods) so decoupling them as you suggest is naive
I appreciate you bringing up this point. But at the end of the day, the Seattle/Vancouver area isn't enough to handle the entire world's demand for electricity, especially with the additional burden that training large ML models bring.
Not to mention, at the point where all of our electricity isn't being derived from green sources, it doesn't matter if training jobs for ML models get their energy solely from green sources because that demand is still competing with other sectors of consumption.
While there remains a lot of work left in optimizing our hardware to run more efficiently, there is no harm in optimizing our algorithms to use less resources in the meantime.
I gave a like, until the environmental nonsense came up, just stick to the topic, no virtue signalling wanted.
@@rudypieplenbosch6752 if you think addressing environmental issues is environmental nonsense you need to get your head out of your --- and read a scientific paper on climate change. And if you don’t believe in that, publish some proof or kindly get out of the scientific discourse, that i.e. this video is. Thank you!
@@rudypieplenbosch6752 rounded discussion is not exclusive to virtue signaling
Consider watching at 0.75 speed.
Hi, thanks for the feedback. I'll make sure to take things a little slower next time!
@@sourishk07 Maybe it's just me, check other peoples' feedbacks on the matter.
Are you real? I have a feeling you are AI video generation hooked up to a LLM that varies a script that the MIC uses to steer humanity to building it's AI god.
LMAO dw I'm very much real. I recently graduated college and am currently living in the Bay Area working at TikTok!