How do you minimize a function when you can't take derivatives? CMA-ES and PSO

Serrano.Academy

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 27 чер 2024
What happens when you want to minimize a function, say, the error function in order to train a machine learning model, but the function has no derivatives, or they are very hard to calculate? You can use Gradient-Free optimizers. In this video, I show you two of them:
- CMA-ES (Covariance matrix adaptation strategy)
- PSO (Particle swarm optimization)
This video is a sequel to "What is Quantum Machine Learning"
• What is Quantum Machin...
and also part of the blog post:
www.zapatacomputing.com/why-g...
Introduction: (0:00)
CMA-ES: (1:23)
PSO (9:17)
Conclusion: (14:00)
Наука та технологія

КОМЕНТАРІ • 31

@chyldstudios Рік тому ⁺⁶
Very clear explanation of these two optimization algorithms. Well done!
@SerranoAcademy Рік тому
Thank you, glad you like it! :)
@cesarkadirtorricovillanuev9761 Рік тому ⁺³
Muy buen video, explicación muy clara e intuitiva. Sigue adelante!
Saludos desde Bolivia.
@luisvasquez5015 Рік тому ⁺²
Fantastic video! Can't wait for the third part of the series!
@shazajmal9695 2 місяці тому ⁺¹
Thanks a lot for this super informative lecture! Could you please make one on Genetic Algorithms 😅
@Todorkotev Рік тому ⁺⁴
Luis, yet another awesome video! Thanks to you, I've learned something new today! The step-by-step visualization with the gaussian evolution of candidates is epic - super helpful and eye-opening! Thank you!
@SerranoAcademy Рік тому ⁺¹
Thank you so much Boyko, I’m glad you enjoyed it! :)
@SerranoAcademy Рік тому ⁺¹
And thank you so much for your contribution! It's very kind.
@RK-TKLINK 5 місяців тому
The best explanation I've found,thank you!
@imadsaddik Рік тому
Thanks, the explanation was crystal clear!
@omarmohy3975 Рік тому
amazing video really, I also wish if u could explain it with more math and some coding as a second part will be amazing.
@user-wr4yl7tx3w Рік тому
so well explained. that is amazing. I wonder what can be the downside with such a particle swarm.
@SerranoAcademy Рік тому
Thanks! Great question. The exact same problems can happen with PSO, it can be stuck at a local minimum just like CMA-ES. And the way to overcome them is the same.
@elie_ Рік тому
Thanks a lot, very good introductory video!
One question on PSO : is the size of the steps fixed or proportional to (respectively) the current « speed », the distance to PB, the distance to group best?
Thanks
@ZaCharlemagne 5 місяців тому
The CMA-ES is pretty much similar to CEM (Cross Entropy Method)
@mohammadarafah7757 Рік тому
I wish you could make a course for statistics in great detail or write a book for that.
@SerranoAcademy Рік тому ⁺²
Thank you! I'm building a course on that, hopefully it'll be out in the next few months, I'll announce it in the channel when it's ready! :)
@mohammadarafah7757 Рік тому
That's great news. I can help you in practical labs; I am PhD researcher in generative modelling@@SerranoAcademy
@brunorcabral Рік тому
Great explanation. Where can I get the math formulas?
@SerranoAcademy Рік тому ⁺¹
Thanks! Here's a huge repository of CMA-ES info, code, tutorials, etc. For PSO I haven't found so much info, so mostly wikipedia.
@floydmaseda Рік тому
How fast is it? If I train a neural net (which we know how to compute the gradient of) with CMA-ES or PSO, will it take longer to converge? I would imagine particularly PSO is pretty slow, maybe only slightly better than a Monte Carlo approach. You're basically doing a line search algorithm without the advantage of knowing the direction you're moving is a descent direction. CMA-ES, on the other hand, might be reasonable?
@SerranoAcademy Рік тому ⁺²
Great question! I haven't used it in neural networks, I imagine that gradient descent is better. I've used CMA-ES and PSO for quantum neural networks, as derivatives are hard there, and I've noticed that CMA-ES tends to work better. Not so much in speed, but actually in finding minimums and not getting stuck. That's where I think the gains are, in the fact that the randomness in CMA-ES allows it to explore more parts of the space than a gradient-based algorithm that only gives small steps. I think a good combination of gradient and non-gradient based methods is the best combination at the end.
@floydmaseda Рік тому
@@SerranoAcademy A common strategy when training a model is to reduce the learning rate of gradient descent when the loss is no longer decreasing to see if we're bouncing around inside a local minimum without decreasing. I wonder if trying an iteration or two of CMA-ES at these times might sometimes allow us to jump to nearby local minima which may be deeper but could not be reached with any gradient-based approach.
Or another use might be during initialization, which is often just random. Maybe doing CMA-ES a few iterations at the beginning of training and picking the best out of like 5 choices might shoehorn the network into a better minimum than just having a single initialization point.
@centscents Рік тому
Do you have a citation for this method? Thanks.
@zyzhang1130 Рік тому
Question: you don’t have gradient so how do you know if cma-es reached a local minimum?
@SerranoAcademy Рік тому ⁺¹
Great question! You can notice if after several iterations you keep getting generations that don’t improve your minimum, or that improve it very very slightly. Then you assume you’re at a local minimum.
@zyzhang1130 Рік тому
@@SerranoAcademy so similar to the convergence analysis of gradient descent. Thank you for your reply😁
@java2379 Рік тому
Thanks. That's just a guess but i doubt this method be efficient in higher dimensions for the following reason. For instance had to take 5 points randomly in 2d. Let's take the square root to guess how much point per dimension you need. Thats about 2. So with 10 dimension i would need 2^10 = 1024. But i think ML involves much more dimensions ; that's basically about the number of weights in a multi layered network. say 3 layers of fully connected neurons : 100*100 about 2 ^ 10000 . 1 Gigabyte of RAM is 2^30 so no way to apply this. Am i wrong somewhere ? :-)
@elie_ Рік тому
The idea of such methods is to optimize functions when gradient descent isn’t available because of lack of derivability. In the case you mention, the network optimization is done in the usual way (SGD, Adam, …), and what you will look at is, say, the loss after a fixed number of epochs for a given set of hyperparameters (learning rate, beta coefficients, etc…) which at less numerous. Then you reiterate the process using CMA-ES/PSO solely on those hyperparameters.
@dvunkannon Місяць тому
For a "guarantee" of convergence, yes, you need a lot of points and a lot of function evaluations. You can in practice avoid scaling the number of points strictly on the number of dimensions. ES algorithms can easily solve 30-dimensional problems with a population size of 100, not 2^30, for example.
@luisvasquez5015 Рік тому
Fantastic video! Can't wait for the third part of the series!

Наступне

Автоматичне відтворення

Singular Value Decomposition (SVD): Mathematical Overview