Gradients are Not All You Need (Machine Learning Research Paper Explained)

Yannic Kilcher

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 30 гру 2024

КОМЕНТАРІ • 53

@YannicKilcher 3 роки тому ⁺⁹
OUTLINE:
0:00 - Foreword
1:15 - Intro & Overview
3:40 - Backpropagation through iterated systems
12:10 - Connection to the spectrum of the Jacobian
15:35 - The Reparameterization Trick
21:30 - Problems of reparameterization
26:35 - Example 1: Policy Learning in Simulation
33:05 - Example 2: Meta-Learning Optimizers
36:15 - Example 3: Disk packing
37:45 - Analysis of Jacobians
40:20 - What can be done?
45:40 - Just use Black-Box methods
@amanvijayjindal5742 3 роки тому
Wut is a Jacobin 🤔
@IoannisNousias 3 роки тому ⁺⁶⁹
Love those foundational explanations. Don’t ever apologize for them! Keep them coming.
@ThichMauXanh 3 роки тому ⁺⁷⁸
The authors history is big on gradient everything (evidenced by the self-citation sprinkled thru out the paper - I didnt look at author section and already know who's on it LOL), so instead of RL/Evo to learn optimizer/arch, they make everything differentiable and cooked up complicated schemes to use these gradients. Now I'm impressed they digged deeper into this direction and criticized it and released this paper. Great science being done here.
@erhimonarowland9464 3 роки тому
Pls try to simplify this paper for the benefit of some of us who are new entrants ino the field of Ai. Thank u .
@galileo3431 3 роки тому ⁺¹⁴
Two things:
1. Your explanation of the reparameterization trick is amazing! It's exactly these Aha! moments I love science for!
2. I really love that you're so honest about that your understanding of the paper might be wrong. For me as a master student that is aiming for a PhD in ML, it's so good to notice that not every paper is self-explainatory. When I don't understand a paper, I always tend to doubt my skills. Watching you beeing honest about this issue really helps. Thank you! 🙏🏼
@herp_derpingson 3 роки тому ⁺²
From what I understand, the paper says that the Gradient Variance can be thought as how inefficient the backprop is, for a system. If the variance is high, it is making the weights ping pong around chaotically without moving in a real direction.
.
Then the paper argues that the gradient variance is high in certain chaotic systems, which makes sense as a small change in initial conditions can make a large difference. Maybe I did not understand it correctly, but I dont think this is an shortcoming of SGD. Rather it just means that predicting chaotic systems is hard, which the SGD needs to do in order to optimize. If there are optimizers available that does not need to predict how chaotic systems behave and can still get the job done then it is preferable to use those instead.
.
Also, this kind of applies while training a very deep neural network. The top layers of the network then become the "chaotic system" and the gradients in the lower layers then get high variance.
.
44:58 Gradient clipping has always been a band-aid. We do not expect the neural network to learn as efficiently as it would otherwise have. It just makes the impossible possible.
.
45:45 Black box methods are usually the 1st thing we throw at a problem then we try with differentiable systems. Blackbox methods take forever to train compared to differentiable systems, so when differentiable systems work its much more preferable to use it.
.
Maybe I am a bit biased because I have put a lot of work into differentiable systems :P Regardless, good paper.
@fhub29 7 місяців тому ⁺¹
Finally I understood the logic behind VAEs and the reparameterization trick. Thank you so much. I failed at getting it during Deep learning lectures but UA-cam saved my *ss. What a time to be alive.
@stefaniew433 3 роки тому ⁺¹
Vielen Dank! Ich freue mich schon darauf, das Video in aller Ruhe zu schauen.
@jonathanballoch 3 роки тому ⁺⁸
Its interesting: If this problem is effectively solved for RNNs by simply using LSTMs, this kind of implies that there could be some sort of corrective layer added to differentiable simulators, met-learning, and the other chaotic regimes right? Overall great video and great paper! keep up the good work everyone!
@tiefkluehlfeuer 3 роки тому ⁺⁶
Finally someone showed that RNNs suffer from vanishing/exploding gradients! So cool!
@scottmiller2591 3 роки тому ⁺³
Black-box gradient is not a very good name here. The key is that the gradient is approximated over a scale which is large compared to the fluctuations. There are several "black box" gradients (like the complex step method) which have scales that are normally very, very small while still being accurate. These will have very similar problems as using backprop. In signal processing you find similar issues (though for different reasons) in estimating spectra. The solution there was the development of multi-taper spectral techniques, which essentially are multiple scale methods weighted to optimize the desired results. I suspect a multi-scale analog here for Jacobian estimates will be a solution.
@rnoro 3 роки тому ⁺³
I think the essential question is the limit theorem in analysis. Given a function and a Hilbert space, under what condition the convergence is good enough to take derivatives. It's subtle mathematically and if you are not careful in carrying out the calculation or choosing the basis, then you will see what so called big variance or so on
@MichaelBrown-gt4qi 3 роки тому ⁺¹
Such a great explanation! I seriously love the way you go through these papers.
@dariopassos 3 роки тому ⁺¹
Yannic, I found the basic explanations very useful. You explain things in a very intuitive way which is a very good complementary information to someone like me that self-taught this kind of subjects by books only. More.... give us more! ;)
@beaconofwierd1883 3 роки тому ⁺²
Is the eigenvalues of the jacobian differentiable without the hessian matrix? If so, could we add the eigenvalues to the loss such that the parameters try to keep the eigenvalues bellow 1 while updating?
@SenilerWhatever 3 роки тому ⁺²
Great video, thanks for the detailed explanation!
How do you find the papers that you want to make a video about? Do you just go through arXiv and choose the papers that sound interesting or do you only choose papers from well-known conferences like NeurIPS etc.?
@stmandl 3 роки тому ⁺³
The proposed Index problem at 11:30 actually could be ok (notice the s_{i-1})
@paincake7865 3 роки тому ⁺⁵
I don't really understand two parts of the reparametrization trick.
First of all, you said you sample the vector from a gaussian. Do you mean a multidimensional gaussian, or are all of the values in the vector based on a single distribution?
Second, if you always sample from the normal distribution for the reparametrization trick to work, wouldn't you just get noise? The normal distribution always has the same mean and sd, so where would the learned part from the encoder to the latent vector be?
@YannicKilcher 3 роки тому ⁺⁶
- yes, it's usually a multidimensional gaussian with a diagonal covariance matrix, so you can treat each dimension independently
- the learned part are the weights that are used to produce mu and sigma, and every sample is multiplied by sigma and shifted by mu. so yes there is noise from the sampling, but there is also signal coming from the encoder.
@paincake7865 3 роки тому ⁺²
@@YannicKilcher Alright, thanks. I did not get the part where (the learned) sigma and mu were used in the reparametrization the first time around, so I got confused. But now it makes sense. Thank you
@laurenpinschannels 3 роки тому
curious to see comparisons and tests of theories about why we sometimes see success on these otherwise ill-posed test problems. it can't just be high dimensionally if the feedback is consistently effectively random. how do you detect when a step of backprop was too chaotic?
@barlowtwin 3 роки тому ⁺³
Very awesome revision for me thanks :)
@nielswarncke537 3 роки тому
Equation (4)-(6) seem wrong: d l_0 / d Phi = something + d l_0 / d Phi => something = 0. Probably they mean something like L1 or L2 loss with the second d l_0 / d Phi ?
@julienherzen7191 2 роки тому
I guess the gradients having high variance speaks in favour of applying some kind of low-pass filter on them - such as what's being done in practice with momentum, Adam, etc.
@CharlesVanNoland 3 роки тому
I consider backprop to be analogous to a horse-drawn carriage where some kind of sparse hierarchical predictive modeling, like brains, is the way forward to sportscar machine learning.
@ElieLabeca 3 роки тому
absolutely love this channel
@etiennetiennetienne 3 роки тому ⁺⁴
in the end, is it that gradient calculation, although correct, has large variance due to stochastic conditions of approximation of it (minibatch, truncated bp) or just that taking the exact derivative to converge is a bad idea?
@YannicKilcher 3 роки тому ⁺⁹
I think the main point is that in chaotic systems, the gradients can have very large variance, just due to the nature of these systems (small changes lead to large effects).
@45pierro 3 роки тому ⁺¹
I would expect they try sampling multiple times and averaging the losses to reduce the variance.
@etiennetiennetienne 3 роки тому
you are both right, the latter is the point of the paper. i guess they study problems where loss landscape is super bumpy, and we cannot smooth it using residuals or batchnorm so we need another way of smooth it, or give up gradients? (or use it with RL for the bad gradient parts?)
@paulcurry8383 3 роки тому ⁺²
One thing I’m always a bit confused by with these more theoretical papers is what the task is that the models in the graphs are doing. I would assume it’s an important thing to specify so as to argue that the behavior of gradients in those situations is generalizable.
@ivanvoid4910 3 роки тому
Amazing want more~
@WhiteEyeTree 3 роки тому ⁺¹
May I ask what kind of application you are using to read pdf with such huge margins on the side? I have been looking for something like that for a while now!
@AtomosNucleous 3 роки тому
I laughed so hard alone in my room when you said "paper rejected" because of the Index. Btw the index is correct because of the i-1 in the product. it just applies for t>=1
@dancar2537 3 роки тому
if you could tell us also how they came to this that would be even greater. still, that you do this explaining, is great
@Ronnypetson 3 роки тому ⁺¹²
Imagine backpropagating through backpropagation itself
@samanthaqiu3416 3 роки тому ⁺⁷
my brain just went NaN trying to make sense of it 😂
@IoannisNousias 3 роки тому ⁺⁵
that would be forward propagation, silly…
@smort123 3 роки тому
@@IoannisNousias 5D multiverse propagation with time travel!
@artlenski8115 3 роки тому ⁺¹
You can, think of the update rule i.e. w = w + \Delta w, then why not extending this expression to a whole polynomial with higher terms, but then, why not even go further and replace this rule by a RNN, now you have to optimise over RNN, that then is used to update the original network. Look for "learning to learn" models.
@herp_derpingson 3 роки тому
There are quite a few papers which have done that. There are a few Yannic videos on it too.
@logo2462 3 роки тому ⁺¹
First! Looking forward to hearing how this applies to inner-optimization routines.
@logo2462 3 роки тому
After finishing the video, I’m wondering if the authors could find ways of including some sort of group norm (batchnorm etc) to reduce the gradient variance.
@computerscienceitconferenc7375 3 роки тому
Good explanations
@peterszilvasi752 3 роки тому
"We found a mistake... Paper is rejected!" 😂
@amanvijayjindal5742 3 роки тому
Gradient is everything we were made to believe
@mgostIH 3 роки тому ⁺³
16:30 "Turns out autoencoders by themselves don't really work"
To be fair a result that proves you wrong has been published just a few days ago, indeed a field that goes *fast* 😆
The paper is "Masked Autoencoders Are Scalable Vision Learners"
@NavinF 3 роки тому ⁺²
I dunno man, their generated images look pretty blurry by my standards. In fact they look like the output images from every other autoencoder paper. Pretty sure autoencoders are still dead.
@erentas7391 3 роки тому
The names to be pressed on t-shirts
@Kenbomp 15 днів тому
Back propagate can be quite wasteful
@aliabdulhussain8359 3 роки тому
Watching this video, the paper is contraductory since it says gradients are not all you need then they recommend approximating it using "black box method". I would say the gradient is not the issue, it is either they way we are computing it or the structure we are using to solve the problem.
@erhimonarowland9464 3 роки тому
This paper sounds great in principles. Please make it a little more practical for some of us trying to apply Bp loss function to interprete the real estate market here in Nigeria

Наступне

Автоматичне відтворення

Learning Rate Grafting: Transferability of Optimizer Tuning (Machine Learning Research Paper Review)