OUTLINE: 0:00 - Foreword 1:15 - Intro & Overview 3:40 - Backpropagation through iterated systems 12:10 - Connection to the spectrum of the Jacobian 15:35 - The Reparameterization Trick 21:30 - Problems of reparameterization 26:35 - Example 1: Policy Learning in Simulation 33:05 - Example 2: Meta-Learning Optimizers 36:15 - Example 3: Disk packing 37:45 - Analysis of Jacobians 40:20 - What can be done? 45:40 - Just use Black-Box methods
The authors history is big on gradient everything (evidenced by the self-citation sprinkled thru out the paper - I didnt look at author section and already know who's on it LOL), so instead of RL/Evo to learn optimizer/arch, they make everything differentiable and cooked up complicated schemes to use these gradients. Now I'm impressed they digged deeper into this direction and criticized it and released this paper. Great science being done here.
Two things: 1. Your explanation of the reparameterization trick is amazing! It's exactly these Aha! moments I love science for! 2. I really love that you're so honest about that your understanding of the paper might be wrong. For me as a master student that is aiming for a PhD in ML, it's so good to notice that not every paper is self-explainatory. When I don't understand a paper, I always tend to doubt my skills. Watching you beeing honest about this issue really helps. Thank you! 🙏🏼
From what I understand, the paper says that the Gradient Variance can be thought as how inefficient the backprop is, for a system. If the variance is high, it is making the weights ping pong around chaotically without moving in a real direction. . Then the paper argues that the gradient variance is high in certain chaotic systems, which makes sense as a small change in initial conditions can make a large difference. Maybe I did not understand it correctly, but I dont think this is an shortcoming of SGD. Rather it just means that predicting chaotic systems is hard, which the SGD needs to do in order to optimize. If there are optimizers available that does not need to predict how chaotic systems behave and can still get the job done then it is preferable to use those instead. . Also, this kind of applies while training a very deep neural network. The top layers of the network then become the "chaotic system" and the gradients in the lower layers then get high variance. . 44:58 Gradient clipping has always been a band-aid. We do not expect the neural network to learn as efficiently as it would otherwise have. It just makes the impossible possible. . 45:45 Black box methods are usually the 1st thing we throw at a problem then we try with differentiable systems. Blackbox methods take forever to train compared to differentiable systems, so when differentiable systems work its much more preferable to use it. . Maybe I am a bit biased because I have put a lot of work into differentiable systems :P Regardless, good paper.
Finally I understood the logic behind VAEs and the reparameterization trick. Thank you so much. I failed at getting it during Deep learning lectures but UA-cam saved my *ss. What a time to be alive.
Its interesting: If this problem is effectively solved for RNNs by simply using LSTMs, this kind of implies that there could be some sort of corrective layer added to differentiable simulators, met-learning, and the other chaotic regimes right? Overall great video and great paper! keep up the good work everyone!
Black-box gradient is not a very good name here. The key is that the gradient is approximated over a scale which is large compared to the fluctuations. There are several "black box" gradients (like the complex step method) which have scales that are normally very, very small while still being accurate. These will have very similar problems as using backprop. In signal processing you find similar issues (though for different reasons) in estimating spectra. The solution there was the development of multi-taper spectral techniques, which essentially are multiple scale methods weighted to optimize the desired results. I suspect a multi-scale analog here for Jacobian estimates will be a solution.
I think the essential question is the limit theorem in analysis. Given a function and a Hilbert space, under what condition the convergence is good enough to take derivatives. It's subtle mathematically and if you are not careful in carrying out the calculation or choosing the basis, then you will see what so called big variance or so on
Yannic, I found the basic explanations very useful. You explain things in a very intuitive way which is a very good complementary information to someone like me that self-taught this kind of subjects by books only. More.... give us more! ;)
Is the eigenvalues of the jacobian differentiable without the hessian matrix? If so, could we add the eigenvalues to the loss such that the parameters try to keep the eigenvalues bellow 1 while updating?
Great video, thanks for the detailed explanation! How do you find the papers that you want to make a video about? Do you just go through arXiv and choose the papers that sound interesting or do you only choose papers from well-known conferences like NeurIPS etc.?
I don't really understand two parts of the reparametrization trick. First of all, you said you sample the vector from a gaussian. Do you mean a multidimensional gaussian, or are all of the values in the vector based on a single distribution? Second, if you always sample from the normal distribution for the reparametrization trick to work, wouldn't you just get noise? The normal distribution always has the same mean and sd, so where would the learned part from the encoder to the latent vector be?
- yes, it's usually a multidimensional gaussian with a diagonal covariance matrix, so you can treat each dimension independently - the learned part are the weights that are used to produce mu and sigma, and every sample is multiplied by sigma and shifted by mu. so yes there is noise from the sampling, but there is also signal coming from the encoder.
@@YannicKilcher Alright, thanks. I did not get the part where (the learned) sigma and mu were used in the reparametrization the first time around, so I got confused. But now it makes sense. Thank you
curious to see comparisons and tests of theories about why we sometimes see success on these otherwise ill-posed test problems. it can't just be high dimensionally if the feedback is consistently effectively random. how do you detect when a step of backprop was too chaotic?
Equation (4)-(6) seem wrong: d l_0 / d Phi = something + d l_0 / d Phi => something = 0. Probably they mean something like L1 or L2 loss with the second d l_0 / d Phi ?
I guess the gradients having high variance speaks in favour of applying some kind of low-pass filter on them - such as what's being done in practice with momentum, Adam, etc.
I consider backprop to be analogous to a horse-drawn carriage where some kind of sparse hierarchical predictive modeling, like brains, is the way forward to sportscar machine learning.
in the end, is it that gradient calculation, although correct, has large variance due to stochastic conditions of approximation of it (minibatch, truncated bp) or just that taking the exact derivative to converge is a bad idea?
I think the main point is that in chaotic systems, the gradients can have very large variance, just due to the nature of these systems (small changes lead to large effects).
you are both right, the latter is the point of the paper. i guess they study problems where loss landscape is super bumpy, and we cannot smooth it using residuals or batchnorm so we need another way of smooth it, or give up gradients? (or use it with RL for the bad gradient parts?)
One thing I’m always a bit confused by with these more theoretical papers is what the task is that the models in the graphs are doing. I would assume it’s an important thing to specify so as to argue that the behavior of gradients in those situations is generalizable.
May I ask what kind of application you are using to read pdf with such huge margins on the side? I have been looking for something like that for a while now!
I laughed so hard alone in my room when you said "paper rejected" because of the Index. Btw the index is correct because of the i-1 in the product. it just applies for t>=1
You can, think of the update rule i.e. w = w + \Delta w, then why not extending this expression to a whole polynomial with higher terms, but then, why not even go further and replace this rule by a RNN, now you have to optimise over RNN, that then is used to update the original network. Look for "learning to learn" models.
After finishing the video, I’m wondering if the authors could find ways of including some sort of group norm (batchnorm etc) to reduce the gradient variance.
16:30 "Turns out autoencoders by themselves don't really work" To be fair a result that proves you wrong has been published just a few days ago, indeed a field that goes *fast* 😆 The paper is "Masked Autoencoders Are Scalable Vision Learners"
I dunno man, their generated images look pretty blurry by my standards. In fact they look like the output images from every other autoencoder paper. Pretty sure autoencoders are still dead.
Watching this video, the paper is contraductory since it says gradients are not all you need then they recommend approximating it using "black box method". I would say the gradient is not the issue, it is either they way we are computing it or the structure we are using to solve the problem.
This paper sounds great in principles. Please make it a little more practical for some of us trying to apply Bp loss function to interprete the real estate market here in Nigeria
OUTLINE:
0:00 - Foreword
1:15 - Intro & Overview
3:40 - Backpropagation through iterated systems
12:10 - Connection to the spectrum of the Jacobian
15:35 - The Reparameterization Trick
21:30 - Problems of reparameterization
26:35 - Example 1: Policy Learning in Simulation
33:05 - Example 2: Meta-Learning Optimizers
36:15 - Example 3: Disk packing
37:45 - Analysis of Jacobians
40:20 - What can be done?
45:40 - Just use Black-Box methods
Wut is a Jacobin 🤔
Love those foundational explanations. Don’t ever apologize for them! Keep them coming.
The authors history is big on gradient everything (evidenced by the self-citation sprinkled thru out the paper - I didnt look at author section and already know who's on it LOL), so instead of RL/Evo to learn optimizer/arch, they make everything differentiable and cooked up complicated schemes to use these gradients. Now I'm impressed they digged deeper into this direction and criticized it and released this paper. Great science being done here.
Pls try to simplify this paper for the benefit of some of us who are new entrants ino the field of Ai. Thank u .
Two things:
1. Your explanation of the reparameterization trick is amazing! It's exactly these Aha! moments I love science for!
2. I really love that you're so honest about that your understanding of the paper might be wrong. For me as a master student that is aiming for a PhD in ML, it's so good to notice that not every paper is self-explainatory. When I don't understand a paper, I always tend to doubt my skills. Watching you beeing honest about this issue really helps. Thank you! 🙏🏼
From what I understand, the paper says that the Gradient Variance can be thought as how inefficient the backprop is, for a system. If the variance is high, it is making the weights ping pong around chaotically without moving in a real direction.
.
Then the paper argues that the gradient variance is high in certain chaotic systems, which makes sense as a small change in initial conditions can make a large difference. Maybe I did not understand it correctly, but I dont think this is an shortcoming of SGD. Rather it just means that predicting chaotic systems is hard, which the SGD needs to do in order to optimize. If there are optimizers available that does not need to predict how chaotic systems behave and can still get the job done then it is preferable to use those instead.
.
Also, this kind of applies while training a very deep neural network. The top layers of the network then become the "chaotic system" and the gradients in the lower layers then get high variance.
.
44:58 Gradient clipping has always been a band-aid. We do not expect the neural network to learn as efficiently as it would otherwise have. It just makes the impossible possible.
.
45:45 Black box methods are usually the 1st thing we throw at a problem then we try with differentiable systems. Blackbox methods take forever to train compared to differentiable systems, so when differentiable systems work its much more preferable to use it.
.
Maybe I am a bit biased because I have put a lot of work into differentiable systems :P Regardless, good paper.
Finally I understood the logic behind VAEs and the reparameterization trick. Thank you so much. I failed at getting it during Deep learning lectures but UA-cam saved my *ss. What a time to be alive.
Vielen Dank! Ich freue mich schon darauf, das Video in aller Ruhe zu schauen.
Its interesting: If this problem is effectively solved for RNNs by simply using LSTMs, this kind of implies that there could be some sort of corrective layer added to differentiable simulators, met-learning, and the other chaotic regimes right? Overall great video and great paper! keep up the good work everyone!
Finally someone showed that RNNs suffer from vanishing/exploding gradients! So cool!
Black-box gradient is not a very good name here. The key is that the gradient is approximated over a scale which is large compared to the fluctuations. There are several "black box" gradients (like the complex step method) which have scales that are normally very, very small while still being accurate. These will have very similar problems as using backprop. In signal processing you find similar issues (though for different reasons) in estimating spectra. The solution there was the development of multi-taper spectral techniques, which essentially are multiple scale methods weighted to optimize the desired results. I suspect a multi-scale analog here for Jacobian estimates will be a solution.
I think the essential question is the limit theorem in analysis. Given a function and a Hilbert space, under what condition the convergence is good enough to take derivatives. It's subtle mathematically and if you are not careful in carrying out the calculation or choosing the basis, then you will see what so called big variance or so on
Such a great explanation! I seriously love the way you go through these papers.
Yannic, I found the basic explanations very useful. You explain things in a very intuitive way which is a very good complementary information to someone like me that self-taught this kind of subjects by books only. More.... give us more! ;)
Is the eigenvalues of the jacobian differentiable without the hessian matrix? If so, could we add the eigenvalues to the loss such that the parameters try to keep the eigenvalues bellow 1 while updating?
Great video, thanks for the detailed explanation!
How do you find the papers that you want to make a video about? Do you just go through arXiv and choose the papers that sound interesting or do you only choose papers from well-known conferences like NeurIPS etc.?
The proposed Index problem at 11:30 actually could be ok (notice the s_{i-1})
I don't really understand two parts of the reparametrization trick.
First of all, you said you sample the vector from a gaussian. Do you mean a multidimensional gaussian, or are all of the values in the vector based on a single distribution?
Second, if you always sample from the normal distribution for the reparametrization trick to work, wouldn't you just get noise? The normal distribution always has the same mean and sd, so where would the learned part from the encoder to the latent vector be?
- yes, it's usually a multidimensional gaussian with a diagonal covariance matrix, so you can treat each dimension independently
- the learned part are the weights that are used to produce mu and sigma, and every sample is multiplied by sigma and shifted by mu. so yes there is noise from the sampling, but there is also signal coming from the encoder.
@@YannicKilcher Alright, thanks. I did not get the part where (the learned) sigma and mu were used in the reparametrization the first time around, so I got confused. But now it makes sense. Thank you
curious to see comparisons and tests of theories about why we sometimes see success on these otherwise ill-posed test problems. it can't just be high dimensionally if the feedback is consistently effectively random. how do you detect when a step of backprop was too chaotic?
Very awesome revision for me thanks :)
Equation (4)-(6) seem wrong: d l_0 / d Phi = something + d l_0 / d Phi => something = 0. Probably they mean something like L1 or L2 loss with the second d l_0 / d Phi ?
I guess the gradients having high variance speaks in favour of applying some kind of low-pass filter on them - such as what's being done in practice with momentum, Adam, etc.
I consider backprop to be analogous to a horse-drawn carriage where some kind of sparse hierarchical predictive modeling, like brains, is the way forward to sportscar machine learning.
absolutely love this channel
in the end, is it that gradient calculation, although correct, has large variance due to stochastic conditions of approximation of it (minibatch, truncated bp) or just that taking the exact derivative to converge is a bad idea?
I think the main point is that in chaotic systems, the gradients can have very large variance, just due to the nature of these systems (small changes lead to large effects).
I would expect they try sampling multiple times and averaging the losses to reduce the variance.
you are both right, the latter is the point of the paper. i guess they study problems where loss landscape is super bumpy, and we cannot smooth it using residuals or batchnorm so we need another way of smooth it, or give up gradients? (or use it with RL for the bad gradient parts?)
One thing I’m always a bit confused by with these more theoretical papers is what the task is that the models in the graphs are doing. I would assume it’s an important thing to specify so as to argue that the behavior of gradients in those situations is generalizable.
Amazing want more~
May I ask what kind of application you are using to read pdf with such huge margins on the side? I have been looking for something like that for a while now!
I laughed so hard alone in my room when you said "paper rejected" because of the Index. Btw the index is correct because of the i-1 in the product. it just applies for t>=1
if you could tell us also how they came to this that would be even greater. still, that you do this explaining, is great
Imagine backpropagating through backpropagation itself
my brain just went NaN trying to make sense of it 😂
that would be forward propagation, silly…
@@IoannisNousias 5D multiverse propagation with time travel!
You can, think of the update rule i.e. w = w + \Delta w, then why not extending this expression to a whole polynomial with higher terms, but then, why not even go further and replace this rule by a RNN, now you have to optimise over RNN, that then is used to update the original network. Look for "learning to learn" models.
There are quite a few papers which have done that. There are a few Yannic videos on it too.
First! Looking forward to hearing how this applies to inner-optimization routines.
After finishing the video, I’m wondering if the authors could find ways of including some sort of group norm (batchnorm etc) to reduce the gradient variance.
Good explanations
"We found a mistake... Paper is rejected!" 😂
Gradient is everything we were made to believe
16:30 "Turns out autoencoders by themselves don't really work"
To be fair a result that proves you wrong has been published just a few days ago, indeed a field that goes *fast* 😆
The paper is "Masked Autoencoders Are Scalable Vision Learners"
I dunno man, their generated images look pretty blurry by my standards. In fact they look like the output images from every other autoencoder paper. Pretty sure autoencoders are still dead.
The names to be pressed on t-shirts
Back propagate can be quite wasteful
Watching this video, the paper is contraductory since it says gradients are not all you need then they recommend approximating it using "black box method". I would say the gradient is not the issue, it is either they way we are computing it or the structure we are using to solve the problem.
This paper sounds great in principles. Please make it a little more practical for some of us trying to apply Bp loss function to interprete the real estate market here in Nigeria