Particularly since the Manhattan Project there is developing a quasi-religious attitude toward Theory that is undoing the empirical tradition. While it is true that we cannot inherently generalize from empirical results just about the only prior we have to those results is the theory of algorithmic information. Yet there have, in recent decades, been repeated examples of important empirical results, rejected in peer review simply because there was no accompanying Theory to explain those results. If this doesn't sicken you then there is something wrong with your rational faculties.
Scientist: "Hey, look! When I bump these two rocks together, this third rock floats! Isn't that neat?" Reviewer: "I don't think that should happen." Scientist: "Oh, neither did I. But it does! Look, try it yourself!" Reviewer: "No thanks. I don't think there's enough theory backing up what you're doing here." Scientist: "Well of course not. This is totally new. What do you expect?" Reviewer: "I'm going to have to reject your experiments because there isn't strong enough evidence." Scientist: "....."
there's pretty much 2 mode of this channel, the facecam with a more punchy persona, and without the facecam a more acamdemic persona both are cool in the context of making YT videos went back and watched some of the older videos, I think yannic is still very much grounded and have not "taken off" into the meme and clickbait in order to be "successful" on YT keep up the honest work, very much liked your contents
Lol, so by training a lot of models, you can calculate a learning rate schedule. And then you can train your model with larger batch sizes because you save memory by just using SGD * scheduled-learning-rate. I wonder if this only works on the same dataset, and if your save compute/time in the end. To get a decent schedule I would expect that you would need a lot of runs to get a robust schedule 😂 Jeremy Howard already used learning rate scheduling years ago across datasets, so I would expect the schedule to somewhat generalise. Thanks for helping us all get smarter ❤️
31:46 One thing that would be really helpful in building intuition would be how the parameters of Adam/Adagrad change over the course of the training. If we see that after some steps it converges to some value, then we graft and switch optimizers. We are assuming that the other variables would be roughly constant for the remainder of the training. I wonder how that would work out. That way we are not arbitrarily picking 2000 or so steps.
This was the case with most ICLR reviews this year. I don't know what they did to the reviewer pool, but I feel the review quality was worse than the previous 2 years (not to say it was stellar then). I have never seen reviews on ICLR that are so all over the place, and so devoid of meaningful dialogue/feedback.
Hi Yannick. Thanks for the paper review. Imho I don’t see the innovation in this paper. These variants of gradient descent algorithms, normalisation, preconditioning have comprehensively been studied in the field of optimisation in general. On a theoretical remark, the authors seem to assume that the direction of the gradient is fixed and one only needs to find optimal step sizes by transferring learning rates between algorithms. Both overshooting or undershooting the optimal steps would result in inefficiency. Furthermore, no mention has been made on how the proposed methodology behaves in the neighbourhood of a saddle point.
I don't think that the authors here go for innovation. They clearly propose this as an investigative tool to disentangle implicit step size and direction correction. Any "innovation" is simply a side-product. Also, I don't think they make any of these assumptions, but simply propose this as a transfer mechanism, explicitly discarding the direction of one algorithm. If they were to assume that the direction were fixed, there would be no need to transfer the step size at all, since the directions would be the same.
@@YannicKilcher Thanks for the clarifications. I eventually noticed that the proposed grafting is not a guaranteed gradient descent method. Interesting paper anyway :)
Did I get that correctly? We only save memory once we start learning how one optimizer "corrects" the other and the use that knowledge to continue with the one requiring less memory.
Interesting question is whether the dependency during the initial amount of steps holds in the process of consequent training. If the optimizer initially was in a region with small curvature and the got into a high-curvature region, I would expect that grafting won't be beneficial.
Thats a great explanation / interesting paper thank you 🙏🏻, I am wondering when people write a paper why they don’t put all the signs / letters in a table as an Appendix for reader to get back to it, why you have to say I am guessing M is magnitude and D direction ☹️ it shakes all my understanding once we start guessing process, this happen also with me all the time 🤷🏻♂️
This is a very similar approach to my current project looking to use pbr raytracing optimizers grafted onto SGD models using blender and cycles render engine. 👍
Working on it. Blender runs Python natively though so if you learn material nodes you can apply it too right out of the box. I'm still trying to learn all the math and python lol
Going by similar idea... Why don't we train model in parts... Maybe train with 1 optimiser for 100 epochs and then test 3 possible optimisers for maybe 5 epochs and use the better one... The loss function keeps changing in structure as models are optimised
Before ML I used Conjugate Gradients, GMRES, Levenberg Marquard. They all seemed pretty smart. Why does ML use SGD and ADAM which seem simpler and less smart? Especially CG with A-orthogonality in theory works great for elongated valleys with minimal memory requirements.
Conjugate gradient solves a different problem, finding a solution of a linear system, but even if you could adapt it to NNs it requires strictly positive definite matrices. GMRES is more general but still for linear systems solutions, while Levenberg Marquard is for least-squares curve fitting. The problem with all those classical iterative algorithms are that they suppose quite strong conditions on the problem you are trying to solve (linearity, convexity or single global optimas) and usually are intractable in huge dimensions (like Newton's method), while NNs have almost no theoretical guarantee on everything and a global solution isn't even important, we have a lot of experimental evidence in the behaviour of NNs minimas and those reached by current methods are often as good as you can get.
@@mgostIH First thanks a lot for your quite detailed answer! I think you have some important points but on the other hand, I and others have applied these classical methods many times to non-linear problems with good results and there exist many variations that overcome their limitations. E.g. there exist various optimizations to the update directions. It appears a bit to me that the ML community has tested all available methods and found which works best. Maybe less stress was put on determining why certain methods work so well or maybe I'm just not aware. I agree that in ML the global solution is not important where in physics problems we want to prevent as much as possible getting stuck in a local minimum. BTW if locally your matrix is indefinite then you have a saddle point and thus no optimum. If your matrix is semi-definite than I think classical optimization still works including CG. Only the gradient becomes zero. I still have the feeling that there is a disconnect between the traditional mathematiciens working on optimization problems and the new big-tech researchers on this point.
@@richardbloemenkamp8532 Don't take it as a cut on old possible methods that may work, in my experience there's a lot of overlooked stuff in ML, you could try an implementation of those algorithms yourself or see where they may have a bottleneck/lack of assumptions in current regimes. If you have results that are competitive with current methods that can be worth a paper!
That's true such comments for PHD students will be really disappointing and confusing, please put yourself in the publisher shoes before saying anything and how this is effective or destructive. In order to a researcher to grow require a support from the community / no one is perfect and we all have to start from somewhere.
The algebra must be consistent with the topology(N. Bourbaki). Dividing mathematics into "topology", "algebra", "analysis", "geometry" etc., is mainly done for clarity in (undergraduate) teaching. Mathematics should rather be seen as a highly coherent body of knowledge.
The worse part of that reviewer is the 5 the reviewer assigned for their confidence on the review. Also I think the reviewers first language isnt english but on the other hand the reviewer critiques the language , insert something about "glass houses"
Saying "despite the evidence, I don't feel like will hold up can be perfectly reasonable. Anytime you're trying to "defeat" a model/optimizer/theory that has held up for a long time, you're more likely to find spurious results, or to have missed something. Case in point: if you were to say this about every paper that claimed to beat Adam, you'd be pretty much right all the time.
Yes, I agree. But that's not the point here, the point is that this is an official reviewer listing their own misunderstanding as weaknesses of the paper, dismissing the authors as amateurs, and directly ignoring the paper's evidence. Sure this might not generalize (which is what the other reviewers also point out), but you can't just look at a graph showing X and then say "I don't think X", especially as a reviewer. Maybe they meant generalization but failed to formulate that, but given the test of the review, I don't think that's the case
There's also no "theory" behind Adam or SGD as well, as well as 99 percent of existing optimizer "tricks". So it's pointless to ask for any theory at all, if the entire field is empirically motivated (whatever trains fast is good stuff) by nature.
So one can have today 200 GPUs run experiments and publish a paper that has no conclusive result... The experiments do not show anything , there is no conclusion as results are in margin of uncertainty (1-2%).
If you don't think there is a conclusive result, I may not have done a good job explaining the paper. Keep in mind that not every paper's goal is to be better than all others.
@@YannicKilcher That reviewer you criticized is clearly not good at writing proper English, and probably hastily made his review. However, I think he has a strong point, and so has @John Doe: most of the elicited "phenomena" in this paper seem within margin of uncertainty, and the paper does not provide any supporting theory. I would have rejected it as well for the lack of any one of those two kinds of evidence. Probably the authors should have made less experiments, but provide better statistics (e.g.confidence intervals) about their measures.
So there is no one right learning rate adaptation algorithm. It seems to me, the paper is a good example of hacking the optimizers, which themselves are a bunch of hacks.
20:35 level of science is so low - middle school mathematics is published as "new method", no theoretical justification, no general theory in deep elarnig at all, only speculations
where does this claim to be a “new method?” i thought this paper only seeks to empirically demonstrate somewhat more competitive application of sgd to problem domains where it’s performance was previously thought completely laughable by searching for a corrective learning rate schedule using adaptive methods
OUTLINE
0:00 - Rant about Reviewer #2
6:25 - Intro & Overview
12:25 - Adaptive Optimization Methods
20:15 - Grafting Algorithm
26:45 - Experimental Results
31:35 - Static Transfer of Learning Rate Ratios
35:25 - Conclusion & Discussion
I feel like we need more reviews of reviews. Maybe one day it'll expose how poor the process is currently.
Agreed 💯
The current process is absolutely broken. Getting a paper published is more a function of luck and fame rather than merit.
I loved watching this video especially the review of the reviewer - it made my day!
Particularly since the Manhattan Project there is developing a quasi-religious attitude toward Theory that is undoing the empirical tradition. While it is true that we cannot inherently generalize from empirical results just about the only prior we have to those results is the theory of algorithmic information. Yet there have, in recent decades, been repeated examples of important empirical results, rejected in peer review simply because there was no accompanying Theory to explain those results. If this doesn't sicken you then there is something wrong with your rational faculties.
Scientist: "Hey, look! When I bump these two rocks together, this third rock floats! Isn't that neat?"
Reviewer: "I don't think that should happen."
Scientist: "Oh, neither did I. But it does! Look, try it yourself!"
Reviewer: "No thanks. I don't think there's enough theory backing up what you're doing here."
Scientist: "Well of course not. This is totally new. What do you expect?"
Reviewer: "I'm going to have to reject your experiments because there isn't strong enough evidence."
Scientist: "....."
Reviewing the reviewers should be standard practice to increase the quality of the reviews and subsequently the accepted paper value efficiency.
reviewing those openreviews are really fun to me personally
The mistake the authors made was not calling the paper "Step Size is All You Need"
It would have been accepted without review
I love your review of the reviews. 😂
there's pretty much 2 mode of this channel, the facecam with a more punchy persona, and without the facecam a more acamdemic persona
both are cool in the context of making YT videos
went back and watched some of the older videos, I think yannic is still very much grounded and have not "taken off" into the meme and clickbait in order to be "successful" on YT
keep up the honest work, very much liked your contents
Lol, so by training a lot of models, you can calculate a learning rate schedule. And then you can train your model with larger batch sizes because you save memory by just using SGD * scheduled-learning-rate. I wonder if this only works on the same dataset, and if your save compute/time in the end. To get a decent schedule I would expect that you would need a lot of runs to get a robust schedule 😂 Jeremy Howard already used learning rate scheduling years ago across datasets, so I would expect the schedule to somewhat generalise. Thanks for helping us all get smarter ❤️
I'm going to be writing about this paper. Thanks for sharing
31:46 One thing that would be really helpful in building intuition would be how the parameters of Adam/Adagrad change over the course of the training. If we see that after some steps it converges to some value, then we graft and switch optimizers. We are assuming that the other variables would be roughly constant for the remainder of the training. I wonder how that would work out. That way we are not arbitrarily picking 2000 or so steps.
true, but they must have done some experimentation to come up with that number in the first place
This was the case with most ICLR reviews this year. I don't know what they did to the reviewer pool, but I feel the review quality was worse than the previous 2 years (not to say it was stellar then). I have never seen reviews on ICLR that are so all over the place, and so devoid of meaningful dialogue/feedback.
The rant was hilarious lmao
2:34 This is gonna be a meme🤣
Hi Yannick. Thanks for the paper review. Imho I don’t see the innovation in this paper. These variants of gradient descent algorithms, normalisation, preconditioning have comprehensively been studied in the field of optimisation in general. On a theoretical remark, the authors seem to assume that the direction of the gradient is fixed and one only needs to find optimal step sizes by transferring learning rates between algorithms. Both overshooting or undershooting the optimal steps would result in inefficiency. Furthermore, no mention has been made on how the proposed methodology behaves in the neighbourhood of a saddle point.
I don't think that the authors here go for innovation. They clearly propose this as an investigative tool to disentangle implicit step size and direction correction. Any "innovation" is simply a side-product. Also, I don't think they make any of these assumptions, but simply propose this as a transfer mechanism, explicitly discarding the direction of one algorithm. If they were to assume that the direction were fixed, there would be no need to transfer the step size at all, since the directions would be the same.
@@YannicKilcher Thanks for the clarifications. I eventually noticed that the proposed grafting is not a guaranteed gradient descent method. Interesting paper anyway :)
Did I get that correctly? We only save memory once we start learning how one optimizer "corrects" the other and the use that knowledge to continue with the one requiring less memory.
correct
Interesting question is whether the dependency during the initial amount of steps holds in the process of consequent training. If the optimizer initially was in a region with small curvature and the got into a high-curvature region, I would expect that grafting won't be beneficial.
Thats a great explanation / interesting paper thank you 🙏🏻, I am wondering when people write a paper why they don’t put all the signs / letters in a table as an Appendix for reader to get back to it, why you have to say I am guessing M is magnitude and D direction ☹️ it shakes all my understanding once we start guessing process, this happen also with me all the time 🤷🏻♂️
I also really hate it when math gets dumped in a paper and you're expected to be fluent in whatever was going through their head.
39:12 "get enough sleep"
Why Yannic knew that I watched this at 2:20 AM before sleep?
This is a very similar approach to my current project looking to use pbr raytracing optimizers grafted onto SGD models using blender and cycles render engine. 👍
Please share!
Working on it. Blender runs Python natively though so if you learn material nodes you can apply it too right out of the box. I'm still trying to learn all the math and python lol
Going by similar idea... Why don't we train model in parts... Maybe train with 1 optimiser for 100 epochs and then test 3 possible optimisers for maybe 5 epochs and use the better one... The loss function keeps changing in structure as models are optimised
That is probably why adabelief(derived from Adam) gives generalization similar to sgd
Before ML I used Conjugate Gradients, GMRES, Levenberg Marquard. They all seemed pretty smart. Why does ML use SGD and ADAM which seem simpler and less smart? Especially CG with A-orthogonality in theory works great for elongated valleys with minimal memory requirements.
Conjugate gradient solves a different problem, finding a solution of a linear system, but even if you could adapt it to NNs it requires strictly positive definite matrices.
GMRES is more general but still for linear systems solutions, while Levenberg Marquard is for least-squares curve fitting.
The problem with all those classical iterative algorithms are that they suppose quite strong conditions on the problem you are trying to solve (linearity, convexity or single global optimas) and usually are intractable in huge dimensions (like Newton's method), while NNs have almost no theoretical guarantee on everything and a global solution isn't even important, we have a lot of experimental evidence in the behaviour of NNs minimas and those reached by current methods are often as good as you can get.
@@mgostIH First thanks a lot for your quite detailed answer! I think you have some important points but on the other hand, I and others have applied these classical methods many times to non-linear problems with good results and there exist many variations that overcome their limitations. E.g. there exist various optimizations to the update directions. It appears a bit to me that the ML community has tested all available methods and found which works best. Maybe less stress was put on determining why certain methods work so well or maybe I'm just not aware. I agree that in ML the global solution is not important where in physics problems we want to prevent as much as possible getting stuck in a local minimum. BTW if locally your matrix is indefinite then you have a saddle point and thus no optimum. If your matrix is semi-definite than I think classical optimization still works including CG. Only the gradient becomes zero. I still have the feeling that there is a disconnect between the traditional mathematiciens working on optimization problems and the new big-tech researchers on this point.
@@richardbloemenkamp8532 Don't take it as a cut on old possible methods that may work, in my experience there's a lot of overlooked stuff in ML, you could try an implementation of those algorithms yourself or see where they may have a bottleneck/lack of assumptions in current regimes.
If you have results that are competitive with current methods that can be worth a paper!
Hilarious rant😂😂
what's the name of the discord channel?
Link is in the video description
That's true such comments for PHD students will be really disappointing and confusing, please put yourself in the publisher shoes before saying anything and how this is effective or destructive. In order to a researcher to grow require a support from the community / no one is perfect and we all have to start from somewhere.
did you just tell me to get enough sleep
you watched to the end, good job :D
The algebra must be consistent with the topology(N. Bourbaki). Dividing mathematics into "topology", "algebra", "analysis", "geometry" etc., is mainly done for clarity in (undergraduate) teaching. Mathematics should rather be seen as a highly coherent body of knowledge.
"Theory is not reasonable."
OK. I'm done. ×DDD
The worse part of that reviewer is the 5 the reviewer assigned for their confidence on the review.
Also I think the reviewers first language isnt english but on the other hand the reviewer critiques the language , insert something about "glass houses"
this reviewer should be removed from his post
Saying "despite the evidence, I don't feel like will hold up can be perfectly reasonable.
Anytime you're trying to "defeat" a model/optimizer/theory that has held up for a long time, you're more likely to find spurious results, or to have missed something. Case in point: if you were to say this about every paper that claimed to beat Adam, you'd be pretty much right all the time.
Yes, I agree. But that's not the point here, the point is that this is an official reviewer listing their own misunderstanding as weaknesses of the paper, dismissing the authors as amateurs, and directly ignoring the paper's evidence. Sure this might not generalize (which is what the other reviewers also point out), but you can't just look at a graph showing X and then say "I don't think X", especially as a reviewer. Maybe they meant generalization but failed to formulate that, but given the test of the review, I don't think that's the case
There's also no "theory" behind Adam or SGD as well, as well as 99 percent of existing optimizer "tricks".
So it's pointless to ask for any theory at all, if the entire field is empirically motivated (whatever trains fast is good stuff) by nature.
So one can have today 200 GPUs run experiments and publish a paper that has no conclusive result... The experiments do not show anything , there is no conclusion as results are in margin of uncertainty (1-2%).
If you don't think there is a conclusive result, I may not have done a good job explaining the paper. Keep in mind that not every paper's goal is to be better than all others.
@@YannicKilcher That reviewer you criticized is clearly not good at writing proper English, and probably hastily made his review. However, I think he has a strong point, and so has @John Doe: most of the elicited "phenomena" in this paper seem within margin of uncertainty, and the paper does not provide any supporting theory. I would have rejected it as well for the lack of any one of those two kinds of evidence. Probably the authors should have made less experiments, but provide better statistics (e.g.confidence intervals) about their measures.
2:31 - We should be able to vote these reviewers out. If your review is nonsensical, you should lose you right to review anything. And maybe more...
acdemic environment
4:17 - How people, who can't even write, become reviewers? *Summary Of The Review... paper is **_insufficinet._*
Yannic Hudziak do you know this person? He's trolling people in comments with GPT.
I'm embarrassed to say it took me two comments to notice...
So there is no one right learning rate adaptation algorithm. It seems to me, the paper is a good example of hacking the optimizers, which themselves are a bunch of hacks.
20:35 level of science is so low - middle school mathematics is published as "new method", no theoretical justification, no general theory in deep elarnig at all, only speculations
Ok buddy, let’s see your papers
where does this claim to be a “new method?” i thought this paper only seeks to empirically demonstrate somewhat more competitive application of sgd to problem domains where it’s performance was previously thought completely laughable by searching for a corrective learning rate schedule using adaptive methods