PonderNet: Learning to Ponder (Machine Learning Research Paper Explained)

Yannic Kilcher

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 21 гру 2024

КОМЕНТАРІ • 56

@YannicKilcher 3 роки тому ⁺³
OUTLINE:
0:00 - Intro & Overview
2:30 - Problem Statement
8:00 - Probabilistic formulation of dynamic halting
14:40 - Training via unrolling
22:30 - Loss function and regularization of the halting distribution
27:35 - Experimental Results
37:10 - Sensitivity to hyperparameter choice
41:15 - Discussion, Conclusion, Broader Impact
@4knahs 3 роки тому ⁺⁶²
Yasss! paper explained is back! :D
@freemind.d2714 3 роки тому
About time...
@IoannisNousias 3 роки тому ⁺⁵
Thank you sir. An international treasure.
@nocomments_s 3 роки тому
Amazing! So happy to see paper explained series back!
@lemurpotatoes7988 3 роки тому ⁺²
I believe that the recurrent structure is the reason that they're able to maintain stability despite attempting to solve two problems at once. My feeling is that the reason it's typically bad to solve two problems at once is that you will be inconsistent about credit assignment in ways that are determined by incidental noise. The incidental noise washes out, in a within-sample sense (as opposed to an across-sample one, which wouldn't be sufficient), due to the recurrent structure of the model. Learning how to do credit assignment correctly in the sense needed for the particular sample under situation is encouraged by the architecture.
Across-sample washing out of incidental noise doesn't work because each sample has a different credit assignment problem associated with it. But for a given sample, at different time steps in the network's operation, the underlying credit assignment problem to be solved remains the same.
@priancho 3 роки тому
So glad to watch your paper introduction video again :-)
@mgostIH 3 роки тому ⁺⁹
Thanks for reviewing this! I love papers that push for different approaches, I think another interesting field coming up is making more things differentiable like rendering (I am sure you saw that recent painting transformer paper) or optimization.
A benchmark I wish they did for PonderNet was learning how to sum and do other operations on integers, since it seems to be something quite hard even for the largest transformers.
@WatchAndGame 3 роки тому
Could you tell me what this "painting paper" is called? I am interested :)
@mgostIH 3 роки тому ⁺²
@@WatchAndGame Paint Transformer: Feed Forward Neural Painting with Stroke Prediction
What they do is very similar to DETR (a paper Yannic reviewed), the architecture is quite simple but the core thing they need is a neural renderer, something that can take as input the strokes to draw and actually displays them on an image all while being differentiable in order to backpropagate to the rest of the architecture. This helps them in not needing to use Reinforcement Learning, which is usually much less stable.
@WatchAndGame 3 роки тому
@@mgostIH Cool thanks!
@Supreme_Lobster 3 роки тому ⁺¹
@@mgostIH is RL not differentiable. I'm quite new to ML and NNs and I'm entirely sure what "differentiable" means, other than "you can backpropagate"
@mgostIH 3 роки тому ⁺²
@@Supreme_Lobster The main issue of RL is that, while you can make *part of it* differentiable (Deep Q Learning, Policy Gradient), you usually don't have a differentiable model of the game state and no information over what causes a good reward (so you can't have backpropagate a loss over "Hey I want the end game screen to look like this").
Think for example in Chess: you get a reward only at the end of the game (win/lose) but you don't have information over what specific action was good and what was bad, this is called the "Score Assignment Problem" and a lot of algorithms try tackling this but it's still largely unsolved.
This isn't to say that RL is impossible, but it's one of the areas where ML still struggles a lot, all methods we use are still very specific, are unstable (some runs may converge to a good game playing agent, some don't, out of pure chance) and require **TONS** of compute power for anything non trivial.
Meanwhile if you check the paper of painting transformer, their differentiable renderer allowed them to just optimize everything based on the desired image loss; compared to other approaches that solve the same problem they trained it much faster and are able to run it faster too (check their benchmarks)
@bdennyw1 3 роки тому
Welcome back Yannic! I've missed your videos.
@colinjacobs176 3 роки тому
Love your work. Very clear explanation. Indeed an interesting innovation.
@drdca8263 3 роки тому ⁺³
21:20 my impression is that the (?)regularization(?) or, err, the term they add to make it prefer to halt earlier if it can while still having good results, should somewhat counteract that? But maybe it wouldn’t be enough, I wouldn’t know
Edit: nvm you were about to get to that part
Oh good, I remembered that word “regularization” correctly.
@sergiomanuel2206 3 роки тому ⁺⁷
Hello Yannic, you confused P with Lambda in the loss function. Pn=Ln * prod(1-Li). This is why the trivial solution is not making all lambdas equal to zero.
@kshitizmalhotra1394 3 роки тому ⁺²
He acknowledged that later
@fiNitEarth 3 роки тому ⁺¹
Omg a new papers explAIned video 😍 my brain is about to explode.
@denissergienko2001 3 роки тому ⁺¹
Welcome Back!!!
@Idiomatick 3 роки тому ⁺¹
Nice! I normally take notes while watching these and often leave side notes to myself about stuff I didn't understand in order to look into further in the paper. But this time I paused and wrote a note that I was confused about the loss function because I don't get how they handle the risk of λ going to 0 and the 2 variable problem being unstable....... unpause and you say basically the exact same concerns. I feel like I actually must have understood an ML paper at first glance for once! It was very gratifying, haha.
I think the regularization term does a lot of work in forcing the loss to push towards a sane output though. But that creates an assumption on calculations that might not follow in the real world. I mean, if I'm given a math problem, I don't gradually improve my understanding past some threshold, some math problems are instant, some I can't solve. At least at first glance, as I type this, I don't think that this algorithm will be as useful on types of problems that have highly variable amounts of computation needed but I'd probably have to implement to be certain.
@srh80 3 роки тому
Love such papers! So much better than 'all you need' hype
@dr.mikeybee 3 роки тому ⁺³
As always, you've made another fascinating video. Thank you. What I wonder is what kinds of models can be trained and used for inference using this architecture on small GPUs? Does this open up possibilities given resource constraints? Can I get GPT3-like performance on a K80 using PonderNet because my network isn't so deep? Or is this just a way to speed up inference? I suppose that with each pass through the model, the combinations of parameters multiply to a Cartesian product, but it's not intuitive to me how this works with a backward pass. After all, this doesn't seem to give new functionality over a feed forward model other than the ability to halt early. In other words, only the same kinds of things can be learned, but perhaps they can be learned more quickly.
@Mikey-lj2kq 3 роки тому ⁺²
i'm no expert but...seems like a dreamcoder punishing Kolmogorov complexity works better for parity, and the general idea of 'aligning model & task complexity?
@herp_derpingson 3 роки тому
I was kinda hoping for ablation for the KL divergence. Good stuff though.
@norik1616 3 роки тому ⁺¹
What an interesting idea!
@nurkleblurker2482 3 роки тому
Interesting. Good explanation
@borisyangel 3 роки тому
I wonder if one can just use the expectation of the distribution induced by p_i as a regularizer. Such regularizer would not force a geometric shape on p_i, just ask it to make fewer steps. And the network would be able to model things like sudden changes in p_i more easily.
@patf9770 3 роки тому
Consider doing a video on PerceiverIO, it's a major upgrade to vanilla Perceiver and I can easily see it's descendants taking over many areas
@andres_pq 3 роки тому
Hello Yannic! Can you teach us to matrix multiply without multiplying?
@choipetercsj7256 3 роки тому
Hi, thanks for your video!.
I plan to do a project on the complexity of tasks on image dataset like imagenet, cifar 100. If I use a vision transformer, then can I implement my project? and Is it meaningful?
@bernardoramos9409 3 роки тому
Yannic, please do a video on the new Fastformer
@Mikey-lj2kq 3 роки тому
the recurrent part seems somewhat like GAN? the ACT is like ada boost while PonderNet is like boosting tree.
@ziquaftynny9285 3 роки тому ⁺¹
41:00 "it is completely thinkable" lol I think the word you're looking for is plausible?
@brll5733 3 роки тому
I don't see how the training works with that added output of every timestep.
By adding all possible outputs and their probabilties, you get an overall, statistical error but no feedback signal for individual outputs?
@ChaiTimeDataScience 3 роки тому ⁺¹
It's Monday, folks!!!
@vishalmathur6545 3 роки тому ⁺²
Can you do a Tesla ai day review.
@Rizhiy13 3 роки тому
22:18 Why can't you just add a small loss just for low probability, so that it tries to increase it?
@konghong3885 3 роки тому ⁺¹
Does the paper references universal transformers?
@mgostIH 3 роки тому ⁺²
Yes it does! In the bAbI, they compare them with transformers + pondernet and they seem to do better, but imo the big deal of the paper is that the architecture is very general and can be applied on anything you might think of
@aspergale9836 3 роки тому ⁺²
@@mgostIH So there isn't really an "architecture" in the sense of, say, Transformers vs LSTMs. The contribution is more: (1) The clearer formulation (?), and (2) The corrected term for the stopping probability. Yes?
@mgostIH 3 роки тому ⁺²
@@aspergale9836 Indeed, you can apply this method for pretty much any DL model you can think of, instead of putting more layers you use this procedure so that the network learns how deep it needs to be per each input.
In this sense, it's similar to Deep Equilibrium Models, without the need to redefine backpropagation.
@paxdriver 3 роки тому ⁺¹
Maybe I'm just a noob and I'm missing something... But why not just train a feed forward network to do a halting mechanism on another simple CNN like a nn manager? Seems way simpler than integrating the halting procedure in a single network
@YannicKilcher 3 роки тому ⁺¹
That's entirely possible in this framework. The step function can be two different NNs, or a combined one.
@siyn007 3 роки тому ⁺¹
Did anyone catch how they normalized the probabilities (lambdas) across time?
@SirSpinach 3 роки тому ⁺²
There's a hyperparameter determining the minimum cumulative halt probability before ending network rollouts. I'm guessing that when calculating the expected loss, they normalize by the actual cumulative halt probability of the rollouts during training?
@swordwaker7749 3 роки тому ⁺⁹
QUICK YANNIC! THE TESLA AI DAY IS OUT!
@nocturnomedieval 3 роки тому ⁺³
No hurry. It can be stressful. Some are so eager that they do not love slow paced videos. But yeah, we would love you to present those Tesla snippets.
@walterlw1078 3 роки тому
Lex Fridman did a review of that, you can check it out
@petrusboniatus 3 роки тому ⁺⁴
General Kenobi
@GeneralKenobi69420 3 роки тому ⁺⁶
Hello there
@NextFuckingLevel 3 роки тому
Holla Todos
@dontaskme1625 3 роки тому ⁺²
I dislike the wishful mnemonics in the paper's title
@paxdriver 3 роки тому ⁺³
Be honest, Yan, you down vote your own vids right? Lol you've got a loyal hater out there if not
@YannicKilcher 3 роки тому ⁺²
All things in the universe must have balance :D

Наступне

Автоматичне відтворення

Grokking: Generalization beyond Overfitting on small algorithmic datasets (Paper Explained)