Beyond neural scaling laws - Paper Explained

AI Coffee Break with Letitia

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 18 жов 2024

КОМЕНТАРІ • 49

@amenezes 2 роки тому ⁺⁵
Great summary of the paper, thank you!
I've dived a bit deeper into it and I think the explanation of the theoretical setup in the video does not fully match the one in the paper.
What I got from the video:
1. We have a labeled (infinite) dataset
2. The teacher perceptron learns to label the training data
3. The student also learns on the training data but only for a few epochs
4. The margin is the difference between the distance of the point to the teacher and to the student boundaries
What I got from the paper:
1. We get (infinite) data points from a normal distribution
2. We initialize the teacher perceptron with a random weights vector and use it to label the data (i.e. the teacher is only used to generate synthetic labels)
3. The student learns from the labeled data
4. The margin is the distance from the point to the student boundary (the teacher is not involved here)
The results in Fig.1 assume the student is perfectly aligned with the teacher (i.e. the margin perfectly reflects the distance to the real class boundary), while in Fig.2 the authors show the effect of having a misaligned student.
Let me know your thoughts on this :)
@AICoffeeBreak 2 роки тому ⁺⁴
Thanks for the comment! The way you understood it, is the way I understood this first too, but on a second thought, it made no sense.
Thanks to your comment, I am on my third iteration and your explanation makes sense again, so let's discuss a bit: You mean that we do not need the teacher model other than labelling data. Why then bother using it to generate the labels, if we could just assume some labels?
Also, how did they otherwise estimate the angle Theta between the probe student and the teacher T? (paper page 5 top).
@amenezes 2 роки тому ⁺³
@@AICoffeeBreak Thanks for the questions, they also helped me clarifying my thoughts. The point where I say the teacher is ONLY used to generate the labels is indeed incorrect. I meant to emphasize that the teacher is not used to compute the margin, but it is actually relevant for the rest of the study.
To elaborate on the questions, I will explain my understanding of the teacher-student perceptron setup for studying learning mechanics in general, regardless of the phenomena being studied (data pruning in this case).
In general, on a machine learning task we have
1. a set of observations from the "world", which is governed by some unknown real model of the "world"
2. a model with learnable parameters, which we assume is able to approximate the real model of the "world"
3. the learning process, where we find the parameters that best fit the observations
This theoretical setup allows us to isolate the learning process, since
1. the observations are taken from a "world" which is governed by a known model: the teacher perceptron
2. the assumption that our model (the student) is able to approximate the real model of the "world" perfectly holds, since they are both perceptrons (we just need to find the right parameters)
Adding these points to the limit of infinite data and infinite parameters, we get a perfect scenario where we can study the learning process without the influence of the limitations that exist in real scenarios. And since we know the real model that governs our synthetic world, we can also quantify the actual deviation between the real and the learned models (which is different from the error on the observations) and use it when studying learning mechanics.
I guess this would explain the points you've raised. Disclaimer: I didn't go into the statistical mechanics papers in the references, this is just my interpretation from this paper.
@AICoffeeBreak 2 роки тому ⁺⁴
@@amenezes Thanks for the clarification. I think you are right. This also makes the proposed method more theoretically motivated (the one with the k-means clustering of representations taken from a pretrained model). The discrepancy between theory (in my understanding) and the proposed method was really stark.
@AICoffeeBreak 2 роки тому ⁺³
I've pinned your comment as an erratum to the video explanation.
@AICoffeeBreak 2 роки тому ⁺³
I am still confused a bit (or the paper is extremely confusing). I cite from the paper's introduction (page 2, numbered point 1): "where examples are pruned based on their teacher margin", so the distance to the teacher boundary is relevant.
@Neptutron 2 роки тому ⁺⁸
So awesome that you have NVIDIA as a sponsor xD
@WilliamDye-willdye 2 роки тому ⁺⁴
The comparison of pruning strategies was very helpful to me. Thank you for summarizing the paper, and best wishes at the conference.
@frommarkham424 Місяць тому ⁺¹
3:22 thanks for the knowledge🙏we gonna make it out the data center with this tutorial🗣🗣🗣🗣
@Erosis 2 роки тому ⁺⁵
These results feel intuitive with what I've felt in practice. The math is nuts, though. :)
@frommarkham424 Місяць тому ⁺¹
4:39 mann the diminishing returns be hitting real hard today💀
@thipoktham5164 2 роки тому ⁺²
I was going to read this paper, thanks for the nice explanation!
@AICoffeeBreak 2 роки тому ⁺²
Great timing! ⏲ Glad it was helpful! :)
@Self-Duality 2 роки тому ⁺²
Nice summary analysis 😊💭
@joecincotta5805 3 місяці тому
Super interesting. I thought they were going to map entropy of the dataset, which is kind of what they imply, easy vs hard is equivalent to novel vs non-novel data in the distribution of the data.
@lighterswang4507 11 місяців тому ⁺²
Very similar with the idea of active learning
@cipritom 2 роки тому
super nice explanation and reasoning ! Thanks for the insight
@ScriptureFirst 2 роки тому ⁺¹
Great content. very accessible: thank you
@TheNettforce 2 роки тому
Thanks for the great introduction to this topic
@vadrif-draco 2 роки тому ⁺¹
very exciting, thank you
@RfMac 2 роки тому ⁺¹
Awesome video, love your explanations!
@dr.mikeybee 2 роки тому ⁺¹
Nicely done!
@mandarjoshi6814 2 роки тому ⁺²
11:40 so in experiment authers selected top 80% difficult/hard examples from clusters and did not include bottom 20% of easy examples while training because initial dataset (ImageNet) is fairly large. Is my understanding correct?
Thanks for explaining.
@AICoffeeBreak 2 роки тому ⁺⁴
Exactly! It is not much data they discard: they discard only 20% to keep the same performance as when not discarding anything.
But imagine, that discarding 20% of the data when working with billion of examples, is a considerable amount. :)
@mandarjoshi6814 2 роки тому ⁺¹
@@AICoffeeBreak Thank you 🤗
@frenchmarty7446 2 роки тому ⁺³
Could this be useful for data augmentation?
For example: assuming I start with a certain size dataset and don't need to prune any examples, could/should I make more augmented copies of the more informative samples? Could I also test to see what kinds of augmentations are more or less useful?
@huonglarne Рік тому
That's great idea!!
@worldofai2924 2 роки тому
Thank you for a great video!
@sonataarcfan9279 2 роки тому ⁺¹
How do you make those animations like in "Exponential scaling in theory" part, which software do you use? I would really be appreciated if you could tell me :)
@AICoffeeBreak 2 роки тому ⁺¹
With PowerPoint. I draw with the drawing functionality. Then I select the drawing, go to the Animations tab and click on Replay.
@joecincotta5805 3 місяці тому ⁺¹
My new favourite videi
@flamboyanta4993 2 роки тому
The screen shot of the mathematics made me chuckle....in horror. Thanks Letitia for an excellent video!
@AICoffeeBreak 2 роки тому ⁺¹
🤣🤣🤣 Yeah, the math part is really impressive. 😏
@averma12 2 роки тому
How does this compare to finetuning the same model on smaller data. How much data would be needed.
@kailashj2145 2 роки тому ⁺¹
hey, any update on the giveaway?
@AICoffeeBreak 2 роки тому ⁺¹
Check your email. You should have received a notification whether you won or not. ☺️
@DerPylz 2 роки тому ⁺²
📈
@brandomiranda6703 2 роки тому
Do you have a mailing list?
@kornellewychan 2 роки тому ⁺¹
greate work, more like that
@poketopa1234 Рік тому
Isn't this just hard sample mining?
@Quaquaquaqua Місяць тому
Shouldnt you use density based clustering?
@TheTimtimtimtam 2 роки тому ⁺¹
First :)
@JorgetePanete 2 роки тому ⁺¹
0:10 " "*
@brandomiranda6703 2 роки тому ⁺¹
Funny they prune the "easy" examples close to the prototypical centroids. Most few shot learning methods like fo-proto-maml uses prototypical examples as the key. Is this suggesting doing that is wrong?
Also, I would have intuitively expected the prototypical examples to summarize the data better and thus the ones to keep. But the do the opposite. That seems bizarre.
I think at least as a sanity check to see their theory really holds in practice to truly challenge their hypothesis they should've tested the reverse. Remove the "hard" examples and if the results still work out I'd personally be very skeptical. It probably didn't occur to them to do this due to confirmation bias...it's happened to me! 😳🫣 But it's no excuse. As a reviewer I'd immediately reject it unless my experiment or something equivalent is done. A falsification experiment.
@huonglarne Рік тому
Thanks for the insight. I never would have realized that
@huonglarne Рік тому
I think maybe they want the model to generalize, even for "outliers" in the data.
Or maybe when the dataset is imbalanced and some classes are under represented then pruning the easy samples may help the model not overfit
@godfrycunio3404 2 роки тому
👉 【p】【r】【o】【m】【o】【s】【m】

Наступне

Автоматичне відтворення

How does Stable Diffusion work? - Latent Diffusion Models EXPLAINED