Big Self-Supervised Models are Strong Semi-Supervised Learners (Paper Explained)

Yannic Kilcher

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 2 тра 2024
This paper proposes SimCLRv2 and shows that semi-supervised learning benefits a lot from self-supervised pre-training. And stunningly, that effect gets larger the fewer labels are available and the more parameters the model has.
OUTLINE:
0:00 - Intro & Overview
1:40 - Semi-Supervised Learning
3:50 - Pre-Training via Self-Supervision
5:45 - Contrastive Loss
10:50 - Retaining Projection Heads
13:10 - Supervised Fine-Tuning
13:45 - Unsupervised Distillation & Self-Training
18:45 - Architecture Recap
22:25 - Experiments
34:15 - Broader Impact
Paper: arxiv.org/abs/2006.10029
Code: github.com/google-research/si...
Abstract:
One paradigm for learning from few labeled examples while making best use of a large amount of unlabeled data is unsupervised pretraining followed by supervised fine-tuning. Although this paradigm uses unlabeled data in a task-agnostic way, in contrast to most previous approaches to semi-supervised learning for computer vision, we show that it is surprisingly effective for semi-supervised learning on ImageNet. A key ingredient of our approach is the use of a big (deep and wide) network during pretraining and fine-tuning. We find that, the fewer the labels, the more this approach (task-agnostic use of unlabeled data) benefits from a bigger network. After fine-tuning, the big network can be further improved and distilled into a much smaller one with little loss in classification accuracy by using the unlabeled examples for a second time, but in a task-specific way. The proposed semi-supervised learning algorithm can be summarized in three steps: unsupervised pretraining of a big ResNet model using SimCLRv2 (a modification of SimCLR), supervised fine-tuning on a few labeled examples, and distillation with unlabeled examples for refining and transferring the task-specific knowledge. This procedure achieves 73.9\% ImageNet top-1 accuracy with just 1\% of the labels (≤13 labeled images per class) using ResNet-50, a 10× improvement in label efficiency over the previous state-of-the-art. With 10\% of labels, ResNet-50 trained with our method achieves 77.5\% top-1 accuracy, outperforming standard supervised training with all of the labels.
Authors: Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, Geoffrey Hinton
Links:
UA-cam: / yannickilcher
Twitter: / ykilcher
Discord: / discord
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher
Наука та технологія

КОМЕНТАРІ • 79

@drhilm 3 роки тому ⁺⁶¹
You are so good at ditilling the paper knowledge. Clearing the top insights. Thanks.
@CosmiaNebula Рік тому
The "projection layer" is not an architecture, but a job description. Any module that performs the job of a projection is a "projection layer".
simCLR is an abstract framework for self-supervised contrastive learning. It consists of the following components:
1. data augmentation: turning data points into data point pairs (or triples, or n-tuples), to be used for contrastive learning.
2. working layer: a module for turning data points into general representations
3. projection layer: a module for turning general representation into specific representation adapted to specific purposes.
4. student network: a different network for distilling the teacher network.
In the paper, simCLRv2 is concretely instantiated as the following:
1. data augmentation: random cropping, color distortion, and gaussian blur
2. working layer: ResNet-152
3. projection layer: 3-layered MLP
4. student network: ResNet but smaller than ResNet-152
The idea of a projection layer is to allow the working layer to focus on learning the general representation, instead of learning both a general representation AND the specific task in self-supervised training. Even self-supervised training is not a general task; it is specific! As they said in simCLRv1 paper.
> We conjecture that the importance of using the representation before the nonlinear projection is due to loss of information induced by the contrastive loss. In particular, z = g(h) is trained to be invariant to data transformation. Thus, g can remove information that may be useful for the downstream task, such as the color or orientation of objects. By leveraging the nonlinear transformation g(·), more information can be formed and maintained in h.
This is similar to how, in iGPT (2020), the authors found that linear probing works best in the middle. Probably because in the middle, the Transformer has fully understood the image, and would then start to focus back to the next pixel. Imagine its attention as a spindle, starting local, then global, finally local again.
@Progfrag 3 роки тому ⁺¹⁷
Wow! So self-distillation is basically label smoothing but smoothing at the right places instead of evenly
@ensabinha Місяць тому ⁺¹
Essentially, they pre-train with contrastive learning and finetune, then do pseudo-labeling (but using the probability distribution over your labels) and retrain on that.
@sudhanshumittal8921 3 роки тому ⁺⁴
Thanks a lot Yannic for latest updates.
@mohammadaliheydari9093 2 роки тому
very good presentation. thank you Yannic!!
@ralf2202 2 роки тому ⁺⁴
Yannic, you are great teacher network! Thank you.
@jaesikkim6218 3 роки тому ⁺¹
Really awesome explanation! Easy to understand!
@SungEunSo 3 роки тому ⁺¹
Thank you for great explanation!
@alviur 3 роки тому
Thanks a lot Yannic!
@teslaonly2136 3 роки тому ⁺³
I was stunned when I saw the broader impact section.
@salmaalsinan8612 2 роки тому ⁺¹
its been a while since i laughed while listening to something technical :D, excellent review and appreciate the funny commentary as I had similar questions .
@christianleininger2954 3 роки тому ⁺¹
great job ! amazing
@authmanapatira3016 3 роки тому
Love all your videos.
@dipamchakraborty 3 роки тому ⁺¹⁸
I think there are two reason to make a big deal out of that extra projection layer.
1. Its not standard practice, so their comparisions with previous methods aren't fully fair, so doing this might improve other methods as well.
2. The last layer of Resnet50 is CNN->Activation->Global Average Pooling, so its kinda different from regular models with only single linear layer on top of CNN
@quAdxify 2 роки тому ⁺²
There usually isn't an activiation in front of GAP, I think at least. But yeah it's basically not just a stacked matrix multiplication (which should be equivalent to just using a wider layer) because of GAP. But it's pretty obvious why it works better. Basically they are bringing the fully connected layer back that was common place before GAP kinda replaced it for most cases. So there simply is more representational power. We shouldn't forget that a fully connected layer has orders of magnitude more weights compared to a Conv layer (depends on the number of filters but let's keep that reasonable). I'd bet it wouldn't matter if they just replaced GAP with a regular fully connected layer.
@sam.q 3 роки тому ⁺¹
Thank you!
@sathisha2394 3 роки тому ⁺³
You are so good at explaining things in a not mathematical way which helps me to grasp the insights very quickly. I kind of felt so much of knowledge I gained just watching your videos. Thank you so much. Keep posting. Can you put a video about SIREN?
@ProfessionalTycoons 3 роки тому ⁺⁴
such a great paper, still so much secrets to unravel
@Guesstahw 3 роки тому
Danke Vielmal @Yannik for the video, you did a great job. On the intuition or explanation behind figure 1 plots and why is it so, here's my 2 cents:
You just have to think in terms of percentage of trainable parameters for the downstream task. To elaborate firstly keep in mind that growing the model size means growing the encoder size only and the size of the classification (linear) head remains constant. Now since in fine-tuning only the head parameters are trained as you grow the size of the self-supervised encoder, the ratio of trainable number of parameters (corresponding to Head) shrinks with respect of the total model parameters. Therefore a downstream task with fewer labels is more benefited from drop in percentage of number of trainable parameters (as the encoder size grows) than its counter parts with more labels. I think this would be the intuition behind the observed larger gains. In other words the fewer the labels, the more expressive encoder is required to capture as much information about the structure and geometry of the (unlabeled) data as possible to compensate for the shortage of labels.
@AdnanKhan-cx9it Рік тому
that horrific background sound at 18:43 , btw excellent explanation as always
@vitocorleone1991 Рік тому
I salute you sir!!!
@JavierPortillo1 2 роки тому
Thanks! Very clearly explained! Coud you please explain the SwAV model?
@PhucLe-qs7nx 3 роки тому ⁺¹
Self-distillation is boostraping / self-play in RL. The recent paper BYOL also uses bootstraping to ditch away the negative samples altogether.
I guess the reason that these self-play or distillation works is because of the initial inductive bias in the random intializaed + architecture.
If you can't bootstrap to learn from initial inductive biases, no learning is possible. And because we know learning is possible, even from zero labels, as long as the inductive biases and procedure is correct, then bootstraping / self-distillation / self-play must work.
@johnkrafnik5414 3 роки тому
Great video, thanks for making this so digestible.
I am curious what the long term goal is here, it feels like we are piling on hack after hack to improve small percentage points. I understand that the overall goal of transitioning to semi-supervised learning is important, but so far feels very incremental.
@christianleininger2954 3 роки тому ⁺¹
I really like your videos Maybe you would like to make a video about the paper Accelerating Online Reinforcement Learning with
Offline Datasets
@herp_derpingson 3 роки тому ⁺⁷
Great paper. Definitely the quality you expect from Hinton. Fun fact: His great-great grandfather was George Boole. (Boolean algebra)
.
21:20 I think its to be noted that ResNet50 probably went through some extensive hyperparameter tuning to do exactly what it was supposed to do and thus had a fixed number of dense layers at the end. So, perhaps adding a new layer just happens to help in the problem we are trying to solve, i.e. the teacher student thing instead of one hot.
.
18:43 The whistling in the background. Is someone snoring?
@YannicKilcher 3 роки тому ⁺¹
Wow, didn't know Hinton had royal blood :D
Yea I agree this extra layer is super problem specific, but I don't get why they don't just say the encoder is now bigger, but instead make the claim that this is part of the projection head.
and no, I have no clue what that noise is O.o
@drdca8263 3 роки тому ⁺²
Regarding the broader impact statement, while I generally agree that many broader impact statements appear to not be useful, I do think that the “where it is more expensive or difficult to label additional data than to train larger models” point, along with the example of needing clinicians to carefully create annotations for the medical applications, was probably worth saying. That part appears to point to a specific area in which this improvement is useful. Of course, it would still be interesting even if it couldn’t be used for anything, but I do think that detail is still worthy of note.
I imagine (with no real justification) that the reason that they mentioned crop yield was because they felt obligated to include at least one negative example, but wanted the positive examples they listed to outnumber the negative ones, so they needed a second one.
Another beneficial use-case where getting labeled data is especially expensive or difficult, compared to other use-cases, and where it is clear that that is the case, may have been better than the part about food, but eh.
@YannicKilcher 3 роки тому ⁺³
Yea it's kind of like a job interview where they ask you about your weaknesses and you want to say something that's so minor it's almost irrelevant :D
jokes aside, it's actually awesome that you don't have to collect as many labels. but that doesn't belong in the broader impact section, at least not as it is defined by NeurIPS, because it still deals with the field of ML. In the BI section, you're supposed to pre-view how your method will influence greater society.
@shivanshu6204 3 роки тому
Damn you went hard after the broader impact lol.
@theodorosgalanos9663 3 роки тому ⁺²
Thanks Yannic this is great! I wonder, are you aware of any approach that deals with domains where augmentation, at least most of it, is not available? The best I remember is the ablation study on augmentations from...I forget which paper, might have been v1 of this one? In my domain, most augmentations, other than random crop, invalidate the image completely (they are physics simulations), I wonder if anyone has tested if the SSL approach still helps in these cases.
@YannicKilcher 3 роки тому
No idea. Yes, I also recall that crop is the main driver of performance here.
@victorrielly4588 3 роки тому ⁺¹
Very interesting hypothesis about why a bigger model provides better improvements through self supervised learning however, I would caution that bigger models do not actually necessarily mean more learned features, for instance, suppose you use a giant model where the last layer is 1 dimensional. In fact the dimensionality of the feature space is not at all dependent on the size of the model but the dimensionality of the output layer.
@phsamuelwork 3 роки тому ⁺²
Broader impact... it is like something one put in an NSF proposal.
@eelcohoogendoorn8044 3 роки тому ⁺²
So.. putting a bunch of cool existing methods together works pretty well? Sarcasm aside, the extensive experiments are appreciated.
@johngrabner 3 роки тому ⁺²
Wow, maybe this paper discovered why we dream.
@grinps Рік тому
Thank for the great review. What app did you use for read and annotate the pdf in this video?
@bengineer_the 3 роки тому ⁺¹
Hi Yannic, this set of ideas feels like gold. This is how 'we' as humans learn. Children are allowed to experience the world with as few 'adult-labels' as possible.. "to get a feel" of the world.. we then come along, explain things.. they kinda memorise what you said, but then years later come back going, "Ahh, now I get it on my terms..". So perhaps the warning for future abuse of this technique is somewhat valid. Can we now make a form of "accumulative consciousness scheme" (over all time) that could then be queried & labelled in the future. Retroactively plucking the knowledge after you become aware of the concept-label. ..this could be quite far reaching.
@bengineer_the 3 роки тому
Hmm, how about teach such a system as described, then [later] give it a inference based connection to the internet (let it search) and let it figure out the labels latter? Going on a tangent, but I wonder if there has been much research into clustering multiple image-classifiers & nlp transformers into a label acquisition learning scheme?
@bengineer_the 3 роки тому
This form of learning (minimal-labelling combined with jittered-input forms) gives the network time to breathe. A bad teacher barks the answer. I like it a lot.
@YannicKilcher 3 роки тому
Super interesting suggestions, I think what you're describing really goes into the direction of AGI where the system sort-of learns to reflect on what it knows and how it can learn the things it doesn't!
@SachinSingh-do5ju 3 роки тому ⁺⁸
You have fans..,
And many of them 😛
I am one now
@hexinlei6250 3 роки тому ⁺¹
Really good presentation!!! btw, may I ask what's the presenting app?
@YannicKilcher 3 роки тому
OneNote
@RohitKumarSingh25 3 роки тому ⁺¹
So only novel idea in this paper is just adding the self training or distillation part right? I wonder how come we had never thought of it before for unlabelled data given it seems so obvious especially after realising the benefits of label smoothing and mix-up technique.
@mhadnanali Рік тому
you are really good at paper reading. how to gain this skill?
@sudhanshumittal8921 3 роки тому ⁺³
And that saturates the semi-supervised image classification performance. The community needs more realistic/harder benchmarks.
@twobob Рік тому
agree
@slackstation 3 роки тому ⁺⁸
This one video, I feel like I've learned so many different insights. I'm still trying to level up where I understand that the math annotations easily and clearly like Mr. Kilcher but, the insights here are amazing.
If I could suggest a paper/video to explain, SIREN: ua-cam.com/video/Q2fLWGBeaiI/v-deo.html Paper: arxiv.org/abs/2006.09661
The video does a decent job of explaining the concept and application. I'm more interested in your opinion on what you think the impact that this could have on the rest of the field by replacing ReLU and others with SIREN. As always, thank you for your work.
@RohitKumarSingh25 3 роки тому ⁺¹
Agree. Yannic please review this paper if you get time.
@rajeshdhawan4624 3 роки тому
I want to connect about same...kindly let me know how??
@RobNeuhaus 3 роки тому ⁺¹
Do you have more information or intuition on self distillation? Why does distilling the same model/architecture on unlabeled using an identical architecture improve the student over the teacher?
@YannicKilcher 3 роки тому
because it sees more data than the teacher
@theodorosgalanos9663 3 роки тому ⁺¹
So SSL gives us access to a sort of large feature space and distillation filters through which of those features are important for the task in hand? I wonder if there an experiment without distillation to see if that extra noise in the feature space hurts (so finetune and predict without student). Okay I'll stop being lazy and check!
@YannicKilcher 3 роки тому ⁺¹
Yes, the first experiments in the paper are without distillation, as far as I understand (it's not explicitly clear, though)
@nopnopnopnopnopnopnop 2 роки тому
I still don't get the self-distillation part. If the teacher and the student are the same network, then they produce the same outputs. So what is there to even learn?
In this case, the student didn't have the additional projection layer, so at least the networks aren't identical (though I still don't understand what there is to learn). But kilcher made it look like it would be useful even if the networks were the same
@sacramentofwilderness6656 3 роки тому ⁺¹
I would like a neural network to slow down the time to keep up with the advances in machine learning and AI
@sayakpaul3152 3 роки тому ⁺²
22:13 why did you mention you were wrong in the supervised loss part? Sorry if this is a redundant question.
@YannicKilcher 3 роки тому
I just re-watched it and I can't figure it out myself :D
@sayakpaul3152 3 роки тому
@@YannicKilcher no worries man. I think these little traits make us human. Anyway, great explanation as always.
@andres_pq 3 роки тому ⁺¹
What is the difference between contrastive loss and triplet loss?
@YannicKilcher 3 роки тому
Haven't looked at triplet loss yet, but contrastive loss has an entire set of negatives
@snippletrap 3 роки тому
Triplet loss is a kind of contrastive loss
@florisas.7557 3 роки тому ⁺¹
one thing confuses me about distillation/self-supervised learning: Some methods enhance the pseudo label, some use confidence threshholds, some use augmentations for the student input, but this paper doesn't do any of those?
@rpcruz 3 роки тому ⁺²
It uses agumentation. From the paper: "SimCLR learns representations by maximizing agreement [26] between differently augmented views of the same data example via a contrastive loss in the latent space. (...) We use the same set of simple augmentations as SimCLR."
@florisas.7557 3 роки тому ⁺¹
@@rpcruz ah ok thanks! makes a lot more sense then
@MastroXel 3 роки тому ⁺¹
You mentioned that it's not well understood why are we getting better model after distillation. Let me even further this question: if that's the case why can't we now take Student and treat it as a new Teacher to obtain even better Student? That doesn't make too much sense, does it?
@YannicKilcher 3 роки тому
People do that, but there are diminishing returns.
@alonsomartinez8630 3 роки тому
pure alchemy...
@Sileadim 3 роки тому
"That would be ridiculous. We'll I guess, in this day and age nothing is ridiculous." xD
@sayakpaul3152 3 роки тому ⁺²
21:20 I think the representations pass through a non-linearity. There's a sigma there. But anyway, the notation is more complicated than it needed to be frankly.
@scottmiller2591 3 роки тому ⁺¹
If I were cynical, I would think you don't see much value in broader impact statements. If I were cynical.
@YannicKilcher 3 роки тому ⁺²
hypothetically
@snippletrap 3 роки тому ⁺¹
Lol. The same busybodies and morality police sticking their noses into open source communities, renaming NIPS, etc. Why does no one say No to these humorless twats and control freaks?
@deeplearner2634 3 роки тому
crop yields... haha. did the researchers suffer from food shortage??

Наступне

Автоматичне відтворення

Yann LeCun - Self-Supervised Learning: The Dark Matter of Intelligence (FAIR Blog Post Explained)