Self-training with Noisy Student improves ImageNet classification (Paper Explained)

Поділитися
Вставка
  • Опубліковано 17 тра 2024
  • The abundance of data on the internet is vast. Especially unlabeled images are plentiful and can be collected with ease. This model investigates a new method for incorporating unlabeled data into a supervised learning pipeline. First, a teacher model is trained in a supervised fashion. Then, that teacher is used to label the unlabeled data. Next, a larger student model is trained on the combination of all data and achieves better performance than the teacher by itself.
    OUTLINE:
    0:00 - Intro & Overview
    1:05 - Semi-Supervised & Transfer Learning
    5:45 - Self-Training & Knowledge Distillation
    10:00 - Noisy Student Algorithm Overview
    20:20 - Noise Methods
    22:30 - Dataset Balancing
    25:20 - Results
    30:15 - Perturbation Robustness
    34:35 - Ablation Studies
    39:30 - Conclusion & Comments
    Paper: arxiv.org/abs/1911.04252
    Code: github.com/google-research/no...
    Models: github.com/tensorflow/tpu/tre...
    Abstract:
    We present Noisy Student Training, a semi-supervised learning approach that works well even when labeled data is abundant. Noisy Student Training achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. On robustness test sets, it improves ImageNet-A top-1 accuracy from 61.0% to 83.7%, reduces ImageNet-C mean corruption error from 45.7 to 28.3, and reduces ImageNet-P mean flip rate from 27.8 to 12.2.
    Noisy Student Training extends the idea of self-training and distillation with the use of equal-or-larger student models and noise added to the student during learning. On ImageNet, we first train an EfficientNet model on labeled images and use it as a teacher to generate pseudo labels for 300M unlabeled images. We then train a larger EfficientNet as a student model on the combination of labeled and pseudo labeled images. We iterate this process by putting back the student as the teacher. During the learning of the student, we inject noise such as dropout, stochastic depth, and data augmentation via RandAugment to the student so that the student generalizes better than the teacher. Models are available at this https URL. Code is available at this https URL.
    Authors: Qizhe Xie, Minh-Thang Luong, Eduard Hovy, Quoc V. Le
    Links:
    UA-cam: / yannickilcher
    Twitter: / ykilcher
    Discord: / discord
    BitChute: www.bitchute.com/channel/yann...
    Minds: www.minds.com/ykilcher
    Parler: parler.com/profile/YannicKilcher
    LinkedIn: / yannic-kilcher-488534136
    If you want to support me, the best thing to do is to share out the content :)
    If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
    SubscribeStar (preferred to Patreon): www.subscribestar.com/yannick...
    Patreon: / yannickilcher
    Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
    Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
    Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
    Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
  • Наука та технологія

КОМЕНТАРІ • 46

  • @MrjbushM
    @MrjbushM 3 роки тому

    Crystal clear explanation, thanks!!!

  • @alceubissoto
    @alceubissoto 3 роки тому

    Thanks for the amazing explanation!

  • @omarsilva924
    @omarsilva924 3 роки тому +1

    Wow! What a great analysis. Thank you

  • @mohamedbahabenticha8624
    @mohamedbahabenticha8624 2 роки тому

    Your explanation is Amazing and very clear for a very interesting work! Inspiring for my work!!!

  • @bluel1ng
    @bluel1ng 3 роки тому +12

    24:15 I think it is very important that they reject images with high entropy soft-pseudo-labels (=low model confidence) and only use the most confident images per class (>0.3 probability). Images that the model is confident about increase the generalization most since they get classified correctly and then extend the class-region through noise and augmentation, e.g. especially when previously unseen images lie at the "fringe" of the existing training set or closes to the decision boundary than other samples. Since the whole input space is always mapped to class-probabilities a region can be mapped to a different/wrong class although there has not seen much evidence there. Through new examples this space can be "conquered" by the correct class. And of course also with each correctly classified new image new augmented views can be generated which increases this effect.

    • @emilzakirov5173
      @emilzakirov5173 3 роки тому

      I think the problem here is that they use softmax. If you use sigmoid, then for unconfident predictions the model would simply output zeros as class probabilities. It would alleviate any need for rejecting images

  • @AdamRaudonis
    @AdamRaudonis 3 роки тому +1

    Super great explaination!!!

  • @48956l
    @48956l 2 роки тому

    these seems insanely resource intensive lol

  • @DrDjango24
    @DrDjango24 3 роки тому

    Amazing review. Keep going

  • @sanderbos4243
    @sanderbos4243 3 роки тому +6

    39:12 I'd love to see a video on minima distributions :)

  • @blanamaxima
    @blanamaxima 3 роки тому

    I would not say I am surprised after the double descent paper... I would have thought someone did this already.

  • @pranshurastogi1130
    @pranshurastogi1130 3 роки тому

    Thanks now i have some new tricks in my sleeves

  • @herp_derpingson
    @herp_derpingson 3 роки тому +6

    11:56 Never heard about Stochastic depth before. Interesting.
    .
    After the pandemic is over, have you considered giving speeches in conferences to gain popularity?

    • @YannicKilcher
      @YannicKilcher  3 роки тому +1

      Yea I don't think conferences will have me :D

    • @herp_derpingson
      @herp_derpingson 3 роки тому

      @@YannicKilcher Its a numbers game. Keep swiping right.

  • @karanjeswani21
    @karanjeswani21 3 роки тому +1

    With a PGD attack, the model is not dead. Its still better than random. Random classification accuracy for 1000 classes would be 0.1%.

  • @hafezfarazi5513
    @hafezfarazi5513 3 роки тому +1

    @11:22 You explained DropConnect instead of Dropout!

  • @JoaoVitor-mf8iq
    @JoaoVitor-mf8iq 3 роки тому

    That deep-emsemble paper could be used here 38:40, for the multiple local minima that are almost the global minima

  • @veedrac
    @veedrac 3 роки тому +1

    This is one of those papers that makes so much sense they could tell you the method and the results might as well be implicit.

  • @samanthaqiu3416
    @samanthaqiu3416 3 роки тому +1

    @Yannic please consider making a video of RealNVP/NICE and generative flows, and what is this fetish of having tractable log likelihoods

  • @roohollakhorrambakht8104
    @roohollakhorrambakht8104 3 роки тому +1

    Filtering the labels based on the confidence level of the mode is a good idea, but the entropy of the predicted distribution is not necessarily a good indicator of that. This is because the probability outputs of the classifier would not be calibrated and produce relative confidence (concerning the other labels). There are many papers on ANN uncertainty estimation, but I find this one from Kendall to be a good sample: arxiv.org/pdf/1703.04977.pdf

  • @BanditZA
    @BanditZA 3 роки тому +1

    If it’s just due to augmentation and model size why not just augment the data the teacher trains on and increase the size of the teacher model? Is there a need to introduce the “student”?

    • @YannicKilcher
      @YannicKilcher  3 роки тому

      It seems like the distillation itself is important, too

  • @mehribaniasadi6027
    @mehribaniasadi6027 3 роки тому +1

    Thanks, great explanation. I have a question though.
    In minute 14:40, when the steps for Algorithm 1: NoisyStudent method are explained, it goes like this: Step 1, is to train a noised teacher, but then in step 2, for labelling the unlabelled data, they use a not noised teacher for the inference.
    So, I don't get why in step 1 they train a noised teacher when eventually they use a not noised teacher for the inference?
    I get that at the end, the final network is noised, but during the steps, they use not noised teachers for the inference, so how these noised teachers trained in the intermediate steps (iterations) are used?

    • @YannicKilcher
      @YannicKilcher  3 роки тому

      it's only used via the labels it outputs.

  • @aa-xn5hc
    @aa-xn5hc 3 роки тому

    Great great channel....

  • @MrjbushM
    @MrjbushM 3 роки тому

    Cool video!!!!!!

  • @kanakraj3198
    @kanakraj3198 3 роки тому

    During the first training, "real" teacher model Efficientnet B5, was trained using augmentations, dropout, and SD, therefore the model becomes "noisy" but during inference, it was mentioned to use "clean", not noised teacher. They why we had trained with noise for the first time?

  • @dmitrysamoylenko6775
    @dmitrysamoylenko6775 3 роки тому

    Basically they achieve more precise learning on smaller data. And without labels, only from teacher. Interesting

  • @thuyennguyenhoang9473
    @thuyennguyenhoang9473 3 роки тому

    Top 2 classify, Top 1 is FixEfficientNet-L2

  • @muzammilaziz9979
    @muzammilaziz9979 3 роки тому

    I personally think this paper has more hacking than the actual novel contribution. It's the researcher bias that made them push the idea more and more. This seems like the hacks had more to do with getting the SOTA than the main idea of the paper.

  • @shivamjalotra7919
    @shivamjalotra7919 3 роки тому

    Great

  • @cameron4814
    @cameron4814 3 роки тому +1

    @11:40 "depth dropout" ??? i think this paper describes this users.cecs.anu.edu.au/~sgould/papers/dicta16-depthdropout.pdf

  • @tripzero0
    @tripzero0 3 роки тому

    Trying this now. resnet50 -> efficientnetB2 -> efficientnetB7. Only problem is that it's difficult to increase batch as the model size increases :(.

    • @mkamp
      @mkamp 3 роки тому

      Because of your GPU memory limitations? Habe you considered gradient accumulation?

    • @tripzero0
      @tripzero0 3 роки тому +1

      @@mkamp didn't know about them until now. Thanks!

    • @tripzero0
      @tripzero0 3 роки тому

      I think this method somewhat depends on having a large-ish "good" initial dataset for the first teacher. I got my resnet50 network to 0.64 recall and 0.84 precision on a mutilabel dataset. The results were still very poor. Relabeling at a 0.8 threshold, produces one or two labels per image to train students from so a lot of labels get missed from there on. The certainty of getting those few labels right increases, but I'm not sure that trade-off is worth it.

  • @Alex-ms1yd
    @Alex-ms1yd 3 роки тому

    At first it sounds quite counter-intuitive that this might work. I would think of student becoming more confident of teacher mistakes.. But thinking over, maybe the idea is that using soft pseudo labels with big batch sizes we are kind of bumping student top1 closer to teacher top5. And teacher mistakes are balanced by other valid datapoints..
    Paper itself gives mixed feelings, one side all those tricks distracts from main idea and its evaluation. From other side its what they need to do to beat SOTA, cause they all do this.. But they tried their best to minimize this effect by many baseline comparisons.

  • @michamartyniak9255
    @michamartyniak9255 3 роки тому

    Isn't it already known as Active Learning?

    • @arkasaha4412
      @arkasaha4412 3 роки тому

      Active learning involves human-in-loop isn't it?

  • @Ruhgtfo
    @Ruhgtfo 3 роки тому

    M pretty sure m silent

  • @impolitevegan3179
    @impolitevegan3179 3 роки тому

    Correct me if I'm wrong, but if you would train a bigger model with the same augmented techniques on the imagenet and performed the same trick here described in the paper, then you probably wouldn't much better model than the original, right? I feel like it's unfair to have a not noised teacher and then say the student outperformed the teacher.

    • @YannicKilcher
      @YannicKilcher  3 роки тому

      Maybe. It's worth a try

    • @impolitevegan3179
      @impolitevegan3179 3 роки тому

      @@YannicKilcher sure, just need to get a few dozens of GPUs to train on 130m images

  • @mehermanoj45
    @mehermanoj45 3 роки тому

    1st, thanks