Self-Training with Noisy Student (87.4% ImageNet Top-1 Accuracy!)
Вставка
- Опубліковано 25 січ 2025
- This video explains the new state-of-the-art for ImageNet classification from Google AI. This technique has a very interesting approach to Knowledge Distillation. Rather than using the student network for model compression (usually either faster inference or less memory requirements), this approach iteratively scales up the capacity of the student network. This video also explains other interesting characteristics in the paper such as the use of noise via data augmentation, dropout, and stochastic depth, as well as ideas like class balancing with pseudo-labels!
Thanks for watching! Please Subscribe!
Self-training with Noisy Student improves ImageNet classification: arxiv.org/pdf/...
EfficientNet: Rethinking Model Scaling for Convolutional Networks
arxiv.org/pdf/...
Billion-scale semi-supervised learning for state-of-the art image and video classification:
/ billion-scale-semi-sup...
A very accessible discussion of the paper that gets its main ideas across well. Thanks, subbed!
Loved it! Please continue more of this!
Thank you so much!!
Thank you!
Thanks for making this video! Clear and succinct.
Thank you!!
Amazing! Thank you! I do research involving knowledge distillation and this is amazing motivation to keep moving in that direction
Thank you!! I have been really interested in distillation lately as well! Planning on making a few videos of odd papers I came across while doing a literature review of distillation!
@@connor-shorten Awesome, please do! I'd love to talk sometime with you about my work.
@@BlakeEdwards333 That sounds great! Could you send me an email at henryailabs@gmail.com to discuss this? Looking forward to it!
Thank you for your work, I greatly appreciate it. One suggestion for future episodes: if applicable, could you dedicate a slide or a few sentences to the novelty and/or contributions as stated in the paper? That would be great to know.
In this paper I'm inclined to think it was the stochastic depth noise, and I couldn't make out if anything else was new.
Thank you so much for the suggestion! I don't think the incremental scaling up of the student networks was the key idea as well (in the paper they attribute +0.5% to this compared to 1.9% for the noise trio of SD, Dropout, and Data Augmentation). I think the scale of the experiments is novel as well, definitely a computationally intense workload of going form EfficientNet-B7 to the bigger L0, L1, L2 models and illustrating this underexplored / underpromoted training methodology. I think generally it flips common knowledge on its head in the sense that we usually think of distillation as a technique for reducing capacity / faster models, but this shows how it can be used for larger models as well. I haven't completely surveyed the literature on distillation, please share any papers similar to this study!
Subbed and recommended to all the people I know :)
Keep up the great work!
Thank you so much!!
Can you turn on the caption! Very appreciated!
Thank you as always. Could you elaborate a bit more on the stochastic depth & class balancing parts?
Awesome work! Can you consider doing a series dedicated to the intuition behind loss functions and their use cases?
Thank you and thanks for the suggestion, it sounds like an interesting project!
@@connor-shorten Really looking forward to it!
Thank you and keep up the good work.
Thank you so much!!
Thank you!
Thank you!
@@connor-shorten It helps me so much, please don't stop making this videos
Train two teacher networks and train student only on images that the teachers agree on its content.
Hey, I know it's been five months. I really like your idea. Did you try it out?
@@maxmetz01 Nope, sadly no.
Only 1% improvement
well nowadays it's hard to improve, even 1%
Usain Bolt only set the 100m record by 0.28 seconds lol