Big Transfer (BiT): General Visual Representation Learning (Paper Explained)
Вставка
- Опубліковано 20 тра 2024
- One CNN to rule them all! BiT is a pre-trained ResNet that can be used as a starting point for any visual task. This paper explains what it takes to pre-train such a large model and details how fine-tuning on downstream tasks is done best.
Paper: arxiv.org/abs/1912.11370
Code & Models: TBA
Abstract:
Transfer of pre-trained representations improves sample efficiency and simplifies hyperparameter tuning when training deep neural networks for vision. We revisit the paradigm of pre-training on large supervised datasets and fine-tuning the model on a target task. We scale up pre-training, and propose a simple recipe that we call Big Transfer (BiT). By combining a few carefully selected components, and transferring using a simple heuristic, we achieve strong performance on over 20 datasets. BiT performs well across a surprisingly wide range of data regimes -- from 1 example per class to 1M total examples. BiT achieves 87.5% top-1 accuracy on ILSVRC-2012, 99.4% on CIFAR-10, and 76.3% on the 19 task Visual Task Adaptation Benchmark (VTAB). On small datasets, BiT attains 76.8% on ILSVRC-2012 with 10 examples per class, and 97.0% on CIFAR-10 with 10 examples per class. We conduct detailed analysis of the main components that lead to high transfer performance.
Authors: Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, Neil Houlsby
Links:
UA-cam: / yannickilcher
Twitter: / ykilcher
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher - Наука та технологія
Saw the paper yesterday, thanks for the video!
I read the paper first and came here and found your insights extra useful. And nice humor too.
You might be correct right with the thoughts on the #Parameters/#Datapoints affecting the accuracies. Just need to look up which model has how many parameters and take into account that for two different #Datapoints the graph most probably is not exactly the same.
You are awesome man, keep it upp
Now I want to see pretraining on quadrillion youtube videos where the model tries to predict next frame and then being finetuned on a different robotics RL tasks. Truly one model to rule them all.
Or do Noisy Student training with not only predicting the teacher's output on an image from a video but predict the same output on the frames close to that frame.
Yannic you have it backwards, transformers were NLPs imagenet moment! Pretraining on huge image datasets and fine tuning for your task has been common practice for at least 6-7 years now.
At a certain image recognition startup we had datasets with 10s-100s of millions of images and 10-30K classes back in 2014 and used it as a backbone for all of our downstream models.
Yea sure, you're right of course, transfer learning isn't new. I think this model just attempts to aim at being the common starting point for any sort of visual task. As I say, there's nothing particularly new here, just consolidation.
The generalization results they get from 10 examples per class is quite impressive. 8 GPU months lol ... BTW the standard classifier architecture shares a lot of internal features that are picked up by the logistic regression layer ... that's why in general we do not train a separate net for each class. An interesting dimension to analyse would be the transfer-performance dependent on the number of classes in the pretrain task... not only net size and total train examples.
Thanks for your perspective, Andreas. We were (happily!) surprised by the good performance on low-shot too, didn't expect that a priori!
Regarding studying pre-training #classes effect on transfer performance, that has been done previously (caveat: at smaller scale) by Huh, Agrawal and Efros in "What makes ImageNet good for transfer learning?" (arxiv.org/abs/1608.08614) and the TL;DR answer is: it doesn't really matter, at least within [100-1000].
The new operating system!!
Thanks for this, Yannic. Has the BiT-S and/or BiT-M models released? I looked at the code associated with the paper but it looks incomplete.
If it isn't available, do you think we can get this trained ourselves by implementing what's in the paper?
Hi, the code you saw was definitely not official. We now released code in TF2, PyTorch, and Jax, as well as all pre-trained models (except JFT) on github! Check it out: github.com/google-research/big_transfer
Edit: @Yannic, would be cool if you could update your description's "TBA" to the repo link!
13:00 more data hirts to resnet50 only because its obviously more noisy
To the theory you are proposing: the difference between performances of small models on different datasets is so small that it can be due to random initialization unless they trained like 5 times on each dataset and took an average, which I highly doubt because the size is just too big.
I don't think I'm saying that anywhere and if I do, I'm sorry :) but it's an interesting proposal
@@YannicKilcher No I was referring to the theory you propose at 15:10. I'm just saying that theory might be true on itself, but the performance difference is so small it can even be due to random initialization of weights. I wasn't clear, sorry.
8 is not that good for BN? I remember having a batch size of 3-4 and still getting noticeable difference when using BN.
sure you'll get a difference compared to no normalization, but if you use GN instead of BN, you might get a better result with low batch sizes.
Makes me think that maybe us humans can learn so data efficiently because evolution already gives us a pre-trained brain from birth. We have millions of years of pre-training even before we are born.
32:09 thats an owl?
Yes that's totally an owl :D
I'm so curious about this because this biggering of models and datasets feels almost wasteful and when I compare to human learning, but of course human learning also consists of evolution, not just the little learning we see. But at the same time we can learn things very quickly that we were likely not at all evolved to do, like maths or art, and it makes me wonder if maybe there's some key architectural thing we missing that will drastically speed up learning without needing so much data
@@NicheAsQuicheMe from the future here!
Maybe our brain is always trying to predict the next N frames and that builds a great autoregressive foundation model. So, even if we disregard evolution, from birth we already have a brain trained on this massive unlabled training set, which helps in all sorts of things when trained on downstream specific tasks.
Personally, I have noticed this, whenever something unexpected happens (something falls that wasn't supposed to or some glitch in a video game, by brain blanks out and I close my eyes to reset). This indicates this perpetual auto-regressive theory may hold water.
8 GPU millennia