Big Transfer (BiT): General Visual Representation Learning (Paper Explained)

Yannic Kilcher

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 20 тра 2024
One CNN to rule them all! BiT is a pre-trained ResNet that can be used as a starting point for any visual task. This paper explains what it takes to pre-train such a large model and details how fine-tuning on downstream tasks is done best.
Paper: arxiv.org/abs/1912.11370
Code & Models: TBA
Abstract:
Transfer of pre-trained representations improves sample efficiency and simplifies hyperparameter tuning when training deep neural networks for vision. We revisit the paradigm of pre-training on large supervised datasets and fine-tuning the model on a target task. We scale up pre-training, and propose a simple recipe that we call Big Transfer (BiT). By combining a few carefully selected components, and transferring using a simple heuristic, we achieve strong performance on over 20 datasets. BiT performs well across a surprisingly wide range of data regimes -- from 1 example per class to 1M total examples. BiT achieves 87.5% top-1 accuracy on ILSVRC-2012, 99.4% on CIFAR-10, and 76.3% on the 19 task Visual Task Adaptation Benchmark (VTAB). On small datasets, BiT attains 76.8% on ILSVRC-2012 with 10 examples per class, and 97.0% on CIFAR-10 with 10 examples per class. We conduct detailed analysis of the main components that lead to high transfer performance.
Authors: Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, Neil Houlsby
Links:
UA-cam: / yannickilcher
Twitter: / ykilcher
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher
Наука та технологія

КОМЕНТАРІ • 32

@CristianGarcia 4 роки тому ⁺¹
Saw the paper yesterday, thanks for the video!
@guruprasadsomasundaram9273 2 роки тому
I read the paper first and came here and found your insights extra useful. And nice humor too.
@jonatan01i 3 роки тому ⁺¹
You might be correct right with the thoughts on the #Parameters/#Datapoints affecting the accuracies. Just need to look up which model has how many parameters and take into account that for two different #Datapoints the graph most probably is not exactly the same.
@karanbirchahal3268 4 роки тому ⁺¹
You are awesome man, keep it upp
@mikhaildoroshenko2169 4 роки тому ⁺³
Now I want to see pretraining on quadrillion youtube videos where the model tries to predict next frame and then being finetuned on a different robotics RL tasks. Truly one model to rule them all.
@jonatan01i 3 роки тому
Or do Noisy Student training with not only predicting the teacher's output on an image from a video but predict the same output on the frames close to that frame.
@m_ke 4 роки тому ⁺³
Yannic you have it backwards, transformers were NLPs imagenet moment! Pretraining on huge image datasets and fine tuning for your task has been common practice for at least 6-7 years now.
At a certain image recognition startup we had datasets with 10s-100s of millions of images and 10-30K classes back in 2014 and used it as a backbone for all of our downstream models.
@YannicKilcher 4 роки тому
Yea sure, you're right of course, transfer learning isn't new. I think this model just attempts to aim at being the common starting point for any sort of visual task. As I say, there's nothing particularly new here, just consolidation.
@bluel1ng 4 роки тому ⁺³
The generalization results they get from 10 examples per class is quite impressive. 8 GPU months lol ... BTW the standard classifier architecture shares a lot of internal features that are picked up by the logistic regression layer ... that's why in general we do not train a separate net for each class. An interesting dimension to analyse would be the transfer-performance dependent on the number of classes in the pretrain task... not only net size and total train examples.
@lucasbeyer2985 4 роки тому
Thanks for your perspective, Andreas. We were (happily!) surprised by the good performance on low-shot too, didn't expect that a priori!
Regarding studying pre-training #classes effect on transfer performance, that has been done previously (caveat: at smaller scale) by Huh, Agrawal and Efros in "What makes ImageNet good for transfer learning?" (arxiv.org/abs/1608.08614) and the TL;DR answer is: it doesn't really matter, at least within [100-1000].
@connorshorten6311 4 роки тому ⁺¹
The new operating system!!
@abhinavankur3779 4 роки тому ⁺¹
Thanks for this, Yannic. Has the BiT-S and/or BiT-M models released? I looked at the code associated with the paper but it looks incomplete.
If it isn't available, do you think we can get this trained ourselves by implementing what's in the paper?
@lucasbeyer2985 4 роки тому ⁺¹
Hi, the code you saw was definitely not official. We now released code in TF2, PyTorch, and Jax, as well as all pre-trained models (except JFT) on github! Check it out: github.com/google-research/big_transfer
Edit: @Yannic, would be cool if you could update your description's "TBA" to the repo link!
@marat61 4 роки тому ⁺¹
13:00 more data hirts to resnet50 only because its obviously more noisy
@impolitevegan3179 4 роки тому ⁺¹
To the theory you are proposing: the difference between performances of small models on different datasets is so small that it can be due to random initialization unless they trained like 5 times on each dataset and took an average, which I highly doubt because the size is just too big.
@YannicKilcher 4 роки тому
I don't think I'm saying that anywhere and if I do, I'm sorry :) but it's an interesting proposal
@impolitevegan3179 4 роки тому
@@YannicKilcher No I was referring to the theory you propose at 15:10. I'm just saying that theory might be true on itself, but the performance difference is so small it can even be due to random initialization of weights. I wasn't clear, sorry.
@impolitevegan3179 4 роки тому ⁺¹
8 is not that good for BN? I remember having a batch size of 3-4 and still getting noticeable difference when using BN.
@YannicKilcher 4 роки тому ⁺²
sure you'll get a difference compared to no normalization, but if you use GN instead of BN, you might get a better result with low batch sizes.
@herp_derpingson 4 роки тому ⁺²
Makes me think that maybe us humans can learn so data efficiently because evolution already gives us a pre-trained brain from birth. We have millions of years of pre-training even before we are born.
32:09 thats an owl?
@YannicKilcher 4 роки тому ⁺¹
Yes that's totally an owl :D
@NicheAsQuiche 4 роки тому ⁺¹
I'm so curious about this because this biggering of models and datasets feels almost wasteful and when I compare to human learning, but of course human learning also consists of evolution, not just the little learning we see. But at the same time we can learn things very quickly that we were likely not at all evolved to do, like maths or art, and it makes me wonder if maybe there's some key architectural thing we missing that will drastically speed up learning without needing so much data
@herp_derpingson 2 місяці тому
@@NicheAsQuicheMe from the future here!
Maybe our brain is always trying to predict the next N frames and that builds a great autoregressive foundation model. So, even if we disregard evolution, from birth we already have a brain trained on this massive unlabled training set, which helps in all sorts of things when trained on downstream specific tasks.
Personally, I have noticed this, whenever something unexpected happens (something falls that wasn't supposed to or some glitch in a video game, by brain blanks out and I close my eyes to reset). This indicates this perpetual auto-regressive theory may hold water.
@mechanicalmonk2020 4 роки тому ⁺¹
8 GPU millennia

Наступне

Автоматичне відтворення