Image GPT: Generative Pretraining from Pixels (Paper Explained)

Yannic Kilcher

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 21 тра 2024
BERT and GPT-2/3 have shown the enormous power of using generative models as pre-training for classification tasks. However, for images, pre-training is usually done with supervised or self-supervised objectives. This paper investigates how far you can get when applying the principles from the world of NLP to the world of images.
OUTLINE:
0:00 - Intro & Overview
2:50 - Generative Models for Pretraining
4:50 - Pretraining for Visual Tasks
7:40 - Model Architecture
15:15 - Linear Probe Experiments
24:15 - Fine-Tuning Experiments
30:25 - Conclusion & Comments
Paper:
cdn.openai.com/papers/Generat...
Blog: openai.com/blog/image-gpt/
Code: github.com/openai/image-gpt
Abstract:
Inspired by progress in unsupervised representation learning for natural language, we examine whether similar models can learn useful representations for images. We train a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure. Despite training on low-resolution ImageNet without labels, we find that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. On CIFAR-10, we achieve 96.3% accuracy with a linear probe, outperforming a supervised Wide ResNet, and 99.0% accuracy with full finetuning, matching the top supervised pre-trained models. An even larger model trained on a mixture of ImageNet and web images is competitive with self-supervised benchmarks on ImageNet, achieving 72.0% top-1 accuracy on a linear probe of our features.
Authors: Mark Chen, Alec Radford, Rewon Child, Jeff Wu, Heewoo Jun, Prafulla Dhariwal, David Luan, Ilya Sutskever
Links:
UA-cam: / yannickilcher
Twitter: / ykilcher
Discord: / discord
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher
Наука та технологія

КОМЕНТАРІ • 75

@pmdl 3 роки тому ⁺⁶
Watching the video is similar to a very good pretraining before fine-tuning by actually reading the paper! Frankly it has almost halved the time it took me to go through (and understand) a paper!
@JinayShah 3 роки тому ⁺³²
I think your explanations are a lot better than Henry’s videos. I would love to hear your explanation, please don’t skip it!
@DistortedV12 3 роки тому ⁺¹⁰
I’m sure yannick appreciates the compliment, but please don’t pit them against one another as better than”. It is better say you prefer, don’t want to minimize another’s hard work,
@JinayShah 3 роки тому ⁺¹
@@DistortedV12 I think Henry is doing great work too, really appreciate his channel! Not taking away from his work, they both are better than I'll ever be. I just like how Yannic makes his explanations geared towards beginners like me.
@harshrajj9995 3 роки тому ⁺¹
i was suprised when the paper just came out and you already made a vid on it too. pro youtuber move... btw great explanation, love your content!
@sando_7 Рік тому
I wouldn't have learned a new cool concept without your explanation. Thank you so much!
@MrVaunorage 3 роки тому ⁺¹
3:39 Please always make a full explanation I cant get enough :)
@bycloudAI 3 роки тому ⁺¹²
damn you are speedy with the papers I love that
@dark808bb8 3 роки тому ⁺⁴
It would be cool to see if iGPT is good at image segmentation. Thanks for the great video!
@florianhonicke5448 3 роки тому ⁺¹
Great work!
@kazinazmulhaqueshezan4219 3 роки тому ⁺¹
Amazing Mate.
@alansmithee419 3 роки тому ⁺²
2:20
The first one of the generated images is so cute.
I want it.
@Bodenseecraft 3 роки тому ⁺⁴
A possible implication for future models could be that OpenAI may just use text and image data simultaneously in a combined model, i.e. read and produce image and text data at the same time. E.g. if crawled from the same web page or using captions, a model could potentially learn common representations. Like a month ago, i would have said that this is pretty unreasonable (altough there was previous work such as the image transformer), but given the kinds of model capacities that we see now, i'm not so sure anymore.
@YannicKilcher 3 роки тому
Yes I guess VirTex is already going in that direction a bit, but having the transformer architecture throughout will certainly help
@BlakeEdwards333 3 роки тому ⁺¹
This is awesome!!!!!!
@jasdeepsinghgrover2470 3 роки тому ⁺²
Would love to see some attention maps... Really difficult to visualize some sort of hierarchical attention and features like CNNs coming out of this!!
Thanks for the amazing video!!
@SachinSingh-do5ju 3 роки тому ⁺¹
I love you videos, man
@anjandash_ 3 роки тому ⁺¹
Man! You're so fast!
@Leibniz_28 3 роки тому ⁺⁴
I think you received the paper a week earlier than anyone else, cause you're so fast XD
@MarkMifsud 3 роки тому ⁺¹
This is so amazing, it's fucked up. I'm glad I went to Uni to learn Computer Science 4 years ago (at age 38). This is stuff I can now get into more easily.
@bengineer_the 3 роки тому
oh... here it is. :O Thank you!
@herp_derpingson 3 роки тому ⁺⁵
Henry AI labs looks interesting. Subbed to that too. Its a shame that UA-cam's recommendation algorithm was not able to correlate the channels.
.
29:59 I think its called hydra nets. The more sources of gradients you are able to give a neural network, the better it does. Even if the tasks are unrelated, as long as you have multiple heads at the end of the trunk, it works.
@YannicKilcher 3 роки тому
Nice. Would be interesting to see if there's a point where it becomes detrimental.
@jeffkeller2590 Рік тому
Loved this video, and where this research is headed. However, this paper seems to solve one of the most basic assumptions around semi-supervised model training schemes outlined in, 'The Dark Matter of Intelligence' paper. That is, we can train vision models in the same way we do NLP models, by semi-supervised predictions of next PIXEL. The Dark Matter paper seems to have went down a rabbit hole in seeking various workarounds for the vision case. Your thoughts?
@teslaonly2136 3 роки тому ⁺¹
Just finished reading this paper in the afternoon.
@teslaonly2136 3 роки тому ⁺¹
I think the cool thing about this paper is the context reduction and how it can complete the image without permutation invariant of the channel.
@IRiViI 3 роки тому ⁺⁴
So the quote "What I cannot create, I do not understand" hold also a bit for neural networks =).
@jeremykothe2847 3 роки тому ⁺³⁸
I hope you're at least getting some sleep!
@teslaonly2136 3 роки тому ⁺⁴
Great job Yannic. Do you mind sharing what you are gonna to study in the next video at the end of your video? It would allow audiences like me to go through the paper first and share our insights in the comment section when you post the video. Just my two cents.
@jeremykothe2847 3 роки тому ⁺¹²
Well you can pause the video, go read the paper and return to watch it...
@Phobos11 3 роки тому ⁺²
Jeremy Kothe big brain
@YannicKilcher 3 роки тому
Haha never thought of that 😁 genious
@XOPOIIIO 3 роки тому ⁺⁵
Did it figured out by itself that cats can keep a sheet of paper in their paws? Or such kind of images are in the dataset?
@YannicKilcher 3 роки тому ⁺³
I mean that's just common sense :D
@bengineer_the 3 роки тому
So does random cropping induce a non localised storage patch of weights (in effect providing contrastive weight spaces), which then can then combine in a 'holographic manor' to contribute towards an answer.
@bengineer_the 3 роки тому
think eye saccades
@glennkroegel1342 3 роки тому ⁺²
I would like to see this done with sparse attention using the row and column for queries and keys. Maybe then you don't have to downsize the images so much.
@Laszer271 3 роки тому
31:00 You could use Discriminator from GAN and I think that's the most common practice but it wouldn't be pixel by pixel. Autoregressive models also can use convolutions though (e.g. PixelCNN). They just kind of use half of a filter because they can't see what's ahead as that would be cheating :P
@Guytron95 3 роки тому ⁺³
I wonder how difficult it would be to switch from image blocking to adding noise and getting denoiseing out of this? Maybe the BERT model would work better for that.
@Phobos11 3 роки тому ⁺¹
How would you model noise in a linear fashion? I may be dumb, but I don't see how it will differentiate the information statistics from noise. You could use masking as in BERT, but then you would have to manually define the noise distribution at inference, defeating the purpose. I don't see it 🤔
@pmdl 3 роки тому
@@Phobos11 randomly blocking multiple patches of an image and asking the model to predict those patches before averaging over all?
@circuit10 3 роки тому ⁺¹
Could this be used for compression by only storing the pixel if it's different from what's expected?
@YannicKilcher 3 роки тому ⁺¹
Very nice idea!
@herecomesyouknowwho 3 роки тому ⁺¹
With "rolled-out" pixels, the last known pixel always has relationships to pixels at each fixed distance away. E.g. given a 32x32 image, the pixel at -1 distance from the pixel to be predicted has similar relationship to the pixel at -32 distance (-1 vertically before "roll-out"). -2 is similar to -64, etc. But with language, there's no repeating 32-word pattern, and there's never a similar relationship between two words at two fixed distances away (maybe in poetry!). Is that fact build into the model before training, or is that a type of "image grammar" that's learned by lower layers?
@YannicKilcher 3 роки тому
True. Good point. The model here has to learn these relationships
@patrickbestgen8834 3 роки тому ⁺¹
I don't have a PhD like many of the commentators here, so I'm sorry if my question might sound a bit dumb or goofy, but I wonder if this paper and the last few papers studied by Yannick (like VirTex for example) lead to an understanding of the generalizing capacity of the biological brain? Or do we still have a long way to go?
@YannicKilcher 3 роки тому
yes, I think what we're doing has relatively little to do with the brain as such :)
@Landonio 2 роки тому ⁺¹
How exactly do we download this and use it?
@kazz811 3 роки тому ⁺¹
So BERT is an autoencoder objective so the only difference that they have here compared to people trying autoencoders ("back in the day!") for semi-supervised learning is self-attention and lots more data? Pretty nuts. I guess the fact the autoregressive GPT objective compared to the autoencoder objective is something.
@YannicKilcher 3 роки тому
It's a de-noising autoencoder, which is not exactly the same as a classic autoencoder, but it does share the objective.
@kazz811 3 роки тому ⁺¹
@@YannicKilcherYup that's true! I guess I think of denoising as a key augmentation to the normal autoencoder training objective.
@YannicKilcher 3 роки тому
@@kazz811 yes, that's a nice way of thinking about it. The other difference is that classical AEs usually have some sort of bottleneck in the middle, which is mostly absent from DAEs
@kazz811 3 роки тому
@@YannicKilcher True! But pre-fine tuning, the middle layers are the best for transfer learning so I guess that is consistent with the emergence of some sort of encoder-decoder structure. It's different with fine tuning though, when Bert obviously improves substantially.
@gatoatigrado1 3 роки тому ⁺¹
hmm, I don't think your comment about linear probing after fine-tuning is likely to help much. iiuc the linear probe accuracy at the last layer should re-discover the fine-tuning result (the 99% accuracy). It seems pretty unlikely (though not impossible) that removing later layers would help, unless you think the model is going to add too much noise in these layers and destroy signal from previous layers.
@YannicKilcher 3 роки тому
Yes, but I'm interested in what happens at the middle layers.
@joelye8373 3 роки тому ⁺²
Couldn’t the improved linear probing vs model size be just a result of better disentanglement with a larger layer getting probed?
@YannicKilcher 3 роки тому ⁺¹
Yes absolutely
@ChurchOfThought 3 роки тому ⁺¹
Yes, depends on the actual entropy inherent in the input. A larger number of linear terms has a larger entropy, and therefore can support simpler, more linear representations, within that "bandwidth."
@firey6220 3 роки тому ⁺¹
How to use those!!!!!!
@theYoutubeHandle 3 роки тому
it's more gooder.
@Wobuffet3 3 роки тому ⁺¹
I'm a dummy who isn't good at computers, how do I use this program?
@YannicKilcher 3 роки тому
Probably not for a while :)
@Wobuffet3 3 роки тому
@@YannicKilcher Aw dang.
@grkb 3 роки тому
@@Wobuffet3 yeah you need to install a old version of ubuntu and try to figure out a lot of stuff.
@sathyanarayanankulasekaran1674 3 роки тому ⁺¹
Isn't this similar to pixel gan
@bryand3576 3 роки тому
What if you train this stuff using memes ?

Наступне

Автоматичне відтворення

OpenAI CLIP: ConnectingText and Images (Paper Explained)