Vision Transformer for Image Classification

Shusen Wang

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 16 січ 2025
Vision Transformer (ViT) is the new state-of-the-art for image classification. ViT was posted on arXiv in Oct 2020 and officially published in 2021. On all the public datasets, ViT beats the best ResNet by a small margin, provided that ViT has been pretrained on a sufficiently large dataset. The bigger the dataset, the greater the advantage of the ViT over ResNet.
Slides: github.com/wan...
Reference:
Dosovitskiy et al. An image is worth 16×16 words: transformers for image recognition at scale. In ICLR, 2021.

КОМЕНТАРІ • 84

@UzzalPodder 3 роки тому ⁺⁴¹
Great Explanation with detailed notations. Most of the videos found in the UA-cam were some kind of oral explanation. But this kind of symbolic notation is very helpful for garbing the real picture, specially if anyone want to re-implement or add new idea with it. Thank you so much. Please continuing helping us by making these kind of videos for us.
@mmpattnaik97 2 роки тому ⁺²
Can't stress enough on how easy to understand you made it
@ai_lite 10 місяців тому ⁺¹
great expalation! Good for you! Don't stop giving ML guides!
@drakehinst271 2 роки тому ⁺⁶
These are some of the best, hands-on and simple explanations I've seen in a while on a new CS method. Straight to the point with no superfluous details, and at a pace that let me consider and visualize each step in my mind without having to constantly pause or rewind the video. Thanks a lot for your amazing work! :)
@adityapillai3091 11 місяців тому
Clear, concise, and overall easy to understand for a newbie like me. Thanks!
@thecheekychinaman6713 Рік тому
The best ViT explanation available. Also key to understand this for understanding Dino and Dino V2
@drelvenkee1885 Рік тому
The best video so far. The animation is easy to follow and the explaination is very straight forward.
@aimeroundiaye1378 3 роки тому ⁺¹⁰
Amazing video. It helped me to really understand the vision transformers. Thanks a lot.
@valentinfontanger4962 2 роки тому ⁺¹
Amazing, I am in a rush to implement vision transformer as an assignement, and this saved me so much time !
@randomperson5303 2 роки тому
lol , same
@sheikhshafayat6984 2 роки тому ⁺³
Man, you made my day! These lectures were golden. I hope you continue to make more of these
@vladi21k 2 роки тому
Very good explanation, better that many other videos on UA-cam, thank you!
@MonaJalal 3 роки тому ⁺²
This was a great video. Thanks for your time producing great content.
@soumyajitdatta9203 Рік тому
Thank you. Best ViT video I found.
@thepresistence5935 2 роки тому ⁺²
15 minutes of heaven 🌿. Thanks a lot understood clearly!
@Peiying-h4m Рік тому
Best ViT explanation ever!!!!!!
@DerekChiach 3 роки тому
Thank you, your video is way underrated. Keep it up!
@arash_mehrabi 2 роки тому
Thank you for your Attention Models playlist. Well explained.
@swishgtv7827 3 роки тому
This reminds me of Encarta encyclopedia clips when I was a kid lol! Good job mate!
@NisseOhlsen 3 роки тому
Very nice job, Shusen, thanks!
@nova2577 2 роки тому ⁺²
If we ignore output c1 ... cn, what c1 ... cn represent then?
@ronalkobi4356 7 місяців тому
Wonderful explanation!👏
@wengxiaoxiong666 Рік тому
good video ,what a splendid presentation , wang shusen yyds.
@nehalkalita Рік тому
Nicely explained. Appreciate your efforts.
@lionhuang9209 3 роки тому
Very clear, thanks for your work.
@muhammadfaseeh5810 2 роки тому
Awesome Explanation.
Thank you
@sehaba9531 2 роки тому
Thank you so much for this amazing presentation. You have a very clear explanation, I have learnt so much. I will definitely watch your Attention models playlist.
@hongkyulee9724 Рік тому
Thank you for the clear explanation!!☺
@rajgothi2633 Рік тому
You have explained ViT in simple words. Thanks
@xXMaDGaMeR Рік тому
amazing precise explanation
@parmanandchauhan6182 6 місяців тому
Great Explanation.Thanqu
@aryanmobiny7340 3 роки тому ⁺¹
Amazing video. Please do one for Swin Transformers if possible. Thanks alot
@jidd32 2 роки тому
Brilliant. Thanks a million
@MenTaLLyMenTaL 2 роки тому ⁺¹
@9:30 Why do we discard c1... cn and use only c0? How is it that all the necessary information from the image gets collected & preserved in c0? Thanks
@abhinavgarg5611 2 роки тому
Hey, did you get answer to your question?
@tallwaters9708 2 роки тому
Brilliant explanation, thank you.
@mmazher5826 Рік тому
Excellent explanation 👌
@ervinperetz5973 2 роки тому ⁺¹
This is a great explanation video.
One nit : you are misusing the term 'dimension'. If a classification vector is linear with 8 values, that's not '8-dimensional' -- it is a 1-dimensional vector with 8 values.
@boemioofworld 3 роки тому
thank you so much for the clear explanation
@deeplearn6584 2 роки тому
Very good explanation
subscribed!
@sudhakartummala4701 3 роки тому
Wonderful talk
@t.pranav2834 3 роки тому
Awesome explanation man thanks a tonne!!!
@medomed1105 2 роки тому
Great explanation
@mariamwaleed2132 2 роки тому
really great explaination , thankyou
@ASdASd-kr1ft Рік тому
Nice video!!, Just a question what is the argue behind to rid of the vectors c1 to cn, and just remain with c0? Thanks
@user-wr4yl7tx3w Рік тому
In the job market, do data scientists use transformers?
@saeedataei269 2 роки тому
great video. thanks. could u plz explain swin transformer too?
@chawkinasrallah7269 8 місяців тому
The class token 0 is in the embed dim, does that mean we should add a linear layer from embed to number of classes before the softmax for the classification?
@ogsconnect1312 3 роки тому
Good job! Thanks
@DrAhmedShahin_707 3 роки тому
The simplest and more interesting explanation, Many Thanks. I am asking about object detection models, did you explain it before?
@BeytullahAhmetKINDAN Рік тому
that was educational!
@DrAIScience 8 місяців тому
How data A is trained? I mean what is the loss function? Is it only using encoder or both e/decoder?
@ME-mp3ne 3 роки тому
Really good, thx.
@bbss8758 3 роки тому
Can you explain yhis paper please “your classifier is secretly an energy based model and you should treat it like one “ i want understand these energy based model
@sevovo Рік тому ⁺¹
CNN on images + positional info = Transformers for images
@fedegonzal 3 роки тому
Super clear explanation! Thanks! I want to understand how attention is applied to the images. I mean, using cnn you can "see" where the neural network is focusing, but with transformers?
@ThamizhanDaa1 2 роки тому
WHY is the transformer requiring so many images to train?? and why is resnet not becoming better with ore training vs ViT?
@zeweichu550 2 роки тому
great video!
@parveenkaur2747 3 роки тому
Very good explanation! Can you please explain how we can fine tune these models to our dataset. Is it possible on our local computer
@ShusenWangEng 3 роки тому ⁺⁴
Unfortunately, no. Google has TPU clusters. The amount of computation is insane.
@parveenkaur2747 3 роки тому
@@ShusenWangEng Actually I have my project proposal due today.. I was proposing this on the dataset of FOOD-101 it has 101000 images
So it can’t be done?
What size dataset can we train on our local PC
@parveenkaur2747 3 роки тому
Can you please reply?
Stuck at the moment..
Thanks
@ShusenWangEng 3 роки тому
@@parveenkaur2747 If your dataset is very different from ImageNet, Google's pretrained model may not transfer well to your problem. The performance can be bad.
@ansharora3248 3 роки тому
Great explanation :)
@DungPham-ai 3 роки тому ⁺¹
Amazing video. It helped me to really understand the vision transformers. Thanks a lot. But i have a question why we only use token cls for classifier .
@NeketShark 3 роки тому
Looks like due to attention layers cls token is able to extract all the data it needs for a good classification from other tokens. Using all tokens for classification would just unnecessarily increase computation.
@Darkev77 3 роки тому
@@NeketShark that’s a good answer. At 9:40, any idea how a softmax function was able to increase (or decrease) the dimension of vector “c” into “p”? I thought softmax would only change the entries of a vector, not its dimensions
@NeketShark 3 роки тому ⁺¹
@@Darkev77 I think it first goes through a linear layer which then goes through a softmax, so its the linear layer that changes the dimention. In the video this info were probably ommited for simplification.
@mahmoudtarek6859 2 роки тому
great
@shamsarfeen2729 3 роки тому
If you remove the positional encoding step, the whole thing is almost equivalent to a CNN, right?
I mean those dense layers are just as filters of a CNN.
@st-hs2ve 3 роки тому
Great great great
@swishgtv7827 3 роки тому ⁺¹
The concept has similarities to TCP protocol in terms of segmentation and positional encoding. 😅😅😅
@palyashuk42 3 роки тому
Why do the authors evaluate and compare their results with the old ResNet architecure? Why not to use EfficientNets for comparison? Looks like not the best result...
@ShusenWangEng 3 роки тому ⁺¹
ResNet is a family of CNNs. Many tricks are applied to make ResNet work better. The reported are indeed the best accuracies that CNNs can achieve.
@顾小杰 2 роки тому
👏
@randomperson5303 2 роки тому
Not All Heroes Wear Capes
@seakan6835 2 роки тому
其实我觉得up主说中文更好🥰🤣
@boyang6105 2 роки тому
也有中文版的（ ua-cam.com/video/BbzOZ9THriY/v-deo.html ），不同的语言有不同的听众
@yinghaohu8784 3 роки тому
1) you mentioned pretrain model, it uses large scale dataset, and then using a smaller dataset for finetuning. Does it mean, they c0 is almost the same, except the last layer softmax will be adjusted based on the class_num ? and then train on fine-tuning dataset ? Or there're other different settings ? 2)Another doubt for me is, there's completely no mask in ViT, right? since it is from MLM ... um ...
@yuan6950 2 роки тому
这英语也是醉了
@kutilkol 9 місяців тому
this is supposed to be english?
@mahdiyehbasereh Рік тому
That was great and helpful 🤌🏻
@tianbaoxie2324 2 роки тому
Very clear, thanks for your work.
@Raulvic 3 роки тому
Thank you for the clear explanation

Наступне

Автоматичне відтворення

Vision Transformer Quick Guide - Theory and Code in (almost) 15 min