Vision Transformer for Image Classification

Поділитися
Вставка
  • Опубліковано 16 січ 2025
  • Vision Transformer (ViT) is the new state-of-the-art for image classification. ViT was posted on arXiv in Oct 2020 and officially published in 2021. On all the public datasets, ViT beats the best ResNet by a small margin, provided that ViT has been pretrained on a sufficiently large dataset. The bigger the dataset, the greater the advantage of the ViT over ResNet.
    Slides: github.com/wan...
    Reference:
    Dosovitskiy et al. An image is worth 16×16 words: transformers for image recognition at scale. In ICLR, 2021.

КОМЕНТАРІ • 84

  • @UzzalPodder
    @UzzalPodder 3 роки тому +41

    Great Explanation with detailed notations. Most of the videos found in the UA-cam were some kind of oral explanation. But this kind of symbolic notation is very helpful for garbing the real picture, specially if anyone want to re-implement or add new idea with it. Thank you so much. Please continuing helping us by making these kind of videos for us.

  • @mmpattnaik97
    @mmpattnaik97 2 роки тому +2

    Can't stress enough on how easy to understand you made it

  • @ai_lite
    @ai_lite 10 місяців тому +1

    great expalation! Good for you! Don't stop giving ML guides!

  • @drakehinst271
    @drakehinst271 2 роки тому +6

    These are some of the best, hands-on and simple explanations I've seen in a while on a new CS method. Straight to the point with no superfluous details, and at a pace that let me consider and visualize each step in my mind without having to constantly pause or rewind the video. Thanks a lot for your amazing work! :)

  • @adityapillai3091
    @adityapillai3091 11 місяців тому

    Clear, concise, and overall easy to understand for a newbie like me. Thanks!

  • @thecheekychinaman6713
    @thecheekychinaman6713 Рік тому

    The best ViT explanation available. Also key to understand this for understanding Dino and Dino V2

  • @drelvenkee1885
    @drelvenkee1885 Рік тому

    The best video so far. The animation is easy to follow and the explaination is very straight forward.

  • @aimeroundiaye1378
    @aimeroundiaye1378 3 роки тому +10

    Amazing video. It helped me to really understand the vision transformers. Thanks a lot.

  • @valentinfontanger4962
    @valentinfontanger4962 2 роки тому +1

    Amazing, I am in a rush to implement vision transformer as an assignement, and this saved me so much time !

  • @sheikhshafayat6984
    @sheikhshafayat6984 2 роки тому +3

    Man, you made my day! These lectures were golden. I hope you continue to make more of these

  • @vladi21k
    @vladi21k 2 роки тому

    Very good explanation, better that many other videos on UA-cam, thank you!

  • @MonaJalal
    @MonaJalal 3 роки тому +2

    This was a great video. Thanks for your time producing great content.

  • @soumyajitdatta9203
    @soumyajitdatta9203 Рік тому

    Thank you. Best ViT video I found.

  • @thepresistence5935
    @thepresistence5935 2 роки тому +2

    15 minutes of heaven 🌿. Thanks a lot understood clearly!

  • @Peiying-h4m
    @Peiying-h4m Рік тому

    Best ViT explanation ever!!!!!!

  • @DerekChiach
    @DerekChiach 3 роки тому

    Thank you, your video is way underrated. Keep it up!

  • @arash_mehrabi
    @arash_mehrabi 2 роки тому

    Thank you for your Attention Models playlist. Well explained.

  • @swishgtv7827
    @swishgtv7827 3 роки тому

    This reminds me of Encarta encyclopedia clips when I was a kid lol! Good job mate!

  • @NisseOhlsen
    @NisseOhlsen 3 роки тому

    Very nice job, Shusen, thanks!

  • @nova2577
    @nova2577 2 роки тому +2

    If we ignore output c1 ... cn, what c1 ... cn represent then?

  • @ronalkobi4356
    @ronalkobi4356 7 місяців тому

    Wonderful explanation!👏

  • @wengxiaoxiong666
    @wengxiaoxiong666 Рік тому

    good video ,what a splendid presentation , wang shusen yyds.

  • @nehalkalita
    @nehalkalita Рік тому

    Nicely explained. Appreciate your efforts.

  • @lionhuang9209
    @lionhuang9209 3 роки тому

    Very clear, thanks for your work.

  • @muhammadfaseeh5810
    @muhammadfaseeh5810 2 роки тому

    Awesome Explanation.
    Thank you

  • @sehaba9531
    @sehaba9531 2 роки тому

    Thank you so much for this amazing presentation. You have a very clear explanation, I have learnt so much. I will definitely watch your Attention models playlist.

  • @hongkyulee9724
    @hongkyulee9724 Рік тому

    Thank you for the clear explanation!!☺

  • @rajgothi2633
    @rajgothi2633 Рік тому

    You have explained ViT in simple words. Thanks

  • @xXMaDGaMeR
    @xXMaDGaMeR Рік тому

    amazing precise explanation

  • @parmanandchauhan6182
    @parmanandchauhan6182 6 місяців тому

    Great Explanation.Thanqu

  • @aryanmobiny7340
    @aryanmobiny7340 3 роки тому +1

    Amazing video. Please do one for Swin Transformers if possible. Thanks alot

  • @jidd32
    @jidd32 2 роки тому

    Brilliant. Thanks a million

  • @MenTaLLyMenTaL
    @MenTaLLyMenTaL 2 роки тому +1

    @9:30 Why do we discard c1... cn and use only c0? How is it that all the necessary information from the image gets collected & preserved in c0? Thanks

  • @tallwaters9708
    @tallwaters9708 2 роки тому

    Brilliant explanation, thank you.

  • @mmazher5826
    @mmazher5826 Рік тому

    Excellent explanation 👌

  • @ervinperetz5973
    @ervinperetz5973 2 роки тому +1

    This is a great explanation video.
    One nit : you are misusing the term 'dimension'. If a classification vector is linear with 8 values, that's not '8-dimensional' -- it is a 1-dimensional vector with 8 values.

  • @boemioofworld
    @boemioofworld 3 роки тому

    thank you so much for the clear explanation

  • @deeplearn6584
    @deeplearn6584 2 роки тому

    Very good explanation
    subscribed!

  • @sudhakartummala4701
    @sudhakartummala4701 3 роки тому

    Wonderful talk

  • @t.pranav2834
    @t.pranav2834 3 роки тому

    Awesome explanation man thanks a tonne!!!

  • @medomed1105
    @medomed1105 2 роки тому

    Great explanation

  • @mariamwaleed2132
    @mariamwaleed2132 2 роки тому

    really great explaination , thankyou

  • @ASdASd-kr1ft
    @ASdASd-kr1ft Рік тому

    Nice video!!, Just a question what is the argue behind to rid of the vectors c1 to cn, and just remain with c0? Thanks

  • @user-wr4yl7tx3w
    @user-wr4yl7tx3w Рік тому

    In the job market, do data scientists use transformers?

  • @saeedataei269
    @saeedataei269 2 роки тому

    great video. thanks. could u plz explain swin transformer too?

  • @chawkinasrallah7269
    @chawkinasrallah7269 8 місяців тому

    The class token 0 is in the embed dim, does that mean we should add a linear layer from embed to number of classes before the softmax for the classification?

  • @ogsconnect1312
    @ogsconnect1312 3 роки тому

    Good job! Thanks

  • @DrAhmedShahin_707
    @DrAhmedShahin_707 3 роки тому

    The simplest and more interesting explanation, Many Thanks. I am asking about object detection models, did you explain it before?

  • @BeytullahAhmetKINDAN
    @BeytullahAhmetKINDAN Рік тому

    that was educational!

  • @DrAIScience
    @DrAIScience 8 місяців тому

    How data A is trained? I mean what is the loss function? Is it only using encoder or both e/decoder?

  • @ME-mp3ne
    @ME-mp3ne 3 роки тому

    Really good, thx.

  • @bbss8758
    @bbss8758 3 роки тому

    Can you explain yhis paper please “your classifier is secretly an energy based model and you should treat it like one “ i want understand these energy based model

  • @sevovo
    @sevovo Рік тому +1

    CNN on images + positional info = Transformers for images

  • @fedegonzal
    @fedegonzal 3 роки тому

    Super clear explanation! Thanks! I want to understand how attention is applied to the images. I mean, using cnn you can "see" where the neural network is focusing, but with transformers?

  • @ThamizhanDaa1
    @ThamizhanDaa1 2 роки тому

    WHY is the transformer requiring so many images to train?? and why is resnet not becoming better with ore training vs ViT?

  • @zeweichu550
    @zeweichu550 2 роки тому

    great video!

  • @parveenkaur2747
    @parveenkaur2747 3 роки тому

    Very good explanation! Can you please explain how we can fine tune these models to our dataset. Is it possible on our local computer

    • @ShusenWangEng
      @ShusenWangEng  3 роки тому +4

      Unfortunately, no. Google has TPU clusters. The amount of computation is insane.

    • @parveenkaur2747
      @parveenkaur2747 3 роки тому

      @@ShusenWangEng Actually I have my project proposal due today.. I was proposing this on the dataset of FOOD-101 it has 101000 images
      So it can’t be done?
      What size dataset can we train on our local PC

    • @parveenkaur2747
      @parveenkaur2747 3 роки тому

      Can you please reply?
      Stuck at the moment..
      Thanks

    • @ShusenWangEng
      @ShusenWangEng  3 роки тому

      @@parveenkaur2747 If your dataset is very different from ImageNet, Google's pretrained model may not transfer well to your problem. The performance can be bad.

  • @ansharora3248
    @ansharora3248 3 роки тому

    Great explanation :)

  • @DungPham-ai
    @DungPham-ai 3 роки тому +1

    Amazing video. It helped me to really understand the vision transformers. Thanks a lot. But i have a question why we only use token cls for classifier .

    • @NeketShark
      @NeketShark 3 роки тому

      Looks like due to attention layers cls token is able to extract all the data it needs for a good classification from other tokens. Using all tokens for classification would just unnecessarily increase computation.

    • @Darkev77
      @Darkev77 3 роки тому

      @@NeketShark that’s a good answer. At 9:40, any idea how a softmax function was able to increase (or decrease) the dimension of vector “c” into “p”? I thought softmax would only change the entries of a vector, not its dimensions

    • @NeketShark
      @NeketShark 3 роки тому +1

      @@Darkev77 I think it first goes through a linear layer which then goes through a softmax, so its the linear layer that changes the dimention. In the video this info were probably ommited for simplification.

  • @mahmoudtarek6859
    @mahmoudtarek6859 2 роки тому

    great

  • @shamsarfeen2729
    @shamsarfeen2729 3 роки тому

    If you remove the positional encoding step, the whole thing is almost equivalent to a CNN, right?
    I mean those dense layers are just as filters of a CNN.

  • @st-hs2ve
    @st-hs2ve 3 роки тому

    Great great great

  • @swishgtv7827
    @swishgtv7827 3 роки тому +1

    The concept has similarities to TCP protocol in terms of segmentation and positional encoding. 😅😅😅

  • @palyashuk42
    @palyashuk42 3 роки тому

    Why do the authors evaluate and compare their results with the old ResNet architecure? Why not to use EfficientNets for comparison? Looks like not the best result...

    • @ShusenWangEng
      @ShusenWangEng  3 роки тому +1

      ResNet is a family of CNNs. Many tricks are applied to make ResNet work better. The reported are indeed the best accuracies that CNNs can achieve.

  • @顾小杰
    @顾小杰 2 роки тому

    👏

  • @randomperson5303
    @randomperson5303 2 роки тому

    Not All Heroes Wear Capes

  • @seakan6835
    @seakan6835 2 роки тому

    其实我觉得up主说中文更好🥰🤣

    • @boyang6105
      @boyang6105 2 роки тому

      也有中文版的( ua-cam.com/video/BbzOZ9THriY/v-deo.html ),不同的语言有不同的听众

  • @yinghaohu8784
    @yinghaohu8784 3 роки тому

    1) you mentioned pretrain model, it uses large scale dataset, and then using a smaller dataset for finetuning. Does it mean, they c0 is almost the same, except the last layer softmax will be adjusted based on the class_num ? and then train on fine-tuning dataset ? Or there're other different settings ? 2)Another doubt for me is, there's completely no mask in ViT, right? since it is from MLM ... um ...

  • @yuan6950
    @yuan6950 2 роки тому

    这英语也是醉了

  • @kutilkol
    @kutilkol 9 місяців тому

    this is supposed to be english?

  • @mahdiyehbasereh
    @mahdiyehbasereh Рік тому

    That was great and helpful 🤌🏻

  • @tianbaoxie2324
    @tianbaoxie2324 2 роки тому

    Very clear, thanks for your work.

  • @Raulvic
    @Raulvic 3 роки тому

    Thank you for the clear explanation