Vision Transformer Quick Guide - Theory and Code in (almost) 15 min

Поділитися
Вставка
  • Опубліковано 15 тра 2024
  • ▬▬ Papers / Resources ▬▬▬
    Colab Notebook: colab.research.google.com/dri...
    ViT paper: arxiv.org/abs/2010.11929
    Best Transformer intro: jalammar.github.io/illustrate...
    CNNs vs ViT: arxiv.org/abs/2108.08810
    CNNs vs ViT Blog: towardsdatascience.com/do-vis...
    Swin Transformer: arxiv.org/abs/2103.14030
    DeiT: arxiv.org/abs/2012.12877
    ▬▬ Support me if you like 🌟
    ►Link to this channel: bit.ly/3zEqL1W
    ►Support me on Patreon: bit.ly/2Wed242
    ►Buy me a coffee on Ko-Fi: bit.ly/3kJYEdl
    ►E-Mail: deepfindr@gmail.com
    ▬▬ Used Music ▬▬▬▬▬▬▬▬▬▬▬
    Music from #Uppbeat (free for Creators!):
    uppbeat.io/t/92elm/jasmine
    License code: SMTWRWLNGHZHH0OC
    ▬▬ Used Icons ▬▬▬▬▬▬▬▬▬▬
    All Icons are from flaticon: www.flaticon.com/authors/freepik
    ▬▬ Timestamps ▬▬▬▬▬▬▬▬▬▬▬
    00:00 Introduction
    00:16 ViT Intro
    01:12 Input embeddings
    01:50 Image patching
    02:54 Einops reshaping
    04:13 [CODE] Patching
    05:35 CLS Token
    06:40 Positional Embeddings
    08:09 Transformer Encoder
    08:30 Multi-head attention
    08:50 [CODE] Multi-head attention
    09:12 Layer Norm
    09:30 [CODE] Layer Norm
    09:55 Feed Forward Head
    10:05 Feed Forward Head
    10:21 Residuals
    10:45 [CODE] final ViT
    13:10 CNN vs. ViT
    14:45 ViT Variants
    ▬▬ My equipment 💻
    - Microphone: amzn.to/3DVqB8H
    - Microphone mount: amzn.to/3BWUcOJ
    - Monitors: amzn.to/3G2Jjgr
    - Monitor mount: amzn.to/3AWGIAY
    - Height-adjustable table: amzn.to/3aUysXC
    - Ergonomic chair: amzn.to/3phQg7r
    - PC case: amzn.to/3jdlI2Y
    - GPU: amzn.to/3AWyzwy
    - Keyboard: amzn.to/2XskWHP
    - Bluelight filter glasses: amzn.to/3pj0fK2

КОМЕНТАРІ • 45

  • @JessSightler
    @JessSightler 7 днів тому

    I've changed the output layer a bit... this:
    self.head_ln = nn.LayerNorm(emb_dim)
    self.head = nn.Sequential(nn.Linear(int((1 + self.height/self.patch_size * self.width/self.patch_size) * emb_dim), out_dim))
    Then in forward:
    x = x.view(x.shape[0], int((1 + self.height/self.patch_size * self.width/self.patch_size) * x.shape[-1]))
    out = self.head(x)
    The downside is that you'll likely get a lot more overfitting, but without it the network was not really training at all.

    • @DeepFindr
      @DeepFindr  6 днів тому

      Hi, thanks for your recommendation.
      I would probably not use this model for real world data as there are many important details that are missing (for the sake of providing a simple overview).
      I will pin your comment for others that also want to use this implementation.
      Thank you!

  • @geekyprogrammer4831
    @geekyprogrammer4831 8 місяців тому +7

    This is very underrated channel. You deserve way more viewers!!

  • @tenma5220
    @tenma5220 9 місяців тому +1

    This channel is amazing. Please continue making videos!

  • @hemanthvemuluri9997
    @hemanthvemuluri9997 5 місяців тому

    Awesome man!! You code and explain with such simplicity.

  • @netanelmad
    @netanelmad 5 місяців тому

    Thank you! Very clear and informative.

  • @hmind9836
    @hmind9836 10 місяців тому +4

    You're awesome man!!! I clicked your video so fast, you're one of the my favorite AI youtubers. I work in the field and I think you have a wonderful ability of explaining complex concepts in your videos

    • @DeepFindr
      @DeepFindr  10 місяців тому

      thanks for the kind words :)

  • @romanlyskov9785
    @romanlyskov9785 3 місяці тому

    Awesome! Thanks for excellent explanation!

  • @florianhonicke5448
    @florianhonicke5448 10 місяців тому +4

    Really great explanation. Nice visuals

    • @DeepFindr
      @DeepFindr  10 місяців тому +1

      Much appreciated!

  • @user-xm7yi8rn4j
    @user-xm7yi8rn4j 3 місяці тому

    감사합니다!!

  • @anightattheraces
    @anightattheraces 5 місяців тому

    Very helpful video, thanks!

  • @marcossrivas
    @marcossrivas 8 місяців тому

    Cool video! What do you think about the implementation of ViT on signal processing (spectrogram analysis) , applied to audio for example. Which advantages could it have over the classic convolutional networks?

  • @datascienceworld
    @datascienceworld 3 місяці тому

    Great tutorial

  • @kristoferkrus
    @kristoferkrus 7 місяців тому +2

    Nice video! However, I think it's incorrect that you would get separate vectors for the three channels? This is not how they do it in the paper; there they say that the number of patches is N = HW/P^2, where H, W and P is the height and width of the original image and (P, P) is the resolution of each patch, so the number of color channels doesn't affect the number of patches you get.

  • @murphy1162
    @murphy1162 9 місяців тому

    Hope you could explain the swim transformer object detection in new video please

  • @hautran-uc8gz
    @hautran-uc8gz 2 місяці тому

    thank you

  • @RAZZKIRAN
    @RAZZKIRAN 10 місяців тому

    thank u ,

  • @kitgary
    @kitgary 9 місяців тому +1

    Awesome video! But I wonder if you reverse the order of LayerNorm and Multi-Head Attention? I think the LayerNorm should be applied after Multi-Head Attention but your implementation apply the LayerNorm before it.

    • @DeepFindr
      @DeepFindr  9 місяців тому +6

      Hi! Thanks!
      There is a paper that investigated pre- VS post-layernorm in transformers (see here arxiv.org/pdf/2002.04745). The "Pre" variant seems to perform better as opposed to the traditional suggestion in the transformer paper. This is also what most public implementations do :)

  • @chinnum9716
    @chinnum9716 7 місяців тому

    Hey,great video. I have a question though. Isn't the entire point of 'pre' norm is that the normalization is applied before attention computation is performed?
    But from the code ,norm = PreNorm(128, Attention(dim=128, n_heads=4, dropout=0.)) , it seems like you are performing attention first and then normalizing aka post-norm. Please correct me if I'm wrong :)

    • @DeepFindr
      @DeepFindr  7 місяців тому +1

      Hi! In the forward pass of the PreNorm layer is this line:
      self.fn(self.norm(x), **kwargs)
      So normalization is applied first and then the function (such as attention in this example)
      The line you are referencing is just the initialization, not the actual call
      Hope that helps :)

    • @chinnum9716
      @chinnum9716 7 місяців тому

      @@DeepFindr stupid of me to not see that first. Thank you for the reply

  • @vero811
    @vero811 10 днів тому

    I think there is a confusion between cls token and positional embedding? At 6:09?

  • @justsomeone3375
    @justsomeone3375 6 місяців тому

    can someone help me with the training codes in the Google Colab link in the description?

  • @user-px7zm2hy3c
    @user-px7zm2hy3c 6 місяців тому +2

    Hello, first of all great tutorial video. I've tried running provided code for training, but after ~400 epochs loss is still the same (~3.61) and model always predicts the same class. Do you have an idea what might be a problem with it?

    • @DeepFindr
      @DeepFindr  6 місяців тому

      Hi, have you tried a Lower learning rate? Also is the train loss decreasing or also stuck?

    • @user-px7zm2hy3c
      @user-px7zm2hy3c 6 місяців тому +3

      ​@@DeepFindr Actually I've already found one bug in notebook. In forward method of Attention module, input is directly passed to MultiheadAttention bypassing linear layers.
      Changing learning rate doesn't affect training at all. Also, when training I've noticed that model's output converges to all zeros.
      I've checked gradients in network and it turns out that gradient flow stops at PatchEmbedding layer. All layers after it have non-zero gradients. Still don't know why this happens

    • @DeepFindr
      @DeepFindr  6 місяців тому

      Thanks for finding this bug. But I actually think it's not super relevant for this issue - I experimented with the attention previously and tried both ways (with linear layers and without), that's how this bug was created in the first place.
      When I started the training back then the loss was definitely decreasing, but I didn't expect it to get stuck at some plateau.
      Typically when models predict always the same class there can be a couple of reasons. I already checked this:
      - Input data is normalized
      - Too few / too many parameters (I would recommend to count the model parameters to get a feeling for this)
      - Learning rate
      - SGD Optimizer (seems to work a bit better)
      - Batch size, I put it to 128
      - Embedding size, make it a bit smaller
      After 100 epochs the Loss is also converging to 3.61, but the model is predicting different classes. Maybe the Dataset is not big enough. What about trying another dataset? Alternatively, try data augmentation.
      As stated in the video, transformers need to see a lot of examples.

  • @MrMadmaggot
    @MrMadmaggot Місяць тому

    Is the Colab using cuda? IF so how can I tell if it is useing cuda

  • @muhammadtariq7474
    @muhammadtariq7474 11 днів тому

    Where to get slides? Used in video

  • @beratcokhavali
    @beratcokhavali 9 місяців тому +1

    in 05:08. how we calculated that? when I calculated the patch shape I got a different result. Could someone explain that?

    • @chandank5266
      @chandank5266 4 дні тому

      yes exactly, I also have same doubt, for me its 192 instead of 324

  • @abhinavvura4973
    @abhinavvura4973 11 днів тому

    hi there
    I have used the code for binary class classification, but encountering the problem on accuracy , showing 100%
    accuracy only on label 1 and some times on label 2. So it would be helpful for me if u provide me any solution

    • @DeepFindr
      @DeepFindr  6 днів тому

      Hi, please see pinned comment. Maybe this helps :)

  • @cosminpetrescu860
    @cosminpetrescu860 4 місяці тому +2

    Why are the positional embeddings learnable? It doesn't make sense to me

    • @trendingtech4youth989
      @trendingtech4youth989 4 місяці тому

      Bcoz, positional embedding represent the adress or position of image information of image patches

    • @cosminpetrescu860
      @cosminpetrescu860 4 місяці тому

      @@trendingtech4youth989 in "Attention is all you need", afaik, positioal embeddings are not learnable

    • @Omsip123
      @Omsip123 17 днів тому

      ​@trendingtech4youth989 so..they are given like the patches, why shall they be learned

  • @avirangal2044
    @avirangal2044 15 днів тому

    The video is great but the training in the code didn't work for the entire 1000 epochs. Despite the code looks logical, there is endless of things that can go wrong so I think it was better to do a tutorial with working ViT notebook.

    • @DeepFindr
      @DeepFindr  15 днів тому

      Hi! I think this is because the Dataset is too small. Transformers are data hungry. It should work with a bigger dataset

    • @DeepFindr
      @DeepFindr  2 дні тому

      Also have a look at the pinned comment, maybe that helps :)

  • @coke_and_cake
    @coke_and_cake 2 місяці тому +1

    And now sora uses the same algorithm. this video aged so well

    • @simpleplant606
      @simpleplant606 2 місяці тому +3

      Sora is using DiT (Diffusion Transformer)