Vision Transformer - Keras Code Examples!!

Поділитися
Вставка
  • Опубліковано 4 чер 2024
  • This video walks through the Keras Code Example implementation of Vision Transformers!! I see this as a huge opportunity for graduate students and researchers because this architecture has a serious room for improvement. I predict that Attention will outperform CNN models like ResNets, EfficientNets, etc. it will just take the discovery of complimentary priors, e.g. custom data augmentations or pre-training tasks. I hope you find this video useful, please check out the rest of the Keras Code Examples playlist!
    Content Links:
    Keras Code Exampes - Vision Transformers: keras.io/examples/vision/imag...
    Google AI Blog Visualization: ai.googleblog.com/2020/12/tra...
    Formal Paper describing this model: arxiv.org/pdf/2010.11929.pdf
    TensorFlow Addons: www.tensorflow.org/addons
    TensorFlow Addons -AdamW: www.tensorflow.org/addons/api...
    Chapters
    0:00 Welcome to the Keras Code Examples!
    0:45 Vision Transformer Explained
    2:47 TensorFlow Add-Ons
    3:29 Hyperparameters
    7:04 Data Augmentations
    8:30 Patch Construction
    11:52 Patch Embeddings
    14:01 ViT Classifier
    16:30 Compile and Run
    19:02 Analysis of Final Performance
  • Наука та технологія

КОМЕНТАРІ • 61

  • @artukikemty
    @artukikemty Місяць тому

    Amazing, few people can even do this explanation line by line, great contribution democratizing AI knowledge!

  • @NehadHirmiz
    @NehadHirmiz 3 роки тому

    Thank you very much for these amazing videos. Your contribution is key to the applications of these methods.

  • @sz4746
    @sz4746 2 роки тому +1

    It's so easy to implement ViT. Before I was afraid of using those big models because I thought it would be hard to implement, but keras and pytorch do have multiheadattention as a built-in function!

  • @sendjasniabderrezzaq9347
    @sendjasniabderrezzaq9347 2 роки тому +1

    Hi,
    Thank you for the explanation.
    I have a question regarding the variable `position_dim`, how it was chosed? If i change the patch size, do I need to change that too?

  • @vl7283
    @vl7283 2 роки тому

    Great job!! Quick question, I see that the the labels on the both csv files are different than previous cnn vision csv files. This is because the data needs to be encoded? By any chance do you know how to encoded? If not is ok thanks for your videos!

  • @sinancalsr726
    @sinancalsr726 3 роки тому +6

    Hi, thanks for the video :)
    In 10:35 , I guess the -1 comes from the number of patches. Like if we say the batch_size=2 the output dimension of the tf.reshape function will be 2x144x108, since there are 144 patches inside the 72x72 image (patch_size=6). Also in the plotting loop, we are looping through the second dimension which has 144 element.

  • @sayakpaul3152
    @sayakpaul3152 3 роки тому +5

    -1 inside reshaping is a handy neat trick. Let's say you want to flatten a tensor of shapes (batch_size, 512, 16). You can easily do that by doing something like tf.reshape(your_tensor, (batch_size, -1)). You don't need to explicitly specify the flattened dimensions.

  • @JoseMiguel_____
    @JoseMiguel_____ 2 роки тому

    great explanation! keep doing this

  • @abdurrahmansefer2548
    @abdurrahmansefer2548 Рік тому +1

    hello thanks, but i want to ask a question,
    in the input section(Extra learnable [class] embedding)
    What is the zero (0) index used for and what information does it contain?

  • @nitishsingla9057
    @nitishsingla9057 3 роки тому

    I have checked the github link given in the original paper. Is this keras code different from what it is mentioned in the github link ?

  • @mahdiyehbasereh
    @mahdiyehbasereh 7 місяців тому

    It was very helpful , thanks alot

  • @annicetrazafindratovolahy1512
    @annicetrazafindratovolahy1512 Рік тому +1

    Hello! Please, can you do a video on how to use Swin Transformer using an autoencoder architecture? Thank you in advance. I have a difficulty when I want to restore the patch into an image (for the decoder part)

  • @sayakpaul3152
    @sayakpaul3152 3 роки тому +10

    When you specify `from_logits=True` softmax is first applied and then cross-entropy is taken.

    • @connorshorten6311
      @connorshorten6311  3 роки тому +2

      Thanks again Sayak, really appreciate it!

    • @CristianGarcia
      @CristianGarcia 3 роки тому +4

      This is the main idea but internally "log_softmax" is used for performance. Actually if you pass from_logits=False Keras turns the output of the softmax back to logits via log:
      github.com/tensorflow/tensorflow/blob/85c8b2a817f95a3e979ecd1ed95bff1dc1335cff/tensorflow/python/keras/backend.py#L4908

    • @sayakpaul3152
      @sayakpaul3152 3 роки тому +1

      Yes totally correctly. I didn't mention it for simplicity. But giving it another thought, I should have been clearer in my answer. Thank you!

    • @santhoshckumar7367
      @santhoshckumar7367 Рік тому

      Appreciate your additional clarification. Thanks,

  • @billiartag
    @billiartag 3 роки тому

    might be a stupid question, but how to visualize the attention? i honestly confused on extracting the attention

  • @CristianGarcia
    @CristianGarcia 3 роки тому +2

    Since TF 2.0 you can the regular plus (+) operator instead of the Add layer.

  • @pakistanproud8123
    @pakistanproud8123 2 роки тому +1

    Can anybody explain this paragraph to me:
    Unlike the technique described in the paper, which prepends a learnable embedding to the sequence of encoded patches to serve as the image representation, all the outputs of the final Transformer block are reshaped with layers.Flatten() and used as the image representation input to the classifier head.

  • @Bomerang23
    @Bomerang23 Рік тому

    maybe it's a silly question but does vit work on gray scale pic??

  • @jayakrishnankv1681
    @jayakrishnankv1681 2 роки тому

    thank you for the video

  • @DiogoSanti
    @DiogoSanti 3 роки тому +3

    Cool job... For the "from_logits=True" part it expects only the logits (without the softmax activation) the SparseCategoricalCorssEntropy will apply softmax for you with that option...
    Just be careful as, if people set from_logits to True and still apply the Softmax at the end of their network, it will apply the loss function(with the softmax) on an already probability distribution

    • @connorshorten6311
      @connorshorten6311  3 роки тому +1

      Thank you so much for the clarification, really appreciate it! What would be the major problem with double softmaxes? I guess slow computation and massive blowup of large densities comes to mind

    • @DiogoSanti
      @DiogoSanti 3 роки тому

      @@connorshorten6311 Happy i could help, thanks for all the good content!

    • @LiveLifeWithLove
      @LiveLifeWithLove Рік тому

      @@connorshorten6311 SoftMax does two things, one it makes sum equal to 1 (probability distribution), other it brings far logits near. So if you apply once far logits will be transformed to values which are relatively near and have probability distribution but will still maintain nice separation, now if you apply it again will bring outputs even more closer, apply again and they will be so near that you won't be able to find the pattern. Logits X = (1.5, 3.5, 2.5), X1 = softmax(X) = (0.10, 0.67, 0.25), X2 = softmax(X1) = (0.25, 0.45, 0.30)

  • @khalladisofiane9195
    @khalladisofiane9195 Рік тому

    Please i have my custom dataset with 3 folders than 3 classes how can i use the ViT please to do classification

  • @sayakpaul3152
    @sayakpaul3152 3 роки тому +1

    I second your thoughts on complementary priors. In fact, BotNets, IMO, are a step in that direction. DeIT as well.

    • @connorshorten6311
      @connorshorten6311  3 роки тому +1

      Thanks Sayak! Yeah, DeiT's distillation with the CNN activations is incredibly interesting. I think the large-scale data pre-training could be a complementary prior with respect to the global aggregation thing and just needing a lot of data to get a sense of that. I hope data augmentations can also be customized to the global prior vs. local prior in CNNs.

    • @sayakpaul3152
      @sayakpaul3152 3 роки тому +1

      @@connorshorten6311 yes, seconded.
      As I mentioned earlier in those lines, BotNet seems to be a really good proposal not only for image classification for other tasks (instance segmentation, object detection) as well where modeling long range depencies is crucial.

  • @chaymaebenhammacht1618
    @chaymaebenhammacht1618 Рік тому

    hi thaank u for this video its very usefull , but i found some problems when i used this model to do my own images classification on multiple malware i tried many times to solve the problem but unfortunately can u help me plzz ??

  • @user-xw9cp3fo2n
    @user-xw9cp3fo2n 2 роки тому +1

    Your explanation is amazing, thank you very much, but I want to ask a question, what is the projection dimension and why it is 64 however the patches 144 per image and the index will be from 0 to 143?? thank you very much again for your attention?

    • @connorshorten6311
      @connorshorten6311  2 роки тому +1

      Thank you! The projection dimension is analogous to the embedding dimension in say, word embeddings or any kind of categorical encoding, in the end you transform the feature set into a 64 x 144 representation with 64 dimensions to encode each of the 144 patches

    • @lifted1785
      @lifted1785 2 роки тому +1

      Was the pun intended at the end? 😂 funny

  • @isaacbaffoursenkyire1018
    @isaacbaffoursenkyire1018 2 роки тому

    Hello,
    Thanks for the video.
    I have a question please. I have written the same and exact line of codes in Google Colab but I don't get any results ie. after running the def run_experiment (model) I don't get any results ( the Epochs with the accuracy), Please is there anything I am not doing right?

  • @jason-yb9qk
    @jason-yb9qk Рік тому

    guys how to modify the code so i can use dataset from kaggle?

  • @yaswanth1679
    @yaswanth1679 3 роки тому +6

    can we implement this ViT on our own dataset

  • @javaqtquicktutorials1131
    @javaqtquicktutorials1131 2 роки тому +1

    Hi sir, can I use this code in a custom dataset?

    • @connorshorten6311
      @connorshorten6311  2 роки тому

      Yes, be mindful of the resolution size and how that changes the hard coded parameters for the patching -- can be a bit tricky, I recommend borrowing that same matplotlib code to plot the patchings to make sure you did it correctly.

  • @adizhol
    @adizhol 3 роки тому

    Where is the CLS token read?

  • @khalladisofiane9195
    @khalladisofiane9195 Рік тому

    How can i use this code on my custom data please with 3 classes

    • @draaken0
      @draaken0 10 місяців тому

      Just change the image size input and num_classes=3.Also you can play with patch size according to your image shape.

  • @suke933
    @suke933 2 роки тому

    Hi Henry, kindly explain how can it be used to binary class problems?

  • @AbdennacerAyeb
    @AbdennacerAyeb 3 роки тому +2

    we miss weekly update in AI..

    • @connorshorten6311
      @connorshorten6311  3 роки тому +2

      Thank you so much for your interest in the series! I’m hoping to get back to it soon

    • @dome8116
      @dome8116 3 роки тому

      @@connorshorten6311 yes please bring them back sir

  • @m.hassan8142
    @m.hassan8142 3 роки тому

    I came here from bert model

  • @squirrel4635
    @squirrel4635 Рік тому +1

    How much coffee did you drink?

  • @WahranRai
    @WahranRai 2 роки тому +2

    Too much animation , what about reduce your speed and let us examine the slides

  • @graceln2480
    @graceln2480 2 роки тому

    Too fast, illustrations with figures as you explain would be more useful

  • @yifeipei5484
    @yifeipei5484 3 роки тому +1

    If you want to explain the codes, you should figure out every part of the codes. For "from_logits", if you didn't know it, you should figure it out by reference of tensorflow API before the tutorial. However, you didn't and were very lazy.

  • @khalladisofiane9195
    @khalladisofiane9195 2 роки тому

    Hi thanks , but can you help me i want to use VIT on my custom dataset for classification please can i get your email .

  • @pakistanproud8123
    @pakistanproud8123 2 роки тому

    Can anybody explain this paragraph to me:
    Unlike the technique described in the paper, which prepends a learnable embedding to the sequence of encoded patches to serve as the image representation, all the outputs of the final Transformer block are reshaped with layers.Flatten() and used as the image representation input to the classifier head.