Vision Transformer - Keras Code Examples!!
Вставка
- Опубліковано 4 чер 2024
- This video walks through the Keras Code Example implementation of Vision Transformers!! I see this as a huge opportunity for graduate students and researchers because this architecture has a serious room for improvement. I predict that Attention will outperform CNN models like ResNets, EfficientNets, etc. it will just take the discovery of complimentary priors, e.g. custom data augmentations or pre-training tasks. I hope you find this video useful, please check out the rest of the Keras Code Examples playlist!
Content Links:
Keras Code Exampes - Vision Transformers: keras.io/examples/vision/imag...
Google AI Blog Visualization: ai.googleblog.com/2020/12/tra...
Formal Paper describing this model: arxiv.org/pdf/2010.11929.pdf
TensorFlow Addons: www.tensorflow.org/addons
TensorFlow Addons -AdamW: www.tensorflow.org/addons/api...
Chapters
0:00 Welcome to the Keras Code Examples!
0:45 Vision Transformer Explained
2:47 TensorFlow Add-Ons
3:29 Hyperparameters
7:04 Data Augmentations
8:30 Patch Construction
11:52 Patch Embeddings
14:01 ViT Classifier
16:30 Compile and Run
19:02 Analysis of Final Performance - Наука та технологія
Amazing, few people can even do this explanation line by line, great contribution democratizing AI knowledge!
Thank you very much for these amazing videos. Your contribution is key to the applications of these methods.
It's so easy to implement ViT. Before I was afraid of using those big models because I thought it would be hard to implement, but keras and pytorch do have multiheadattention as a built-in function!
Hi,
Thank you for the explanation.
I have a question regarding the variable `position_dim`, how it was chosed? If i change the patch size, do I need to change that too?
Great job!! Quick question, I see that the the labels on the both csv files are different than previous cnn vision csv files. This is because the data needs to be encoded? By any chance do you know how to encoded? If not is ok thanks for your videos!
Hi, thanks for the video :)
In 10:35 , I guess the -1 comes from the number of patches. Like if we say the batch_size=2 the output dimension of the tf.reshape function will be 2x144x108, since there are 144 patches inside the 72x72 image (patch_size=6). Also in the plotting loop, we are looping through the second dimension which has 144 element.
Thank you so much for the clarification on this!
-1 inside reshaping is a handy neat trick. Let's say you want to flatten a tensor of shapes (batch_size, 512, 16). You can easily do that by doing something like tf.reshape(your_tensor, (batch_size, -1)). You don't need to explicitly specify the flattened dimensions.
Thanks Sayak! I was really confused about that haha
great explanation! keep doing this
hello thanks, but i want to ask a question,
in the input section(Extra learnable [class] embedding)
What is the zero (0) index used for and what information does it contain?
I have checked the github link given in the original paper. Is this keras code different from what it is mentioned in the github link ?
It was very helpful , thanks alot
Hello! Please, can you do a video on how to use Swin Transformer using an autoencoder architecture? Thank you in advance. I have a difficulty when I want to restore the patch into an image (for the decoder part)
When you specify `from_logits=True` softmax is first applied and then cross-entropy is taken.
Thanks again Sayak, really appreciate it!
This is the main idea but internally "log_softmax" is used for performance. Actually if you pass from_logits=False Keras turns the output of the softmax back to logits via log:
github.com/tensorflow/tensorflow/blob/85c8b2a817f95a3e979ecd1ed95bff1dc1335cff/tensorflow/python/keras/backend.py#L4908
Yes totally correctly. I didn't mention it for simplicity. But giving it another thought, I should have been clearer in my answer. Thank you!
Appreciate your additional clarification. Thanks,
might be a stupid question, but how to visualize the attention? i honestly confused on extracting the attention
Since TF 2.0 you can the regular plus (+) operator instead of the Add layer.
Thanks! Definitely cleans it up a bit
Can anybody explain this paragraph to me:
Unlike the technique described in the paper, which prepends a learnable embedding to the sequence of encoded patches to serve as the image representation, all the outputs of the final Transformer block are reshaped with layers.Flatten() and used as the image representation input to the classifier head.
maybe it's a silly question but does vit work on gray scale pic??
thank you for the video
Cool job... For the "from_logits=True" part it expects only the logits (without the softmax activation) the SparseCategoricalCorssEntropy will apply softmax for you with that option...
Just be careful as, if people set from_logits to True and still apply the Softmax at the end of their network, it will apply the loss function(with the softmax) on an already probability distribution
Thank you so much for the clarification, really appreciate it! What would be the major problem with double softmaxes? I guess slow computation and massive blowup of large densities comes to mind
@@connorshorten6311 Happy i could help, thanks for all the good content!
@@connorshorten6311 SoftMax does two things, one it makes sum equal to 1 (probability distribution), other it brings far logits near. So if you apply once far logits will be transformed to values which are relatively near and have probability distribution but will still maintain nice separation, now if you apply it again will bring outputs even more closer, apply again and they will be so near that you won't be able to find the pattern. Logits X = (1.5, 3.5, 2.5), X1 = softmax(X) = (0.10, 0.67, 0.25), X2 = softmax(X1) = (0.25, 0.45, 0.30)
Please i have my custom dataset with 3 folders than 3 classes how can i use the ViT please to do classification
I second your thoughts on complementary priors. In fact, BotNets, IMO, are a step in that direction. DeIT as well.
Thanks Sayak! Yeah, DeiT's distillation with the CNN activations is incredibly interesting. I think the large-scale data pre-training could be a complementary prior with respect to the global aggregation thing and just needing a lot of data to get a sense of that. I hope data augmentations can also be customized to the global prior vs. local prior in CNNs.
@@connorshorten6311 yes, seconded.
As I mentioned earlier in those lines, BotNet seems to be a really good proposal not only for image classification for other tasks (instance segmentation, object detection) as well where modeling long range depencies is crucial.
hi thaank u for this video its very usefull , but i found some problems when i used this model to do my own images classification on multiple malware i tried many times to solve the problem but unfortunately can u help me plzz ??
Your explanation is amazing, thank you very much, but I want to ask a question, what is the projection dimension and why it is 64 however the patches 144 per image and the index will be from 0 to 143?? thank you very much again for your attention?
Thank you! The projection dimension is analogous to the embedding dimension in say, word embeddings or any kind of categorical encoding, in the end you transform the feature set into a 64 x 144 representation with 64 dimensions to encode each of the 144 patches
Was the pun intended at the end? 😂 funny
Hello,
Thanks for the video.
I have a question please. I have written the same and exact line of codes in Google Colab but I don't get any results ie. after running the def run_experiment (model) I don't get any results ( the Epochs with the accuracy), Please is there anything I am not doing right?
did you find a solution to this problem ?
@@ferdoussedjamai1954 No please
guys how to modify the code so i can use dataset from kaggle?
can we implement this ViT on our own dataset
did you try to do it ?
Hi sir, can I use this code in a custom dataset?
Yes, be mindful of the resolution size and how that changes the hard coded parameters for the patching -- can be a bit tricky, I recommend borrowing that same matplotlib code to plot the patchings to make sure you did it correctly.
Where is the CLS token read?
How can i use this code on my custom data please with 3 classes
Just change the image size input and num_classes=3.Also you can play with patch size according to your image shape.
Hi Henry, kindly explain how can it be used to binary class problems?
we miss weekly update in AI..
Thank you so much for your interest in the series! I’m hoping to get back to it soon
@@connorshorten6311 yes please bring them back sir
I came here from bert model
How much coffee did you drink?
Too much animation , what about reduce your speed and let us examine the slides
Too fast, illustrations with figures as you explain would be more useful
If you want to explain the codes, you should figure out every part of the codes. For "from_logits", if you didn't know it, you should figure it out by reference of tensorflow API before the tutorial. However, you didn't and were very lazy.
Hi thanks , but can you help me i want to use VIT on my custom dataset for classification please can i get your email .
Can anybody explain this paragraph to me:
Unlike the technique described in the paper, which prepends a learnable embedding to the sequence of encoded patches to serve as the image representation, all the outputs of the final Transformer block are reshaped with layers.Flatten() and used as the image representation input to the classifier head.