LOVE this live coding channel format! I always find it much easier to understand a paper when I see a simple implementation and this makes it even easier! Keep it up!
A great channel about Pytorch. I like the way you carefully explain the meaning of each function. It encourages me to get my hand more dirty. Thank you and looking forward to seeing more videos from you.
Excellent presentation, thank you for sharing! A few reasons why I enjoyed the video: 1. < 30min, no fluff, no typos, no BS, good consistent pace. Everyone is busy, staying under 30min is extremely helpful and will force you to optimize your time! 2. Useful sidebars; break down key concepts into useful nuggets. Very helpful 3. Chose a popular topic, based on one of the best repos, and gave a nice intro 4. Stay with pytorch please, no one likes tensorflow ;) I look forward to more of your work. Thank you
A much needed video. I went through several iterations of paper and supplementary videos online explaining the paper. I always had some residual doubts remaining and dint understand with pin point accuracy. After this video everything is now clear !!
❤️❤️ absolutely fantastic presentation. This cured my depression after 5 days of banging head against the wall. The pace of this video is so ideal. One suggestion that I want to propose is to add network architecture's figure/diagram from the paper while writing the code so it's easier for the new ML/DL coders to understand. Keep it up. Looking forward for more. Amazing work. ❤️thank you so much ❤️
Beautiful code, wonderful explanations to follow along. Thanks for taking the extra time to look at some of the essential concepts in iPython. Superb content!
@@mildlyoverfittedafter some study, I got that actually ViT model is encoder of transformer, may I expect to introduce decoder part or complete seq2seq model in the future🤣 Besides, I was surprised that the implementation of ViT model was completed without using nn.MultiheadAttention, nn.Transformer Isn’t it more convenient?
@@yichehsieh243 Good question actually. I guess one of the goals of this channel is to do things "from scratch" mostly for educational purposes. However, in real life I would always go for well maintained libraries rather than reinventing the wheel and reimplementing everything.
@@mildlyoverfitted, in PyTorch transformer: torch.nn.modules.transformer.py, q & k & v = x. It was a discovery for me. But it gives more better convergation of net. I didn't know that, yet yesterday. # self-attention block def _sa_block(self, x: Tensor, attn_mask: Optional[Tensor], key_padding_mask: Optional[Tensor]) -> Tensor: x = self.self_attn(x, x, x, attn_mask=attn_mask, key_padding_mask=key_padding_mask, need_weights=False)[0] return self.dropout1(x) This method push 'x' to class MultiheadAttention in torch.nn.modules.activation.py
I subscribed owing to such a clean implementation, well explained. I love how you comment the code and check shapes on the go. I request you to please make a video on your approach to implement papers.
fantastic video, just a quick note: at 16:01 you say that "none of the operations are changing the shape of the tensor", but isnt this wrong, since when applying fc2, the last dim should be out_features, not hidden_features, so the shapes are also wrongly commented.
I don't understand the reason why you use nn.Conv2d layer in Patch embedding module at 2:53 time. In my mind, I only use nn.Linear(in_channels, out_channels). Can you explain it?
shape of v : (n_samples, n_heads, n_patches + 1, head_dim) shape of atten : (n_samples, n_heads, n_patches + 1, n_patches + 1) How can you multiply these two tensors? And how the result's shape is same as v's? Please explain . BTW great content. Glad I found this channel.
atten @ v can be done. the output (n_samples, n_heads, n_patches + 1, n_patches + 1) @ (n_samples, n_heads, n_patches + 1, head_dim) = (n_samples, n_heads, n_patches + 1, head_dim). For examples lets say you have two matrices with shapes (2,2,5,5) and (2,2,5,3) then output will be (2,2,5,3).
fantastic live coding video!!!!!!!! you save my day, and hope you can keep on it making such a nice video. I believe it would be the best video in explaining VIT~
Loved the video! Just a quick question: Here, you save the custom_model, that has not been trained a single epoch. How is it able to predict the image correctly (without training)? Or am I missing something here!
I have the same problem !! i can't understand how it able to predict without training? please can explain me what's happens ! and how do i can train this model?
@@mildlyoverfitted thanks for your replying ! , can you help me for do a training and test for ur code? it's possible? 1- load the pretarined model 2- finetunine this model and train it 3- test step it's correct ? i want to applied ViT for facial expression classification but i didn't find any example for do it
Amazing clarity. Your tutorial is gold!! Great work. Can you please make a video on code implementation of VOLO-D5 model (Vision Outlooker for Visual Recognition)
Hi, great video indeed! Thanks for your time for making such a video and for sharing it with the community. Do you have plans to create further videos on the implementation of other types of architectures or training/inference of models that might be more difficult than or different from the standard setups?
Thank you for your message! Really nice to hear that. Yes, I am planning to create many videos like this on different architectures/approaches. The format will stay very similar to this video: Taking an existing github repo and implementing it from scratch! I guess the goal is to change things up and cover different fields and topics!
Best video ever. Please implement Swin transformer which is the latest in Image Transformers family. I find it difficult to understand the source code of Window Attention in Swin Transformer. It would be very useful if you could upload either a walk through or implementation of Swin Transformer code
I love things like these that are application focused. I am currently experimenting on editing backbones so that they should start with Gabor filters. These backbones are loaded from mmdetection or detectron2. Can you do something like that? As to how we could edit backbones? That might be useful to people that want to experiment.
thanks for your video. i have a question. i had a result from trained model, but i cant see the result like in your video. did u train vit model for imagenet data?
Thanks so much for the video. Easy to follow, and some detour to explain the side stuffs are also relevant. Line 217/218 comments on shape to be changed to (n_samples, n_patches+1, out_features) or am I wrong?
Thanks again for the great hands-on tutorial on ViT. This helped me greatly to understand the Transformer implementation in Pytorch. My understanding is that you have covered the Encoder part here (for Classification tasks). Do you have a separate session on Decoder part or is it implemented here?
Glad it helped! As you pointed out, this video is purely about the encoder! I don't have a video on the BERT decoder with cross attention, however, I have a new video on the GPT-2 model that does contain a variant of the decoder block. Feel free to check it out:)
I am building a custom model based on ViT. It is almost the same as ViT, with just a few additional layers. I am trying to load the pretrained weights of ViT using load_state_dict() function. But the size of input image I am feeding to the model is not 384x384, rather 640x640. So the positional embedding layer of my model has more parameters than ViT. How to handle these extra parameters of positional embedding? Can I perform some interpolation of the existing parameters?
Hi, I couldn't get why was the position embedding initiated as zero tensor. Why wasn't it initialised with index values of the patches in the original image as the flow diagram suggests. I would highly appreciate clarification on this? Great video btw!!
Is it possible to fine tune vision transformers on a single GPU machine? Given that they’ve been trained using tons of TPUs, I’m inclined to think fine tuning also requires huge compute power and thus out of reach of most people at the moment.
I have never fine tuned a Vision Transformer myself, however, I would imagine it takes fewer resources than to train it scratch. Just give it a go with a single GPU and monitor the performance:) Good luck!
Hai thanks for the nice video, I have a question. So I am doing a CNN + ViT project using 3 conv layers, can you show me how to incorporate the CNN layers with the ViT architecture that you have implemented in your video and how can I optimize it. Please help me. Thank you very much.
Quick question. For the forward.py file, what is the purpose of k=10? I see it's used for the topk function but I was curious as to what the k variable denotes as well as why specifically 10 was chosen
One quick question: I have implemented ViT but when I try to train it from scratch it seems like it is not learning at all (the loss is not going down), and i have been using a simple dataset (cats vs dogs) with adamW optimizer and lr = 0.001. What should I do other than loading the pretrained weights
I would definitely try to overfit one single batch of your training data. If it is possible, then in theory your setup is correct and you just need to train on more data/longer. If it is not possible, something went wrong with your architecture/loss. I hope that helps. Also, I know that this might be against the spirit of what you are trying to do but there are a bunch of frameworks that implemented the architecture /training logic already. Examples: * rwightman.github.io/pytorch-image-models/models/vision-transformer/ * huggingface.co/docs/transformers/model_doc/vit
So the `MLP` module is used inside of the Transformer block and and it inputs a 3D tensor. See this link for the only place where the CLS is explicitly extracted github.com/jankrepl/mildlyoverfitted/blob/22f0ecc67cef14267ee91ff2e4df6bf9f6d65bc2/github_adventures/vision_transformer/custom.py#L423-L424 Hope that helps:)
Hi thanks for sharing. Its great. Could you please share your experience of training a transformer from scratch. I am trying to train one with skeleton datasets in a self supervised manner with SimCLR loss and my transformer seems not learn much and after few epoch loss increases. I am new to this and don't understand whats wrong.
Hey! Thank you for your comment! Hmmmm, it is a pretty hard question since I don't know what your code and exact setup look like. Anyway, a few ideas I would recommend (not sure if that applies to your problem): * Make sure it is possible to to "overfit" your network on a single batch of samples * Track as many relevant metrics (+other artifacts) as possible (with tools like TensorBoard) to understand what is going on * Try to use a popular open-source package/repository for the training before actually writing custom code
@@mildlyoverfitted Thanks a lot. I have just one concern. Transformers are really great in NLP and image or video data. But my data is sequence of frames with each frame containing just 30 values (10 joint with 3 x-y-z coordinates). Do you think a 300*30 dimension is too low for Transformer to learn something meaningful.
@@saniazahan5424 Interesting! I don't think that should be a problem. However, as I said, it is really hard to give any tips without actually knowing all the technical details:( Good luck with you project!!!
Many thanks for this amazing explanation. Could you, by any chance, be knowing a tutorial on how to utilize transformers on tabular data (using PyTorch)? Thanks again. :-}
Great Video on vision transformers. However, I have a small problem in the implementation. When I tried to train the model that I implemented, I was getting the same outputs for all the images in a batch. On further investigation, I found out that the first row of every tensor in a batch, i.e, the cls_token for every image in a batch, is not changing when it passes through all the layers. Is this problem occuring because we are giving the same cls_token to every class, or is it because of some other implementation error. It would be really great if someone could answer. Thanks in advance.
Thank you! AFAIK if your batch contains different images then you should indeed have different embeddings of the CLS token after the forward pass. Anyway, it is hard to say what the issue could be without seeing the code. If you think the problem is coming from my code feel free to create an issue on github where we could discuss it in detail! Cheers!
@@mildlyoverfittedThanks for the reply. In my implementation, I was passing the the tensor we get after performing layer normalization directly to the attention layer as the query, key and value. However, in your implementation and pytorch timm's implementation, you have passed the input tensor through a linear layer and reshaped it to get query, key and value. That was the problem with my code, but I still do not understand the reasoning behind my mistake. Because in the original transformer, we just pass the embeddings as key, value and query directly without performing any linear projections, so I thought the same would be applicable here. However, that was not the case. If anyone can give the reasoning behind this procedure, it would be really appreciable. Thanks in advance.
@@mildlyoverfitted On Vaswani et al., in the description of the attention module, I thought that they never mentioned about applying linear projection. However, I might have missed that information in the original paper. Anyways, thanks for the reply.
@@baveshbalaji301 Just checked the paper. The Figure 2 (right) shows the linear mapping logic. But I agree that it is easy to miss:) In the text they actually use the W^{Q}, W^{K}, W^{V} matrices to refer to this linear mapping (no bias).
@@mildlyoverfitted make more tutorials like this, it helps a lot not only as an implementation purpose but I also learn how to write clean code with docstings.
Errata:
* 217/218 lines of `custom.py`: shape should be (n_samples, n_patches+1, out_features)
This is awesome! Glad this got recommended, will watch this later 👍
Appreciate the message! I hope you will like it:) BTW you are creating great content! Keep it up!
@@mildlyoverfitted Yeah I liked it :)
@devstuff2576 pUT it into chatgpt
LOVE this live coding channel format! I always find it much easier to understand a paper when I see a simple implementation and this makes it even easier! Keep it up!
Thank you! It is funny that you say that because I am exactly like you!
A great channel about Pytorch. I like the way you carefully explain the meaning of each function. It encourages me to get my hand more dirty. Thank you and looking forward to seeing more videos from you.
Great to hear!
Excellent presentation, thank you for sharing!
A few reasons why I enjoyed the video:
1. < 30min, no fluff, no typos, no BS, good consistent pace. Everyone is busy, staying under 30min is extremely helpful and will force you to optimize your time!
2. Useful sidebars; break down key concepts into useful nuggets. Very helpful
3. Chose a popular topic, based on one of the best repos, and gave a nice intro
4. Stay with pytorch please, no one likes tensorflow ;)
I look forward to more of your work.
Thank you
Thank you very much:) Very encouraging and motivating comment!
no one likes tensorflow, haha.Strongly agree with you
A much needed video. I went through several iterations of paper and supplementary videos online explaining the paper. I always had some residual doubts remaining and dint understand with pin point accuracy. After this video everything is now clear !!
I appreciate your comment!
❤️❤️ absolutely fantastic presentation. This cured my depression after 5 days of banging head against the wall.
The pace of this video is so ideal.
One suggestion that I want to propose is to add network architecture's figure/diagram from the paper while writing the code so it's easier for the new ML/DL coders to understand.
Keep it up. Looking forward for more. Amazing work. ❤️thank you so much ❤️
Heh:) Thank you for the kind words! That is a great suggestion actually!
@@mildlyoverfitted Yes the diagram might be very helpful!
Not watch full video yet, but I like the way you explain thing clearly with ipython demo for a beginner like me. Nice video!
Appreciate it! Nice to know that you enjoyed the ipython segments:) I will definitely try to include them in my future videos too!
Thank you so much for helping me to understand ViT!! Great work
Happy to help!
Amazing clarity. Your tutorial is gold!! Great work.
Glad you enjoyed it!
Beautiful code, wonderful explanations to follow along. Thanks for taking the extra time to look at some of the essential concepts in iPython. Superb content!
Much appreciated!
Thank you for uploading this video, make me learned a lot and got more familiar with ViT model
Glad you enjoyed it!
@@mildlyoverfittedafter some study, I got that actually ViT model is encoder of transformer, may I expect to introduce decoder part or complete seq2seq model in the future🤣
Besides, I was surprised that the implementation of ViT model was completed without using nn.MultiheadAttention, nn.Transformer Isn’t it more convenient?
@@yichehsieh243 Good question actually. I guess one of the goals of this channel is to do things "from scratch" mostly for educational purposes. However, in real life I would always go for well maintained libraries rather than reinventing the wheel and reimplementing everything.
Thank you for helping me understand ViT! It's a great and kind Video!!
Much appreciated!
Thank you!!!!! Super useful. Before, I knew how drop out works but I didn't know how pytorch handle it .
Glad it helped!
Excellent material. Thanks for preparing and sharing it! Keep up the good work.
Thank you for watching the video:)
Thank you! It's best video about VIT for understanding.
Appreciate your comment!
@@mildlyoverfitted, in PyTorch transformer: torch.nn.modules.transformer.py, q & k & v = x. It was a discovery for me. But it gives more better convergation of net. I didn't know that, yet yesterday.
# self-attention block
def _sa_block(self, x: Tensor,
attn_mask: Optional[Tensor], key_padding_mask: Optional[Tensor]) -> Tensor:
x = self.self_attn(x, x, x,
attn_mask=attn_mask,
key_padding_mask=key_padding_mask,
need_weights=False)[0]
return self.dropout1(x)
This method push 'x' to class MultiheadAttention in torch.nn.modules.activation.py
Thank you so much.. this is a lifesaver! Bless you my friend!
I'm so glad that i found this channel , you are a gem :) !!
Thank you! I appreciate it:)
I subscribed owing to such a clean implementation, well explained. I love how you comment the code and check shapes on the go. I request you to please make a video on your approach to implement papers.
Great to hear that! I guess it is easier to take an existing code and modify it rather than starting from scratch:)
Just found your amazing channel. I love it, pls continue.
Thank you for the kind message! I will definitely continue:)
fantastic video, just a quick note: at 16:01 you say that "none of the operations are changing the shape of the tensor", but isnt this wrong, since when applying fc2, the last dim should be out_features, not hidden_features, so the shapes are also wrongly commented.
Nice find and sorry for the mistake:)! Somebody already pointed it out a while ago:) Look at the pinned errata comment:)
ah i see, my bad :D @@mildlyoverfitted
I don't understand the reason why you use nn.Conv2d layer in Patch embedding module at 2:53 time. In my mind, I only use nn.Linear(in_channels, out_channels). Can you explain it?
Thanks !!! I like your clear way of explanation
You are welcome!
very helpful video, please make a similar video explaining the decoder architecture as well
Thank you for the idea!
shape of v : (n_samples, n_heads, n_patches + 1, head_dim)
shape of atten : (n_samples, n_heads, n_patches + 1, n_patches + 1)
How can you multiply these two tensors?
And how the result's shape is same as v's?
Please explain . BTW great content. Glad I found this channel.
atten @ v can be done. the output (n_samples, n_heads, n_patches + 1, n_patches + 1) @ (n_samples, n_heads, n_patches + 1, head_dim) = (n_samples, n_heads, n_patches + 1, head_dim). For examples lets say you have two matrices with shapes (2,2,5,5) and (2,2,5,3) then output will be (2,2,5,3).
@@suleymanerim2119 sorry my bad. I was doing v @ atten instead ot atten @ v. Thanks anyway
Great video saw this posted on the Artificial Intelligence and deep learning group on facebook
Thank you!!
really like the videos on the channel, keep them coming. I knew I had to subscribe just few minutes in the video.
Hehe:) Much appreciated.
I like the shape checking part and your vim usage, using old style vim just show your ability to play around code
Thank you!
Thank you for the tutorial, your explanation was perfect!
Glad it was helpful!
Thanks for your great video and description, I have learned a lot.
Thank you for your comment!
THANK YOU! JUST THANK YOU! 😂 I don't know why I thought the linear layer only accepts 2d tensors.........................
Glad you liked it!
Loved it!
Please make a video on Transformer in Transformer (TNT) pytorch implementation .
EXTREMELY HELPFUL AS ALWAYS. KUDOS
Thank you!!!
Amazing explanation!!! Just loved it !!!
Glad you liked it!
Thanks for all great content!
You're welcome!
Love this format 👍
♥️Glad you like it!
Love this vid :)
Clear explained with nice examples
Glad to hear!
fantastic live coding video!!!!!!!! you save my day, and hope you can keep on it making such a nice video. I believe it would be the best video in explaining VIT~
Very helpful explanation. Thank you!
Glad it was helpful!
Loved the video! Just a quick question: Here, you save the custom_model, that has not been trained a single epoch. How is it able to predict the image correctly (without training)? Or am I missing something here!
I got it! You are copying the learned weights from the official_model to the custom_model. I missed it the first time!
Yeh, that’s right! Anyway, thank you for your comment!!
I have the same problem !! i can't understand how it able to predict without training? please can explain me what's happens ! and how do i can train this model?
@@ibtissamsaadi6250 I just took a pretrained model from `timm` and copied its weights
@@mildlyoverfitted thanks for your replying ! , can you help me for do a training and test for ur code? it's possible? 1- load the pretarined model
2- finetunine this model and train it
3- test step
it's correct ? i want to applied ViT for facial expression classification but i didn't find any example for do it
Hi great explanation. Can this transformer be used only for embedding extraction leaving out classification??
Thank you! You can simply take the final CLS token embedding:)
Amazing clarity. Your tutorial is gold!! Great work.
Can you please make a video on code implementation of VOLO-D5 model (Vision Outlooker for Visual Recognition)
Appreciate it! Thank you for the suggestion!
🎉🎉🎉 1000 subscribers 👏👏👏
Great video! Love it! my question is why you set pos_embeding to a learnable parameter?
I think I just did what the paper suggested. However, yes, there are positional encodings that are not learnable so it is a possible alternative:)
Hi, great video indeed!
Thanks for your time for making such a video and for sharing it with the community. Do you have plans to create further videos on the implementation of other types of architectures or training/inference of models that might be more difficult than or different from the standard setups?
Thank you for your message! Really nice to hear that. Yes, I am planning to create many videos like this on different architectures/approaches. The format will stay very similar to this video: Taking an existing github repo and implementing it from scratch! I guess the goal is to change things up and cover different fields and topics!
Best video ever. Please implement Swin transformer which is the latest in Image Transformers family. I find it difficult to understand the source code of Window Attention in Swin Transformer. It would be very useful if you could upload either a walk through or implementation of Swin Transformer code
Appreciate it:) Anyway, I haven't even heard about this Swin Transformer. I will definitely try to read up on it and maybe make a video on it:)
Have literature tried working with Tranformer + CNN. Like replacing the 2d poolings with attention?
I love things like these that are application focused. I am currently experimenting on editing backbones so that they should start with Gabor filters. These backbones are loaded from mmdetection or detectron2. Can you do something like that? As to how we could edit backbones? That might be useful to people that want to experiment.
Thank you for the feedback! Interesting suggestion! I am writing it down:)
@@mildlyoverfitted Thanks. It's kind of a transfer learning but with the ability to either replace layers or edit them. Thanks for these videos, btw
its a great explanation, found it extremely useful!!
Glad to hear that!
Very helpful video!Thanks!! BTW, what's your developement environment?
You are welcome! Thank you for your comment! I only use vim and tmux:)
Amazing video. Just curious, what keyboard are you using?
Glad you enjoyed it! Logitech MX Keys S
Thank you. This is a really helpful video.
Glad it was helpful!
you're awesome literally! Please make more vedio!
thank you for the tutorial it helps me very much
Thank you for watching:)
thanks for your video. i have a question. i had a result from trained model, but i cant see the result like in your video. did u train vit model for imagenet data?
I used the pretrained model from the timm package as shown in the video. Not sure what it was trained on.
what a beautiful demo, Thank you
Appreciate it very much!!
excellent video. Could you create a YOLO algorithm tutorial? It would be very helpful for me
Really appreciate your feedback:) Thank you! I will definitely try to create a YOLO video in the future:)
Thanks for the explanation. Can you explain how do train VIT?
hi, fantastic video! Is it better to get the input images in range -1 to 1 or 0 to 1?
Thank you! I guess it should not make a difference as long as you do the same thing at training and inference time.
I think the batchnorm layer does that in the forward pass
Where did you include positional encoding ? or its not needed when using convolutions for patching and embedding ?
Thanks so much for the video. Easy to follow, and some detour to explain the side stuffs are also relevant. Line 217/218 comments on shape to be changed to (n_samples, n_patches+1, out_features) or am I wrong?
Thank you! You are absolutely right! Nice find! I will create an errata comment and fix it on github.
this is perfect code review
thank u for sharing good review
Really appreciate it!
Thanks again for the great hands-on tutorial on ViT. This helped me greatly to understand the Transformer implementation in Pytorch. My understanding is that you have covered the Encoder part here (for Classification tasks). Do you have a separate session on Decoder part or is it implemented here?
Glad it helped! As you pointed out, this video is purely about the encoder! I don't have a video on the BERT decoder with cross attention, however, I have a new video on the GPT-2 model that does contain a variant of the decoder block. Feel free to check it out:)
@@mildlyoverfitted Ok, Thanks. Will check it out soon.
Thank you! that is fantastic. I can deeply understand. could you please present Swin transformer like this?
Glad you liked it! Thank you for the suggestion!
can you make a video about the "Rethinking Attention with Performers" paper? :D
That is a great idea actually! Thank you! I just read through the paper and it looks interesting.
I am building a custom model based on ViT. It is almost the same as ViT, with just a few additional layers. I am trying to load the pretrained weights of ViT using load_state_dict() function. But the size of input image I am feeding to the model is not 384x384, rather 640x640. So the positional embedding layer of my model has more parameters than ViT. How to handle these extra parameters of positional embedding? Can I perform some interpolation of the existing parameters?
freaking awesome, Niubility!
hehe:) I had to google that word:) Thank you !
@@mildlyoverfitted We call it Chinglish(Chinese English). Ha, ha, ha.
Hi, I couldn't get why was the position embedding initiated as zero tensor. Why wasn't it initialised with index values of the patches in the original image as the flow diagram suggests. I would highly appreciate clarification on this? Great video btw!!
hoW CAN I train FaceForensics++ on VisionTransformer as you have used already present classes.
Is it possible to fine tune vision transformers on a single GPU machine? Given that they’ve been trained using tons of TPUs, I’m inclined to think fine tuning also requires huge compute power and thus out of reach of most people at the moment.
I have never fine tuned a Vision Transformer myself, however, I would imagine it takes fewer resources than to train it scratch. Just give it a go with a single GPU and monitor the performance:) Good luck!
the positions of the patches embeddings are learned during training? and why?
Hai thanks for the nice video, I have a question. So I am doing a CNN + ViT project using 3 conv layers, can you show me how to incorporate the CNN layers with the ViT architecture that you have implemented in your video and how can I optimize it. Please help me. Thank you very much.
Can we use this code for Change detection in two satellite images
Quick question. For the forward.py file, what is the purpose of k=10? I see it's used for the topk function but I was curious as to what the k variable denotes as well as why specifically 10 was chosen
Have you tried exploring what the different inner layer of Vision transformer sees ?
hi, can I run this code in a laptop without GPU?
Sure:)
One quick question: I have implemented ViT but when I try to train it from scratch it seems like it is not learning at all (the loss is not going down), and i have been using a simple dataset (cats vs dogs) with adamW optimizer and lr = 0.001. What should I do other than loading the pretrained weights
I would definitely try to overfit one single batch of your training data. If it is possible, then in theory your setup is correct and you just need to train on more data/longer. If it is not possible, something went wrong with your architecture/loss.
I hope that helps. Also, I know that this might be against the spirit of what you are trying to do but there are a bunch of frameworks that implemented the architecture /training logic already. Examples:
* rwightman.github.io/pytorch-image-models/models/vision-transformer/
* huggingface.co/docs/transformers/model_doc/vit
why is the shape of the mlp input at 2nd dim n_patches +1, isnt the mlp just applied to the class token?
So the `MLP` module is used inside of the Transformer block and and it inputs a 3D tensor. See this link for the only place where the CLS is explicitly extracted github.com/jankrepl/mildlyoverfitted/blob/22f0ecc67cef14267ee91ff2e4df6bf9f6d65bc2/github_adventures/vision_transformer/custom.py#L423-L424
Hope that helps:)
thanks, yeah confused the mlp inside the block with the mlp at the end for classification@@mildlyoverfitted
Instantly subscribed
Thank you!
I try to change a hyper-parameters, add a MLP-blocks & train it network. But result is same, 61% validation on CIFAR10. Why...?
How would such an implementation be modified to accommodate for the vit_base_patch16_224_sam model?
Also how would fine-tuning be done with this model or the sam model to customize for more unique datasets?
Thanks, can you please implement or point me to a repo which uses ViT for image captioning?
You're welcome! I am sorry but I have very little knowledge about image captioning:(
@@mildlyoverfitted No worries, thanks for your work anyway, really love it!
Hi thanks for sharing. Its great. Could you please share your experience of training a transformer from scratch. I am trying to train one with skeleton datasets in a self supervised manner with SimCLR loss and my transformer seems not learn much and after few epoch loss increases. I am new to this and don't understand whats wrong.
Hey! Thank you for your comment! Hmmmm, it is a pretty hard question since I don't know what your code and exact setup look like. Anyway, a few ideas I would recommend (not sure if that applies to your problem):
* Make sure it is possible to to "overfit" your network on a single batch of samples
* Track as many relevant metrics (+other artifacts) as possible (with tools like TensorBoard) to understand what is going on
* Try to use a popular open-source package/repository for the training before actually writing custom code
@@mildlyoverfitted Thanks a lot. I have just one concern. Transformers are really great in NLP and image or video data. But my data is sequence of frames with each frame containing just 30 values (10 joint with 3 x-y-z coordinates). Do you think a 300*30 dimension is too low for Transformer to learn something meaningful.
@@saniazahan5424 Interesting! I don't think that should be a problem. However, as I said, it is really hard to give any tips without actually knowing all the technical details:( Good luck with you project!!!
@@mildlyoverfitted I guess it is. Thanks.
How about making a video on "Escaping the Big Data Paradigm with Compact Transformers" paper?
I am not familiar with this paper. However, I will definitely try to read it! Thanks for the tip!
very cool channel, keep going! :)
Thank you!
novelty explained in just over 6 minutes. 🙇
Hope you liked it:)
Can I train this on google colab free plan for a 21.4k number of cassava leaf images dataset?
I guess it should be possible:) Just give it a try:)
great video, can i run the code in a mac with M1 chip as it is?
Thanks! Yes, you can:)
Many thanks for this amazing explanation. Could you, by any chance, be knowing a tutorial on how to utilize transformers on tabular data (using PyTorch)?
Thanks again.
:-}
Great Video on vision transformers. However, I have a small problem in the implementation. When I tried to train the model that I implemented, I was getting the same outputs for all the images in a batch. On further investigation, I found out that the first row of every tensor in a batch, i.e, the cls_token for every image in a batch, is not changing when it passes through all the layers. Is this problem occuring because we are giving the same cls_token to every class, or is it because of some other implementation error. It would be really great if someone could answer. Thanks in advance.
Thank you! AFAIK if your batch contains different images then you should indeed have different embeddings of the CLS token after the forward pass. Anyway, it is hard to say what the issue could be without seeing the code. If you think the problem is coming from my code feel free to create an issue on github where we could discuss it in detail!
Cheers!
@@mildlyoverfittedThanks for the reply. In my implementation, I was passing the the tensor we get after performing layer normalization directly to the attention layer as the query, key and value. However, in your implementation and pytorch timm's implementation, you have passed the input tensor through a linear layer and reshaped it to get query, key and value. That was the problem with my code, but I still do not understand the reasoning behind my mistake. Because in the original transformer, we just pass the embeddings as key, value and query directly without performing any linear projections, so I thought the same would be applicable here. However, that was not the case. If anyone can give the reasoning behind this procedure, it would be really appreciable. Thanks in advance.
AFAIK one always needs to apply the linear projection. What paper do you refer to when you say "original transformer"?
@@mildlyoverfitted On Vaswani et al., in the description of the attention module, I thought that they never mentioned about applying linear projection. However, I might have missed that information in the original paper. Anyways, thanks for the reply.
@@baveshbalaji301 Just checked the paper. The Figure 2 (right) shows the linear mapping logic. But I agree that it is easy to miss:) In the text they actually use the W^{Q}, W^{K}, W^{V} matrices to refer to this linear mapping (no bias).
what a video!
thanks so much! helped me a lot
Happy to hear that!
thanks plz explain more about codings and plz implement swin transformer too.
Great idea:)
Thank you! New subscriber :D
Really appreciate it!
awesome video
"mildly overfitted" is how I like to keep my underwear so I don't get the hyena.
Haha:) Made me laugh:D
Awesome!!!!
Thanks!!
Great!
What a coding speed !!! Did you speed up the video or you were actually coding it in real time?
It is all sped up:) The goal is to keep the videos as short as possible:)
@@mildlyoverfitted Oh no :v I was kinda motivated to code fast. Nice tutorial by the way :)
I got the error in assert_tensors_equal(res_c, res_o)
Feel free to create an issue on GitHub if it is related to the implementation from this video.
@@mildlyoverfitted Thank you for quick reply. I checked again and I found my mistake.
@@mildlyoverfitted make more tutorials like this, it helps a lot not only as an implementation purpose but I also learn how to write clean code with docstings.
@@vkmavani7878 Thank you! I will try to continue making videos like this:)