Vision Transformer in PyTorch

Поділитися
Вставка
  • Опубліковано 17 гру 2024

КОМЕНТАРІ • 235

  • @mildlyoverfitted
    @mildlyoverfitted  2 роки тому +4

    Errata:
    * 217/218 lines of `custom.py`: shape should be (n_samples, n_patches+1, out_features)

  • @AladdinPersson
    @AladdinPersson 3 роки тому +54

    This is awesome! Glad this got recommended, will watch this later 👍

    • @mildlyoverfitted
      @mildlyoverfitted  3 роки тому +10

      Appreciate the message! I hope you will like it:) BTW you are creating great content! Keep it up!

    • @AladdinPersson
      @AladdinPersson 3 роки тому +2

      @@mildlyoverfitted Yeah I liked it :)

    • @Patrick-wn6uj
      @Patrick-wn6uj 8 місяців тому

      @devstuff2576 pUT it into chatgpt

  • @liam9519
    @liam9519 3 роки тому +45

    LOVE this live coding channel format! I always find it much easier to understand a paper when I see a simple implementation and this makes it even easier! Keep it up!

    • @mildlyoverfitted
      @mildlyoverfitted  3 роки тому +1

      Thank you! It is funny that you say that because I am exactly like you!

  • @tuankhoanguyen3222
    @tuankhoanguyen3222 3 роки тому +5

    A great channel about Pytorch. I like the way you carefully explain the meaning of each function. It encourages me to get my hand more dirty. Thank you and looking forward to seeing more videos from you.

  • @vishalgoklani
    @vishalgoklani 3 роки тому +6

    Excellent presentation, thank you for sharing!
    A few reasons why I enjoyed the video:
    1. < 30min, no fluff, no typos, no BS, good consistent pace. Everyone is busy, staying under 30min is extremely helpful and will force you to optimize your time!
    2. Useful sidebars; break down key concepts into useful nuggets. Very helpful
    3. Chose a popular topic, based on one of the best repos, and gave a nice intro
    4. Stay with pytorch please, no one likes tensorflow ;)
    I look forward to more of your work.
    Thank you

    • @mildlyoverfitted
      @mildlyoverfitted  3 роки тому +2

      Thank you very much:) Very encouraging and motivating comment!

    • @jamgplus334
      @jamgplus334 3 роки тому

      no one likes tensorflow, haha.Strongly agree with you

  • @mohitlamba117
    @mohitlamba117 2 роки тому +1

    A much needed video. I went through several iterations of paper and supplementary videos online explaining the paper. I always had some residual doubts remaining and dint understand with pin point accuracy. After this video everything is now clear !!

  • @100vivasvan
    @100vivasvan 3 роки тому +15

    ❤️❤️ absolutely fantastic presentation. This cured my depression after 5 days of banging head against the wall.
    The pace of this video is so ideal.
    One suggestion that I want to propose is to add network architecture's figure/diagram from the paper while writing the code so it's easier for the new ML/DL coders to understand.
    Keep it up. Looking forward for more. Amazing work. ❤️thank you so much ❤️

    • @mildlyoverfitted
      @mildlyoverfitted  3 роки тому +2

      Heh:) Thank you for the kind words! That is a great suggestion actually!

    • @rushirajparmar9602
      @rushirajparmar9602 3 роки тому +1

      @@mildlyoverfitted Yes the diagram might be very helpful!

  • @tranquangkhai8329
    @tranquangkhai8329 3 роки тому +4

    Not watch full video yet, but I like the way you explain thing clearly with ipython demo for a beginner like me. Nice video!

    • @mildlyoverfitted
      @mildlyoverfitted  3 роки тому +1

      Appreciate it! Nice to know that you enjoyed the ipython segments:) I will definitely try to include them in my future videos too!

  • @danielasefa8087
    @danielasefa8087 8 місяців тому +1

    Thank you so much for helping me to understand ViT!! Great work

  • @AnkityadavGrowConscious
    @AnkityadavGrowConscious 3 роки тому +2

    Amazing clarity. Your tutorial is gold!! Great work.

  • @mkamp
    @mkamp 2 роки тому +4

    Beautiful code, wonderful explanations to follow along. Thanks for taking the extra time to look at some of the essential concepts in iPython. Superb content!

  • @yichehsieh243
    @yichehsieh243 3 роки тому +1

    Thank you for uploading this video, make me learned a lot and got more familiar with ViT model

    • @mildlyoverfitted
      @mildlyoverfitted  3 роки тому +1

      Glad you enjoyed it!

    • @yichehsieh243
      @yichehsieh243 3 роки тому

      @@mildlyoverfittedafter some study, I got that actually ViT model is encoder of transformer, may I expect to introduce decoder part or complete seq2seq model in the future🤣
      Besides, I was surprised that the implementation of ViT model was completed without using nn.MultiheadAttention, nn.Transformer Isn’t it more convenient?

    • @mildlyoverfitted
      @mildlyoverfitted  3 роки тому

      @@yichehsieh243 Good question actually. I guess one of the goals of this channel is to do things "from scratch" mostly for educational purposes. However, in real life I would always go for well maintained libraries rather than reinventing the wheel and reimplementing everything.

  • @froghana1995
    @froghana1995 3 роки тому +1

    Thank you for helping me understand ViT! It's a great and kind Video!!

  • @news2000tw
    @news2000tw Рік тому

    Thank you!!!!! Super useful. Before, I knew how drop out works but I didn't know how pytorch handle it .

  • @mlearnxyz
    @mlearnxyz 2 роки тому +2

    Excellent material. Thanks for preparing and sharing it! Keep up the good work.

  • @StasGT
    @StasGT Рік тому +1

    Thank you! It's best video about VIT for understanding.

    • @mildlyoverfitted
      @mildlyoverfitted  Рік тому

      Appreciate your comment!

    • @StasGT
      @StasGT Рік тому

      @@mildlyoverfitted, in PyTorch transformer: torch.nn.modules.transformer.py, q & k & v = x. It was a discovery for me. But it gives more better convergation of net. I didn't know that, yet yesterday.
      # self-attention block
      def _sa_block(self, x: Tensor,
      attn_mask: Optional[Tensor], key_padding_mask: Optional[Tensor]) -> Tensor:
      x = self.self_attn(x, x, x,
      attn_mask=attn_mask,
      key_padding_mask=key_padding_mask,
      need_weights=False)[0]
      return self.dropout1(x)
      This method push 'x' to class MultiheadAttention in torch.nn.modules.activation.py

  • @goldfishjy95
    @goldfishjy95 3 роки тому +2

    Thank you so much.. this is a lifesaver! Bless you my friend!

  • @sanskarshrivastava5193
    @sanskarshrivastava5193 3 роки тому +1

    I'm so glad that i found this channel , you are a gem :) !!

  • @talhayousuf4599
    @talhayousuf4599 3 роки тому +5

    I subscribed owing to such a clean implementation, well explained. I love how you comment the code and check shapes on the go. I request you to please make a video on your approach to implement papers.

    • @mildlyoverfitted
      @mildlyoverfitted  3 роки тому

      Great to hear that! I guess it is easier to take an existing code and modify it rather than starting from scratch:)

  • @shahriarshayesteh8602
    @shahriarshayesteh8602 3 роки тому +2

    Just found your amazing channel. I love it, pls continue.

    • @mildlyoverfitted
      @mildlyoverfitted  3 роки тому +1

      Thank you for the kind message! I will definitely continue:)

  • @macx7760
    @macx7760 10 місяців тому +1

    fantastic video, just a quick note: at 16:01 you say that "none of the operations are changing the shape of the tensor", but isnt this wrong, since when applying fc2, the last dim should be out_features, not hidden_features, so the shapes are also wrongly commented.

    • @mildlyoverfitted
      @mildlyoverfitted  10 місяців тому +1

      Nice find and sorry for the mistake:)! Somebody already pointed it out a while ago:) Look at the pinned errata comment:)

    • @macx7760
      @macx7760 10 місяців тому

      ah i see, my bad :D @@mildlyoverfitted

  • @thuancollege5594
    @thuancollege5594 2 роки тому +2

    I don't understand the reason why you use nn.Conv2d layer in Patch embedding module at 2:53 time. In my mind, I only use nn.Linear(in_channels, out_channels). Can you explain it?

  • @vasylcf
    @vasylcf 2 роки тому +2

    Thanks !!! I like your clear way of explanation

  • @nishantbhansali3671
    @nishantbhansali3671 2 роки тому +1

    very helpful video, please make a similar video explaining the decoder architecture as well

  • @sushilkhadka8069
    @sushilkhadka8069 Рік тому +1

    shape of v : (n_samples, n_heads, n_patches + 1, head_dim)
    shape of atten : (n_samples, n_heads, n_patches + 1, n_patches + 1)
    How can you multiply these two tensors?
    And how the result's shape is same as v's?
    Please explain . BTW great content. Glad I found this channel.

    • @suleymanerim2119
      @suleymanerim2119 Рік тому +1

      atten @ v can be done. the output (n_samples, n_heads, n_patches + 1, n_patches + 1) @ (n_samples, n_heads, n_patches + 1, head_dim) = (n_samples, n_heads, n_patches + 1, head_dim). For examples lets say you have two matrices with shapes (2,2,5,5) and (2,2,5,3) then output will be (2,2,5,3).

    • @sushilkhadka8069
      @sushilkhadka8069 Рік тому

      @@suleymanerim2119 sorry my bad. I was doing v @ atten instead ot atten @ v. Thanks anyway

  • @laxlyfters8695
    @laxlyfters8695 3 роки тому +2

    Great video saw this posted on the Artificial Intelligence and deep learning group on facebook

  • @dhananjayraut
    @dhananjayraut 3 роки тому +1

    really like the videos on the channel, keep them coming. I knew I had to subscribe just few minutes in the video.

  • @zeamon4932
    @zeamon4932 3 роки тому +1

    I like the shape checking part and your vim usage, using old style vim just show your ability to play around code

  • @fuat7775
    @fuat7775 2 роки тому +1

    Thank you for the tutorial, your explanation was perfect!

  • @elaherahimian3619
    @elaherahimian3619 3 роки тому +1

    Thanks for your great video and description, I have learned a lot.

  • @EstZorion
    @EstZorion 2 роки тому +1

    THANK YOU! JUST THANK YOU! 😂 I don't know why I thought the linear layer only accepts 2d tensors.........................

  • @dewan_shaheb
    @dewan_shaheb 3 роки тому +1

    Loved it!
    Please make a video on Transformer in Transformer (TNT) pytorch implementation .

  • @visuality2541
    @visuality2541 2 роки тому +1

    EXTREMELY HELPFUL AS ALWAYS. KUDOS

  • @vishakbhat3032
    @vishakbhat3032 Рік тому

    Amazing explanation!!! Just loved it !!!

  • @omerfarukyasar4681
    @omerfarukyasar4681 2 роки тому +1

    Thanks for all great content!

  • @_shikh4r_
    @_shikh4r_ 3 роки тому +2

    Love this format 👍

  • @junhyeokpark1214
    @junhyeokpark1214 2 роки тому

    Love this vid :)
    Clear explained with nice examples

  • @陳思愷-b1y
    @陳思愷-b1y 3 роки тому

    fantastic live coding video!!!!!!!! you save my day, and hope you can keep on it making such a nice video. I believe it would be the best video in explaining VIT~

  • @danyellacarvalho4120
    @danyellacarvalho4120 Рік тому

    Very helpful explanation. Thank you!

  • @mevanekanayake4363
    @mevanekanayake4363 3 роки тому +2

    Loved the video! Just a quick question: Here, you save the custom_model, that has not been trained a single epoch. How is it able to predict the image correctly (without training)? Or am I missing something here!

    • @mevanekanayake4363
      @mevanekanayake4363 3 роки тому

      I got it! You are copying the learned weights from the official_model to the custom_model. I missed it the first time!

    • @mildlyoverfitted
      @mildlyoverfitted  3 роки тому +1

      Yeh, that’s right! Anyway, thank you for your comment!!

    • @ibtissamsaadi6250
      @ibtissamsaadi6250 3 роки тому

      I have the same problem !! i can't understand how it able to predict without training? please can explain me what's happens ! and how do i can train this model?

    • @mildlyoverfitted
      @mildlyoverfitted  3 роки тому +1

      @@ibtissamsaadi6250 I just took a pretrained model from `timm` and copied its weights

    • @ibtissamsaadi6250
      @ibtissamsaadi6250 3 роки тому

      @@mildlyoverfitted thanks for your replying ! , can you help me for do a training and test for ur code? it's possible? 1- load the pretarined model
      2- finetunine this model and train it
      3- test step
      it's correct ? i want to applied ViT for facial expression classification but i didn't find any example for do it

  • @aravindreddy4871
    @aravindreddy4871 3 роки тому +2

    Hi great explanation. Can this transformer be used only for embedding extraction leaving out classification??

    • @mildlyoverfitted
      @mildlyoverfitted  3 роки тому

      Thank you! You can simply take the final CLS token embedding:)

  • @pranavkathar5383
    @pranavkathar5383 2 роки тому +1

    Amazing clarity. Your tutorial is gold!! Great work.
    Can you please make a video on code implementation of VOLO-D5 model (Vision Outlooker for Visual Recognition)

    • @mildlyoverfitted
      @mildlyoverfitted  11 місяців тому

      Appreciate it! Thank you for the suggestion!

  • @lauraennature
    @lauraennature 3 роки тому +1

    🎉🎉🎉 1000 subscribers 👏👏👏

  • @tuoxin7800
    @tuoxin7800 Рік тому

    Great video! Love it! my question is why you set pos_embeding to a learnable parameter?

    • @mildlyoverfitted
      @mildlyoverfitted  Рік тому

      I think I just did what the paper suggested. However, yes, there are positional encodings that are not learnable so it is a possible alternative:)

  • @hamedhemati5151
    @hamedhemati5151 3 роки тому +2

    Hi, great video indeed!
    Thanks for your time for making such a video and for sharing it with the community. Do you have plans to create further videos on the implementation of other types of architectures or training/inference of models that might be more difficult than or different from the standard setups?

    • @mildlyoverfitted
      @mildlyoverfitted  3 роки тому +3

      Thank you for your message! Really nice to hear that. Yes, I am planning to create many videos like this on different architectures/approaches. The format will stay very similar to this video: Taking an existing github repo and implementing it from scratch! I guess the goal is to change things up and cover different fields and topics!

  • @bhanu0669
    @bhanu0669 3 роки тому +1

    Best video ever. Please implement Swin transformer which is the latest in Image Transformers family. I find it difficult to understand the source code of Window Attention in Swin Transformer. It would be very useful if you could upload either a walk through or implementation of Swin Transformer code

    • @mildlyoverfitted
      @mildlyoverfitted  3 роки тому

      Appreciate it:) Anyway, I haven't even heard about this Swin Transformer. I will definitely try to read up on it and maybe make a video on it:)

  • @TheAero
    @TheAero Рік тому

    Have literature tried working with Tranformer + CNN. Like replacing the 2d poolings with attention?

  • @iinarrab19
    @iinarrab19 3 роки тому +4

    I love things like these that are application focused. I am currently experimenting on editing backbones so that they should start with Gabor filters. These backbones are loaded from mmdetection or detectron2. Can you do something like that? As to how we could edit backbones? That might be useful to people that want to experiment.

    • @mildlyoverfitted
      @mildlyoverfitted  3 роки тому

      Thank you for the feedback! Interesting suggestion! I am writing it down:)

    • @iinarrab19
      @iinarrab19 3 роки тому +2

      @@mildlyoverfitted Thanks. It's kind of a transfer learning but with the ability to either replace layers or edit them. Thanks for these videos, btw

  • @MercyPrasanna
    @MercyPrasanna 2 роки тому

    its a great explanation, found it extremely useful!!

  • @陈文-p6u
    @陈文-p6u 3 роки тому +1

    Very helpful video!Thanks!! BTW, what's your developement environment?

    • @mildlyoverfitted
      @mildlyoverfitted  3 роки тому

      You are welcome! Thank you for your comment! I only use vim and tmux:)

  • @harrisnisar5345
    @harrisnisar5345 9 місяців тому

    Amazing video. Just curious, what keyboard are you using?

  • @marearts.
    @marearts. 3 роки тому +1

    Thank you. This is a really helpful video.

  • @mybirdda
    @mybirdda 3 роки тому +2

    you're awesome literally! Please make more vedio!

  • @abdallahghazaly359
    @abdallahghazaly359 3 роки тому +1

    thank you for the tutorial it helps me very much

  • @조원기-w6b
    @조원기-w6b 3 роки тому +1

    thanks for your video. i have a question. i had a result from trained model, but i cant see the result like in your video. did u train vit model for imagenet data?

    • @mildlyoverfitted
      @mildlyoverfitted  3 роки тому +1

      I used the pretrained model from the timm package as shown in the video. Not sure what it was trained on.

  • @ahmedyassin7684
    @ahmedyassin7684 2 роки тому

    what a beautiful demo, Thank you

  • @mehedihasanshuvo4874
    @mehedihasanshuvo4874 3 роки тому +3

    excellent video. Could you create a YOLO algorithm tutorial? It would be very helpful for me

    • @mildlyoverfitted
      @mildlyoverfitted  3 роки тому +1

      Really appreciate your feedback:) Thank you! I will definitely try to create a YOLO video in the future:)

  • @HaiderAli-lr9fw
    @HaiderAli-lr9fw Рік тому

    Thanks for the explanation. Can you explain how do train VIT?

  • @vaehligchs
    @vaehligchs 2 роки тому

    hi, fantastic video! Is it better to get the input images in range -1 to 1 or 0 to 1?

    • @mildlyoverfitted
      @mildlyoverfitted  2 роки тому

      Thank you! I guess it should not make a difference as long as you do the same thing at training and inference time.

    • @muhammadnaufil5237
      @muhammadnaufil5237 Рік тому

      I think the batchnorm layer does that in the forward pass

  • @prajyotmane9067
    @prajyotmane9067 8 місяців тому

    Where did you include positional encoding ? or its not needed when using convolutions for patching and embedding ?

  • @gopsda
    @gopsda 2 роки тому

    Thanks so much for the video. Easy to follow, and some detour to explain the side stuffs are also relevant. Line 217/218 comments on shape to be changed to (n_samples, n_patches+1, out_features) or am I wrong?

    • @mildlyoverfitted
      @mildlyoverfitted  2 роки тому

      Thank you! You are absolutely right! Nice find! I will create an errata comment and fix it on github.

  • @jjongjjong2365
    @jjongjjong2365 3 роки тому +1

    this is perfect code review
    thank u for sharing good review

  • @gopsda
    @gopsda 2 роки тому

    Thanks again for the great hands-on tutorial on ViT. This helped me greatly to understand the Transformer implementation in Pytorch. My understanding is that you have covered the Encoder part here (for Classification tasks). Do you have a separate session on Decoder part or is it implemented here?

    • @mildlyoverfitted
      @mildlyoverfitted  2 роки тому +1

      Glad it helped! As you pointed out, this video is purely about the encoder! I don't have a video on the BERT decoder with cross attention, however, I have a new video on the GPT-2 model that does contain a variant of the decoder block. Feel free to check it out:)

    • @gopsda
      @gopsda 2 роки тому

      @@mildlyoverfitted Ok, Thanks. Will check it out soon.

  • @maralzarvani8154
    @maralzarvani8154 2 роки тому +1

    Thank you! that is fantastic. I can deeply understand. could you please present Swin transformer like this?

  • @dontaskme1625
    @dontaskme1625 3 роки тому +2

    can you make a video about the "Rethinking Attention with Performers" paper? :D

    • @mildlyoverfitted
      @mildlyoverfitted  3 роки тому +1

      That is a great idea actually! Thank you! I just read through the paper and it looks interesting.

  • @preetomsahaarko8145
    @preetomsahaarko8145 Рік тому

    I am building a custom model based on ViT. It is almost the same as ViT, with just a few additional layers. I am trying to load the pretrained weights of ViT using load_state_dict() function. But the size of input image I am feeding to the model is not 384x384, rather 640x640. So the positional embedding layer of my model has more parameters than ViT. How to handle these extra parameters of positional embedding? Can I perform some interpolation of the existing parameters?

  • @cwang6936
    @cwang6936 3 роки тому

    freaking awesome, Niubility!

    • @mildlyoverfitted
      @mildlyoverfitted  3 роки тому

      hehe:) I had to google that word:) Thank you !

    • @cwang6936
      @cwang6936 3 роки тому

      @@mildlyoverfitted We call it Chinglish(Chinese English). Ha, ha, ha.

  • @nikhilmehra5559
    @nikhilmehra5559 Рік тому

    Hi, I couldn't get why was the position embedding initiated as zero tensor. Why wasn't it initialised with index values of the patches in the original image as the flow diagram suggests. I would highly appreciate clarification on this? Great video btw!!

  • @HamzaAli-dy1qp
    @HamzaAli-dy1qp 2 роки тому

    hoW CAN I train FaceForensics++ on VisionTransformer as you have used already present classes.

  • @HassanKhan-fe3pn
    @HassanKhan-fe3pn 3 роки тому +1

    Is it possible to fine tune vision transformers on a single GPU machine? Given that they’ve been trained using tons of TPUs, I’m inclined to think fine tuning also requires huge compute power and thus out of reach of most people at the moment.

    • @mildlyoverfitted
      @mildlyoverfitted  3 роки тому

      I have never fine tuned a Vision Transformer myself, however, I would imagine it takes fewer resources than to train it scratch. Just give it a go with a single GPU and monitor the performance:) Good luck!

  • @klindatv
    @klindatv Рік тому

    the positions of the patches embeddings are learned during training? and why?

  • @hakunamatata0014
    @hakunamatata0014 2 роки тому

    Hai thanks for the nice video, I have a question. So I am doing a CNN + ViT project using 3 conv layers, can you show me how to incorporate the CNN layers with the ViT architecture that you have implemented in your video and how can I optimize it. Please help me. Thank you very much.

  • @PriyaDas-he4te
    @PriyaDas-he4te 3 місяці тому

    Can we use this code for Change detection in two satellite images

  • @DCentFN
    @DCentFN 2 роки тому

    Quick question. For the forward.py file, what is the purpose of k=10? I see it's used for the topk function but I was curious as to what the k variable denotes as well as why specifically 10 was chosen

  • @stanley_george
    @stanley_george Рік тому

    Have you tried exploring what the different inner layer of Vision transformer sees ?

  • @macknightxu2199
    @macknightxu2199 Рік тому +1

    hi, can I run this code in a laptop without GPU?

  • @georgemichel9278
    @georgemichel9278 2 роки тому

    One quick question: I have implemented ViT but when I try to train it from scratch it seems like it is not learning at all (the loss is not going down), and i have been using a simple dataset (cats vs dogs) with adamW optimizer and lr = 0.001. What should I do other than loading the pretrained weights

    • @mildlyoverfitted
      @mildlyoverfitted  2 роки тому +1

      I would definitely try to overfit one single batch of your training data. If it is possible, then in theory your setup is correct and you just need to train on more data/longer. If it is not possible, something went wrong with your architecture/loss.
      I hope that helps. Also, I know that this might be against the spirit of what you are trying to do but there are a bunch of frameworks that implemented the architecture /training logic already. Examples:
      * rwightman.github.io/pytorch-image-models/models/vision-transformer/
      * huggingface.co/docs/transformers/model_doc/vit

  • @macx7760
    @macx7760 10 місяців тому

    why is the shape of the mlp input at 2nd dim n_patches +1, isnt the mlp just applied to the class token?

    • @mildlyoverfitted
      @mildlyoverfitted  10 місяців тому +1

      So the `MLP` module is used inside of the Transformer block and and it inputs a 3D tensor. See this link for the only place where the CLS is explicitly extracted github.com/jankrepl/mildlyoverfitted/blob/22f0ecc67cef14267ee91ff2e4df6bf9f6d65bc2/github_adventures/vision_transformer/custom.py#L423-L424
      Hope that helps:)

    • @macx7760
      @macx7760 10 місяців тому

      thanks, yeah confused the mlp inside the block with the mlp at the end for classification@@mildlyoverfitted

  • @mayukh3556
    @mayukh3556 3 роки тому +1

    Instantly subscribed

  • @StasGT
    @StasGT Рік тому

    I try to change a hyper-parameters, add a MLP-blocks & train it network. But result is same, 61% validation on CIFAR10. Why...?

  • @DCentFN
    @DCentFN 2 роки тому

    How would such an implementation be modified to accommodate for the vit_base_patch16_224_sam model?

    • @DCentFN
      @DCentFN 2 роки тому

      Also how would fine-tuning be done with this model or the sam model to customize for more unique datasets?

  • @sayeedchowdhury11
    @sayeedchowdhury11 2 роки тому

    Thanks, can you please implement or point me to a repo which uses ViT for image captioning?

    • @mildlyoverfitted
      @mildlyoverfitted  2 роки тому +1

      You're welcome! I am sorry but I have very little knowledge about image captioning:(

    • @sayeedchowdhury11
      @sayeedchowdhury11 2 роки тому

      @@mildlyoverfitted No worries, thanks for your work anyway, really love it!

  • @saniazahan5424
    @saniazahan5424 3 роки тому

    Hi thanks for sharing. Its great. Could you please share your experience of training a transformer from scratch. I am trying to train one with skeleton datasets in a self supervised manner with SimCLR loss and my transformer seems not learn much and after few epoch loss increases. I am new to this and don't understand whats wrong.

    • @mildlyoverfitted
      @mildlyoverfitted  3 роки тому +1

      Hey! Thank you for your comment! Hmmmm, it is a pretty hard question since I don't know what your code and exact setup look like. Anyway, a few ideas I would recommend (not sure if that applies to your problem):
      * Make sure it is possible to to "overfit" your network on a single batch of samples
      * Track as many relevant metrics (+other artifacts) as possible (with tools like TensorBoard) to understand what is going on
      * Try to use a popular open-source package/repository for the training before actually writing custom code

    • @saniazahan5424
      @saniazahan5424 3 роки тому +1

      @@mildlyoverfitted Thanks a lot. I have just one concern. Transformers are really great in NLP and image or video data. But my data is sequence of frames with each frame containing just 30 values (10 joint with 3 x-y-z coordinates). Do you think a 300*30 dimension is too low for Transformer to learn something meaningful.

    • @mildlyoverfitted
      @mildlyoverfitted  3 роки тому +1

      @@saniazahan5424 Interesting! I don't think that should be a problem. However, as I said, it is really hard to give any tips without actually knowing all the technical details:( Good luck with you project!!!

    • @saniazahan5424
      @saniazahan5424 3 роки тому +1

      @@mildlyoverfitted I guess it is. Thanks.

  • @adityamishra348
    @adityamishra348 3 роки тому +1

    How about making a video on "Escaping the Big Data Paradigm with Compact Transformers" paper?

    • @mildlyoverfitted
      @mildlyoverfitted  3 роки тому

      I am not familiar with this paper. However, I will definitely try to read it! Thanks for the tip!

  • @johanngerberding5956
    @johanngerberding5956 3 роки тому +1

    very cool channel, keep going! :)

  • @vidinvijay
    @vidinvijay 10 місяців тому +1

    novelty explained in just over 6 minutes. 🙇

  • @siddharthmagadum16
    @siddharthmagadum16 2 роки тому

    Can I train this on google colab free plan for a 21.4k number of cassava leaf images dataset?

    • @mildlyoverfitted
      @mildlyoverfitted  2 роки тому

      I guess it should be possible:) Just give it a try:)

  • @danieltello8016
    @danieltello8016 8 місяців тому

    great video, can i run the code in a mac with M1 chip as it is?

  • @KountayDwivedi
    @KountayDwivedi 2 роки тому +1

    Many thanks for this amazing explanation. Could you, by any chance, be knowing a tutorial on how to utilize transformers on tabular data (using PyTorch)?
    Thanks again.
    :-}

  • @baveshbalaji301
    @baveshbalaji301 2 роки тому

    Great Video on vision transformers. However, I have a small problem in the implementation. When I tried to train the model that I implemented, I was getting the same outputs for all the images in a batch. On further investigation, I found out that the first row of every tensor in a batch, i.e, the cls_token for every image in a batch, is not changing when it passes through all the layers. Is this problem occuring because we are giving the same cls_token to every class, or is it because of some other implementation error. It would be really great if someone could answer. Thanks in advance.

    • @mildlyoverfitted
      @mildlyoverfitted  2 роки тому

      Thank you! AFAIK if your batch contains different images then you should indeed have different embeddings of the CLS token after the forward pass. Anyway, it is hard to say what the issue could be without seeing the code. If you think the problem is coming from my code feel free to create an issue on github where we could discuss it in detail!
      Cheers!

    • @baveshbalaji301
      @baveshbalaji301 2 роки тому +1

      @@mildlyoverfittedThanks for the reply. In my implementation, I was passing the the tensor we get after performing layer normalization directly to the attention layer as the query, key and value. However, in your implementation and pytorch timm's implementation, you have passed the input tensor through a linear layer and reshaped it to get query, key and value. That was the problem with my code, but I still do not understand the reasoning behind my mistake. Because in the original transformer, we just pass the embeddings as key, value and query directly without performing any linear projections, so I thought the same would be applicable here. However, that was not the case. If anyone can give the reasoning behind this procedure, it would be really appreciable. Thanks in advance.

    • @mildlyoverfitted
      @mildlyoverfitted  2 роки тому

      AFAIK one always needs to apply the linear projection. What paper do you refer to when you say "original transformer"?

    • @baveshbalaji301
      @baveshbalaji301 2 роки тому

      @@mildlyoverfitted On Vaswani et al., in the description of the attention module, I thought that they never mentioned about applying linear projection. However, I might have missed that information in the original paper. Anyways, thanks for the reply.

    • @mildlyoverfitted
      @mildlyoverfitted  2 роки тому

      @@baveshbalaji301 Just checked the paper. The Figure 2 (right) shows the linear mapping logic. But I agree that it is easy to miss:) In the text they actually use the W^{Q}, W^{K}, W^{V} matrices to refer to this linear mapping (no bias).

  • @kbkim-f4z
    @kbkim-f4z 3 роки тому +1

    what a video!

  • @youngyulkim3072
    @youngyulkim3072 2 роки тому

    thanks so much! helped me a lot

  • @saeedataei269
    @saeedataei269 2 роки тому

    thanks plz explain more about codings and plz implement swin transformer too.

  • @rafaelgg1291
    @rafaelgg1291 3 роки тому +2

    Thank you! New subscriber :D

  • @jamgplus334
    @jamgplus334 3 роки тому +1

    awesome video

  • @jeffg4686
    @jeffg4686 9 місяців тому

    "mildly overfitted" is how I like to keep my underwear so I don't get the hyena.

  • @andreydung
    @andreydung 2 роки тому +1

    Awesome!!!!

  • @hamzaahmed5837
    @hamzaahmed5837 3 роки тому +2

    Great!

  • @awsaf49
    @awsaf49 3 роки тому

    What a coding speed !!! Did you speed up the video or you were actually coding it in real time?

    • @mildlyoverfitted
      @mildlyoverfitted  3 роки тому +1

      It is all sped up:) The goal is to keep the videos as short as possible:)

    • @awsaf49
      @awsaf49 3 роки тому

      @@mildlyoverfitted Oh no :v I was kinda motivated to code fast. Nice tutorial by the way :)

  • @vkmavani7878
    @vkmavani7878 3 роки тому

    I got the error in assert_tensors_equal(res_c, res_o)

    • @mildlyoverfitted
      @mildlyoverfitted  3 роки тому +1

      Feel free to create an issue on GitHub if it is related to the implementation from this video.

    • @vkmavani7878
      @vkmavani7878 3 роки тому +1

      @@mildlyoverfitted Thank you for quick reply. I checked again and I found my mistake.

    • @vkmavani7878
      @vkmavani7878 3 роки тому +1

      @@mildlyoverfitted make more tutorials like this, it helps a lot not only as an implementation purpose but I also learn how to write clean code with docstings.

    • @mildlyoverfitted
      @mildlyoverfitted  3 роки тому +1

      @@vkmavani7878 Thank you! I will try to continue making videos like this:)