Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (paper illustrated)

Поділитися
Вставка
  • Опубліковано 17 гру 2024

КОМЕНТАРІ • 63

  • @phattailam9814
    @phattailam9814 Рік тому +1

    Thank you so much for the explanation!

  • @mahmoudimus
    @mahmoudimus 10 місяців тому +1

    Great explanation. Love the music + the voice :)

    • @AIBites
      @AIBites  10 місяців тому

      Thanks. Glad you liked it!

  • @MangalisoMngomezulu-y3b
    @MangalisoMngomezulu-y3b 8 місяців тому +1

    This is brilliant!

    • @AIBites
      @AIBites  8 місяців тому

      Thanks 👍

  • @kalluriramakrishna5732
    @kalluriramakrishna5732 2 роки тому +1

    Thank you for your fabulous Explanation

  • @tonywang7933
    @tonywang7933 8 місяців тому +1

    Thank you!! So nicely explained

    • @AIBites
      @AIBites  8 місяців тому

      You're welcome. So would you like to see more of papers explained or would you like more of coding videos?

  • @tensing2009
    @tensing2009 2 роки тому

    Great Video!
    Thanks for making it! :)

  • @muhammadsalmanali1066
    @muhammadsalmanali1066 3 роки тому

    Thank you so much for the explanation. Please keep the videos coming.

    • @AIBites
      @AIBites  3 роки тому +1

      Sure will do!

  • @suke933
    @suke933 2 роки тому +3

    Thanks for the video dear AI Bites. I was struggling to understand the SWIN architecture. It was very easily elaborated up to the point, but I would like to ask on "the motivation for different C value selection". Why is it important? If you would convey, it would further give more meaningful understanding to me.

  • @robosergTV
    @robosergTV 6 місяців тому +1

    huh? ViT was the first backbone Trasnformer arch for vision, not swin

    • @AIBites
      @AIBites  3 місяці тому

      awesome spot. And thanks for this info.

  • @deadbeat_genius_daydreamer
    @deadbeat_genius_daydreamer Рік тому

    This is seriously underrated, I enjoyed this visual approach, Thanks and regards for your efforts to make this explanation. Cheers🎊👍

    • @AIBites
      @AIBites  Рік тому

      Thank you so much Harshad! 😊

  • @JagannathanK-y5e
    @JagannathanK-y5e Рік тому +1

    Great explanation

  • @JC-ru4bp
    @JC-ru4bp 3 роки тому +1

    Very clear explanation of the paper idea, thanks.

    • @AIBites
      @AIBites  3 роки тому

      very encouraging to keep making videos :)

    • @JC-ru4bp
      @JC-ru4bp 3 роки тому

      @@AIBites Keep up, man,

  • @manub.n2451
    @manub.n2451 2 роки тому +1

    Thank you so much

  • @sanjeetpatil1249
    @sanjeetpatil1249 2 роки тому

    Can you kindly explain this line in the paper, related to the patch merging layer, "The first patch merging layer concatenates the
    features of each group of 2 × 2 neighboring patches, and applies a linear layer on the 4C-dimensional concatenated
    features".
    Thank you for the video

  • @muhammadwaseem_
    @muhammadwaseem_ Рік тому +1

    Good explanation

  • @TheMomentumhd
    @TheMomentumhd 2 роки тому

    You think these swin transformers would be usefull in real time object detection? (are they fast enough)?

  • @harutmargaryan9980
    @harutmargaryan9980 3 роки тому

    Thank you, well done!

  • @triminh3849
    @triminh3849 3 роки тому

    great video with excellent visualization, thanks a lot

    • @AIBites
      @AIBites  3 роки тому

      Glad you like it! :)

  • @anhminhtran7609
    @anhminhtran7609 3 роки тому

    Can you civer a bit more on the using Swin for object detection please?

  • @EngRiadAlmadani
    @EngRiadAlmadani 3 роки тому +2

    thanks for this great video just one question why we used linear layer in patch merging while we can reshaping the input patches directly using reshape method ???

    • @AIBites
      @AIBites  3 роки тому +2

      Great question. One thing I can think of is efficiency. I believe reshape is also challenging to propagate gradients backwards.

    • @Deshwal.mahesh
      @Deshwal.mahesh 2 роки тому +1

      Maybe thy're trying to make the model learn how to merge with knowledge? Just like solving a graphical puzzle?

    • @suke933
      @suke933 2 роки тому

      @@AIBites Can we use the convolution within this scenario?

  • @arpitaingermany
    @arpitaingermany 2 роки тому +1

    Thank you for illustrating this architecture. Can you make videos more on segmentation algorithms which are being used now a days please. Thanks.

    • @AIBites
      @AIBites  2 роки тому +2

      Sure. Will plan to make one on SegFormers.

    • @arpitaingermany
      @arpitaingermany 2 роки тому

      @@AIBites cool ❤️
      And thanks for this presentation

  • @saeedataei269
    @saeedataei269 2 роки тому +1

    Thanks for the explanation. plz review more SOTA papers.

    • @AIBites
      @AIBites  2 роки тому +1

      Sure will do Saeed! Thx. 🙂

  • @jialima8298
    @jialima8298 3 роки тому

    Love the voice!

  • @anonymous-random
    @anonymous-random 3 роки тому

    The video is awesome! Thanks a lot!

    • @AIBites
      @AIBites  3 роки тому

      Glad you liked it!

  • @parveenkaur2747
    @parveenkaur2747 3 роки тому +1

    Very informative video!

    • @AIBites
      @AIBites  3 роки тому

      Thanks! Glad you liked it.

  • @knowhowww
    @knowhowww 3 роки тому

    Thank you for the great effort.

  • @kashishbansal2651
    @kashishbansal2651 3 роки тому

    AMAZING EXPLANATION!

  • @taoufiqelfilali2224
    @taoufiqelfilali2224 3 роки тому

    great exlplanation, thank you

    • @AIBites
      @AIBites  3 роки тому

      Thanks for your postive comment! :)

  • @harshkumaragarwal8326
    @harshkumaragarwal8326 3 роки тому

    great work, thanks :)

  • @rybdenis
    @rybdenis 3 роки тому +1

    cool, thank you

  • @keroldjoumessi
    @keroldjoumessi 3 роки тому +1

    Thanks for the video. It was very awesome and easy to follow. Therefore even if the Windows architecture reduces the complexity to compute the self-attention, I think we still have this computational issue for the overall image and the attention becomes locally as in CNNs instead of globally like in RNN. Anyway thanks for your explaination

    • @readera84
      @readera84 3 роки тому +1

      How you are saying such complex things so easily 😫 I couldn't even understand what he said 🤕

    • @keroldjoumessi9597
      @keroldjoumessi9597 3 роки тому

      ​@@readera84 what don't you understand? maybe I can give you a hand

    • @readera84
      @readera84 3 роки тому

      @@keroldjoumessi9597 Windows shifting diagonally...an you make it more clear it to me

  • @garyhuntress6871
    @garyhuntress6871 3 роки тому +1

    Excellent review, thanks. I've subscribed for future papers! Do you use manim for your animations?

    • @AIBites
      @AIBites  3 роки тому

      Hi Gary, Thanks for your comments! In some places I use manim but not always. :)

  • @rajatayyab7737
    @rajatayyab7737 3 роки тому +1

    next should Dynamic Head: Unifying Object Detection Heads with Attentions

    • @rybdenis
      @rybdenis 3 роки тому

      agreed

    • @AIBites
      @AIBites  3 роки тому

      Thanks Raja for pointing out. We will try to prioritise the paper at some point.

  • @peddisaivivek6676
    @peddisaivivek6676 2 роки тому

    Great video. But can you refrain from putting the music in the background while explaining. It's a little distracting when viewing it at higher speed.

    • @AIBites
      @AIBites  2 роки тому

      Sure will take it on board when making the future ones 👍

  • @nguyenanhnguyen7658
    @nguyenanhnguyen7658 3 роки тому

    NLP, you have 100,000 words at most to permute and train with. With images? Well. ViT with 400m images can hardly manage to match ImageNet :)