Thanks for the video dear AI Bites. I was struggling to understand the SWIN architecture. It was very easily elaborated up to the point, but I would like to ask on "the motivation for different C value selection". Why is it important? If you would convey, it would further give more meaningful understanding to me.
Can you kindly explain this line in the paper, related to the patch merging layer, "The first patch merging layer concatenates the features of each group of 2 × 2 neighboring patches, and applies a linear layer on the 4C-dimensional concatenated features". Thank you for the video
thanks for this great video just one question why we used linear layer in patch merging while we can reshaping the input patches directly using reshape method ???
Thanks for the video. It was very awesome and easy to follow. Therefore even if the Windows architecture reduces the complexity to compute the self-attention, I think we still have this computational issue for the overall image and the attention becomes locally as in CNNs instead of globally like in RNN. Anyway thanks for your explaination
Thank you so much for the explanation!
Great explanation. Love the music + the voice :)
Thanks. Glad you liked it!
This is brilliant!
Thanks 👍
Thank you for your fabulous Explanation
Thank you!! So nicely explained
You're welcome. So would you like to see more of papers explained or would you like more of coding videos?
Great Video!
Thanks for making it! :)
Thank you so much for the explanation. Please keep the videos coming.
Sure will do!
Thanks for the video dear AI Bites. I was struggling to understand the SWIN architecture. It was very easily elaborated up to the point, but I would like to ask on "the motivation for different C value selection". Why is it important? If you would convey, it would further give more meaningful understanding to me.
huh? ViT was the first backbone Trasnformer arch for vision, not swin
awesome spot. And thanks for this info.
This is seriously underrated, I enjoyed this visual approach, Thanks and regards for your efforts to make this explanation. Cheers🎊👍
Thank you so much Harshad! 😊
Great explanation
Thanks!
Very clear explanation of the paper idea, thanks.
very encouraging to keep making videos :)
@@AIBites Keep up, man,
Thank you so much
Can you kindly explain this line in the paper, related to the patch merging layer, "The first patch merging layer concatenates the
features of each group of 2 × 2 neighboring patches, and applies a linear layer on the 4C-dimensional concatenated
features".
Thank you for the video
Good explanation
You think these swin transformers would be usefull in real time object detection? (are they fast enough)?
Thank you, well done!
great video with excellent visualization, thanks a lot
Glad you like it! :)
Can you civer a bit more on the using Swin for object detection please?
thanks for this great video just one question why we used linear layer in patch merging while we can reshaping the input patches directly using reshape method ???
Great question. One thing I can think of is efficiency. I believe reshape is also challenging to propagate gradients backwards.
Maybe thy're trying to make the model learn how to merge with knowledge? Just like solving a graphical puzzle?
@@AIBites Can we use the convolution within this scenario?
Thank you for illustrating this architecture. Can you make videos more on segmentation algorithms which are being used now a days please. Thanks.
Sure. Will plan to make one on SegFormers.
@@AIBites cool ❤️
And thanks for this presentation
Thanks for the explanation. plz review more SOTA papers.
Sure will do Saeed! Thx. 🙂
Love the voice!
The video is awesome! Thanks a lot!
Glad you liked it!
Very informative video!
Thanks! Glad you liked it.
Thank you for the great effort.
My pleasure!
AMAZING EXPLANATION!
great exlplanation, thank you
Thanks for your postive comment! :)
great work, thanks :)
cool, thank you
Thanks for the video. It was very awesome and easy to follow. Therefore even if the Windows architecture reduces the complexity to compute the self-attention, I think we still have this computational issue for the overall image and the attention becomes locally as in CNNs instead of globally like in RNN. Anyway thanks for your explaination
How you are saying such complex things so easily 😫 I couldn't even understand what he said 🤕
@@readera84 what don't you understand? maybe I can give you a hand
@@keroldjoumessi9597 Windows shifting diagonally...an you make it more clear it to me
Excellent review, thanks. I've subscribed for future papers! Do you use manim for your animations?
Hi Gary, Thanks for your comments! In some places I use manim but not always. :)
next should Dynamic Head: Unifying Object Detection Heads with Attentions
agreed
Thanks Raja for pointing out. We will try to prioritise the paper at some point.
Great video. But can you refrain from putting the music in the background while explaining. It's a little distracting when viewing it at higher speed.
Sure will take it on board when making the future ones 👍
NLP, you have 100,000 words at most to permute and train with. With images? Well. ViT with 400m images can hardly manage to match ImageNet :)