@@soroushmehraban yes abs thx so much, although I Have a Quick Question More Related to PyTorch actually which is in min 12:49 in line 239 in the code 1st what does -1 here means and what does it do exactly with the tensor 2nd from where we get [4,16] the 4 here from where we got it cuz its not mentioned in the reshaping. Thanks in advance.
That's a hyperparameter I believe. It's hard to use lots of layers at first and second stage because of the memory constraints we have with 4x4 and 8x8 patches and 32x32 patch at the last stage has the highest patch size (least attention to details). So they used the most at 16x16 patch size instead.
I was comparing 4x4 swin transformer vs 4x4 ViT. In 4x4 ViT the whole layers have patches of 4x4 pixels so in all layers they have good attention to details. But in swin transformer as we go forward we merge these tokens so we have less attention to details in deep layers (that's why the end layer output is not enough for segmentation).
2:43 C would be equal to the number of filters not the number of kernels. In the torch.nn.conv2d operation being performed we have 3 kernels for each input channel and then C number of filters. Each filter having 3 kernels not C number of kernels.
By far one of the best + complete, SWIN transformer explanations on the entire Internet.
Thanks!
@@soroushmehraban Hi sir, could you also explain the FasterViT and GCViT paper...
Great explanation, thanks!
That's The Most Illustrative Video Of Swin-Transformers on The Internet!
Glad you enjoyed it 😃
@@soroushmehraban yes abs thx so much, although I Have a Quick Question More Related to PyTorch actually which is in min 12:49 in line 239 in the code 1st what does -1 here means and what does it do exactly with the tensor 2nd from where we get [4,16] the 4 here from where we got it cuz its not mentioned in the reshaping. Thanks in advance.
Thorough! Very comprehensible, thank you.
Really informative, helped me lot to understand many concepts here. Keep up the good work
Thanks! I’ll try my best.
Very well explained, thank you Soroush.
Glad you liked it
You deserve more likes and subscribers
Thanks man🙂 appreciated
Thanks for the good explanation!
17:15, may I ask why the number at the right bottom of the 3rd swin block is 6?
That's a hyperparameter I believe. It's hard to use lots of layers at first and second stage because of the memory constraints we have with 4x4 and 8x8 patches and 32x32 patch at the last stage has the highest patch size (least attention to details). So they used the most at 16x16 patch size instead.
Amazing video !
Thanks!
The discussion about patch size at around 16:40 is confusing
I was comparing 4x4 swin transformer vs 4x4 ViT. In 4x4 ViT the whole layers have patches of 4x4 pixels so in all layers they have good attention to details. But in swin transformer as we go forward we merge these tokens so we have less attention to details in deep layers (that's why the end layer output is not enough for segmentation).
perfect description.
Glad it was helpful 🙂
Great video! Thanks
Thanks for the feedback 🙂
where is the code that u were referring to?
github.com/microsoft/Swin-Transformer/blob/main/models/swin_transformer.py#L222
I enjoy very much
very nicely explained thank you! likes are at 314 so didnt hit like it😁subbed instead
2:43 C would be equal to the number of filters not the number of kernels. In the torch.nn.conv2d operation being performed we have 3 kernels for each input channel and then C number of filters. Each filter having 3 kernels not C number of kernels.