Thanks for the very amazing explanation!! It is well-explained and concise. I like the recap at the end of the explanation where you mentioned what happened. Keep going!
Initializing pos_embed w/ zeros vs w/ randn, made the position embedding similarity visualization (pesv) make sense vs nonsensical. Any idea why this happens? In case when I init with randn, the pesv looked random throughout the training (with better losses and eval metrics). In case when I init with all 0s, the pesv made a lot of sense in very early iters, i.e. similar regions in dataset had similar pos_embed (with little worse eval metrics and losses than before).
That’s interesting. What if you do a reduction of the positional embedding space to a much lower dimensionality using say PCA ? And then see similarity between positional embedding projections in that space ?
Thanks for the video. I learned a lot, and I what if I have 3D images(C, W, H, Z(slices)), what parameter should I change or add? What patch size should I consider? Thank you in advance.
Thank you :) So I have never actually worked with 3d images(i am assuming you have medical 3d images) so take everything that I have said below with a grain of salt. If your goal is to do classification you could simply consider tubelets instead of patches, so that means you are patchifying not just along spatial dimension but also along z(slices I am guessing) dimension. But if you trivially try to compute the attention between these tubelet embeddings, it would most likely end up very costly(but if your images are small and z dimension is not huge you can try this). You can refer to ViVIT paper to understand this and details involved in it and some approaches to reduce the cost. In case your images/slices are large then you can try to use Swin transformer. Specifically you can refer to Video Swin Transformer paper. Just that for you temporal dimension becomes slices. For specifc patch sizes and other parameter, theres a Swin paper for 3d medical images which you can take a look at(Self-Supervised Pre-Training of Swin Transformers for 3D Medical Image Analysis), specifically their encoder part.
@@Explaining-AI thank you for such a great explanation and suggestion regarding my question. My image is 1x244x244x40 (1 channel and 40 slices). Do you think this is a large input for ViT? And yes, I am working on a classification task. I will definitely read and try to use ViVIT and the Swin paper for my problem. thank you again for creating great content and you help.😇
@@buh357 Its my pleasure :) And maybe You can try downscaling images and see how much time it takes for convergence to test that quickly. But for easy calculation assume your images are 224x224x40 and patch dimensions are 16x16x4 Then you will essentially be running a transformer on sequence of 14x14x10 ~ 2000 tokens. So I do feel Swin might be better option here but still do try with simply converting VIT to 3D VIT on downscaled images to see what kind of results you are getting to decide better.
@@Explaining-AI hi, sir. you were right, video swin transformer performed better than 3D vision transformer(from my traininng, both train and val loss did not decrease), but problem is that the video swin transformer is not performing better than 3D resnet50, and 3D efficientnet, i am wondering how can i boost the performence of video swin transformer? more data? do you have suggestions or tips for training video swin transformer? btw i am traninng from scracth. thank you in advance.😁
Hello @@buh357 , assuming you have already tried changing things like number of layers and tuning the hyper parameters then yes you can try with more data, that should help. But if getting more data requires more investment (time or cost) then maybe to get sense of whether it will benefit or not, you can try training both(swin and resnet) on lesser data. Say 10%/20%/50%/75% and so on of current dataset. And if you see that the performance gap between resnet and swin keeps decreasing as you provide it higher fraction of dataset then thats a good indication that more data will further help to bridge that gap. And like you said do give it a shot with pretrained checkpoints as well(for both variants for fair comparison). Btw How large is your dataset and whats the difference between the performance of both?
*Github Code* - github.com/explainingai-code/VIT-Pytorch
*Patch Embedding* - Vision Transformer (Part One) - ua-cam.com/video/lBicvB4iyYU/v-deo.html
*Attention* in Vision Transformer (Part Two) - ua-cam.com/video/zT_el_cjiJw/v-deo.html
*Implementing Vision Transformer* (Part Three) - ua-cam.com/video/G6_IA5vKXRI/v-deo.html
Very well explained with math clarity behind. Very good.
Thank You!
Thanks for the very amazing explanation!! It is well-explained and concise. I like the recap at the end of the explanation where you mentioned what happened. Keep going!
Thanks for the feedback and am really happy that you liked it.
I've been struggling with this lol. Really good stuff.
Fantastic explanation.
Amazing video! Thanks a lot!
Thank You!
Wonderful video!
Thank you very much!
Thank you for the amazing video. Only thing for the future: Maybe remove that click sound when you change the slides!
Thank you for the feedback. Will take care of it in future videos :)
Small and consice thanks !
Initializing pos_embed w/ zeros vs w/ randn, made the position embedding similarity visualization (pesv) make sense vs nonsensical. Any idea why this happens?
In case when I init with randn, the pesv looked random throughout the training (with better losses and eval metrics).
In case when I init with all 0s, the pesv made a lot of sense in very early iters, i.e. similar regions in dataset had similar pos_embed (with little worse eval metrics and losses than before).
That’s interesting. What if you do a reduction of the positional embedding space to a much lower dimensionality using say PCA ? And then see similarity between positional embedding projections in that space ?
Thanks for the video. I learned a lot,
and I what if I have 3D images(C, W, H, Z(slices)),
what parameter should I change or add?
What patch size should I consider? Thank you in advance.
Thank you :)
So I have never actually worked with 3d images(i am assuming you have medical 3d images) so take everything that I have said below with a grain of salt.
If your goal is to do classification you could simply consider tubelets instead of patches, so that means you are patchifying not just along spatial dimension but also along z(slices I am guessing) dimension.
But if you trivially try to compute the attention between these tubelet embeddings, it would most likely end up very costly(but if your images are small and z dimension is not huge you can try this). You can refer to ViVIT paper to understand this and details involved in it and some approaches to reduce the cost.
In case your images/slices are large then you can try to use Swin transformer. Specifically you can refer to Video Swin Transformer paper. Just that for you temporal dimension becomes slices.
For specifc patch sizes and other parameter, theres a Swin paper for 3d medical images which you can take a look at(Self-Supervised Pre-Training of Swin Transformers
for 3D Medical Image Analysis), specifically their encoder part.
@@Explaining-AI thank you for such a great explanation and suggestion regarding my question. My image is 1x244x244x40 (1 channel and 40 slices). Do you think this is a large input for ViT? And yes, I am working on a classification task. I will definitely read and try to use
ViVIT and the Swin paper for my problem. thank you again for creating great content and you help.😇
@@buh357 Its my pleasure :) And maybe You can try downscaling images and see how much time it takes for convergence to test that quickly. But for easy calculation assume your images are 224x224x40 and patch dimensions are 16x16x4 Then you will essentially be running a transformer on sequence of 14x14x10 ~ 2000 tokens. So I do feel Swin might be better option here but still do try with simply converting VIT to 3D VIT on downscaled images to see what kind of results you are getting to decide better.
@@Explaining-AI hi, sir. you were right, video swin transformer performed better than 3D vision transformer(from my traininng, both train and val loss did not decrease), but problem is that the video swin transformer is not performing better than 3D resnet50, and 3D efficientnet, i am wondering how can i boost the performence of video swin transformer? more data? do you have suggestions or tips for training video swin transformer? btw i am traninng from scracth. thank you in advance.😁
Hello @@buh357 , assuming you have already tried changing things like number of layers and tuning the hyper parameters then yes you can try with more data, that should help.
But if getting more data requires more investment (time or cost) then maybe to get sense of whether it will benefit or not, you can try training both(swin and resnet) on lesser data. Say 10%/20%/50%/75% and so on of current dataset.
And if you see that the performance gap between resnet and swin keeps decreasing as you provide it higher fraction of dataset then thats a good indication that more data will further help to bridge that gap.
And like you said do give it a shot with pretrained checkpoints as well(for both variants for fair comparison).
Btw How large is your dataset and whats the difference between the performance of both?
Please don't use music in the background, it's very distracting, thanks.
Thank you for the feedback. Have taken care of this in my recent videos.