Mathilda Caron thanks a lot for such clear and concise explanation. I enjoyed your talk in the ML street talk too. Looking forward to more great works from you.
When using multi-crops, the size of the crops is not fixed. For example, the author uses 2 crops of 160x160 and 4 crops of 96x96. How can the network architecture remain the same when the input dimensions are not fixed?
It uses this function pytorch.org/vision/0.8/_modules/torchvision/transforms/transforms.html#RandomResizedCrop which crops first but then resizes the crops, so the output size of all the crops are the same
@@jjxed I don't think this is correct. If you look at the code, the model really processes (3, 224, 224) and (3, 96, 96) images. It's the AdaptiveAvgPool2d layer which pools the final representation in a size-specific way so the output size stays the same
Mathilda Caron thanks a lot for such clear and concise explanation. I enjoyed your talk in the ML street talk too. Looking forward to more great works from you.
great topic, but can understand her... I was looking for subtitles, but there are not enabled.... it will be a good idea to enable them. thanks.
Awesome! Great work Mathilde!
A proper sound quality would have been much appreciated.
Agreed, thanks! We have been working on ensuring better sound quality in our videos
@@PyTorchLightning Is there a possibility to add subtitles to the video? It will help with the videos that already have bad audio quality
When using multi-crops, the size of the crops is not fixed. For example, the author uses 2 crops of 160x160 and 4 crops of 96x96. How can the network architecture remain the same when the input dimensions are not fixed?
It uses this function pytorch.org/vision/0.8/_modules/torchvision/transforms/transforms.html#RandomResizedCrop which crops first but then resizes the crops, so the output size of all the crops are the same
@@jjxed I don't think this is correct. If you look at the code, the model really processes (3, 224, 224) and (3, 96, 96) images. It's the AdaptiveAvgPool2d layer which pools the final representation in a size-specific way so the output size stays the same
French accent is classy but also unfortunately hard to understand
PhD student looks like 17.
Thanks for this information.