Receptive Fields: Why 3x3 conv layer is the best?

Soroush Mehraban

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 29 гру 2024

КОМЕНТАРІ • 27

@zukofire6424 Рік тому ⁺³
just subscribed thanks for this, great explanation!
@soroushmehraban Рік тому
Glad you enjoyed!
@ericsy78 2 роки тому ⁺¹
Great video as always
@DimaZheludko 2 роки тому ⁺²
Thanks for explanation!
As for 03:55 Max-pooling, could you compare it to average pooling? What are advantages of each one and where one shoud use either?
I implement some simplified version of StyleGan, and in discriminator (or critic) part I use AveragePooling. But I see that MaxPooling is used there more commonly. Should I switch, or it depends on something else? I haven't tested these methods side-by side. Maybe i should.
@soroushmehraban 2 роки тому ⁺²
Great question, Dima!
I think the usage depends on the application and the dataset we use during training. Comparing max pooling vs avg pooling for the task of object detection, I believe max pooling is a better option. Because where the object is located is important in object detection, and object edges' (which likely have the highest value relative to their neighbors) play an important role.
However, discriminators should be concerned with every minute detail in the input images. When you use max pooling, you throw away 3 pixels of information for every 2x2 region of the feature map, which may have a huge receptive field size of the input layer. As a result, you make it more difficult for the discriminator to differentiate.
What I said was based on my intuition, which may or may not be correct. Please try average pooling instead and share the results with me. I'm excited to see the change (you can find my email in the about section).
@DimaZheludko 2 роки тому
@@soroushmehraban Well, for now AveragePooling definetively works for me. I have other technical problems in my training, but I believe that pooling method is not something that riuns my day. I'm looking forward to try two identical networks - one with Average, and other with Max pooling. I'm curious to compare results as well. For now my conjecture is that I may be taking too much unuseful information in each layer because of wrong pooling method, thus choking my model and reducing its usefullness.
However, that's not my main problem for now. Main limitation is that I don't have an Nvidia GPU. Thus, I train model on one of a cheapest or Ryzen processors, and on top, I do that on a Windows machine. As a result, I can not implement WGAN-GP. The GP part means that in order to train, a model has to complute gradients of gradients. And my framework does not seem to be able to do such task. That does not relate to pooling method, just a fun story.
@soroushmehraban 2 роки тому ⁺¹
@@DimaZheludko Why don't you use Google Colab? Currently, it offers Tesla T4 GPU for free. You can check the GPU type on Google Colab simply by entering
!nvidia-smi
on a cell.
@DimaZheludko 2 роки тому ⁺¹
@@soroushmehraban Yeah, Colab is a great service. I still can't believe it is free and totally usable, although limited by session time. I considered trying to migrate to it. Still, for now, I'm not sure it's worth all the hassle. See, my project seems to be somewhat opposite to collab ideology. My project aims to establish ability to generate images locally on AMD GPU. Actually, generating is a simple part, but training is not. Collab's main advantage is CUDA. So, they're quite opposite by idea. You could, of course, train in collab and then transfer model to run locally. But that would require a lot of work to harmonize these two environments.
@soroushmehraban 2 роки тому ⁺¹
@@DimaZheludko I see .. Good luck!
@arohawrami8132 Рік тому
Thanks a lot for your great explanation.
@hamedshokripoor Рік тому ⁺²
Thank you very much for your video. If the parameters of filters for all dimensions of input the same? for example, do we use 1 kernel for all R, G, and B for an RGB image?
@soroushmehraban Рік тому
That should reduce the model’s flexibility. The whole point of having different weights in different channels is learning different representations for each channel. But by having the same weights I believe we won’t have a rich high level semantic at the end.
@dddz7738 Рік тому ⁺⁴
Don't you think if the network gets two deep, the receptive field will cross the size of the input image itself? What happens in that case?
Great video mate!
@soroushmehraban Рік тому ⁺²
If it goes too deep it ends up having a layer that each element is a vector representing the whole image. Following by that if we add another layer, for every 3x3 region it acts as a weighted sum of the vectors that the weights are kernel learnable parameters. I think since that layer the receptive field would be the same.
@dddz7738 Рік тому ⁺¹
@@soroushmehraban This is a very logical explanation. Thanks! Will try coding it out to confirm :) Do you have a discord?
@soroushmehraban Рік тому ⁺¹
@@dddz7738 I'm not active on discord I'm afraid. If you are on Twitter, mine is twitter.com/soroushmhrbn
@aaomms7986 Місяць тому
Thank you so muchh and i do subscribed!!!
@atharvjagtap4865 18 днів тому
New subscriber 🎉🎉
@DimaZheludko 2 роки тому ⁺¹
Hey there. I tested MaxPooling compared to AveragePooling as I asked in other comment here.
So for me and for my network result seems to be quite conclusive.
MaxPooling is much better.
1. It converges network much faster. Probably 3-4 times faster per number of steps.
2. It runs a bit faster per step. Though, the speed is barely notisable. Somewhat around 2-3%.
Also, I suspect that MaxPooling uses network capacity much more efficiently, but that is still just an assumption.
Anyway, I'm stayng with MaxPooling.
@soroushmehraban 2 роки тому
Thanks for sharing your experience! I wonder why then they have used AvgPooling in papers like ProGANs 🤔
@DimaZheludko 2 роки тому
@@soroushmehraban Good question.
Simplest assumption is that it is just simpler to understand and, probably, implement.
Maybe it has something to do with the way new layers are added and blended into ProGAN.
Still, I have no idea. Something to think about.
@Jianju69 Рік тому
Looks to me that receptive field size increases by 2 with each layer, yet you mark the 2nd layer as increasing by 3. How is the difference between 1 & 3 equal to 3?
@soroushmehraban Рік тому
It's correct it increases by 2 with each layer. Based on the formula generally it increases by L(K-1) and in our case since K=2, It increases by 2L. I marked second layer to have the receptive field of 3 because one pixel in output layer corresponds to 3x3 pixels in input layer.
@BrandonFurtwangler Рік тому ⁺¹
By the same logic in this video, using 3x1 followed by 1x3 kernels would be better in terms of parameter count and flops while having the same receptive field. Why isn’t that common?
@soroushmehraban Рік тому
Good question. I think for answering it we have to also consider convolution kernels in a different perspective as well. Because ultimately after training, they have to detect some "edges" from the input feature volume, and by using 3x1 followed by 1x3 kernels, each kernels can detect either horizontal or vertical edges and are not flexible enough to learn more complex features. Having said that, maybe that's the reason why in ConvNeXt (from A ConvNet for the 2020s video) they're using 7x7 depthwise convolution instead (which we're having larger receptive field for a single layer and by using depthwise, we're learning different features for different channels). Replacing depthwise convolution in ConvNeXt with three depthwise 3x3 convolutions (or 3x1 followed by 1x3) would be a good experiment to try!
@kinger1080 9 місяців тому
good

Наступне

Автоматичне відтворення