Thanks for explanation! As for 03:55 Max-pooling, could you compare it to average pooling? What are advantages of each one and where one shoud use either? I implement some simplified version of StyleGan, and in discriminator (or critic) part I use AveragePooling. But I see that MaxPooling is used there more commonly. Should I switch, or it depends on something else? I haven't tested these methods side-by side. Maybe i should.
Great question, Dima! I think the usage depends on the application and the dataset we use during training. Comparing max pooling vs avg pooling for the task of object detection, I believe max pooling is a better option. Because where the object is located is important in object detection, and object edges' (which likely have the highest value relative to their neighbors) play an important role. However, discriminators should be concerned with every minute detail in the input images. When you use max pooling, you throw away 3 pixels of information for every 2x2 region of the feature map, which may have a huge receptive field size of the input layer. As a result, you make it more difficult for the discriminator to differentiate. What I said was based on my intuition, which may or may not be correct. Please try average pooling instead and share the results with me. I'm excited to see the change (you can find my email in the about section).
@@soroushmehraban Well, for now AveragePooling definetively works for me. I have other technical problems in my training, but I believe that pooling method is not something that riuns my day. I'm looking forward to try two identical networks - one with Average, and other with Max pooling. I'm curious to compare results as well. For now my conjecture is that I may be taking too much unuseful information in each layer because of wrong pooling method, thus choking my model and reducing its usefullness. However, that's not my main problem for now. Main limitation is that I don't have an Nvidia GPU. Thus, I train model on one of a cheapest or Ryzen processors, and on top, I do that on a Windows machine. As a result, I can not implement WGAN-GP. The GP part means that in order to train, a model has to complute gradients of gradients. And my framework does not seem to be able to do such task. That does not relate to pooling method, just a fun story.
@@DimaZheludko Why don't you use Google Colab? Currently, it offers Tesla T4 GPU for free. You can check the GPU type on Google Colab simply by entering !nvidia-smi on a cell.
@@soroushmehraban Yeah, Colab is a great service. I still can't believe it is free and totally usable, although limited by session time. I considered trying to migrate to it. Still, for now, I'm not sure it's worth all the hassle. See, my project seems to be somewhat opposite to collab ideology. My project aims to establish ability to generate images locally on AMD GPU. Actually, generating is a simple part, but training is not. Collab's main advantage is CUDA. So, they're quite opposite by idea. You could, of course, train in collab and then transfer model to run locally. But that would require a lot of work to harmonize these two environments.
Thank you very much for your video. If the parameters of filters for all dimensions of input the same? for example, do we use 1 kernel for all R, G, and B for an RGB image?
That should reduce the model’s flexibility. The whole point of having different weights in different channels is learning different representations for each channel. But by having the same weights I believe we won’t have a rich high level semantic at the end.
Don't you think if the network gets two deep, the receptive field will cross the size of the input image itself? What happens in that case? Great video mate!
If it goes too deep it ends up having a layer that each element is a vector representing the whole image. Following by that if we add another layer, for every 3x3 region it acts as a weighted sum of the vectors that the weights are kernel learnable parameters. I think since that layer the receptive field would be the same.
Hey there. I tested MaxPooling compared to AveragePooling as I asked in other comment here. So for me and for my network result seems to be quite conclusive. MaxPooling is much better. 1. It converges network much faster. Probably 3-4 times faster per number of steps. 2. It runs a bit faster per step. Though, the speed is barely notisable. Somewhat around 2-3%. Also, I suspect that MaxPooling uses network capacity much more efficiently, but that is still just an assumption. Anyway, I'm stayng with MaxPooling.
@@soroushmehraban Good question. Simplest assumption is that it is just simpler to understand and, probably, implement. Maybe it has something to do with the way new layers are added and blended into ProGAN. Still, I have no idea. Something to think about.
Looks to me that receptive field size increases by 2 with each layer, yet you mark the 2nd layer as increasing by 3. How is the difference between 1 & 3 equal to 3?
It's correct it increases by 2 with each layer. Based on the formula generally it increases by L(K-1) and in our case since K=2, It increases by 2L. I marked second layer to have the receptive field of 3 because one pixel in output layer corresponds to 3x3 pixels in input layer.
By the same logic in this video, using 3x1 followed by 1x3 kernels would be better in terms of parameter count and flops while having the same receptive field. Why isn’t that common?
Good question. I think for answering it we have to also consider convolution kernels in a different perspective as well. Because ultimately after training, they have to detect some "edges" from the input feature volume, and by using 3x1 followed by 1x3 kernels, each kernels can detect either horizontal or vertical edges and are not flexible enough to learn more complex features. Having said that, maybe that's the reason why in ConvNeXt (from A ConvNet for the 2020s video) they're using 7x7 depthwise convolution instead (which we're having larger receptive field for a single layer and by using depthwise, we're learning different features for different channels). Replacing depthwise convolution in ConvNeXt with three depthwise 3x3 convolutions (or 3x1 followed by 1x3) would be a good experiment to try!
just subscribed thanks for this, great explanation!
Glad you enjoyed!
Great video as always
Thanks for explanation!
As for 03:55 Max-pooling, could you compare it to average pooling? What are advantages of each one and where one shoud use either?
I implement some simplified version of StyleGan, and in discriminator (or critic) part I use AveragePooling. But I see that MaxPooling is used there more commonly. Should I switch, or it depends on something else? I haven't tested these methods side-by side. Maybe i should.
Great question, Dima!
I think the usage depends on the application and the dataset we use during training. Comparing max pooling vs avg pooling for the task of object detection, I believe max pooling is a better option. Because where the object is located is important in object detection, and object edges' (which likely have the highest value relative to their neighbors) play an important role.
However, discriminators should be concerned with every minute detail in the input images. When you use max pooling, you throw away 3 pixels of information for every 2x2 region of the feature map, which may have a huge receptive field size of the input layer. As a result, you make it more difficult for the discriminator to differentiate.
What I said was based on my intuition, which may or may not be correct. Please try average pooling instead and share the results with me. I'm excited to see the change (you can find my email in the about section).
@@soroushmehraban Well, for now AveragePooling definetively works for me. I have other technical problems in my training, but I believe that pooling method is not something that riuns my day. I'm looking forward to try two identical networks - one with Average, and other with Max pooling. I'm curious to compare results as well. For now my conjecture is that I may be taking too much unuseful information in each layer because of wrong pooling method, thus choking my model and reducing its usefullness.
However, that's not my main problem for now. Main limitation is that I don't have an Nvidia GPU. Thus, I train model on one of a cheapest or Ryzen processors, and on top, I do that on a Windows machine. As a result, I can not implement WGAN-GP. The GP part means that in order to train, a model has to complute gradients of gradients. And my framework does not seem to be able to do such task. That does not relate to pooling method, just a fun story.
@@DimaZheludko Why don't you use Google Colab? Currently, it offers Tesla T4 GPU for free. You can check the GPU type on Google Colab simply by entering
!nvidia-smi
on a cell.
@@soroushmehraban Yeah, Colab is a great service. I still can't believe it is free and totally usable, although limited by session time. I considered trying to migrate to it. Still, for now, I'm not sure it's worth all the hassle. See, my project seems to be somewhat opposite to collab ideology. My project aims to establish ability to generate images locally on AMD GPU. Actually, generating is a simple part, but training is not. Collab's main advantage is CUDA. So, they're quite opposite by idea. You could, of course, train in collab and then transfer model to run locally. But that would require a lot of work to harmonize these two environments.
@@DimaZheludko I see .. Good luck!
Thanks a lot for your great explanation.
Thank you very much for your video. If the parameters of filters for all dimensions of input the same? for example, do we use 1 kernel for all R, G, and B for an RGB image?
That should reduce the model’s flexibility. The whole point of having different weights in different channels is learning different representations for each channel. But by having the same weights I believe we won’t have a rich high level semantic at the end.
Don't you think if the network gets two deep, the receptive field will cross the size of the input image itself? What happens in that case?
Great video mate!
If it goes too deep it ends up having a layer that each element is a vector representing the whole image. Following by that if we add another layer, for every 3x3 region it acts as a weighted sum of the vectors that the weights are kernel learnable parameters. I think since that layer the receptive field would be the same.
@@soroushmehraban This is a very logical explanation. Thanks! Will try coding it out to confirm :) Do you have a discord?
@@dddz7738 I'm not active on discord I'm afraid. If you are on Twitter, mine is twitter.com/soroushmhrbn
Thank you so muchh and i do subscribed!!!
New subscriber 🎉🎉
Hey there. I tested MaxPooling compared to AveragePooling as I asked in other comment here.
So for me and for my network result seems to be quite conclusive.
MaxPooling is much better.
1. It converges network much faster. Probably 3-4 times faster per number of steps.
2. It runs a bit faster per step. Though, the speed is barely notisable. Somewhat around 2-3%.
Also, I suspect that MaxPooling uses network capacity much more efficiently, but that is still just an assumption.
Anyway, I'm stayng with MaxPooling.
Thanks for sharing your experience! I wonder why then they have used AvgPooling in papers like ProGANs 🤔
@@soroushmehraban Good question.
Simplest assumption is that it is just simpler to understand and, probably, implement.
Maybe it has something to do with the way new layers are added and blended into ProGAN.
Still, I have no idea. Something to think about.
Looks to me that receptive field size increases by 2 with each layer, yet you mark the 2nd layer as increasing by 3. How is the difference between 1 & 3 equal to 3?
It's correct it increases by 2 with each layer. Based on the formula generally it increases by L(K-1) and in our case since K=2, It increases by 2L. I marked second layer to have the receptive field of 3 because one pixel in output layer corresponds to 3x3 pixels in input layer.
By the same logic in this video, using 3x1 followed by 1x3 kernels would be better in terms of parameter count and flops while having the same receptive field. Why isn’t that common?
Good question. I think for answering it we have to also consider convolution kernels in a different perspective as well. Because ultimately after training, they have to detect some "edges" from the input feature volume, and by using 3x1 followed by 1x3 kernels, each kernels can detect either horizontal or vertical edges and are not flexible enough to learn more complex features. Having said that, maybe that's the reason why in ConvNeXt (from A ConvNet for the 2020s video) they're using 7x7 depthwise convolution instead (which we're having larger receptive field for a single layer and by using depthwise, we're learning different features for different channels). Replacing depthwise convolution in ConvNeXt with three depthwise 3x3 convolutions (or 3x1 followed by 1x3) would be a good experiment to try!
good