I have a question related with the faster r-cnn network, which i am struggled -My question is : During the prediction what if my image size is 3000x3000 instead of 600x1000 and feed into network as an input. what will happen? does The faster r-cnn network resize it itself to 600x1000? This makes me so complicated. im working on tensorflow 1.15 Sorry if i made a mistake Thanks in advance.
The video series was great! And thanks for explaining it very clearly and neatly. However, I had a doubt, at 4:34 the black box (sliding window at center) is generated by 3x3 conv on 38x50 feature map, as the receptive field at feature map from vgg-16 is 16, when a 3x3 conv is applied on it, the receptive field for the conv becomes 16x3 right? Please correct me if I am wrong.
@@abhishekjatram zike.io/posts/calculate-receptive-field-for-vgg-16/ if we use a 224x224 image as input, then the feature map dim at the last conv layer would be 14x14. if we use a 600x800 image, then FM dim will be 38x50. Irrespective of the FM dim, the receptive field at any given layer wont change. Here it is 196x196 at the last conv layer. That is, if you consider any pixel in the FM of the last conv layer of this network, then, we are effectively looking at a 196x196 patch of the image. Since the effective stride is 16 for a 3x3 patch of FM, we will be covering 228x228 patch of the image, as you have pointed out. The receptive field only changes if we change the network configuration, (stride, kernel size, number of layers etc). Please let me know if I need to elaborate further.
@@Cogneethi Ok, got it. The receptive field of VGG-16 at FM is 196x196 and when we move by 1 pixel in FM => we are moving by 16 pixels in image (width(image) / width(FM) = 16 ). So the receptive field of 3x3 conv on FM would be 196 (0,0) + 16 (0,1) + 16 (0,2) = 228 x 228. Thank you once again :)
@@Cogneethi I might be missing something very fundamental. How an image of 600 X 800 can be an feed into VGG16, when only the image size of 224 X 224 is allowed.
For this, we need to understand receptive field. But the explanation is long. To get the intuition about receptive field, see this: ua-cam.com/video/QyM8c8XK01g/v-deo.html The detailed calculation of Receptive Fields for VGG16 can be seen here: zike.io/posts/calculate-receptive-field-for-vgg-16/ Here, we take the 196x196 output. (we are not using the FC layers of this network) The effective stride of the VGG16 network is 16. Since, we are using 3x3 conv, the receptive field of this filter is, 196 (output receptive field) +16 (stride of 2nd pixel)+16 (stride of 3rd pixel) = 228. So, in total, we have 228x228 receptive field for a 3x3 conv filter. For more details see: Paper and Github: arxiv.org/pdf/1603.07285.pdf & github.com/vdumoulin/conv_arithmetic Please let me know if I need to elaborate further.
Amazing Lecture!! Can you please let me know.. that since VGG16 will reduce the feature map to 38x50.. how come sliding window box would be 228x228? Is there a video where you have explained that?
Hey, the loss function is the same, you just add up the Classification and BBox losses for each ROI proposal. Doesnt matter if there are 2 objects or 3 in your image. See: github.com/endernewton/tf-faster-rcnn/blob/0e61bacf863f9f466b54770f35a130514e85cac6/lib/nets/network.py def _smooth_l1_loss() & def _add_losses() Let me know if it is not clear.
Since we will be using Softmax for Foreground/Background classification in the RPN, we will be getting a score. Using this score, we can sort the Region Proposals. From this sorted list, we pick the top 6000. Is it clear now?
Hi thanks for beautiful explanation. I have a small doubt. Are you visualizing filter output in last conv layer? Are you using Class Activation Map? How to visualize different filter's heatmap overlayed on original image? When I am trying to visualize, it is showing heatmap of a particular layer (i.e., Block5_Conv1), not individual filter output. If you kindly enlighten me. Thank you in advanced.
I have taken code from here github.com/endernewton/tf-faster-rcnn/ way back in 2018. Now this repo is updated. So I am not sure if my code mods will work. But this is what I did. In file test.py --------------------------------------------------------------------------------- In function: def im_detect(sess, net, im, im_name): ... ... # I added these 2 lines - start feat_map = net.extract_head(sess, blobs['data']) plot_feat_map(im, feat_map, False, im_name) # I added these 2 lines. - end # this is the new fn to dump the feature maps overlapped with the image. def plot_feat_map(im, feat_map, savefig, im_name): print("im shape: {}".format(im.shape)) print("fm shape: {}".format(feat_map[0].shape)) fig, ax = plt.subplots(figsize=(12, 12)) for i in range(0, 512, 50): print(i) plt.title('Filter ' + str(i)) ax.imshow(feat_map[0, :, :, i], aspect='auto', cmap="gray", extent=(0, im.shape[1], 0, im.shape[0])) ax.imshow(im, alpha=0.3, aspect='equal') plt.tight_layout() if savefig: pth = Path("out/featmap") pth.mkdir(parents=True, exist_ok=True) plt.draw() plt.savefig(pth / (im_name + '_' + str(i) + ".jpg") --------------------------------------------------------------------------------- As I might have said in the video, I am not sure if my approach to visualize is correct. This is just my hack. Nowadays, there are other visualizing libs/tools which might be better and accurate. Request you to please check them too. Also, let me know what you find. Hope this helps.
I seem to have missed the original link, but here are similar references: In the repo, you can see the pixel norm related code here: github.com/endernewton/tf-faster-rcnn/blob/0e61bacf863f9f466b54770f35a130514e85cac6/lib/model/config.py __C.PIXEL_MEANS = np.array([[[102.9801, 115.9465, 122.7717]]]) & github.com/endernewton/tf-faster-rcnn/blob/0e61bacf863f9f466b54770f35a130514e85cac6/lib/utils/blob.py def prep_im_for_blob(im, pixel_means, target_size, max_size): Some links explaining it: forums.fast.ai/t/images-normalization/4058/2 Here, in Faster RCNN they are basically doing just: (x - x.mean()) stats.stackexchange.com/a/220970 arthurdouillard.com/post/normalization/
Thanks a lot for this fantastic explanation
At 00:34, for pixel norm, could you please share the link? It is not there in the description.
this is very good visualization of Faster-RCNN !!!
Thanks for such a good explaination
I have a question related with the faster r-cnn network, which i am struggled
-My question is : During the prediction what if my image size is 3000x3000 instead of 600x1000 and feed into network as an input. what will happen? does The faster r-cnn network resize it itself to 600x1000?
This makes me so complicated.
im working on tensorflow 1.15
Sorry if i made a mistake
Thanks in advance.
The video series was great! And thanks for explaining it very clearly and neatly.
However, I had a doubt, at 4:34 the black box (sliding window at center) is generated by 3x3 conv on 38x50 feature map, as the receptive field at feature map from vgg-16 is 16, when a 3x3 conv is applied on it, the receptive field for the conv becomes 16x3 right? Please correct me if I am wrong.
From a comment below, I understood that we are using output of 196 x 196 from vgg, so that could give us 228, but here we are using 38x50 right?
@@abhishekjatram zike.io/posts/calculate-receptive-field-for-vgg-16/
if we use a 224x224 image as input, then the feature map dim at the last conv layer would be 14x14.
if we use a 600x800 image, then FM dim will be 38x50.
Irrespective of the FM dim, the receptive field at any given layer wont change. Here it is 196x196 at the last conv layer.
That is, if you consider any pixel in the FM of the last conv layer of this network, then, we are effectively looking at a 196x196 patch of the image.
Since the effective stride is 16 for a 3x3 patch of FM, we will be covering 228x228 patch of the image, as you have pointed out.
The receptive field only changes if we change the network configuration, (stride, kernel size, number of layers etc).
Please let me know if I need to elaborate further.
@@Cogneethi Ok, got it. The receptive field of VGG-16 at FM is 196x196 and when we move by 1 pixel in FM => we are moving by 16 pixels in image (width(image) / width(FM) = 16 ). So the receptive field of 3x3 conv on FM would be 196 (0,0) + 16 (0,1) + 16 (0,2) = 228 x 228.
Thank you once again :)
@@abhishekjatram Yes your calculation is right.
You are Welcome.
@@Cogneethi I might be missing something very fundamental. How an image of 600 X 800 can be an feed into VGG16, when only the image size of 224 X 224 is allowed.
Finally a source that explains it so well... Thanks a lot
welcome!
How did you get the value of 228x228 at 4:46?
For this, we need to understand receptive field. But the explanation is long.
To get the intuition about receptive field, see this: ua-cam.com/video/QyM8c8XK01g/v-deo.html
The detailed calculation of Receptive Fields for VGG16 can be seen here: zike.io/posts/calculate-receptive-field-for-vgg-16/
Here, we take the 196x196 output. (we are not using the FC layers of this network)
The effective stride of the VGG16 network is 16.
Since, we are using 3x3 conv, the receptive field of this filter is, 196 (output receptive field) +16 (stride of 2nd pixel)+16 (stride of 3rd pixel) = 228.
So, in total, we have 228x228 receptive field for a 3x3 conv filter.
For more details see:
Paper and Github: arxiv.org/pdf/1603.07285.pdf & github.com/vdumoulin/conv_arithmetic
Please let me know if I need to elaborate further.
Amazing Lecture!!
Can you please let me know.. that since VGG16 will reduce the feature map to 38x50.. how come sliding window box would be 228x228? Is there a video where you have explained that?
Thanks for the video. Can you just give a slight intution as to why we slide the anchor boxes at a stride of 16 pixels ?
ua-cam.com/video/50-PhoCJEOk/v-deo.html
This is really fantastic ,thanks for saving much time!
Welcome Praveen
this is the real stuff man
Thank you !!!
Hii, where you are expalining the loss function with multiple objects in image??
Hey, the loss function is the same, you just add up the Classification and BBox losses for each ROI proposal. Doesnt matter if there are 2 objects or 3 in your image.
See: github.com/endernewton/tf-faster-rcnn/blob/0e61bacf863f9f466b54770f35a130514e85cac6/lib/nets/network.py
def _smooth_l1_loss() & def _add_losses()
Let me know if it is not clear.
Thank you!
pro trick : watch series at Flixzone. Been using it for watching loads of movies recently.
@Roger Ronan yup, I've been using Flixzone} for years myself :D
@Roger Ronan Definitely, I've been watching on flixzone} for since november myself :D
How do u find their is a object or not out of 17000 how 6000 are selected because their is no selective search only convolution is going to do ?
Since we will be using Softmax for Foreground/Background classification in the RPN, we will be getting a score. Using this score, we can sort the Region Proposals. From this sorted list, we pick the top 6000.
Is it clear now?
Hi thanks for beautiful explanation. I have a small doubt.
Are you visualizing filter output in last conv layer? Are you using Class Activation Map? How to visualize different filter's heatmap overlayed on original image? When I am trying to visualize, it is showing heatmap of a particular layer (i.e., Block5_Conv1), not individual filter output. If you kindly enlighten me. Thank you in advanced.
I have taken code from here github.com/endernewton/tf-faster-rcnn/
way back in 2018. Now this repo is updated. So I am not sure if my code mods will work.
But this is what I did.
In file test.py
---------------------------------------------------------------------------------
In function: def im_detect(sess, net, im, im_name):
...
...
# I added these 2 lines - start
feat_map = net.extract_head(sess, blobs['data'])
plot_feat_map(im, feat_map, False, im_name)
# I added these 2 lines. - end
# this is the new fn to dump the feature maps overlapped with the image.
def plot_feat_map(im, feat_map, savefig, im_name):
print("im shape: {}".format(im.shape))
print("fm shape: {}".format(feat_map[0].shape))
fig, ax = plt.subplots(figsize=(12, 12))
for i in range(0, 512, 50):
print(i)
plt.title('Filter ' + str(i))
ax.imshow(feat_map[0, :, :, i], aspect='auto', cmap="gray", extent=(0, im.shape[1], 0, im.shape[0]))
ax.imshow(im, alpha=0.3, aspect='equal')
plt.tight_layout()
if savefig:
pth = Path("out/featmap")
pth.mkdir(parents=True, exist_ok=True)
plt.draw()
plt.savefig(pth / (im_name + '_' + str(i) + ".jpg")
---------------------------------------------------------------------------------
As I might have said in the video, I am not sure if my approach to visualize is correct. This is just my hack.
Nowadays, there are other visualizing libs/tools which might be better and accurate.
Request you to please check them too. Also, let me know what you find.
Hope this helps.
where is the link that explains further about pixel normalization?
I seem to have missed the original link, but here are similar references:
In the repo, you can see the pixel norm related code here:
github.com/endernewton/tf-faster-rcnn/blob/0e61bacf863f9f466b54770f35a130514e85cac6/lib/model/config.py
__C.PIXEL_MEANS = np.array([[[102.9801, 115.9465, 122.7717]]])
&
github.com/endernewton/tf-faster-rcnn/blob/0e61bacf863f9f466b54770f35a130514e85cac6/lib/utils/blob.py
def prep_im_for_blob(im, pixel_means, target_size, max_size):
Some links explaining it:
forums.fast.ai/t/images-normalization/4058/2
Here, in Faster RCNN they are basically doing just: (x - x.mean())
stats.stackexchange.com/a/220970
arthurdouillard.com/post/normalization/
Nice demo
excuse me sir, may i have the code for this explanations ?? thanks a lot before
github.com/endernewton/tf-faster-rcnn
The code is updated after I made these videos, so there might be some differences. Please check accordingly.