See full course on Object Detection: ua-cam.com/play/PL1GQaVhO4f_jLxOokW7CS5kY_J1t1T17S.html and Subscribe to my channel If you found this tutorial useful, please share with your friends(WhatsApp/iMessage/Messenger/WeChat/Line/KaTalk/Telegram) and on Social(LinkedIn/Quora/Reddit), Tag @cogneethi on twitter.com Let me know your feedback @ cogneethi.com/contact
To add to the above video: If you have an image of 3 channels, 3x300x300 and your goal is to get an image with 6 channels as the output of convolution then you need a kernel of shape 6x3xP x P . For example: 3x300x300 * 6x3x5x5 = 6x296x296 If you convolve NxN with PxP then the output is N-P+1xN-P+1 for a valid mode convolution. Always remember that the number of channels in your kernel should be same as the number of channels in the input image (in above case it's 3) and the number of channels in the output of convolution is the extra dimension that you will add to the kernel (in above case it's 6).
Here as you mentioned the bounding box regression output for image of scale 281* 317 as 2*3*4*C , can you please tell the spatial dimension of to be exact ? Is it (6*4*C) ?
No. For every sliding window, the regressor head gives 4 outputs and the classifier head gives C classes. So, the shape when you implement this would be 2 x 3 x (4) and 2x3xC.
Hi Cogneethi! Here again at 8:02 you are saying 256x4096 filters. I think we are using only 4096 filters of 5x5 dimentions right making 256x4096x5x5 weights parameters?
Its a design choice. Their basic idea seems to be to get a single prediction for the smallest dimension in the Image Pyramid. And you can get spatial output for other dimensions. And using 1x1 convolutions in later stages, you can also save on model parameters.
I understand that each filter for the final conv correspond to a class. And this makes sense for the smallest image in the image pyramid because outputs are 1 x 1 x C. However, for the output maps that are NOT 1 x 1 x C, what does the ground truth look like? Like if my output is 2 x 3 x C, what is the ground truth for each of those C spatial outputs in the classification case and regression case? Does one ground truth 2 x 3 filter contain the same number, that is the number corresponding to the class label, or is it something else? Help!
GT boxes/labels are not related to our CNN logic. If there is only 1 object in an image, there will be only 1 GT with bbox and a class label. In spatial output, we are getting more fine grained bboxes which will get elimiated during NMS.
How does softmax deal with (2x3) .... class - as I understand softmax deals with one value for each class - can you please help me to understand how softmax deal with this ?
@@useForwardMax softmax takes single value (predicted) for each class and it outputs probabilities between[0, 1] sum of them is 1. In the case mentioned above, it can be solved by using max or average max to get one value for each class before passing to softmax. I haven’t implemented the algorithm but I believe that the idea
@@useForwardMax Hey, you can take all the detections to the next stage, that is NMS (Non Max Suppression) where only the most accurate prediction is kept and the rest are ignored.
I have a question. I'm trying to perform object detection of manhole covers in mobile mapping data and was planning to perform the object detection by using sliding window. Now thanks to your videos i understand Overfeat and the principal behind it but I am wondering if this can also be performed efficiently on very large images of resolution 4000*8000 or will the efficiency of this method decrease?
Honestly, I have not worked on images of that size, so I cant be sure, but theoretically, it should work. So, this is one approach you can take. I checked if anyone has worked on images of 8K resolution and I found one paper: arxiv.org/abs/1810.10551 In this they use 2 other approaches: a. Downscale the image and do the detection or b. Take overlapping crops of the image and do detection separately on them. (I've simplified their technique, see paper for details) Btw, is the dataset you are working on publicly available? I am also curious to try it out. Anyway, can you please let me know what your final approach was and the result of it...
you should be thinking whether you need 4000x8000 resolution to identify man holes in a frame. If you photograph from an aerial perspective like google earth or birds eye view, a higher resolution image can hold more man holes, as each manhole will be only few pixels in area in that image. If your images are from human vision level, then it will be having only one or two man holes, and they can be detected using lower resolution image itself as the manholes can be identified by certain number of pixels which are sufficient to identify them.
Hi, just want add that technically speaking one can apply larger sized filters than the feature map by padding. related to: ua-cam.com/video/JKTzkcaWfuk/v-deo.html
Sorry, that was a mistake in the terminology I used through out the course. Here, I should have said 18 matrics. 3 matrices makes a filter and there are 6 filters. Each filter gives us one feature map output.
See full course on Object Detection: ua-cam.com/play/PL1GQaVhO4f_jLxOokW7CS5kY_J1t1T17S.html and Subscribe to my channel
If you found this tutorial useful, please share with your friends(WhatsApp/iMessage/Messenger/WeChat/Line/KaTalk/Telegram) and on Social(LinkedIn/Quora/Reddit),
Tag @cogneethi on twitter.com
Let me know your feedback @ cogneethi.com/contact
To add to the above video:
If you have an image of 3 channels, 3x300x300 and your goal is to get an image with 6 channels as the output of convolution then you need a kernel of shape 6x3xP x P .
For example: 3x300x300 * 6x3x5x5 = 6x296x296
If you convolve NxN with PxP then the output is N-P+1xN-P+1 for a valid mode convolution.
Always remember that the number of channels in your kernel should be same as the number of channels in the input image (in above case it's 3) and the number of channels in the output of convolution is the extra dimension that you will add to the kernel (in above case it's 6).
If we use ResNet or any other network,do we get same size of feature map
Awesome, thanks for such a comprehensive explanation.
Here as you mentioned the bounding box regression output for image of scale 281* 317 as 2*3*4*C , can you please tell the spatial dimension of to be exact ? Is it (6*4*C) ?
No. For every sliding window, the regressor head gives 4 outputs and the classifier head gives C classes. So, the shape when you implement this would be 2 x 3 x (4) and 2x3xC.
Hi Cogneethi! Here again at 8:02 you are saying 256x4096 filters. I think we are using only 4096 filters of 5x5 dimentions right making 256x4096x5x5 weights parameters?
Hey Bharat, you are right. I have made the mistake throughout the course.
@@Cogneethi Can you explain it ? please, i didnt get you
Why we are convolving 5x5 feature map with 5x5 filter? We can convolve with 3x3 filter.
Its a design choice. Their basic idea seems to be to get a single prediction for the smallest dimension in the Image Pyramid. And you can get spatial output for other dimensions.
And using 1x1 convolutions in later stages, you can also save on model parameters.
I understand that each filter for the final conv correspond to a class. And this makes sense for the smallest image in the image pyramid because outputs are 1 x 1 x C. However, for the output maps that are NOT 1 x 1 x C, what does the ground truth look like? Like if my output is 2 x 3 x C, what is the ground truth for each of those C spatial outputs in the classification case and regression case? Does one ground truth 2 x 3 filter contain the same number, that is the number corresponding to the class label, or is it something else? Help!
GT boxes/labels are not related to our CNN logic.
If there is only 1 object in an image, there will be only 1 GT with bbox and a class label.
In spatial output, we are getting more fine grained bboxes which will get elimiated during NMS.
How does softmax deal with (2x3) .... class -
as I understand softmax deals with one value for each class - can you please help me to understand how softmax deal with this ?
This might help: ua-cam.com/video/B4svfUzNWWw/v-deo.html
@@useForwardMax softmax takes single value (predicted) for each class and it outputs probabilities between[0, 1] sum of them is 1. In the case mentioned above, it can be solved by using max or average max to get one value for each class before passing to softmax.
I haven’t implemented the algorithm but I believe that the idea
@@useForwardMax Hey, you can take all the detections to the next stage, that is NMS (Non Max Suppression) where only the most accurate prediction is kept and the rest are ignored.
I have a question.
I'm trying to perform object detection of manhole covers in mobile mapping data and was planning to perform the object detection by using sliding window.
Now thanks to your videos i understand Overfeat and the principal behind it but I am wondering if this can also be performed efficiently on very large images of resolution 4000*8000 or will the efficiency of this method decrease?
Honestly, I have not worked on images of that size, so I cant be sure, but theoretically, it should work. So, this is one approach you can take.
I checked if anyone has worked on images of 8K resolution and I found one paper: arxiv.org/abs/1810.10551
In this they use 2 other approaches:
a. Downscale the image and do the detection or
b. Take overlapping crops of the image and do detection separately on them. (I've simplified their technique, see paper for details)
Btw, is the dataset you are working on publicly available? I am also curious to try it out. Anyway, can you please let me know what your final approach was and the result of it...
you should be thinking whether you need 4000x8000 resolution to identify man holes in a frame. If you photograph from an aerial perspective like google earth or birds eye view, a higher resolution image can hold more man holes, as each manhole will be only few pixels in area in that image. If your images are from human vision level, then it will be having only one or two man holes, and they can be detected using lower resolution image itself as the manholes can be identified by certain number of pixels which are sufficient to identify them.
Hi, just want add that technically speaking one can apply larger sized filters than the feature map by padding. related to: ua-cam.com/video/JKTzkcaWfuk/v-deo.html
18 different filters?
Sorry, that was a mistake in the terminology I used through out the course.
Here, I should have said 18 matrics. 3 matrices makes a filter and there are 6 filters. Each filter gives us one feature map output.
last time I requested your email,kindly send me your email
cogneethi.com/contact