C 8.1 | Faster RCNN | Absolute vs Relative BBOX Regression | Anchor Boxes | CNN | Machine Learning
Вставка
- Опубліковано 23 гру 2024
- This video discusses the absolute and relative bounding box regression techniques.
Which of these would be suitable for our RPN design?
If the objects were not overlapping, either of these techniques could be used.
But if the objects are overlapping, both these techniques would not work as it is. Since they will end up fitting the most dominating object in the foreground.
To solve this, they used the concept of Anchor Boxes. Instead of regressing from the Sliding Window as the reference, they just use another box of a fixed size as reference and regress from it.
And if you use 3 different Reference boxes of 3 different aspect ratios, then they would individually fit the object that is nearest to its aspect ratio. And to take care of objects of different sizes, reference boxes of different scales are to be used.
In total we have 3 aspect ratios and 3 scales, making 9 different reference boxes.
These reference boxes are called Anchor Boxes.
Note that the mid point of these anchor boxes should match the midpoint of sliding window at all positions.
Each of these anchor boxes are used along with a different BBox Regressor.
One BBox Regressor would regress and fit square objects, one would fit wide objects and one would fit tall objects. On the same lines, you would have these BBox Regressors to fit bigger objects.
Note that this technique is different from Feature Pyramid. In FP, you would have had 9 different sliding windows. But here, we have 1 sliding window and 9 Anchor boxes. This has an additional advantage of doing backpropagation only once at each position. If, instead, had we used Feature Pyramid, we would have had to do backpropagation 9 times.
But, though this technique looks promising, does it have any drawbacks?
If we are using 9 anchor boxes and every position in a 40x60 Feature Map, then we would end up with around 20,000 proposals.
Now, the question is, how will I reduce the number of proposals?
------------------------
This is a part of the course 'Evolution of Object Detection Networks'.
See full playlist here: • Evolution Of Object De...
------------------------
Copyright Disclaimer: Under section 107 of the Copyright Act 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.
Very good explanations.
If the purpose of the RPN network is to propose 9 fixed anchor boxes for each position of the sliding window, what's the need for a pretrained alex/vggnet in it??
Hmm.. RPN network do not output the anchor boxes but deltas.. Probably, the bounding box described by these deltas are still crude, to be refined by the ROI pool and the regressor at at the bottom of the diagram.
How is using 3 different sizes at 11:33 different from the Image pyramid technique?
I think I have explained in one of the videos.
But the gist is that, with Image Pyramid, you will have to process 3 images through the entire pipeline (Feature Extractor and Downstream tasks).
But here, you need to process only 1 image through the Feature Extractor. But you will still need to pay a penalty for the downstream tasks. So you will save some compute/time.
Here Downstream tasks = Classification/Regression etc.
Please correct me if I am wrong.
Do we have a regressor (3 in this case for 3 aspect ratios) for every sliding window position on the feature map? Finally we do a NMS based on the score value for all the objectness scores for all sliding window positions? i remember reading in the paper that if the feature map is WxH and we have k scales then there are WxHxk score values and WxHxkX4 coordinates for bounding boxes.
I'm having trouble understanding how applying the bounding boxes to the filters finds the locations of the objects in the original image. I understood how many bounding box regressors are there (number of aspect ratios X number of area sizes) and that they are applied to each individual filter of the backbone network in a sliding window fashion, but when you explain how the anchor boxes fit different objects, you draw the boxes over the objects in the original images and don't really show what that has to do with the filters. That's the part that got me confused. Very helpful series overall. Thanks for the effort.
@@dipakkumarmohnani7063 my understanding is that, since we know by how much the backbone scales its input down, when an object is detected in the feature maps, we have to remember that the coordinates of the detected object are also scaled down by the same factor.
For example, we know that VGG-16 produces feature maps which are 32 times smaller than its input image. So, an 800x608 image will produce feature maps of size 25x19. When an object is detected by either the RPN or Fast RCNN in coordinates 4x3 of these feature maps, and let's say this object has width and height 2x2, we multiply these values by 32 and discover that the object is in position 128x96 of the original image (in pixels) with dimensions 64x64 pixels.
I don't really know if this exactly correct, but it is my takeaway from the videos. Tell me if I'm wrong.
How is ground truth generated and how does the prediction model predicts the face based on ground truth....i mean we are giving the different set of image data for training and different set of image to test...how is the prediction happening ?
Ground truth is manually annotated.
Yes, different images to test. That is the point. We expect the neural networks to learn general characteristics just like we humans.
On point Explanation . Great Work !! ..
Hi, am a bot confused. If every method will have the Ground truth then we will always use relative method only right. Because we will predict one bounding box and there will be a Ground truth box. we will adjust our predicted so that it matches Ground truth. Am i missing something? In sliding window method, the window boundary is predicted and we also have ground truth right? incase ROI the boundary of regions is predicted in my assumption. Or is there a separate prediction apart from the region boundary
I have a few doubts ,
1. what if there are more than 3 objects are overlapped with almost comparable aspect ratio but differentiable? Does faster algorithm able to localize all those objects?
2. why would not they use the Same VGG net for both RPN as well as classification/bbox regression?( Intuitively it sounds both the networks use similar filters)
I think I have covered both these questions in the further videos. May be you could go through all of them and let me know if you still have any queries.
do we have ground truth boxes for rcnn also or the bbox regressor just uses the proposal from selective search and try to adjust it?
Hey Afsan, the GT boxes are needed irrespective of the architecutre you are using. We need those to train the model.
Why will you discuss YOLO? or SSD?
Not finding time for that. But will definitely cover those.
See full course on Object Detection: ua-cam.com/play/PL1GQaVhO4f_jLxOokW7CS5kY_J1t1T17S.html
If you found this tutorial useful, please share with your friends(WhatsApp/iMessage/Messenger/WeChat/Line/KaTalk/Telegram) and on Social(LinkedIn/Quora/Reddit),
Tag @cogneethi on twitter.com
Let me know your feedback @ cogneethi.com/contact
Why faster rcnn has more accuracy than yolo?
In general, 2 stage networks tend to have better accuracy.
SSD/Yolo does not have the RPN stage.
But latest advances might change that.
Would you know how do we add a new head to a faster RCNN architect to make Mask RCNN? I am looking for step-by-step code implementation guidelines/tutorials?
Sorry, as of now, I have not covered Mask RCNN, may be at a later date will include it.