Convolutional Implementation of Sliding Windows *CORRECTION* At 7:14, Andrew should have said 2x2x400 instead of 2x2x40. At 10:04 onward, the size of the second layer should be 24 x 24 instead of 16 x 16.
One problem I see in this implementation is that it may be the case that the model we trained for object detection that specific window size is not good for test object like if you trained for 14*14*3 it may be the case that car is covering 28*28*3 image whole area and model may perform poorly here!
At the end of the video, the bounding box inaccuracy is mentioned. In addition, I'd like to remark that the network can only recognize fully visible, unobscured cars at that moment, still.
I follow the idea, however I don't get how it can be implemented programmatically. When you train your convolutional neural network, you define a input size. If a larger image is pushed trough the network, I assume an error on input dimensions will pop up. Can the dimensions be easily changed after training?
Since the idea is changing the FC layer to convolutional layer, we can easily train and test the model without specific size of width and height, for instance we can set the value of input dimension as None,None,3 in Keras. Remember, convolutional layer is different with FC, it shares the weight to each features map.
Yes, because you save the weights of the kernels, so when you're testing your network you never worry about the dimentions of the input size if they are larger than the dimentions used on training
When 5x5x16 change to 1x1x400, I think this process should be linear. Then, Is there no ReLU function in this process? (I meant only 5x5x16 -> 1x1x400)
Why, in sliding window approach, matching exact position of an object is a problem? If the stride is 1, then we cover each pixel of the image (let's say with a 14x14 box centered at each pixel of the image), so we cover all the possible locations in image and therefore we will match the exact position of an object (its center). The problem arises only when we use a bigger stride.
A smaller stride = more computations. Also, the objects may show up with different aspect ratios, which would require using many sliding windows with different sizes to detect all of them, so you can imagine how badly this scales up when you combine it with a small stride.
I am using CNN's with a lot of layers. They use padding so that the input size doesn't shrink. This makes this approach not so straight forward. Any idea how to deal with that? Another case are Resnet like blocks, which have different convolutions on different paths merging. Without padding this is difficult, any idea?
Slides are created/deleted/rearranged each session, but the material is more or less the same. What's really missing are the problem sets. They are quite difficult if you're a newbie, but with a lot of 'Net searching, they are solvable. If you just audit the course, you can't download the datasets, but you can search for equivalent datasets and use those.
can someone explain this video? im almost done with all the previous videos. but more i watch this video i feel like im missing out something i do still dont know why
using the parameter setting in this example, the last 3 conv layers need more parameters than the last 3 fc layers……am I wrong or it is actually this case
I was reading about that yesterday. Actually , FCN stands Fully Connected Networks where you have ONLY Convolution operators. CNN stands for Convolutional Neural Networks where you NOT ONLY Convolution operators but also contains Fully Connected Layer(s). This is what I have understood. Hope this is clear.
I want to express my gratitude for making these lectures available for free :). I want to note that C4W3L05 is not there. thank you again
ua-cam.com/video/gKreZOUi-O0/v-deo.html
Pictures the volume as the rectangle for simplification. Proceeds to draw the volume by hand :)
Its about the images on the following lines..
damn, whoever came up with this idea deserves a cookie
Convolutional Implementation of Sliding Windows *CORRECTION*
At 7:14, Andrew should have said 2x2x400 instead of 2x2x40.
At 10:04 onward, the size of the second layer should be 24 x 24 instead of 16 x 16.
10:13 A small note to consider the size of the second matrix should be 24x24 after the 5x5 matrix not 16x16
It is really like a magic. Andrew, I love you....
omg I couldn't completely get this in class just now but now I could! Thanks
8:19 wow thats amazing
One problem I see in this implementation is that it may be the case that the model we trained for object detection that specific window size is not good for test object like if you trained for 14*14*3 it may be the case that car is covering 28*28*3 image whole area and model may perform poorly here!
At the end of the video, the bounding box inaccuracy is mentioned. In addition, I'd like to remark that the network can only recognize fully visible, unobscured cars at that moment, still.
I am in love with this model
I follow the idea, however I don't get how it can be implemented programmatically.
When you train your convolutional neural network, you define a input size. If a larger image is pushed trough the network, I assume an error on input dimensions will pop up. Can the dimensions be easily changed after training?
You need to preprocess your image (cropping/resizing) to conform with the image size used in the training process.
I have the same question. I really hope Andrew would talk more about the back propagation.
Since the idea is changing the FC layer to convolutional layer, we can easily train and test the model without specific size of width and height, for instance we can set the value of input dimension as None,None,3 in Keras. Remember, convolutional layer is different with FC, it shares the weight to each features map.
Yes, because you save the weights of the kernels, so when you're testing your network you never worry about the dimentions of the input size if they are larger than the dimentions used on training
A question: slide window is the same that feature map that we get when apply a convolution filter? thank you
When 5x5x16 change to 1x1x400, I think this process should be linear. Then, Is there no ReLU function in this process?
(I meant only 5x5x16 -> 1x1x400)
it does, He skipped flatten layer and continued fc layer
Why, in sliding window approach, matching exact position of an object is a problem? If the stride is 1, then we cover each pixel of the image (let's say with a 14x14 box centered at each pixel of the image), so we cover all the possible locations in image and therefore we will match the exact position of an object (its center). The problem arises only when we use a bigger stride.
A smaller stride = more computations. Also, the objects may show up with different aspect ratios, which would require using many sliding windows with different sizes to detect all of them, so you can imagine how badly this scales up when you combine it with a small stride.
Wow!! Absolutely wow!
I am using CNN's with a lot of layers. They use padding so that the input size doesn't shrink. This makes this approach not so straight forward. Any idea how to deal with that? Another case are Resnet like blocks, which have different convolutions on different paths merging. Without padding this is difficult, any idea?
This video comes after the next video in the list. (26->25->27..... is the right sequence of videos specified in the course)
Slides are created/deleted/rearranged each session, but the material is more or less the same. What's really missing are the problem sets. They are quite difficult if you're a newbie, but with a lot of 'Net searching, they are solvable. If you just audit the course, you can't download the datasets, but you can search for equivalent datasets and use those.
This is golden.
I am pretty sure that the next video is not uploaded correctly. One video is missing and because of that the anchor box lecture dose not make sense.
can someone explain this video? im almost done with all the previous videos. but more i watch this video i feel like im missing out something i do still dont know why
What if the dimension of the test image is smaller than that of training images? Do we use paddings?
using the parameter setting in this example, the last 3 conv layers need more parameters than the last 3 fc layers……am I wrong or it is actually this case
How to build training data for this?!
I didn't understand how convolutionally the number of iterations for a stride will be less ?
hello sir , i think you have not provided C4W3L04
video
For previsious video link : ua-cam.com/video/5e5pjeojznk/v-deo.html or search with title "C4W3L03 Object Detection"
how did we drop from 28 x 28 to 16 x 16
it seems like a typo, should be 24x24
can someone please explain the last 6 minutes of the video? I cant follow any of it
Don't worry if you can't follow it
thank you!
how to set size of sliding window in cnn,
suprb explanation
extraordinary viedo
This left me more confused
Anyone got some sort of written reference (books/papers) for this?
Overfeat paper from the arxiv
then the FCN are CNN?
I was reading about that yesterday.
Actually ,
FCN stands Fully Connected Networks where you have ONLY Convolution operators.
CNN stands for Convolutional Neural Networks where you NOT ONLY Convolution operators but also contains Fully Connected Layer(s).
This is what I have understood.
Hope this is clear.