Certainly! I was hoping to climb up to current state of the art in object detection, and then expand towards more advanced problems like object tracking
@@Taehyoung_Kim I have that in my plan. I plan to make videos about Bert, CoDetr, Grounding DINO, and probably Mamba for Vision, and then start digging into segmentation models. It will also probably take some time to get up to speed on all concept before doing SOTA. I am making 1 video per week, so we are looking at something like 2-3 months at least
In the deformable convolution, I still dont get how the "offset branch" is calculating the offset map via a convolution kernel of the same size as the original one. How is its output re arranged to match the specific pixel offsets EDIT: I think it is the following: N refers to the number of kernel elements (eg 9 for a 3x3 kernel) and 2 for x and y offset. So channel 1 and 2 refer to the x and y offsets for the top left position of the kernel. Then, the spatial dimensions of the offset map correspond to the current position of the sliding operation of the kernel. Thus, the first 2 channels of the top left value in the offset map determine the x y offsets of the top left kernel item when the kernel is currently in its first position during sliding
@@makgaiduk would be cool, but maybe not that relevant anymore... I also think its written in cuda, as well as they did for deformable attention because of the bilinear interpolation thingy
Thank you for this insightful video! The explanations are clear and easy to follow. Love it! Regarding the object detection task, especially for detecting stacked or cluttered items, would a DETR-based model be more suitable than YOLO?
By design and reported metrics, more advanced DETR based model like DINO or CoDETR should be better. Depending on what sort of data you have, you might also take a look at multi-modal models like OpenAI's "CLIP" or Grounding DINO, they might get better accuracy without finetuning
Great video!!!
thank you for making this video
I was praying to see this 2 years ago lol
I am glad you enjoyed it!
This is an actual explanation. Unlike most of the other channels that purport to "explain" these architectures.
thank you so much for this amazing video! looking forward to more of your content :D
Glad you enjoyed it! More to come!
Literally the only in depth source other than the "Deformable Convolution Networks" paper. Helped me a lot for my bachelors thesis!
Check out my next video: reading Deformable DETR source code ua-cam.com/video/3M9mS_3eiaw/v-deo.html
Great explanation!!
Could I request videos covering the object tracking problem, and more specifically models like MOTR?
Certainly! I was hoping to climb up to current state of the art in object detection, and then expand towards more advanced problems like object tracking
@@makgaiduk Great!! Looking forward to it
Was really helpful :) keep it up
Glad it helped!
@@makgaidukby any chance you plan to go over sota segmentation model as well?
@@Taehyoung_Kim I have that in my plan. I plan to make videos about Bert, CoDetr, Grounding DINO, and probably Mamba for Vision, and then start digging into segmentation models. It will also probably take some time to get up to speed on all concept before doing SOTA. I am making 1 video per week, so we are looking at something like 2-3 months at least
@@makgaiduk nice! I’ll stay tuned
In the deformable convolution, I still dont get how the "offset branch" is calculating the offset map via a convolution kernel of the same size as the original one. How is its output re arranged to match the specific pixel offsets
EDIT:
I think it is the following:
N refers to the number of kernel elements (eg 9 for a 3x3 kernel) and 2 for x and y offset. So channel 1 and 2 refer to the x and y offsets for the top left position of the kernel.
Then, the spatial dimensions of the offset map correspond to the current position of the sliding operation of the kernel. Thus, the first 2 channels of the top left value in the offset map determine the x y offsets of the top left kernel item when the kernel is currently in its first position during sliding
Good question. I guess I should do a "deformable convolution" code read
@@makgaiduk would be cool, but maybe not that relevant anymore... I also think its written in cuda, as well as they did for deformable attention because of the bilinear interpolation thingy
This was amazing, nice work - I really appreciate it. Please continue with the vids :)
Thank you for this insightful video! The explanations are clear and easy to follow. Love it!
Regarding the object detection task, especially for detecting stacked or cluttered items, would a DETR-based model be more suitable than YOLO?
By design and reported metrics, more advanced DETR based model like DINO or CoDETR should be better.
Depending on what sort of data you have, you might also take a look at multi-modal models like OpenAI's "CLIP" or Grounding DINO, they might get better accuracy without finetuning
@@makgaiduk Got it. Thank you for sharing❤
this is awesome !!
Great video!!
Glad it was useful! And thanks for commenting!