DETR: End-to-End Object Detection with Transformers (Paper Explained)

Yannic Kilcher

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 14 тра 2024
Object detection in images is a notoriously hard task! Objects can be of a wide variety of classes, can be numerous or absent, they can occlude each other or be out of frame. All of this makes it even more surprising that the architecture in this paper is so simple. Thanks to a clever loss function, a single Transformer stacked on a CNN is enough to handle the entire task!
OUTLINE:
0:00 - Intro & High-Level Overview
0:50 - Problem Formulation
2:30 - Architecture Overview
6:20 - Bipartite Match Loss Function
15:55 - Architecture in Detail
25:00 - Object Queries
31:00 - Transformer Properties
35:40 - Results
ERRATA:
When I introduce bounding boxes, I say they consist of x and y, but you also need the width and height.
My Video on Transformers: • Attention Is All You Need
Paper: arxiv.org/abs/2005.12872
Blog: / end-to-end-object-dete...
Code: github.com/facebookresearch/detr
Abstract:
We present a new method that views object detection as a direct set prediction problem. Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components like a non-maximum suppression procedure or anchor generation that explicitly encode our prior knowledge about the task. The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss that forces unique predictions via bipartite matching, and a transformer encoder-decoder architecture. Given a fixed small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel. The new model is conceptually simple and does not require a specialized library, unlike many other modern detectors. DETR demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster RCNN baseline on the challenging COCO object detection dataset. Moreover, DETR can be easily generalized to produce panoptic segmentation in a unified manner. We show that it significantly outperforms competitive baselines. Training code and pretrained models are available at this https URL.
Authors: Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko
Links:
UA-cam: / yannickilcher
Twitter: / ykilcher
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher
Наука та технологія

КОМЕНТАРІ • 175

@slackstation 4 роки тому ⁺¹¹¹
This is a gift. The clarity of the explanation, the speed at which it comes out. Thank you for all of your work.
@ankitbhardwaj1956 3 роки тому ⁺¹
I had seen your Attention is all you need video and now watching this, I am astounded by the clarity you give in your videos. Subscribed!
@aashishghosh8246 4 роки тому ⁺¹
Yup. Subscribed with notifications. I love that you enjoy the content of the papers. It really shows! Thank you for these videos.
@rishabpal2726 3 роки тому ⁺²
Really appreciate the efforts you are putting into this. You paper explanations make my day everyday!
@Phobos11 4 роки тому ⁺¹³
The attention visualization are practically instance segmentations, very impressive results and great job untangling it all
@sahandsesoot 4 роки тому ⁺⁴
Greatest find on UA-cam for me todate!! Thank you for the great videos!
@chaouidhuzgen6818 2 роки тому
WoW , the way you've explained and break down this paper is spectacular ,
Thx mate
@adisingh4422 2 роки тому ⁺²
Awesome video. Highly recommend reading the paper first and then watching this to solidfy understanding. This definitely helped me understand DETR model more.
@user-ze2lj2nr1p 3 роки тому ⁺³
Thank you for your wonderful video. When I read this paper first, I couldn't understand what is the input of decoder (object queries), but after watching your video, finally I got it, random vector !
@michaelcarlon1831 4 роки тому
A great paper and a great review of the paper! As always nice work!
@opiido 4 роки тому
Great!!! absolutely great! fast , to the point, and extremely clear. Thanks!!
@edwarddixon 4 роки тому ⁺³
"Maximal benefit of the doubt" - love it!
@renehaas7866 3 роки тому
Thank you for this content! I have recommended this channel to my colleagues.
@AishaUroojKhan 2 роки тому
Thanks so much for making it so easy to understand these papers.
@Konstantin-qk6hv 2 роки тому ⁺²
Very informative. Thanks for explanation!
@Gotrek103 3 роки тому ⁺¹
Very well done and understandable. Thank you!
@hackercop 2 роки тому
This video was absolutely amazing. You explaned this concept really well and I loved the bit at 33:00 about flattening the image twice and using the rows and columns to create an attention matrix where every pixel can releate to every other pixel. Also loved the bit at the beginning when you explaned the loss in detail. alot of other videos just gloss over that part. Have liked and subscribed
@ramandutt3646 4 роки тому ⁺¹³
Was waiting for this. Thanks a lot! Also dude, how many papers do you read everyday?!!!
@pranabsarkar 4 роки тому ⁺¹
Fantastic explanation 👌 looking forward for more videos ❤️
@biswadeepchakraborty685 3 роки тому ⁺²
You are a godsend! Please keep up the good work!
@kodjigarpp 10 місяців тому
Thanks for the walkthrough!
@tsunamidestructor 4 роки тому ⁺³
YES! I was waiting for this!
@sawanaich4765 3 роки тому ⁺¹
You saved my project. Thank you 🙏🏻
@tae898 3 роки тому
What an amazing paper and an explanation!
@pravindesai6687 4 місяці тому
Amazing explanation. Keep up the great work.
@zeynolabedinsoleymani4591 5 місяців тому
I like the way you DECIPHER things! thanks!
@AlexOmbla Рік тому
Very very nice explanation, I really subscribed for that quadratic attention explanation. Thanks! :D
@sungso7689 3 роки тому ⁺¹
Thanks for great explanation!
@mahimanzum 4 роки тому ⁺¹
You explained it so well. Thanks . best of luck
@TheAhmadob 2 роки тому ⁺¹
Really smart idea about how the (HxW)^2 matrix naturally embeds bounding boxes information. I am impressed :)
@cuiqingli2077 3 роки тому ⁺¹
really thank you for your explanation!
@AonoGK 3 роки тому ⁺⁹
infinite respect for the ali G reference
@YannicKilcher 3 роки тому
Haha someone noticed :D
@dheerajkhanna7697 9 місяців тому
Thank you sooo much for this explanation!!
@RyanMartinRAM 5 місяців тому
Holy shit. Instant subscribe within 3 minutes. Bravo!!
@uditagarwal6435 Рік тому
very clear explanation, great work sir. thanks
@Charles-my2pb Рік тому
thank u so much for video! that's so amazing and make me much understanding for this paper ^^
@oldcoolbroqiuqiu6593 2 роки тому
34:08 GOAT explanation about the bbox in atttention feature map.
@pokinchaitanasakul-boss3370 2 роки тому ⁺¹
Thank you very much. This is a very good video. Very easy to understand.
@user-nh3er6vh1r Рік тому
Excellent work,Thanks!
@drhilm 4 роки тому ⁺²
Thank you very much, this was really good.
@user-gy9ef7mr7g Рік тому
Great explanation
@Muhammadw92 Місяць тому
Thanks for the explaination
@apkingboy 3 роки тому ⁺¹
Love this content bro thank you so much, hoping to get a Mac in Artificial Intelligence
@user-sv5uc7vc9j 3 роки тому
Thank you for providing such interesting paper reading ! Yannic Kilcher
@tianhao7783 4 роки тому ⁺¹
really quite quick. thanks. make more...
@tarmiziizzuddin337 3 роки тому ⁺⁹
"First paper ever to have ever cite a youtube channel." ...challenge accepted.
@arturiooo 2 роки тому
I love how it understands which part of the image belongs to which object (elephant example) regardless of overlapping. Kind of understands the depth. Maybe transformers can be used for depth-mapping?
@anheuser-busch 4 роки тому ⁺²
Awesome!!! Yannic, by any chance, would you mind reviewing the paper (1) Fawkes: Protecting Personal Privacy against Unauthorized Deep Learning Models or (2) Analyzing and Improving the Image Quality of StyleGAN? I would find it helpful to have those papers deconstructed a bit!
@kylepena8908 2 роки тому
This is a really great idea
@krocodilnaohote1412 2 роки тому
Very cool video, thank you!
@yashmandilwar8904 4 роки тому ⁺¹²¹
Are you even human? You're really quick.
@m.s.d2656 4 роки тому
Nope .. A Bot
@meerkatj9363 4 роки тому ⁺¹
@@m.s.d2656 I don't actually know which is the most impressive
@krishnendusengupta6158 3 роки тому ⁺²
There's a bird!!! There's a bird...
@sadraxis 3 роки тому ⁺¹
@@krishnendusengupta6158 bird, bird, bird, bird, bird, bird, bird, bird, its a BIRD
@frederickwilliam6497 4 роки тому ⁺¹
Great content!
@jjmachan 4 роки тому ⁺²
Awesome 🔥🔥🔥
@thivyesh 2 роки тому ⁺¹
Great video! What about a video on this paper: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows? They split the images in patches and uses self attention locally on every patch and then shift the patches. Would be great to hear you're explanation on this!
@quantum01010101 3 роки тому ⁺¹
Excellent
@florianhonicke5448 4 роки тому ⁺²
So cool! You are great!
@KarolMajek 4 роки тому ⁺²
Thanks for this vid, really fast. I still (after 2 days) didn't tried to run it on my data - feeling bad
@diplodopote 2 роки тому
Thanks a lot for this really helpful
@wjmuse 3 роки тому ⁺¹
Great sharing! Like to ask about if there is any clue to deside how many object queries should we use for any particular Object Detection problems? Thanks!
@CristianGarcia 4 роки тому ⁺⁵²
Loved the video! I was just reading the paper.
Just wanted to point out that Transformers, or rather Multi Head Attention, naturally processes sets, not sequences, this is why you have to include the positional embeddings.
Do a video about the Set Transformer! In that paper the call the technique used by the Decoder in this paper "Pooling by Multihead Attention".
@YannicKilcher 4 роки тому ⁺⁷
Very true, I was just still in the mode where transformers are applied to text ;)
@princecanuma 3 роки тому
What are positional encodings?
@snippletrap 3 роки тому ⁺¹
@@princecanuma The positional encoding is simply the index of each token in the sequence.
@coldblaze100 3 роки тому ⁺³
@@snippletrap I had a feeling it was gonna be something that simple. 🤦🏾‍♂️ AI researchers' naming conventions aren't helping the community, in terms of accessibility lmao
@chuwang2125 3 роки тому
Thank you for the one-line summary of "Pooling by Multihead Attention". This makes it 10x clearer about what exactly the decoder is doing. I was feeling that the "decoder + object seeds" is doing similar things to ROI pooling, which is gathering relevant information for a possible object. I also recommend reading the set transformer paper, which enhanced my limited knowledge of attention models. Thanks again for your comment!
@quebono100 4 роки тому ⁺²
I love your channel thank you soooo much
@TheGatoskilo Рік тому
Thanks Yannick! Great explanation. Since the object queries are learned and I assume they remain fixed after training, why do we keep the lower self-attention part of the decoder block during inference, and not just replace it with the precomputed Q values?
@hackercop 2 роки тому
2:47 worth pointing out that the CNN reduces the size of the image while retaining high level features and so massively speeds up computation
@0lec817 4 роки тому ⁺⁵
Hi Yannic, amazing video and great improvements in the presentation (time-sections in youtube etc.) I really like where this channel is going, keep it up.
I've been reading through the paper myself yesterday as I've been working with that kind of attention for CNNs a bit and I really liked the way you described the mechanism behind the different attention heads in such a simplistic and easily understandable way!
Your idea with directly inferring bboxes from two attending points in the "attention matrix" sounds neat and didn't cross my mind yet. But I guess then you probably have to use some kind of nms again if you do so?
One engineering problem that I came across, especially with those full (HxW)^2 attention matrices is that this blows up your GPU memory insanely. Thus one can only use a fraction of the batchsize and a (HxW)^2 multiplication also takes forever, which is why that model takes much longer to train (and infer I think)
What impressed me most was that an actually very "unsophisticated learned upscaling and argmax over all attentionmaps" achieved such great results for panoptic segmentation!
One thing that I did not quite get: Can the multiple attention heads actually "communicate" with each other during the "look up"? Going by the description in the Attention is all you need: "we then perform the attention function in parallel, yielding dv-dimensional
output values" and the formula: "Concat(head1, ..., headh)W°". This to me looks like the attention heads do not share information while attending to things. Only the W° might be able during the backprop to reweight the attention heads if they have overlapping attention regions?
@YannicKilcher 4 роки тому
Yes I see it the same way, the individual heads do independent operations in each layer. I guess the integration of information between them would then happen in higher layers, where their signal could be aggregated in a single head there.
@YannicKilcher 4 роки тому
Also, thanks for the feedback :)
@gruffalosmouse107 3 роки тому
@@YannicKilcher The multi-head part is the only confusion I have about this great work. In NLP multi-head makes total sense: an embedding can "borrow" features/semantics from multiple words at different feature dimensions. But in CV seems it's not necessary? The authors didn't do ablation study about the number of heads. My suspicion is single head works almost as well as 8 heads. Would test it once I got a lot of GPUs...
@musbell 4 роки тому ⁺¹
Awesome!
@a_sobah 4 роки тому ⁺¹
Great video thanks you
@maloukemallouke9735 2 роки тому
Hi Thanks yannic for all videos. i have a question about the digits recognition in image that no writied by hand, how we can find digits in street like number of building of cars .... ? Thanks in advance
@johngrabner 3 роки тому ⁺²
Excellent job as usual. Congrats on your Ph.D.
Cool trick adding position encoding to K,Q and leaving V without position encoding. Is this unique to DETR?
I'm guessing, the decoder learns an offset from these given positions analogous to more traditional bounding box algorithms findings bounding boxes relative to a fixed grid with the extra where decoder also eliminates duplicates.
@danielharsanyi844 Місяць тому
This is the same thing I wanted to ask. Why leave out V? It's not even described in the paper.
@DacNguyenDW 3 роки тому ⁺¹
Great!
@wizardOfRobots 2 роки тому
So basically little people asking lots of questions... nice!
PS. Thanks Yannic for the great analogy and insight...
@himanshurawlani3445 3 роки тому ⁺¹
Thank you very much for the explanation! I have a couple of questions:
1. Can we consider object queries to be analogous to anchor boxes?
2. Does the attention visualization highlights those parts in the image which the network gives highest importance to while predicting?
@YannicKilcher 3 роки тому ⁺¹
1. Somewhat, but object queries are learned and initially completely independent of the datapoint.
2. Yes, there are multiple ways, but roughly it's what you're saying
@jadtawil6143 2 роки тому
the object queries remind me of latent variables in variational architectures (VAEs for example). In those architectures, the LV's are constrained with a prior. Is this done for the object queries. Is that a good idea?
@linusjohansson3164 4 роки тому
Hi Yannic! Great video! I am working on a project, just for fun because i want
to get better at deep learning, about predicting sales prices on auctions
based on a number of features over time and also the state of the economy,
probably represented by the stock market or GDP. So its a Time Series prediction project.
And i want to use transfer learning, finding a good pretrained model i can use.
As you seem to be very knowledgeable about state of the art deep learning
i wonder if you have any idea about a model i can use?
Preferably i should be able to use it with tensorflow.
@YannicKilcher 4 роки тому
Wow, no clue :D You might want to look for example in the ML for medicine field, because they have a lot of data over time (heart rate, etc.) or the ML for speech field if you have really high sample rates. Depending on your signal you might want to extract your own features or work with something like a fourier transform of the data. If you have very little data, it might make sense to bin it into classes, rather than use its original value. I guess the possibilities are endless, but ultimately it boils down to how much data you have, which puts a limit on how complicated of a model you can learn.
@vaibhavsingh1049 4 роки тому ⁺¹
can you do one about Efficient-det?
@benibachmann9274 4 роки тому ⁺¹
Great channel, subscribed! How does this approach compare to models opimized for size and inference speed for mobile devices like SSD mobile net? (See detection model zoo on the TF github)
@YannicKilcher 4 роки тому
No idea, I'm sorry :)
@christianjoshua8666 4 роки тому ⁺⁸
AI Developer:
AI: 8:36 BIRD! BIRD! BIRD!
@AlexanderPacha 3 роки тому ⁺¹
I'm a bit confused. At 17:17, you are drawing vertical lines, meaning that you unroll the channels (ending up with a vectors of features per pixel that are fed into the transformer, "pixel by pixel"). Is that how it's being done? Or should there be horizontal lines (WH x C), where you feed one feature at a time for the entire image into the transformer?
@YannicKilcher 3 роки тому ⁺¹
Yes, if you think as text transformers consuming one word vector per word, the. analogy would be you consume all channels of a pixel per pixel
@Augmented_AI 4 роки тому ⁺⁴
Great video, very speedy :). How well does this compare to YOLOv4?
@YannicKilcher 4 роки тому
No idea, I've never looked into it.
@gunslingerarthur5865 4 роки тому
I think it might not be as good rn but the transformer part can be scaled like crazy.
@mariosconstantinou8271 Рік тому
I am having problems understanding the trainable queries size. I know it's a random vector, but of what size? If we want the output to be 1. Bounding box (query_num, x, y, W, H) and 2. Class (query_num, num,classes), so the size of our object querie will be a 1x5 vector? [class, x, y, W, H]?
@Volconnh 3 роки тому ⁺¹
Have anyone tried to run this in a Jetson Nano to compare with previous approaches? How faster is this in comparison with a mobilenet ssd v2?
@TaherAbbasiz Рік тому
At 16:27 it is claimed "The transformer is naturally a sequence processing unit" is it? Isn't it a naturally set processing unit? and this is why we are putting a position encoding block before it.
@erobusblack4856 Рік тому
can u train this for live vr/ar data?
@arjunpukale3310 3 роки тому ⁺⁶
Please make a video to train this model on our own custom datasets
@chideraachinike7619 4 роки тому
It's definitely not AGI, following your argument - which is true.
It seems to do more filtering, interpolation than actual reasoning.
I kinda feel disappointed. But this is good progress.
I'm still amateur in AI by the way.
@JLin-xk9nf 2 роки тому
Thank you for your detailed explanation. But I still can not follow the idea of object queries in the transformer decoder. Based on your explanation, N people are trained to find a different region with a random value. Then why we do not directly grid the image into the N part. Get rid of randomness. In Object detection, we do not need the probability of "Generator."
@arnavdas3139 3 роки тому ⁺¹
A naive doubt...in 39:17 , the attention maps you say here are generated within the model itself or are feeded from outside at that stage ?
@YannicKilcher 3 роки тому ⁺¹
They are from within the model
@mathematicalninja2756 3 роки тому ⁺⁴
I wonder if we can use this to generate captions from image using pure transformers
@amantayal1897 3 роки тому ⁺²
And also for VQA like we can give question encoding as input in decoder
@sanjaybora380 Рік тому
Wow its same as how human attention works
When we are focus on one thing we ignore other things in an image
@DANstudiosable 2 роки тому
How are those object queries learnt?
@FlorianLaborde 3 роки тому
I have never been so confused when you started saying diagonal and then going from bottom left to top right. So used to the matrix paradigm. 32:40 Absolutely great otherwise.
@herp_derpingson 4 роки тому
I just realized youtube added labels for parts of the video. I wonder what kind of AI Google will train using this data. :O
.
35:20 Thats a very interesting interpretation.
@YannicKilcher 4 роки тому ⁺¹
Yea I still have to provide the outline, but hopefully in the future that's done automatically, like subtitles
@tranquil_cove4884 3 роки тому ⁺¹
How do you make the bipartite matching loss differentiable?
@YannicKilcher 3 роки тому ⁺²
the matching itself isn't differentiable, but the resulting differences are, so you just take that.
@gerardwalsh4724 4 роки тому ⁺¹
Interesting to compare to YoloV4 which claims to get 65.7% @ mAP50?
@dshlai 4 роки тому ⁺¹
But Yolo can’t do instance segmentation yet though, so Mask-RCNN is probably better comparison. Also Yolo probably run faster than either of these.
@mikhaildoroshenko2169 4 роки тому ⁺¹
This probably quite a stupid question, but can we just train end to end, from image embedding to string of symbols which contains all necessary information for object detection? I am not arguing that would be efficient, because of obvious problems with representing numbers as text, but that could work, right? If yes, then we could alleviate the requirement for the predefined maximum number of object to detect.
@YannicKilcher 4 роки тому ⁺¹
I guess technically you could solve any problem by learning an end-to-end system to predict its output in form of a string. T5 is already doing sort-of this for text tasks, so it's not so far out there, but I think these custom approaches still work better for now.
@larrybird3729 4 роки тому ⁺¹
Maybe! but getting the neural-network to converge to that dataset would be a nightmare. The gradient-descent-algorithm only cares about one thing, "getting down that hill fast", with that sort of tunnel-vision, it can easily miss important features. So forcing gradient-descent to look at the scenery as it climbs down the mountain, you might get lucky and find a helicopter😆
@mikhaildoroshenko2169 2 роки тому
@@YannicKilcher
Guess it works now. :)
Pix2seq: A Language Modeling Framework for Object Detection
(sorry if I tagged you twice, the first comment had a Twitter link and got removed instantly.)
@dshlai 4 роки тому ⁺³
I wonder how “Object Query” is different from “Region Proposal Network” in RCNN detector
@dshlai 4 роки тому
It looks like Faster RCNN may still be better than DETR on smaller objects.
@FedericoBaldassarre 4 роки тому
First difference that comes to mind is that the RPN has a chance to look at the image before outputting any region proposal, while the object queries don't. The RPN makes suggestion like "there's something interesting at this location of this image, we should look more into it". The object queries instead are learned in an image-agnostic fashion, meaning that they look more like questions e.g. "is there any small object in the bottom-left corner?"
@ruskinrajmanku2753 4 роки тому
If you think about it, transformers are really so much more effective than LSTM's for long sequences. The sequence is of length WxH, that's in the order of thousands....Seriously Attention is All you need was a breakthrough paper like the one on GAN's
@marcgrondier398 4 роки тому ⁺³
I've always wondered where we could find the code for ML research papers (In this case, we're lucky to have Yannic sharing everything)... Can anyone in the community help me out?
@YannicKilcher 4 роки тому ⁺¹
Sometimes the authors create a github repo or put the code as additional files on arxiv, but mostly there's no code.
@convolvr 4 роки тому ⁺³
paperswithcode.com/
@jjmachan 4 роки тому ⁺¹
Will you do gpt 3 ?
@YannicKilcher 4 роки тому
I have. Check it out :) ua-cam.com/video/SY5PvZrJhLE/v-deo.html
@MrDonald911 4 роки тому ⁺²
I understand your confusion about the object queries since it is not clearly explained in their paper. After looking at the code, it seems the object queries are not learned, it's simply an embedding of the input (correct me if i'm wrong).
Please see github.com/facebookresearch/detr/blob/0af41930d1b6c2244e33bbef76dff6c537dd53c0/models/detr.py#L38
And the keyword "query_pos" in the transformer.py file.
@YannicKilcher 4 роки тому ⁺¹
The Embedding module contains internal weights that can be learned.
@MrDonald911 4 роки тому
@@YannicKilcher Thanks for the answer. Do you know if these weights are learned during the training or it's after some kind of pretraining ?
@LaoZhao11 4 роки тому ⁺¹
I'm so exciting which seeing title - paper explanined
but... WTF 40:56!! ok, a paper a day keeps hair away

Наступне

Автоматичне відтворення

How I Read a Paper: Facebook's DETR (Video Tutorial)