Explaining the Segment Anything Model - Network architecture, Dataset, Training

Neural Breakdown with AVB

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 6 лис 2024

КОМЕНТАРІ • 57

@avb_fj Рік тому
Here's me from the future posting a detailed analysis of Neural Attention:
ua-cam.com/video/frosrL1CEhw/v-deo.html
@DatuxGames Рік тому ⁺⁴
Your videos just keep getting better and better! Editing is on point with this one. Also great topic and really valuable to have you break things down like this.
@avb_fj Рік тому
Thank you so much! I’m learning things as I go, so I really appreciate feedback like this!
@rmayer4086 Рік тому
@@avb_fj I agree with him. Your pacing is excellent and you're giving a perfect level of detail.
@man9mj Рік тому ⁺⁵
I am flabbergasted by the quality of this content. Thank you for the effort. I just subscribed to your channel. Keep up the good work brother! We look for more :)
@anacaznok872 Рік тому ⁺¹
The best video on the subject. Thank you! I'll keep watching your videos
@avb_fj Рік тому
Awesome! Welcome to the channel and I’m glad you liked the video!
@gingerderidder8665 Рік тому ⁺¹
So happy I got recommended this video. Great quality content!
@avb_fj Рік тому
Nice! Glad you enjoyed it!
@jorgeabraham3414 Рік тому ⁺¹
this video will have tens of thousands of views in the upcoming days
@aprilaustin5569 Місяць тому
Very good and clear explanation!
@SlashDL 2 місяці тому ⁺¹
Some more information at 10:25 - In the token to image attention, the query comes from the prompt + output tokens and the key, value comes from the image. In the image to token attention, the query comes from the image embedding and the key, value comes from the prompt + output tokens.
@Sciencehub-oq5go Рік тому ⁺¹
I really like this explanation. Thanks a lot!
@mohamedkarim-p7j 5 місяців тому ⁺¹
Thank for sharing 👍
@hinchengchen3153 5 місяців тому ⁺¹
easy and short but splendid！！
@ItalianPizza64 Рік тому ⁺¹
Excellent video, thank you very much! After watching this, there's no doubt in my mind that transformer-based architectures will take over AI for computer vision
@avb_fj Рік тому ⁺¹
Very true. Vision Transformers are definitely here to stay. The generalization power of transformers/attention is so surprising sometimes… decades of computer vision research suggested that CNNs are best for images because they can encode spatial information about the image… it’s just counterintuitive and mind boggling that ViTs can still learn from images by flattening individual patches and lose spatial structure.
@keneth4 Рік тому ⁺²
Awesome explanation 👏🏼
@billy.n2813 Рік тому ⁺¹
Thank you for this!
@SlashDL 2 місяці тому ⁺¹
At 10:03, 4 new tokens are added to the sparse embeddings, 1 representing the IoU score, the rest of the 3 representing the masks. Just a minor correction.
@willikappler1401 Рік тому
Wonderful, I really like the way how you present complex topics!
@SofieSimp Рік тому ⁺²
Nice video explaining the interactive training. I have one question: During each step in the interactive training, the loss is calculated during each step or at the end.
To be more clear:
Step 1: I sample a point at the middle of the ground truth mask
Step 2: Feed the point as a prompt into the model
Step 3: Get the best mask from the model
Step 4: From the best mask, calculate error region and sample another positive OR negative point in the error region
Step 5: Loop from step 2 until reached the maximum iteration
Do I have to calculate loss between step 3 and step 4 then update the model, then move onto step 4, or I calculate loss at the end after step 5?
@avb_fj Рік тому ⁺¹
That's a great question. We should definitely calculate the loss between step 3 and step 4. Every iteration is regarded as an isolated training example, so basically for each training example we input an image and a prompt (with a dense mask) and outputs predicted mask(s)... and then apply the loss over our prediction and the ground truth.
As far as "updating the model" is concerned, it is largely a design choice I think. It's not incorrect to update the weights between each iteration, but it's probably better to do gradient accumulation (basically aggregating the losses over multiple iterations) before updating the weights to get a more stable training curve.
Hope that helps!
@SofieSimp Рік тому ⁺¹
@@avb_fjThanks a lot! Really great explanation!
@avb_fj Рік тому
🙌🙌@@SofieSimp
@scifaipy9301 18 днів тому
Just a quick suggestion, don't use background music. I mostly avoid videos with background music, it distracts from the informative explanations. Besides that, thanks for making videos that focus on AI research papers. Your English is very clear.
@victorbjorklund Рік тому ⁺¹
Good quality video. You got a subscriber.
@davidyu2372 4 місяці тому
great video!
@VictorVelazquezEspitia 9 місяців тому
Hey man, congrats on the great video, rn i am doing my theisis on SAM was of help. May i ask you which camera did u use?
@avb_fj 8 місяців тому
Good old iPhone. Good luck on your thesis man!
@wkgates Рік тому
Great explanation!
@miyutube1 8 місяців тому
Very good, I am using SAM and want to understand better to tune the parameters, thus here I am struggling to understand your video (one of the few that actually try to explain the concepts...). What is pt in the focal loss definition?
@avb_fj 8 місяців тому
Could you add a timestamp?
@miyutube1 8 місяців тому
2:54
@avb_fj 8 місяців тому
@@miyutube1 I see. “p” here is the simply the probability outputted by the model for the classification task. You can find more info in Page 3 of this paper
arxiv.org/pdf/1708.02002.pdf
@prafulmathur4567 Рік тому
Hi The explanation is awesome. But its still not clear how SAM handles nested annotations ? The GT annotations has no hierarchy defined, each part is independently annotated. Then how SAM learns whole, part and subpart for each object ?
@avb_fj Рік тому
That’s a great question… you are right the network does not explicitly outputs the part/subpart/whole stuff! Just the annotators are asked to annotate in such a way. I believe what ends up happening is that the network automatically learns to map each output head to one of the three labeled GT. Because each output head also has its own unique output token embedding, they technically learn to associate with different output masks as well. They leave it up to backpropagation and gradient descent to handle the rest…
Note that neither during inference nor training do they provide labels for whole/part/subpart to the network… and during prediction/inference too, the network doesn’t explicitly output those labels. It just returns 3 distinct masks and their confidence scores…
@EkShunya Рік тому ⁺¹
i like your energy.
can you help the community with resources you refer to and channels/people you follow?
@avb_fj Рік тому
Thanks for the comment! That’s great feedback, I’ll try to share more in the upcoming videos!
@nitinsurya1991 11 місяців тому
- What could be the intuition for having MLP for IOU scores and MSE loss on top?
- from their repository, don't see any interface of text prompt usage. Any examples available?
@avb_fj 11 місяців тому
- Predicting the IoU scores helps during inference to determine which of the three predicted masks is most likely to be the "correct" mask to show the user. The three IOU predictions are simply considered “confidence scores” for each of the three predicted masks. This allows them to rank each of the mask outputs according to how high the IOU prediction is. They are basically asking the network to output how confident it is for each of its three predictions.
For example, if one of the masks has a predicted IOU of 0.99 it suggests that the network thinks it’s strongly overlapping the queried object.
Again, note that this is all an inference-time thing. During training, we have the Ground Truth mask and use MSE loss to train the network to output the correct IOU scores. During inference, we just have the three predicted masks and the network's own confidence score as IoU for each of the three masks. Hope that helps.
- Reg the text-prompt usage, they did not release it in their web-app. They just documented it in their paper. Don't know if there are plans to release in the future, or if there are other ways to access it.
@nitinsurya1991 11 місяців тому
@@avb_fj thanks. Have you tried if the CLIP text embeddings would do the trick? Essentially, they mentioned they did train the model with the input.
@timanb2491 10 місяців тому
May i iask your a question - one type of prompt is segmentation mask . if we have segmentation mask as a prompt why should we use SAM ? we already have binary segmentation mask
@avb_fj 10 місяців тому
Check out the part about interactive training at around 5:00
Basically the dense mask prompt is used during the training phase to iteratively improve the networks segmentations. Kinda like asking the model, “Hey you gave me this mask last time, but here’s another internal/external point prompt, give me an updated mask”.
During inference, we can pass an image full of zeros as the dense mask into the model (meaning we have no idea where the segmentation should be) and ask the model to update it. Once it gives an initial estimation, we might recursively pass the network’s output logits (not binary, but the prob distribution) back into it as a dense prompt to iteratively improve the predicted mask.
In other words, don’t assume that the mask we pass in as prompt need to be the “correct one”… they will be incorrect / a gross estimation of the correct mask, and the networks job is to iteratively improve it till it converges somewhere.
@barbaraz5363 Рік тому
Hi, thanks for your video! It explained so grreat! I have just one question about IoU predicted score: how does it calculate during inference ? In the paper they juste said that it's calculated between predicted mask and object it covers. I wonder how they get the surface about object it covers( because basically we don't have gt)
@avb_fj Рік тому ⁺¹
Yeah it’s kinda tricky to understand it. During training, they have the GT, so they can calculate the IOU with the 3 predicted masks and train the model to predict IOU scores.
During inference, the three IOU predictions are simply considered as “confidence scores” for each of the three predicted masks. This allows them to rank each of the mask outputs according to how good the iou prediction is.
For example, if one of the masks has a predicted IOU of 0.99 it suggests that the network thinks it’s strongly overlapping the queried object. That said, this is still a network’s own confidence on its own prediction, and there is no way of knowing the correct iou score coz we won’t have the GT during inference. It’s all an additional helpful output that tells us the network’s confidence on each of the masks.
Hope that helps!
@barbaraz5363 Рік тому
@@avb_fj Thanks for your quick reply! If I understand well, the so-called "confidence scores" during inference, which are in fact calculated from a MLP head with input (3, embedding_dim) and some hidden layers of 256 neurons, in the end it outputs a tensor (3, 1) which represent the probability (0, 1) of each mask after pass a sigmoid activation ?
@avb_fj 11 місяців тому
Sorry for the late response, I must've missed the notification. Fwiw, what you said makes perfect sense to me.@@barbaraz5363
@Grenoble7 Рік тому ⁺²
hello. Great dense video. Suggestion: you are a bit too fast for me: i have to pause on every slide to read it. Usually i x2 the speed of the video, but you are the only opposite i’ve seen on youtube! Maybe you could describe each slide in more details to let us the time to understand it? Just an idea.
@Ye1324 Рік тому
In what format is the mask data saved , is it in tensors or numpy array
@avb_fj Рік тому ⁺¹
It is mostly a design/implementation choice. Generally people would save the mask data in an image format (like png) or in any uint8 format (including numpy arrays)… during training though we would need to load/convert them to tensors for easy gradient calculations…
@Alice-yq6yy 3 місяці тому
How does SAM guess the IoU for new images when there is no ground truth available?
@avb_fj 3 місяці тому
During training, the ground truth images and their IOU scores are available, so we can train the SAM network to predict it using supervised training. During inference, the network predicts the segmentation masks and also the estimates of the IOU scores.
@egonvanpraet Рік тому ⁺¹
Your content is very underrated in the algorithm. Keep making videos, they are great :) Would be great if you could explain MusicLM from Google.
@avb_fj Рік тому
Thanks for the suggestion. I’ll add it to my bucket list for next month!

Наступне

Автоматичне відтворення

How does Segment Anything 2 (SAM 2) work? Paper and Network Architecture Explained!