Your videos just keep getting better and better! Editing is on point with this one. Also great topic and really valuable to have you break things down like this.
I am flabbergasted by the quality of this content. Thank you for the effort. I just subscribed to your channel. Keep up the good work brother! We look for more :)
Some more information at 10:25 - In the token to image attention, the query comes from the prompt + output tokens and the key, value comes from the image. In the image to token attention, the query comes from the image embedding and the key, value comes from the prompt + output tokens.
Excellent video, thank you very much! After watching this, there's no doubt in my mind that transformer-based architectures will take over AI for computer vision
Very true. Vision Transformers are definitely here to stay. The generalization power of transformers/attention is so surprising sometimes… decades of computer vision research suggested that CNNs are best for images because they can encode spatial information about the image… it’s just counterintuitive and mind boggling that ViTs can still learn from images by flattening individual patches and lose spatial structure.
At 10:03, 4 new tokens are added to the sparse embeddings, 1 representing the IoU score, the rest of the 3 representing the masks. Just a minor correction.
Nice video explaining the interactive training. I have one question: During each step in the interactive training, the loss is calculated during each step or at the end. To be more clear: Step 1: I sample a point at the middle of the ground truth mask Step 2: Feed the point as a prompt into the model Step 3: Get the best mask from the model Step 4: From the best mask, calculate error region and sample another positive OR negative point in the error region Step 5: Loop from step 2 until reached the maximum iteration Do I have to calculate loss between step 3 and step 4 then update the model, then move onto step 4, or I calculate loss at the end after step 5?
That's a great question. We should definitely calculate the loss between step 3 and step 4. Every iteration is regarded as an isolated training example, so basically for each training example we input an image and a prompt (with a dense mask) and outputs predicted mask(s)... and then apply the loss over our prediction and the ground truth. As far as "updating the model" is concerned, it is largely a design choice I think. It's not incorrect to update the weights between each iteration, but it's probably better to do gradient accumulation (basically aggregating the losses over multiple iterations) before updating the weights to get a more stable training curve. Hope that helps!
Just a quick suggestion, don't use background music. I mostly avoid videos with background music, it distracts from the informative explanations. Besides that, thanks for making videos that focus on AI research papers. Your English is very clear.
Very good, I am using SAM and want to understand better to tune the parameters, thus here I am struggling to understand your video (one of the few that actually try to explain the concepts...). What is pt in the focal loss definition?
@@miyutube1 I see. “p” here is the simply the probability outputted by the model for the classification task. You can find more info in Page 3 of this paper arxiv.org/pdf/1708.02002.pdf
Hi The explanation is awesome. But its still not clear how SAM handles nested annotations ? The GT annotations has no hierarchy defined, each part is independently annotated. Then how SAM learns whole, part and subpart for each object ?
That’s a great question… you are right the network does not explicitly outputs the part/subpart/whole stuff! Just the annotators are asked to annotate in such a way. I believe what ends up happening is that the network automatically learns to map each output head to one of the three labeled GT. Because each output head also has its own unique output token embedding, they technically learn to associate with different output masks as well. They leave it up to backpropagation and gradient descent to handle the rest… Note that neither during inference nor training do they provide labels for whole/part/subpart to the network… and during prediction/inference too, the network doesn’t explicitly output those labels. It just returns 3 distinct masks and their confidence scores…
- What could be the intuition for having MLP for IOU scores and MSE loss on top? - from their repository, don't see any interface of text prompt usage. Any examples available?
- Predicting the IoU scores helps during inference to determine which of the three predicted masks is most likely to be the "correct" mask to show the user. The three IOU predictions are simply considered “confidence scores” for each of the three predicted masks. This allows them to rank each of the mask outputs according to how high the IOU prediction is. They are basically asking the network to output how confident it is for each of its three predictions. For example, if one of the masks has a predicted IOU of 0.99 it suggests that the network thinks it’s strongly overlapping the queried object. Again, note that this is all an inference-time thing. During training, we have the Ground Truth mask and use MSE loss to train the network to output the correct IOU scores. During inference, we just have the three predicted masks and the network's own confidence score as IoU for each of the three masks. Hope that helps. - Reg the text-prompt usage, they did not release it in their web-app. They just documented it in their paper. Don't know if there are plans to release in the future, or if there are other ways to access it.
May i iask your a question - one type of prompt is segmentation mask . if we have segmentation mask as a prompt why should we use SAM ? we already have binary segmentation mask
Check out the part about interactive training at around 5:00 Basically the dense mask prompt is used during the training phase to iteratively improve the networks segmentations. Kinda like asking the model, “Hey you gave me this mask last time, but here’s another internal/external point prompt, give me an updated mask”. During inference, we can pass an image full of zeros as the dense mask into the model (meaning we have no idea where the segmentation should be) and ask the model to update it. Once it gives an initial estimation, we might recursively pass the network’s output logits (not binary, but the prob distribution) back into it as a dense prompt to iteratively improve the predicted mask. In other words, don’t assume that the mask we pass in as prompt need to be the “correct one”… they will be incorrect / a gross estimation of the correct mask, and the networks job is to iteratively improve it till it converges somewhere.
Hi, thanks for your video! It explained so grreat! I have just one question about IoU predicted score: how does it calculate during inference ? In the paper they juste said that it's calculated between predicted mask and object it covers. I wonder how they get the surface about object it covers( because basically we don't have gt)
Yeah it’s kinda tricky to understand it. During training, they have the GT, so they can calculate the IOU with the 3 predicted masks and train the model to predict IOU scores. During inference, the three IOU predictions are simply considered as “confidence scores” for each of the three predicted masks. This allows them to rank each of the mask outputs according to how good the iou prediction is. For example, if one of the masks has a predicted IOU of 0.99 it suggests that the network thinks it’s strongly overlapping the queried object. That said, this is still a network’s own confidence on its own prediction, and there is no way of knowing the correct iou score coz we won’t have the GT during inference. It’s all an additional helpful output that tells us the network’s confidence on each of the masks. Hope that helps!
@@avb_fj Thanks for your quick reply! If I understand well, the so-called "confidence scores" during inference, which are in fact calculated from a MLP head with input (3, embedding_dim) and some hidden layers of 256 neurons, in the end it outputs a tensor (3, 1) which represent the probability (0, 1) of each mask after pass a sigmoid activation ?
hello. Great dense video. Suggestion: you are a bit too fast for me: i have to pause on every slide to read it. Usually i x2 the speed of the video, but you are the only opposite i’ve seen on youtube! Maybe you could describe each slide in more details to let us the time to understand it? Just an idea.
It is mostly a design/implementation choice. Generally people would save the mask data in an image format (like png) or in any uint8 format (including numpy arrays)… during training though we would need to load/convert them to tensors for easy gradient calculations…
During training, the ground truth images and their IOU scores are available, so we can train the SAM network to predict it using supervised training. During inference, the network predicts the segmentation masks and also the estimates of the IOU scores.
Here's me from the future posting a detailed analysis of Neural Attention:
ua-cam.com/video/frosrL1CEhw/v-deo.html
Your videos just keep getting better and better! Editing is on point with this one. Also great topic and really valuable to have you break things down like this.
Thank you so much! I’m learning things as I go, so I really appreciate feedback like this!
@@avb_fj I agree with him. Your pacing is excellent and you're giving a perfect level of detail.
I am flabbergasted by the quality of this content. Thank you for the effort. I just subscribed to your channel. Keep up the good work brother! We look for more :)
The best video on the subject. Thank you! I'll keep watching your videos
Awesome! Welcome to the channel and I’m glad you liked the video!
So happy I got recommended this video. Great quality content!
Nice! Glad you enjoyed it!
this video will have tens of thousands of views in the upcoming days
Very good and clear explanation!
Some more information at 10:25 - In the token to image attention, the query comes from the prompt + output tokens and the key, value comes from the image. In the image to token attention, the query comes from the image embedding and the key, value comes from the prompt + output tokens.
I really like this explanation. Thanks a lot!
Thank for sharing 👍
easy and short but splendid!!
Excellent video, thank you very much! After watching this, there's no doubt in my mind that transformer-based architectures will take over AI for computer vision
Very true. Vision Transformers are definitely here to stay. The generalization power of transformers/attention is so surprising sometimes… decades of computer vision research suggested that CNNs are best for images because they can encode spatial information about the image… it’s just counterintuitive and mind boggling that ViTs can still learn from images by flattening individual patches and lose spatial structure.
Awesome explanation 👏🏼
Thank you for this!
At 10:03, 4 new tokens are added to the sparse embeddings, 1 representing the IoU score, the rest of the 3 representing the masks. Just a minor correction.
Wonderful, I really like the way how you present complex topics!
Nice video explaining the interactive training. I have one question: During each step in the interactive training, the loss is calculated during each step or at the end.
To be more clear:
Step 1: I sample a point at the middle of the ground truth mask
Step 2: Feed the point as a prompt into the model
Step 3: Get the best mask from the model
Step 4: From the best mask, calculate error region and sample another positive OR negative point in the error region
Step 5: Loop from step 2 until reached the maximum iteration
Do I have to calculate loss between step 3 and step 4 then update the model, then move onto step 4, or I calculate loss at the end after step 5?
That's a great question. We should definitely calculate the loss between step 3 and step 4. Every iteration is regarded as an isolated training example, so basically for each training example we input an image and a prompt (with a dense mask) and outputs predicted mask(s)... and then apply the loss over our prediction and the ground truth.
As far as "updating the model" is concerned, it is largely a design choice I think. It's not incorrect to update the weights between each iteration, but it's probably better to do gradient accumulation (basically aggregating the losses over multiple iterations) before updating the weights to get a more stable training curve.
Hope that helps!
@@avb_fjThanks a lot! Really great explanation!
🙌🙌@@SofieSimp
Just a quick suggestion, don't use background music. I mostly avoid videos with background music, it distracts from the informative explanations. Besides that, thanks for making videos that focus on AI research papers. Your English is very clear.
Good quality video. You got a subscriber.
great video!
Hey man, congrats on the great video, rn i am doing my theisis on SAM was of help. May i ask you which camera did u use?
Good old iPhone. Good luck on your thesis man!
Great explanation!
Very good, I am using SAM and want to understand better to tune the parameters, thus here I am struggling to understand your video (one of the few that actually try to explain the concepts...). What is pt in the focal loss definition?
Could you add a timestamp?
2:54
@@miyutube1 I see. “p” here is the simply the probability outputted by the model for the classification task. You can find more info in Page 3 of this paper
arxiv.org/pdf/1708.02002.pdf
Hi The explanation is awesome. But its still not clear how SAM handles nested annotations ? The GT annotations has no hierarchy defined, each part is independently annotated. Then how SAM learns whole, part and subpart for each object ?
That’s a great question… you are right the network does not explicitly outputs the part/subpart/whole stuff! Just the annotators are asked to annotate in such a way. I believe what ends up happening is that the network automatically learns to map each output head to one of the three labeled GT. Because each output head also has its own unique output token embedding, they technically learn to associate with different output masks as well. They leave it up to backpropagation and gradient descent to handle the rest…
Note that neither during inference nor training do they provide labels for whole/part/subpart to the network… and during prediction/inference too, the network doesn’t explicitly output those labels. It just returns 3 distinct masks and their confidence scores…
i like your energy.
can you help the community with resources you refer to and channels/people you follow?
Thanks for the comment! That’s great feedback, I’ll try to share more in the upcoming videos!
- What could be the intuition for having MLP for IOU scores and MSE loss on top?
- from their repository, don't see any interface of text prompt usage. Any examples available?
- Predicting the IoU scores helps during inference to determine which of the three predicted masks is most likely to be the "correct" mask to show the user. The three IOU predictions are simply considered “confidence scores” for each of the three predicted masks. This allows them to rank each of the mask outputs according to how high the IOU prediction is. They are basically asking the network to output how confident it is for each of its three predictions.
For example, if one of the masks has a predicted IOU of 0.99 it suggests that the network thinks it’s strongly overlapping the queried object.
Again, note that this is all an inference-time thing. During training, we have the Ground Truth mask and use MSE loss to train the network to output the correct IOU scores. During inference, we just have the three predicted masks and the network's own confidence score as IoU for each of the three masks. Hope that helps.
- Reg the text-prompt usage, they did not release it in their web-app. They just documented it in their paper. Don't know if there are plans to release in the future, or if there are other ways to access it.
@@avb_fj thanks. Have you tried if the CLIP text embeddings would do the trick? Essentially, they mentioned they did train the model with the input.
May i iask your a question - one type of prompt is segmentation mask . if we have segmentation mask as a prompt why should we use SAM ? we already have binary segmentation mask
Check out the part about interactive training at around 5:00
Basically the dense mask prompt is used during the training phase to iteratively improve the networks segmentations. Kinda like asking the model, “Hey you gave me this mask last time, but here’s another internal/external point prompt, give me an updated mask”.
During inference, we can pass an image full of zeros as the dense mask into the model (meaning we have no idea where the segmentation should be) and ask the model to update it. Once it gives an initial estimation, we might recursively pass the network’s output logits (not binary, but the prob distribution) back into it as a dense prompt to iteratively improve the predicted mask.
In other words, don’t assume that the mask we pass in as prompt need to be the “correct one”… they will be incorrect / a gross estimation of the correct mask, and the networks job is to iteratively improve it till it converges somewhere.
Hi, thanks for your video! It explained so grreat! I have just one question about IoU predicted score: how does it calculate during inference ? In the paper they juste said that it's calculated between predicted mask and object it covers. I wonder how they get the surface about object it covers( because basically we don't have gt)
Yeah it’s kinda tricky to understand it. During training, they have the GT, so they can calculate the IOU with the 3 predicted masks and train the model to predict IOU scores.
During inference, the three IOU predictions are simply considered as “confidence scores” for each of the three predicted masks. This allows them to rank each of the mask outputs according to how good the iou prediction is.
For example, if one of the masks has a predicted IOU of 0.99 it suggests that the network thinks it’s strongly overlapping the queried object. That said, this is still a network’s own confidence on its own prediction, and there is no way of knowing the correct iou score coz we won’t have the GT during inference. It’s all an additional helpful output that tells us the network’s confidence on each of the masks.
Hope that helps!
@@avb_fj Thanks for your quick reply! If I understand well, the so-called "confidence scores" during inference, which are in fact calculated from a MLP head with input (3, embedding_dim) and some hidden layers of 256 neurons, in the end it outputs a tensor (3, 1) which represent the probability (0, 1) of each mask after pass a sigmoid activation ?
Sorry for the late response, I must've missed the notification. Fwiw, what you said makes perfect sense to me.@@barbaraz5363
hello. Great dense video. Suggestion: you are a bit too fast for me: i have to pause on every slide to read it. Usually i x2 the speed of the video, but you are the only opposite i’ve seen on youtube! Maybe you could describe each slide in more details to let us the time to understand it? Just an idea.
In what format is the mask data saved , is it in tensors or numpy array
It is mostly a design/implementation choice. Generally people would save the mask data in an image format (like png) or in any uint8 format (including numpy arrays)… during training though we would need to load/convert them to tensors for easy gradient calculations…
How does SAM guess the IoU for new images when there is no ground truth available?
During training, the ground truth images and their IOU scores are available, so we can train the SAM network to predict it using supervised training. During inference, the network predicts the segmentation masks and also the estimates of the IOU scores.
Your content is very underrated in the algorithm. Keep making videos, they are great :) Would be great if you could explain MusicLM from Google.
Thanks for the suggestion. I’ll add it to my bucket list for next month!