Just uploaded the second part of the series discussing Self Attention and variants. Link here: ua-cam.com/video/4naXLhVfeho/v-deo.html Here's future me posting the third part about Transformers: ua-cam.com/video/0P6-6KhBmZM/v-deo.html
First time ever I comment a video but I just had to: the way you make complex concepts understandable is awesome! Best explanation about attention I’ve come across so far 👍
@2:14 You've suddenly jumped from vector 512 to vector of 2. But how? Please explain what happened there, because I think a key portion of video got cut out. Thanks
As I mentioned around 3:06, the 2D thing was an example. The Q/K/V embedding size is an arbitrary hyperparameter so it can be set to anything. I used the size 2 example just to illustrate how the "dot product" works since it is easy to show the cosine similarity between two 2D vectors as in 2:51.
@@avb_fj Ah, I got it now. Thank you! But how do you actually produce a query embedding from a query. Is there a video on building the key and query neural networks that do that? Especially interesting is the part where query embedding is learned in a way that corresponds to the key embedding vector coordinates, I am assuming using same word embedding for both should take care of it somehow, but it would be great to see the actual technique that is used. Thank you!
So an embedding can be obtained by passing your input through a neural network. For the case of text, we can use anything from “word embeddings” or “RNN/LSTMs” etc to convert input text into embeddings. In my channel there are a couple of helpful videos ua-cam.com/video/0P6-6KhBmZM/v-deo.html ua-cam.com/video/uocYQH0cWTs/v-deo.html But there are plenty of resources online too! Good luck!
These videos are prepared in a very thought-provoking way... I think the weights/biases of the system are at query, key and value networks given at @15:30 and all training occurs there.. Therefore can we say that what neural attention learns are embeddings? On the other hand, the part softmax(QK^T/sqrt(d_k))V is fixed and apparently it does not learn anything during training.. Thank you very much again for these very well prepared videos.
The Query, Keys, and Values are indeed embeddings that must be optimized, by updating the Query, Key, and Value neural networks. The softmax(QK^T/sqrt(d_k))V part is the "computation graph" that inputs the embeddings and transforms them into new "contextually aware embeddings". Consider the below example: Say: You have 2 inputs a and b. And you want to train a neural net to predict C. You can model your network as Y = F(a) + G(b). That is we are saying: "some function of a and some function of b will add up to be Y. We will optimize the functions F and G such that Y 1) F and G are analogous to the Q, K, V neural networks. 2) F(a) and G(b) are analogous to Q, K, V embeddings 3) The + sign is the computation graph that combines a & b to make a prediction Y. In the attention formula, this + sign is equivalent to softmax(Q . K^T) V. 4) Finally since we wanted to predict C, not Y. We calculate the loss between C and Y, and then optimize the weights/biases of the neural networks F and G such that Y (our prediction) gets closer to C (our target). The gradients of the loss flows right through the + operation & the F and G. So yeah, the input collections are passed through the Q, K, V networks to derive the Q, K, V embeddings. The softmax(...) portion is the computation that combines these embeddings. The softmax(...) operation doesn't contain any parameters to train, as you mentioned, but it forms the backbone/computation graph of how the forward pass and backward propagation work. Hope that helps.
@@avb_fj Thanks a lot.. Right now I'm listening the self-attention part.When I'm done with the third part, I'll return back here and read your reply again more carefully.
Excellent video. My only critique is that the concept of hidden state is used around @13:15 without being defined. After @13:13, it becomes a little bit too fast and concepts become a blur..
Appreciate all the feedback! Thanks for sharing your experience... I thought getting into LSTMs and RNNs would be a bit of a rabbit hole for this video since not all of it is relevant to the primary topic, so I stayed at the surface level with the hidden state stuff and focussed more on the "attention" portions of the video.
AMAZING TUTORIAL, I am definitely using your video as a recommendation to all my friends that want to learn Deep Learning in an easy way. Greeting from Brazil! And keep up the good work!
Check out the first quarter of this video: Multimodal AI from First Principles - Neural Nets that can see, hear, AND write. ua-cam.com/video/-llkMpNH160/v-deo.html
you definitely have that special "knack" some of the best teachers have, and have a very soothing tone to boost. eagerly waiting for part 2 and 3. cheers! @@avb_fj
The weights, biases, and non-linearity come from the Query, Key, and Value neural networks. These convert the input embeddings into the query, key, value embeddings respectively - which then go through the attention computation. We can also add additional feed-forward layers after the attention layer to add additional transformations/non-linearity. Other weights we train can be initial embeddings of the input collection. Look up word-embeddings for example that train special embedding vectors for each word in the vocabulary. There can also be separate neural networks to embed each input type. For example, suppose you are trying to learn attention between a bunch of images and a sentence. The images can have their own image encoding neural network, and the sentence can have a text encoding neural network. All of these nets have their own weights and biases according to whatever the end-goal is. Once the forward-pass is defined, we compute the loss between network prediction and target. Through backpropagation, all learnable parameters then get updated.
I was so confused for the dimensions I was questioning if I had learned matrices right lol. I knew something didn't match. But in numpy they're column by default so we would also need the transpose of Q as well.
Public service announcement: this is overview is unintelligible as is. Sorry, going to have to downvote this, too many completely unexplained abstractions and jargon, and glossing over the key details that are absolutely required for anyone to grasp 'attention' mechanisms. eg. you don't even define 'mean pooling' etc, (this isn't even a common ML term, I've written a transformer and have never heard this jargon used.)
Just uploaded the second part of the series discussing Self Attention and variants. Link here:
ua-cam.com/video/4naXLhVfeho/v-deo.html
Here's future me posting the third part about Transformers:
ua-cam.com/video/0P6-6KhBmZM/v-deo.html
By far one of the best Deep Learning UA-cam channel I have ever checked out!
Can't wait for part 2 and 3, keep up the good work!
Thanks a lot! Nice timing with the comment, I just published the next video a couple hours back! 😀
please keep making these videos!! your explanations are absolutely amazing, engaging, to the point, intuitive, and very easy to understand!!!
The best explanation I've ever seen. It totally clicked to me! Thanks!!
Awesome! Glad to hear it! :)
First time ever I comment a video but I just had to: the way you make complex concepts understandable is awesome! Best explanation about attention I’ve come across so far 👍
Nice, I am a fellow non-commenter as well! Glad to see you here, and thanks for all the appreciation!
@2:14 You've suddenly jumped from vector 512 to vector of 2. But how? Please explain what happened there, because I think a key portion of video got cut out. Thanks
As I mentioned around 3:06, the 2D thing was an example. The Q/K/V embedding size is an arbitrary hyperparameter so it can be set to anything. I used the size 2 example just to illustrate how the "dot product" works since it is easy to show the cosine similarity between two 2D vectors as in 2:51.
@@avb_fj Ah, I got it now. Thank you!
But how do you actually produce a query embedding from a query. Is there a video on building the key and query neural networks that do that? Especially interesting is the part where query embedding is learned in a way that corresponds to the key embedding vector coordinates, I am assuming using same word embedding for both should take care of it somehow, but it would be great to see the actual technique that is used. Thank you!
So an embedding can be obtained by passing your input through a neural network. For the case of text, we can use anything from “word embeddings” or “RNN/LSTMs” etc to convert input text into embeddings. In my channel there are a couple of helpful videos
ua-cam.com/video/0P6-6KhBmZM/v-deo.html
ua-cam.com/video/uocYQH0cWTs/v-deo.html
But there are plenty of resources online too! Good luck!
These videos are prepared in a very thought-provoking way...
I think the weights/biases of the system are at query, key and value networks given at @15:30 and all training occurs there.. Therefore can we say that what neural attention learns are embeddings? On the other hand, the part softmax(QK^T/sqrt(d_k))V is fixed and apparently it does not learn anything during training..
Thank you very much again for these very well prepared videos.
The Query, Keys, and Values are indeed embeddings that must be optimized, by updating the Query, Key, and Value neural networks. The softmax(QK^T/sqrt(d_k))V part is the "computation graph" that inputs the embeddings and transforms them into new "contextually aware embeddings".
Consider the below example:
Say: You have 2 inputs a and b. And you want to train a neural net to predict C. You can model your network as Y = F(a) + G(b). That is we are saying: "some function of a and some function of b will add up to be Y. We will optimize the functions F and G such that Y
1) F and G are analogous to the Q, K, V neural networks.
2) F(a) and G(b) are analogous to Q, K, V embeddings
3) The + sign is the computation graph that combines a & b to make a prediction Y. In the attention formula, this + sign is equivalent to softmax(Q . K^T) V.
4) Finally since we wanted to predict C, not Y. We calculate the loss between C and Y, and then optimize the weights/biases of the neural networks F and G such that Y (our prediction) gets closer to C (our target). The gradients of the loss flows right through the + operation & the F and G.
So yeah, the input collections are passed through the Q, K, V networks to derive the Q, K, V embeddings. The softmax(...) portion is the computation that combines these embeddings. The softmax(...) operation doesn't contain any parameters to train, as you mentioned, but it forms the backbone/computation graph of how the forward pass and backward propagation work. Hope that helps.
@@avb_fj Thanks a lot.. Right now I'm listening the self-attention part.When I'm done with the third part, I'll return back here and read your reply again more carefully.
Excellent video. My only critique is that the concept of hidden state is used around @13:15 without being defined. After @13:13, it becomes a little bit too fast and concepts become a blur..
Appreciate all the feedback! Thanks for sharing your experience... I thought getting into LSTMs and RNNs would be a bit of a rabbit hole for this video since not all of it is relevant to the primary topic, so I stayed at the surface level with the hidden state stuff and focussed more on the "attention" portions of the video.
AMAZING TUTORIAL, I am definitely using your video as a recommendation to all my friends that want to learn Deep Learning in an easy way. Greeting from Brazil! And keep up the good work!
Thanks!! Totally made my day!
Is there a link to your contrastive learning video?
Check out the first quarter of this video: Multimodal AI from First Principles - Neural Nets that can see, hear, AND write.
ua-cam.com/video/-llkMpNH160/v-deo.html
you will be a star teacher on youtube if you keep it up (and if that is your goal). thank you this was very good, subscribed.
Wow that’s got to be one of the kindest comment I’ve ever received! Thanks a lot… glad you enjoyed it!
you definitely have that special "knack" some of the best teachers have, and have a very soothing tone to boost. eagerly waiting for part 2 and 3. cheers! @@avb_fj
I watch the whole series and it is a real gem.. But something is missing.. Where is the nonlinearity and weights and biases? What do we train?
The weights, biases, and non-linearity come from the Query, Key, and Value neural networks. These convert the input embeddings into the query, key, value embeddings respectively - which then go through the attention computation.
We can also add additional feed-forward layers after the attention layer to add additional transformations/non-linearity.
Other weights we train can be initial embeddings of the input collection. Look up word-embeddings for example that train special embedding vectors for each word in the vocabulary. There can also be separate neural networks to embed each input type. For example, suppose you are trying to learn attention between a bunch of images and a sentence. The images can have their own image encoding neural network, and the sentence can have a text encoding neural network. All of these nets have their own weights and biases according to whatever the end-goal is. Once the forward-pass is defined, we compute the loss between network prediction and target. Through backpropagation, all learnable parameters then get updated.
great video!! your explanations and graphics are amazing. love the content
Thanks!😊
Very Clear Explanation. Thanks for that.
Maybe an unimportant point, but @2:00 the vector Q sees like a column vector.. I think it should be a row vector..
Yeah a row vector would be more accurate for the QK^t stuff that happens later. Thanks for pointing that out.
I was so confused for the dimensions I was questioning if I had learned matrices right lol. I knew something didn't match. But in numpy they're column by default so we would also need the transpose of Q as well.
Very good explanation, thanks 😄
great explanation!
Thanks!!
I think the embeddings must be normalized for the dot product to make sense.
Is that a One Piece shirt?! Love it man! 🎉
Also great content tas always.
Haha thanks!🙏🏽
Good videos but a bit fast
Public service announcement: this is overview is unintelligible as is. Sorry, going to have to downvote this, too many completely unexplained abstractions and jargon, and glossing over the key details that are absolutely required for anyone to grasp 'attention' mechanisms. eg. you don't even define 'mean pooling' etc, (this isn't even a common ML term, I've written a transformer and have never heard this jargon used.)