The idea of explicitly explain and breaking down the idea down to pieces, even using the board-kind of, makes the video absolutely helpful and useful! We fed up from naive academic explanations.
Stumbled upon this gem whilst preparing a session on transformers and it has been soooo helpful, Aleksa. Just to understand the code took already a few days and then the concepts clicked. I really admire that you coded this up from scratch, that's probably the only way I would have ever understood the topic. Well done, your channel is amazing! Pozdrav iz Svajcarske!
I am a genomics data scientist and I don't think I can explain how helpful I've found your videos. By far the best resource on youtube. Your methods really suit my personal approach.
Thanks a lot Chris I really appreciate it! Btw. how did you find my channel? That may help me better reach more people as the channel is still in it's infancy stage and YT is still not recommending most of my videos
@@TheAIEpiphany I think I found it on reddit, I see you've posted some videos on r/deeplearning. It is also possible that I found your video as a recommendation after watching a video from watching videos on the WelcomeAIOverlords channel.
Super useful!.. thanks for the detailed explanation. One does not get the practical insight just by reading the paper, and reading code may be sometimes too obfuscating. Kepp up the great work. Love it!
Hi this is an amazing video! It deserves more likes and views.... It helps me a lot to understand the lower detail of transformer! Thank you very much!
Really love the complexity of these architectures.. But I'm finding myself at times thinking how could somone actually have thought of this stuff. It's beyond amazing
Also, I love your videos man, you have helped me understand various concepts especially Graph Neural Networks which are beasts of prior knowledge. If I have any suggestion after watching many of your videos, that would be to get a pen or something so you can write stuff both faster and more clearly for us
Hey! I didn't really understand what was happening at 11:18 with the pointwise apply thing. How are you multiplying the token embeddings by the Q (and k and v) matrices? How does this multiplication result in a new 1x512 matrix?
Thanks for this video! I think there is a small error: The positional encodings are intended to encode the position of a token in the sequence, not in the embedding table. As such, @9:00 for "how" you'd take not the ID 2, but rather its position in the sentence, meaning 0. Or am I missing somehing? At least that is generally how positional encodings would be used to my understanding.
@The AI Epiphany, The video is great however i have a small doubt. In Multi Head Attention Module, do we use the 512 vector word embedding to get 512 vector and then split it into 64 d assuming we have 8 heads or for each head we use a embedding matrix of size 512 x 64 for each head to get a 64 dimensional vector form 512 d word embedding. In jalammar.github.io/illustrated-transformer/ he mentions the latter but in this video the explaination provided is the former. Can you tell which is the right one?
@@TheAIEpiphany Bumping this question. Indeed, there is a disparity between Jay Alammar's blog and your explanation of the calculations of the multi-headed attention mechanism. Thanks!
The latter is correct I think. It makes more sense that each head still has the entire information codified in a more simple manner than each head just getting a fraction of the embedding vector.
but are the word embeddings for English and German (at 8:00 ) computed and trained separately and are provided to the Transformer model as given dictionaries?
Hail to the almighty transformer! Make sure to check out this Jupyter notebook/repo as a supplement to this video and the paper: github.com/gordicaleksa/pytorch-original-transformer
It is strange and amazing how you can simply concatenate the word and the positional embeddings together as input and somehow the model still works. How would the model know which part of the combined input is the word part and the positional part.
Thank you for the video, it's very helpful for new people to understand attention mechanism. I had few questions: 1. Why is add operation being used for positional encoding, is there any different with other operations? And does position really strictly matter because some similar sentence can make the embedding vector change? e.g. "hi, how are you to day?" (in this case, all position are changed) 2. Related to the position from question 1, is there any way to get benefit from attention in term of backpropagation through time like RNN and LSTM? 3. You mentioned final representation vector from multi-head attention of decoding is combined Q vectors from German tokens and (K+V) vectors from English tokens. But if the number of tokens after translating are different. How can you compose the final vector?
Hey nice questions. 1. Summation is a really good way to combine features and positional encodings do matter as sometimes swapping the word order can change the semantics (sentence meaning) 2. There is no need to do that with transformers, unless you're asking what happens when the context gets to big to fit into the memory 3. You use the target token representations (German) as the query and source tokens as the keys and values and just do cross dot product between queries and keys as usually - nothing else changes
At 14:25 you get the dot product of the Q value for how against all 5 K vectors which should result in 5 numbers. But a minute later you end up with a single number instead - 3.2. Is there some addition that happens between the dot product results?
can you explain the Encoding Nx params: 22:31 we repeat the Nx times, each time we get a 512-d vector for each word (how are you to day), how do we process the output of Nx times? Tks!
Great job! I think you still have space to improve Redo same set of ideas but with better representations, no need for animations, just presentation style Transformers are still vague on YT
Great content, really simple and informative! I had a question regarding the Embeddings. In the paper, they say they share the weights between the Input Embeddings, Output Embeddings, and the Linear matrix at the output of the decoder. I understand from the paper that "Sentences were encoded using byte-pair encoding [3], which has a shared source-target vocabulary of about 37000 tokens." Did you implement this weight sharing in your implementation of the paper? because if you don't use BPE it seems the vocab sizes of each language are huge, and combined, even bigger... Cheers!
The idea of explicitly explain and breaking down the idea down to pieces, even using the board-kind of, makes the video absolutely helpful and useful! We fed up from naive academic explanations.
Stumbled upon this gem whilst preparing a session on transformers and it has been soooo helpful, Aleksa. Just to understand the code took already a few days and then the concepts clicked. I really admire that you coded this up from scratch, that's probably the only way I would have ever understood the topic. Well done, your channel is amazing! Pozdrav iz Svajcarske!
Hvala Natasa!! :))
Great complement to the previous Attention video, it made the training process much clearer. Thanks!
Awesome! Glad to hear that ^^
I am a genomics data scientist and I don't think I can explain how helpful I've found your videos. By far the best resource on youtube. Your methods really suit my personal approach.
Thanks a lot Chris I really appreciate it!
Btw. how did you find my channel? That may help me better reach more people as the channel is still in it's infancy stage and YT is still not recommending most of my videos
@@TheAIEpiphany I think I found it on reddit, I see you've posted some videos on r/deeplearning. It is also possible that I found your video as a recommendation after watching a video from watching videos on the WelcomeAIOverlords channel.
@@christophermarais5253 Awesome, thank you!
Bro, thank you so much for this video!!! It is helping me a lot through my PhD!! haha
Your transformer videos are still great!
Such a great video and awesome Multi head Explanation.. Thanks a lot for your effort.
Super useful!.. thanks for the detailed explanation. One does not get the practical insight just by reading the paper, and reading code may be sometimes too obfuscating. Kepp up the great work. Love it!
Thank you for making this video, I learned a lot
I wish I had this resource when I was trying to understand BERT when it came out. You make everything so clear, great job!
Glad to hear that Daven! Thank you!
Amazing video. Please make more vids like this. We need those smallest details
Hi this is an amazing video! It deserves more likes and views....
It helps me a lot to understand the lower detail of transformer! Thank you very much!
Glad to hear that!
Really love the complexity of these architectures.. But I'm finding myself at times thinking how could somone actually have thought of this stuff. It's beyond amazing
Also, I love your videos man, you have helped me understand various concepts especially Graph Neural Networks which are beasts of prior knowledge. If I have any suggestion after watching many of your videos, that would be to get a pen or something so you can write stuff both faster and more clearly for us
Congratulations on the channel. Very instructive. Keep on with the awesome work!
Thanks Murilo!
Great video! I dare you to take a big swig of beer each time he says basically. basically, you would be drunk by the end of the video.
Lol I know man, it's hard to do it real-time though so lots of sucky filler words 🤣
Very useful video. Thank you.
This is a great resource. Thank you so much for doing this.
We need more videos like this, so informative.😀
Videos like this makes it so much easier to go through the paper /code is after the
Awesome that's what I love to hear!
Excellent job done! Super useful, keep it up!
Nice job. Thank you!
This is the best video ever.
Thanks,bro!salute! I want more like this!
In Jay Alammar blog, it says that for each head we have an independent matrixes Qi,Ki,Vi where i varies from 1 to 8
Best explanation
This was a great video. Maybe you can do some coding videos using hugging face for example.
Thanks! Hmm I wasn't planning on doing those, thanks for the feedback.
Hey! I didn't really understand what was happening at 11:18 with the pointwise apply thing. How are you multiplying the token embeddings by the Q (and k and v) matrices?
How does this multiplication result in a new 1x512 matrix?
Awesome! Just what I needed.
Thank you for your work!
Thanks for this video! I think there is a small error: The positional encodings are intended to encode the position of a token in the sequence, not in the embedding table. As such, @9:00 for "how" you'd take not the ID 2, but rather its position in the sentence, meaning 0. Or am I missing somehing? At least that is generally how positional encodings would be used to my understanding.
that's also how I understood it
What is the max size of the input sentence that can be fed?
Depends on how much memory you've got - no limits from the model side
@The AI Epiphany, The video is great however i have a small doubt. In Multi Head Attention Module, do we use the 512 vector word embedding to get 512 vector and then split it into 64 d assuming we have 8 heads or for each head we use a embedding matrix of size 512 x 64 for each head to get a 64 dimensional vector form 512 d word embedding. In jalammar.github.io/illustrated-transformer/ he mentions the latter but in this video the explaination provided is the former. Can you tell which is the right one?
@@TheAIEpiphany Bumping this question.
Indeed, there is a disparity between Jay Alammar's blog and your explanation of the calculations of the multi-headed attention mechanism.
Thanks!
The latter is correct I think. It makes more sense that each head still has the entire information codified in a more simple manner than each head just getting a fraction of the embedding vector.
Very useful
but are the word embeddings for English and German (at 8:00 ) computed and trained separately and are provided to the Transformer model as given dictionaries?
Hail to the almighty transformer! Make sure to check out this Jupyter notebook/repo as a supplement to this video and the paper: github.com/gordicaleksa/pytorch-original-transformer
It is strange and amazing how you can simply concatenate the word and the positional embeddings together as input and somehow the model still works. How would the model know which part of the combined input is the word part and the positional part.
Thank you for the video, it's very helpful for new people to understand attention mechanism. I had few questions:
1. Why is add operation being used for positional encoding, is there any different with other operations? And does position really strictly matter because some similar sentence can make the embedding vector change? e.g. "hi, how are you to day?" (in this case, all position are changed)
2. Related to the position from question 1, is there any way to get benefit from attention in term of backpropagation through time like RNN and LSTM?
3. You mentioned final representation vector from multi-head attention of decoding is combined Q vectors from German tokens and (K+V) vectors from English tokens. But if the number of tokens after translating are different. How can you compose the final vector?
Hey nice questions.
1. Summation is a really good way to combine features and positional encodings do matter as sometimes swapping the word order can change the semantics (sentence meaning)
2. There is no need to do that with transformers, unless you're asking what happens when the context gets to big to fit into the memory
3. You use the target token representations (German) as the query and source tokens as the keys and values and just do cross dot product between queries and keys as usually - nothing else changes
@@TheAIEpiphany thank you for your answers, they are clear !
At 14:25 you get the dot product of the Q value for how against all 5 K vectors which should result in 5 numbers. But a minute later you end up with a single number instead - 3.2. Is there some addition that happens between the dot product results?
can you explain the Encoding Nx params: 22:31
we repeat the Nx times, each time we get a 512-d vector for each word (how are you to day), how do we process the output of Nx times?
Tks!
Great job!
I think you still have space to improve
Redo same set of ideas but with better representations, no need for animations, just presentation style
Transformers are still vague on YT
Great content, really simple and informative!
I had a question regarding the Embeddings. In the paper, they say they share the weights between the Input Embeddings, Output Embeddings, and the Linear matrix at the output of the decoder.
I understand from the paper that "Sentences were encoded using byte-pair encoding [3], which has a shared source-target vocabulary of about 37000 tokens."
Did you implement this weight sharing in your implementation of the paper? because if you don't use BPE it seems the vocab sizes of each language are huge, and combined, even bigger...
Cheers!
Could u make a BPE video, sounds interesting right.
Thanks for the feedback I'll put that on my backlog! If more people ask for it I'll do it sooner.
now make the backpass 😂😂😂😂
Some steps were not entirely clear from the drawings I think.
Link the timestamp I'll be glad to clarify!
i love you