Attention Is All You Need (Transformer) | Paper Explained

Aleksa Gordić - The AI Epiphany

556

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 4 лют 2025

КОМЕНТАРІ • 64

@josephtsangko3558 2 роки тому
The idea of explicitly explain and breaking down the idea down to pieces, even using the board-kind of, makes the video absolutely helpful and useful! We fed up from naive academic explanations.
@Natasha_Databricks Рік тому ⁺¹
Stumbled upon this gem whilst preparing a session on transformers and it has been soooo helpful, Aleksa. Just to understand the code took already a few days and then the concepts clicked. I really admire that you coded this up from scratch, that's probably the only way I would have ever understood the topic. Well done, your channel is amazing! Pozdrav iz Svajcarske!
@TheAIEpiphany Рік тому
Hvala Natasa!! :))
@lukasugar94 4 роки тому ⁺⁷
Great complement to the previous Attention video, it made the training process much clearer. Thanks!
@TheAIEpiphany 4 роки тому ⁺¹
Awesome! Glad to hear that ^^
@christophermarais5253 4 роки тому ⁺³
I am a genomics data scientist and I don't think I can explain how helpful I've found your videos. By far the best resource on youtube. Your methods really suit my personal approach.
@TheAIEpiphany 4 роки тому
Thanks a lot Chris I really appreciate it!
Btw. how did you find my channel? That may help me better reach more people as the channel is still in it's infancy stage and YT is still not recommending most of my videos
@christophermarais5253 4 роки тому
@@TheAIEpiphany I think I found it on reddit, I see you've posted some videos on r/deeplearning. It is also possible that I found your video as a recommendation after watching a video from watching videos on the WelcomeAIOverlords channel.
@TheAIEpiphany 4 роки тому
@@christophermarais5253 Awesome, thank you!
@lucasthim1 2 роки тому ⁺³
Bro, thank you so much for this video!!! It is helping me a lot through my PhD!! haha
@tednoob 8 місяців тому ⁺¹
Your transformer videos are still great!
@anujsingh5961 3 роки тому ⁺¹
Such a great video and awesome Multi head Explanation.. Thanks a lot for your effort.
@gerardsanroma 3 роки тому ⁺²
Super useful!.. thanks for the detailed explanation. One does not get the practical insight just by reading the paper, and reading code may be sometimes too obfuscating. Kepp up the great work. Love it!
@AlexS-si7de 2 роки тому
Thank you for making this video, I learned a lot
@DavenH 4 роки тому ⁺³
I wish I had this resource when I was trying to understand BERT when it came out. You make everything so clear, great job!
@TheAIEpiphany 4 роки тому
Glad to hear that Daven! Thank you!
@toooomah 3 роки тому ⁺¹
Amazing video. Please make more vids like this. We need those smallest details
@imamnurbaniyusuf9628 3 роки тому ⁺¹
Hi this is an amazing video! It deserves more likes and views....
It helps me a lot to understand the lower detail of transformer! Thank you very much!
@TheAIEpiphany 3 роки тому
Glad to hear that!
@naevan1 2 роки тому ⁺¹
Really love the complexity of these architectures.. But I'm finding myself at times thinking how could somone actually have thought of this stuff. It's beyond amazing
@naevan1 2 роки тому
Also, I love your videos man, you have helped me understand various concepts especially Graph Neural Networks which are beasts of prior knowledge. If I have any suggestion after watching many of your videos, that would be to get a pen or something so you can write stuff both faster and more clearly for us
@muriloime 3 роки тому ⁺²
Congratulations on the channel. Very instructive. Keep on with the awesome work!
@TheAIEpiphany 3 роки тому
Thanks Murilo!
@grownupgaming 2 роки тому ⁺¹
Great video! I dare you to take a big swig of beer each time he says basically. basically, you would be drunk by the end of the video.
@TheAIEpiphany 2 роки тому ⁺¹
Lol I know man, it's hard to do it real-time though so lots of sucky filler words 🤣
@twrch2939 2 роки тому
Very useful video. Thank you.
@aaronsayeb6566 3 роки тому ⁺¹
This is a great resource. Thank you so much for doing this.
@raunaquepatra3966 4 роки тому ⁺²
We need more videos like this, so informative.😀
Videos like this makes it so much easier to go through the paper /code is after the
@TheAIEpiphany 4 роки тому ⁺¹
Awesome that's what I love to hear!
@anwarsaid135 2 роки тому ⁺¹
Excellent job done! Super useful, keep it up!
@michaelpadilla141 2 роки тому
Nice job. Thank you!
@bannapollimanond7470 3 роки тому ⁺¹
This is the best video ever.
@kevinchen3860 3 роки тому ⁺¹
Thanks，bro！salute! I want more like this!
@anas.2k866 Рік тому
In Jay Alammar blog, it says that for each head we have an independent matrixes Qi,Ki,Vi where i varies from 1 to 8
@codewithdev1375 9 місяців тому
Best explanation
@bogdan3209 4 роки тому ⁺³
This was a great video. Maybe you can do some coding videos using hugging face for example.
@TheAIEpiphany 4 роки тому
Thanks! Hmm I wasn't planning on doing those, thanks for the feedback.
@world11191 Рік тому
Hey! I didn't really understand what was happening at 11:18 with the pointwise apply thing. How are you multiplying the token embeddings by the Q (and k and v) matrices?
How does this multiplication result in a new 1x512 matrix?
@user-or7ji5hv8y 4 роки тому ⁺¹
Awesome! Just what I needed.
@НиколайНовичков-е1э 3 роки тому
Thank you for your work!
@chaos199122 Рік тому ⁺²
Thanks for this video! I think there is a small error: The positional encodings are intended to encode the position of a token in the sequence, not in the embedding table. As such, @9:00 for "how" you'd take not the ID 2, but rather its position in the sentence, meaning 0. Or am I missing somehing? At least that is generally how positional encodings would be used to my understanding.
@jannick__ Рік тому
that's also how I understood it
@ameynaik1755 2 роки тому ⁺¹
What is the max size of the input sentence that can be fed?
@TheAIEpiphany 2 роки тому
Depends on how much memory you've got - no limits from the model side
@SK-zs6en 3 роки тому ⁺⁵
@The AI Epiphany, The video is great however i have a small doubt. In Multi Head Attention Module, do we use the 512 vector word embedding to get 512 vector and then split it into 64 d assuming we have 8 heads or for each head we use a embedding matrix of size 512 x 64 for each head to get a 64 dimensional vector form 512 d word embedding. In jalammar.github.io/illustrated-transformer/ he mentions the latter but in this video the explaination provided is the former. Can you tell which is the right one?
@thegimel 3 роки тому
@@TheAIEpiphany Bumping this question.
Indeed, there is a disparity between Jay Alammar's blog and your explanation of the calculations of the multi-headed attention mechanism.
Thanks!
@iliassoto7032 2 роки тому ⁺¹
The latter is correct I think. It makes more sense that each head still has the entire information codified in a more simple manner than each head just getting a fraction of the embedding vector.
@zkostic3716 2 роки тому
Very useful
@user-or7ji5hv8y 4 роки тому
but are the word embeddings for English and German (at 8:00 ) computed and trained separately and are provided to the Transformer model as given dictionaries?
@TheAIEpiphany 4 роки тому ⁺¹
Hail to the almighty transformer! Make sure to check out this Jupyter notebook/repo as a supplement to this video and the paper: github.com/gordicaleksa/pytorch-original-transformer
@user-or7ji5hv8y 4 роки тому ⁺¹
It is strange and amazing how you can simply concatenate the word and the positional embeddings together as input and somehow the model still works. How would the model know which part of the combined input is the word part and the positional part.
@geekvn 4 роки тому ⁺²
Thank you for the video, it's very helpful for new people to understand attention mechanism. I had few questions:
1. Why is add operation being used for positional encoding, is there any different with other operations? And does position really strictly matter because some similar sentence can make the embedding vector change? e.g. "hi, how are you to day?" (in this case, all position are changed)
2. Related to the position from question 1, is there any way to get benefit from attention in term of backpropagation through time like RNN and LSTM?
3. You mentioned final representation vector from multi-head attention of decoding is combined Q vectors from German tokens and (K+V) vectors from English tokens. But if the number of tokens after translating are different. How can you compose the final vector?
@TheAIEpiphany 4 роки тому ⁺²
Hey nice questions.
1. Summation is a really good way to combine features and positional encodings do matter as sometimes swapping the word order can change the semantics (sentence meaning)
2. There is no need to do that with transformers, unless you're asking what happens when the context gets to big to fit into the memory
3. You use the target token representations (German) as the query and source tokens as the keys and values and just do cross dot product between queries and keys as usually - nothing else changes
@geekvn 4 роки тому ⁺¹
@@TheAIEpiphany thank you for your answers, they are clear !
@smilenstoychev5372 3 роки тому
At 14:25 you get the dot product of the Q value for how against all 5 K vectors which should result in 5 numbers. But a minute later you end up with a single number instead - 3.2. Is there some addition that happens between the dot product results?
@vutu201130 3 роки тому
can you explain the Encoding Nx params: 22:31
we repeat the Nx times, each time we get a 512-d vector for each word (how are you to day), how do we process the output of Nx times?
Tks!
@samirelzein1095 3 роки тому
Great job!
I think you still have space to improve
Redo same set of ideas but with better representations, no need for animations, just presentation style
Transformers are still vague on YT
@thegimel 2 роки тому
Great content, really simple and informative!
I had a question regarding the Embeddings. In the paper, they say they share the weights between the Input Embeddings, Output Embeddings, and the Linear matrix at the output of the decoder.
I understand from the paper that "Sentences were encoded using byte-pair encoding [3], which has a shared source-target vocabulary of about 37000 tokens."
Did you implement this weight sharing in your implementation of the paper? because if you don't use BPE it seems the vocab sizes of each language are huge, and combined, even bigger...
Cheers!
@rustombhesania7265 4 роки тому ⁺²
Could u make a BPE video, sounds interesting right.
@TheAIEpiphany 4 роки тому ⁺¹
Thanks for the feedback I'll put that on my backlog! If more people ask for it I'll do it sooner.
@onurberktore2979 10 місяців тому
now make the backpass 😂😂😂😂
@user-or7ji5hv8y 4 роки тому
Some steps were not entirely clear from the drawings I think.
@TheAIEpiphany 4 роки тому
Link the timestamp I'll be glad to clarify!
@razvanrotaru2285 3 роки тому
i love you

Наступне

Автоматичне відтворення

Attention in transformers, step-by-step | DL6