Pytorch Transformers from Scratch (Attention is all you need)
Вставка
- Опубліковано 31 тра 2024
- In this video we read the original transformer paper "Attention is all you need" and implement it from scratch!
Attention is all you need paper:
arxiv.org/abs/1706.03762
A good blogpost on Transformers:
www.peterbloem.nl/blog/transfo...
❤️ Support the channel ❤️
/ @aladdinpersson
Paid Courses I recommend for learning (affiliate links, no extra cost for you):
⭐ Machine Learning Specialization bit.ly/3hjTBBt
⭐ Deep Learning Specialization bit.ly/3YcUkoI
📘 MLOps Specialization bit.ly/3wibaWy
📘 GAN Specialization bit.ly/3FmnZDl
📘 NLP Specialization bit.ly/3GXoQuP
✨ Free Resources that are great:
NLP: web.stanford.edu/class/cs224n/
CV: cs231n.stanford.edu/
Deployment: fullstackdeeplearning.com/
FastAI: www.fast.ai/
💻 My Deep Learning Setup and Recording Setup:
www.amazon.com/shop/aladdinpe...
GitHub Repository:
github.com/aladdinpersson/Mac...
✅ One-Time Donations:
Paypal: bit.ly/3buoRYH
▶️ You Can Connect with me on:
Twitter - / aladdinpersson
LinkedIn - / aladdin-persson-a95384153
Github - github.com/aladdinpersson
OUTLINE:
0:00 - Introduction
0:54 - Paper Review
11:20 - Attention Mechanism
27:00 - TransformerBlock
32:18 - Encoder
38:20 - DecoderBlock
42:00 - Decoder
46:55 - Putting it togethor to form The Transformer
52:45 - A Small Example
54:25 - Fixing Errors
56:44 - Ending
Here's the outline for the video:
0:00 - Introduction
0:54 - Paper Review
11:20 - Attention Mechanism
27:00 - TransformerBlock
32:18 - Encoder
38:20 - DecoderBlock
42:00 - Decoder
46:55 - Forming The Transformer
52:45 - A Small Example
54:25 - Fixing Errors
56:44 - Ending
First thanks for this amazing video, but I have one question regarding the implementation of Self Attention.
To distribute values, keys and queries to heads you just did a reshape for the input, while the original paper suggested to do projection using trainable matrices.
Am I right or I missed up something?
@@alhasanalkhaddour434 yes i think he did the projection already using self.values, self,keys, self.queries cause these are linear layers . the real inputs comes from the parameters passed to forward function see 14.43 for more details
Why did you use self. Values, self. Keys in the init method bcz they are not used at all in forward
it would be far better if you coded with an illustration of the architecture on the side.
Sorry can you share the github link of this special code?
Attention is not all we need, this video is all we need
You're too kind :)
haha, :)
best explaination ever
Attention to this video is all you need
Attention was never enough
Not found a tutorial so much detail oriented. Now I am completely able to understand the Transformer and Attention Mechanism.Great Work.Thank you😊
I really appreciate you saying that, thanks a lot :)
@@AladdinPerssonHi! You missed one error in your video. In your GitHub code, you have `self.values = nn.Linear(embed_size, embed_size)`, but in your video, you used `self.values = nn.Linear(self.head_dim, self.head_dim, bias=False)`. I couldn't reproduce your results until I noticed this discrepancy.
I watched 3 Transformer videos before this one and thought I would never understand it. Love the way you explained such a complicated topic.
This is undoubtedly one of the best transformer implementation videos I have seen. Thanks for posting such good content. Looking forward to seeing some more paper implementation videos.
This is one of the best explanation videos about a paper to code I've watched in a loong time! Congratz Aladdin dude!
I have been struggling to implement and understand custom transformer code from various sources. This was perhaps one of the best tutorials.
you're an absolute saint. idk if i can even put it into words the amount of respect and appreciation I have for you man! thank you!
This is cool. It would be helpful to have a section highlighting what parts of the dimensions should be changed if you are using a dataset of a different size or you want to change the input length. ie: keeping the architecture constant but noting how it could be used flexibly
Many thanks to you for this impressive tutorial, amazing job and outstanding explanation, and also thanks for sharing all these resources in the description.
Hi, I really like your channel. I have been learning from your tutorials for a while. Best wishes!
Love your work!! I was very confused when dealing with other tutorials... but your work made me clear about Transformer. I wish only I know you and your work.
I appreciate the kind words 🙏
In the original paper each head should have seperate weights, but in your code all heads share the same weights. here are two steps to fix it:
1. in __init__: self.queries = nn.Linear(self.embed_size, self.embed_size, bias=False) (same for key and value weights)
2. in forward: put "queries = self.queries(queries)" above "queries = queries.reshape(...)" (also same for keys and values)
Great video btw
Hey, thank you so much for bringing this to my attention ;) When reading through the paper I get the same idea that you do, namely that each head should have separate weights, and when reading blog posts like "The Annotated Transformer" he has done exactly what you describe. From the blog post www.peterbloem.nl/blog/transformers he explains narrow vs wide self attention and in his Github implementation he does similarly as I do, however I noticed now that an issue has been raised regarding the same issue you bring up: github.com/pbloem/former/issues/13.
And I agree with the point brought up there also, if each head is using same weights it doesn't feel like you can say they are different. I'm having difficulty finding other implementations, but I will keep a close look at this and if I get some more time I will try to spend more time and investigate this. I'm also a bit surprised that when training on this implementation it provides good results if I remember correctly with only 3x32x32 vs 3x256x256 parameters.
@@AladdinPersson Yes both methods should work just fine, but I believe using seperate weights for each head would give better performance, without slowing down the model. it would use more memory of course, but it's almost nothing compared to number of parameters in the feedforward sublayers.
@@sehbanomer8151 I think your inplementation may still have some issues. Since each head shoud have seperate weights, shouldn't there be eight(number of heads) different head_dim*head_dim linear layers instead of one embed_size*embed_size linear layer. Additionally, these two implementations have different number of parameters.
@@66dp97 the key, query & value projection of each head will project an _embed_dim_ dimentional vector to _head_dim_ dimentional space, so for each attention head, the projection matrix will have shape (head_dim, embed_dim). Fusing _n_heads_ seperate linear layers into a single (embed_dim, head_dim * n_heads) linear layer is more GPU friendly, thus faster.
Compliments for the video, really gives better insight into a complex architecture. Thanks for sharing all this information.
Really thank you!!! This really helps me deeply understand Transformer!!!
It is the best description for transformer implementation.
thank you so much.
best regards.
I found this very helpful. I always used to get confused regarding the tensor sizes. Now it's all clear. Thank you very much. Also this is the first time I came across einsum. Thanks again for that too.
Appreciate the kind words 🙏
Dude, you rock! I bow to your expertise 🙏😊
Ya,.. agreed,.. this was an extremely difficult architecture to implement,. with .a LOT of moving parts,.. but this has to be the best walkthrough out there,.. sure, there are certain things like the src_mask unsqueeze that were a little tricky to visualize,.. but even barring that, you broke it down quite well! Thank you for this!. I'm so glad that we have all of this implemented in HF/PT hahah
Great work! Really helped me. Thanks.
Superb...Hats off. Thank you for explanation.
Gonna try this for my uni assignment! Thank you
excellent video and thank u for sharing this. I have one point about implementation, in "SelfAttention" class for query, value and key matrices (linear layer) you used (head_dim, head_dim) dimension. so these matrices will be shared in all heads. I think it's better to use (embed_dim, embed_dim) matrix to map input to q, k, v vectors and reshape it to have head dimension.
I want to use the encoder of transformer for video classification, where each frame of the video will be first passed through a pretrained cnn and the output of this would act as an embedding and then passed as an input tor encoder. Any suggestions on how to do that?
Great Job!! Thanks for the video!
excellent work mate cleared all my doubts
Very nice! Congratulations!!
Thank you very much for the info!
Great video, advanced my understanding.
making something sophisticated so easy and clear that's what I call magic. Aladdin, you are truly the magician.
First of all, thank you for the video. The most valuable thing I learned from it is how to create a so complex model step by step from the flow chart. Next, I will find out weither this self-attention model can be used in environmental pollution problems.
Great Tutorial! Thanks Aladdin
for actually training it, what would we do?
@@jushkunjuret4386 Can you specify where you're exactly getting stuck?
Very detailed and clear! Thank you very much!
Thanks a lot for the kind words🙏
This is the best way to learn, through hands on. Great video! Also may I know which font is used in this video? I noticed that your choice of font is very clean and easy to work with!
Thanks for your educational contribution! Just one question: what are the linear layers self.values, self.keys and self.queries for? These are not used inside the forward pass.
Hey Aladdin,
Really amazing videos brother!
This was the first video of yours that I stumbled upon and I fell in love with your channel.
Hey Sahil, I definitely need a refresher and go through transformers again, so I'm not sure if I will be able to give you the best answer right now. So from what I recall the most important part of the masking with regards to padding is that we make sure these are not backpropagated through. We want the network weights and embeddings etc not to learn to be associated with the padded values, and that's what we are trying to accomplish with setting it to -infinity since gradient of softmax will then be 0.
@@AladdinPersson Yeah I get the reason why we do it and the -inf setting. I had doubts with the padding that we use, I feel we need more padding to take care of the cases where both sentences are padded and then we have attention over them. I feel I have made it pretty clear in the comment above.
great explanation, much more helpful than the theoretical only explanations
Thanks Aladdin. The video helped a lot.
Thanks for the helpful video. Could I interpret that in line 244, the transformer is being trained on 'x' and predicting the last number in the 'trg' sequences? If so, how to find which number was predicted with the highest probability/likelyhood? My goal is to use a transformer for a similar task, like it will be trained on a set of sequence (like in 'x') to detect the temporal relations between different events, and predict next events for a given sequence (say it a test). Any hint on how to do that will be an immense help. Thanks for reading my comment.
One of the best resource on the internet!
The best video I watched on youtube! Why I found you so late!!!
This is the best tutorial on Transformers online. I was able to understand the nuts and bolts of it. Kudos to you!! It will be great if you can cover Graph Convolutional Networks from scratch
In the Jay Alammar blog there is no split of the embeddings in order to compute attention for each head.
fantastic, awesome videos as ever.
Thank you for your video. You did a great job! I was wondering how to train a transformer if the input form is (batch_size, sequence_length, number_of_features). Let's say number_of_features = 2 (it could be X and Y coordinates in time, for example). What impact does this type of input have on positional encoding, the masking strategy and the attention mechanism?
I am not understanding sir, does this input sequence is divided into a number of chunks like here you did 256/8 where 8 is the number of attention heads. I am thinking for the self-attention whole of the input embeddings need to transform into three parts namely Q, K and V. and then we need to divide this for 8 times in the case of 8 multi heads. that's why the name is the multi head. Please clear. Regards
Thanks a lot for the video, this was great an its helping me a lot.
Thanks for the great video.
I have one doubt though. Encoder output is fed into each decoder block. So the last encoder block is fed to each decoder block or like layer1 encoder block output is fed to layer 1 decoder block like that and so on.
wow, the best transformer tutorial I've seen
Thank you the video was very helpful. In the end we got output of dim (2,7,10). So why did we got the probabilities of the next 7 words? And why is the output len dependent on the number of words we feed to the decoder?
Thx for this amazing tutorial. I think the "energy" (Q * K_transpose) should be divided by the square root of head_dim instead of embedding_size.
Very good tutorial!
Just one thing though: this is not how multihead attention is implemented in the original attention is all you need paper. In the paper the input is not split into h smaller vectors, but linearly transformed h times. So their wouldn't be reshape and then linear(head_dim, head_dim) but rather linear(embed_size, head_dim) in each head.
Also you can have more heads than heads*head_dim = embed_size. This is because in the paper you would transform your concatenated head-outputs again with a jointly trained matrix WO (concatenation size x embed_size)
Thank you so much!
hey man great video. how should i remove the embedding part of this network? and replace it with just an LSTM layer. i want to use this model for time series prediction and dont need nn.embedding.
@Aladdin Persson. Thank you. Great lesson. Which IDE and theme you used ?
Great! ❤ Thanks for this master piece. Hmm, I follow you along and I have not any error when I run, since I already noted your error in code and update it 😊. Waiting for your new Video. This is the first video I go along with you. Subscribed! Bell Notification on.
It's an extremely useful video for researches trying to implement paper codes. Do make a series implementing other Machine Learning codes described in other papers as well.
Please make a video to use this model on an actual NLP task such as translation, etc.
Thank you for saying that I really appreciate it. I have made one other video on transformers for machine translation, and I will do my best to continue making videos and to cover more advanced topics! :)
@@AladdinPersson I can't seem to find it. Can you please paste the link here, please? I'd truly appreciate it. :)
@@flamingflamingo4021 Yeah for sure: ua-cam.com/video/M6adRGJe5cQ/v-deo.html
It's the last video of an inofficial serie of building Seq2Seq models for the task of machine translation. First video was normal seq2seq, second video was seq2seq+attention and the last video that I linked above is using transformers. These videos were inspired a lot by Bentrevett on Github and I recommend you check him out also if you're interested in NLP :)
Excellent! Thank you so much for this! Had a small request, can you please come up with videos on BERT and controlled text generation models like PPLM? Thanks again!
Thank you for the comment! I will look into it, got a few videos that I'm planning but will come back to this in the future for sure :)
Thank you! great explanation, I just wonder why in the attention mechanism you have to inizialize self.queries, self.keys ecc as Linear layers
From paper, the attention mechanism is fully connected, which means you should use linear layers.
This video is all I needed
i've got a question here. In order to generate a target sentence, there should be multiple time steps right?
The first output word from Decoder will go through Decoder again to generate the secend output word.
i cant find where you difine this in this video. Or maybe i understand it wrong.
During training everything is done in parallel (we have the entire translated target sentence) and we utilize these target masks that I talked about in the video. This is a major difference between transformer and normal Seq2Seq, where we actually send in the entire target sentence rather than word by word. When we evaluate the model you're completely right that we need to do multiple time steps (one word at a time) but this is not the case during training. In this video we kind of just do the transformer from scratch, the question you're asking is more related to actually training & evaluating transformer models. I'll try to see if I find code for what you're asking for.
So here is a full code example of using transformers (also have a separate video on it): github.com/AladdinPerzon/Machine-Learning-Collection/blob/master/ML/Pytorch/more_advanced/seq2seq_transformer/seq2seq_transformer.py
When we actually evaluate the model we need to do it time step by time step and it would look like this (translate sentence function) and I believe THIS is what you're asking for: github.com/AladdinPerzon/Machine-Learning-Collection/blob/master/ML/Pytorch/more_advanced/seq2seq_transformer/utils.py
@@AladdinPersson Thank you, the link explained it pretty well. Thank you a lot.
why is the input embedding input is 256 dimensions? at 10:02
Hello, This is a great explanation of transformers. I have a question. How did you know that query.shape[0] would give you the number of training examples? Why is it later used in reshaping the keys, query, and values?
The first dimension is always the batch size in tensor operations. As any model is trained on batches, and the batch size is the number of samples
Superb!
I love your toturials
I got confused a bit. What you are sending from the encoder to the decoder, Do they represent queries and keys, or keys and values??
Hi @Aladdin, can I take the attention weight values from model = Transformer(src_vocab_size, trg_vocab_size, src_pad_idx, trg_pad_idx).to(
device
)
out = model(x, trg[:, :-1])
attention_weights =
I haven't tried it but you should definitely be able to take the attention weights and I've seen other people like: github.com/bentrevett/pytorch-seq2seq/blob/master/6%20-%20Attention%20is%20All%20You%20Need.ipynb
Where he uses them to create these Attention Maps but I haven't personally tried that. Let me know how it goes for you
Why do you split the embedding vector into heads instead of using the same embedding in different heads like the paper does?
It is one of the best Transformer videos I have ever seen! Thanks a lot!!!
I have a question. In the paper, for multi-head attention, Wi^Q is in the domain of R^(d_model x d_k), but Wi^Q in your implementation seems to be in the domain of R^(d_model/h x d_k) because self.queries in class SelfAttention is defined as nn.Linear(head_dim, head_dim) and the queries are reshaped before coming into the linear layer. The cases of Wi^K and Wi^V are the same as the case of Wi^Q. Do I miss something? Thanks again!
In decoder block in forward function why src_mask passed in transformer?
Dude! you're amazing!
Hi everyone! I finished following the this tutorial to the end... But now I am confused on how to "train" and "test/predict" this model? Any help is appreciated! Thanks!
Exactly what I need
Very nice video. I have a question.
The positional encoding you used is different from the one in the paper where they use sin/cos function of word position and vector index. It seems in your code, these positional embedding will be trained unlike the one in the paper. Do you have the code for how positional encoding is done in the paper?
Yes you're right about that, if I recall I did mention it in the video but I could have missed that. There have been other questions about this as well so I might try to implement positional encoding also but as of right now I have not
Why do you set bias=False for nn.Linear of keys, values and queries?
if I'm using transformer for a speech recognition task (speech-to-text). after training the model, in prediction what should I place on the target parameter if I have only audio file (not transcibed)?
did you get ur answer?
45:59: WHOA! slow that down! Pause a sec, be emphatic if we're going to change something back up there
in SelfAttention, you have not used the linears self.keys, self.values, self.queries in forward method, whats the use of those layers?
Love this content.
Thank you for recording and publishing such informative tutorials. Could self-attention be regarded as a replacement for the RNN, meaning that anything RNN could do can be substituted by using self-attention? If so, could you do a tutorial regarding how we can use self-attention to classify the text?
Yes it can, in fact many have proclaimed RNN/GRU/LSTMs are "dead" (im not so sure I would be that dramatic) but transformers have definitely taken over in terms of SOTA performance. I haven't done any projects personally on using it so far to classify text so far though
@@AladdinPersson Okay. Thanks for your reply. I will give it a try and see how it goes.
Why are you not using `sin` or `cos` function for positional encoding?
the last question please what is intuitive meaning for the source and target inputs of transformer why model takes x, trg[:, :-1]
model = Transformer(src_vocab_size, trg_vocab_size, src_pad_idx, trg_pad_idx).to(device)
out = model(x, trg[:, :-1])
what we could get from out?
I tried
model = Transformer(src_vocab_size, trg_vocab_size, src_pad_idx, trg_pad_idx).to(device)
out = model(x, trg)
and got
torch.Size([2, 8, 10])
to be honest I could not interpret that :(
Okay so I understand your question correctly:
Your doubt is why are we using trg[:, :-1] instead of trg
First:
trg[:, :-1] this means all the batches(sentences) and entire sentences except the last word in all the sentences.
Second:
We do this because of how the transformer model is developed to train. Unlike RNNs our transformer model does not predict the entire output sentence, instead it predicts one word at a time. So the decoder takes in (t-1) time step's output of the transformer and then predicts the t time step output word. Hence we provide the entire sentence but the last word so as to predict the last word.
Refer to the beautiful video by Yannic Kelcher:
ua-cam.com/video/iDulhoQ2pro/v-deo.html
Hope your doubt is solved. Let me know if it's still unclear.
thans for you tuturial! but there is one thing i cant resolve. Whether embeddings splits to num_heads parts along embed_size dimension then goes to linear layer OR goes to linear first then split to 8 heads?
should go to linear first
Thank you so much sir
Great video! But why do we add dropout after skip connection?
I can't seem to understand the necessity for self.keys, self.queries and self.values in the SelfAttention class. Am I missing something?
for the dropout in your codes, for example DecoderBlock forward, I think it should be:
query = self.norm(x + self.dropout(attention))
instead:
query = self.dropout(self.norm(attention + x))
Here is the paper quote:
"We apply dropout [33] to the output of each sub-layer, before it is added to the sub-layer input and normalized."
Thanks so much for the great work!
I think you're right, I'll look into this some more soon and update the Github code :)
Thank you very much!
World's best UA-cam channel evaaaa.
Hi, at 18:45, energy = queries* keys. Are you doing outer product?
Thank you for the complete video. Excellent work, if you have time, can you do some more videos on attention maps on images? like the one in the Learn to pay attention paper.
I'll look into it:)
great video! is the code in your GitHub Repository? because I can't find it there? in which folder should it be?
Thank you for your toturial, I have a question : you said in encoder all value, key and query are the same. as the paper said value, query and key are just the same in size not in element. can u plz explain it a little more?
I have a question please when you calculate the attention : according to the formula we should divid on the square root of the key length right? you divided on the embedding length so here I did not understand is it a mistake or I miss something? should not we divide on key_len? in he paper it was mentioned that "The input consist of queries and keys of dimension dk and values of dimension dv", in the minute you have said that key_len and value len are always going to be the same wherease in the paper the opposite key_len and query_len are alwyas the same and the value_len differ
I will look into it:)
I am a bit confused about that in encoder block where you create the positional embedding layer. Why do we initialize that layer with max len parameter. Can you please explain it in more detail?
Sure! Sorry for the late response. When using positional embedding (in constrast to positional encodings) there's a pro that it's very simple we just add an embedding layer for the positions and in this removes the restriction that the transformer is permutationally invariant. Although one con of doing it this way is that we need to restrict the sentences to be within some max_length and that's why we initialize it with this parameter. Essentially we won't be able to have any sentence longer than this parameter
i don't know if there are still people who are watching this but i have a question at code level, in the decoder method "forward" when i pass the parameters to layer i had as the fifth one "target_mask" but in the decoder block you decided to put the parameter "device". Did i miss something, is it just an error or there is an other explanation? Thanks a lot
Thanks so much!
Appreciate it :)
great video! what does the forward_expansion parameter mean?
Great video.
Hi Aladdin. Very nice coding!
But I am confused as to why, here, the kqv projections for the different heads seem to be shared. It seems like we should use nn.Linear(embed_dim, embed_dim), and later divide it into different heads?
Some early comments have addressed the issue.
After the decoder block we have to again pass the matrix to neural network with output set to target vocab dimension and apply softmax to get the probabilities of word right ?
nope, it says in the previous comments.
the softmax is contained in loss function(cross entropy).
if you do softmax again, it cause the gradient diminishing.
this what author replied in other comments
Thank you for the comment! First you're going to probably use CrossEntropyLoss and softmax is then included in that loss function, so you don't want to do softmax as output. I have another video where we trained the transformer model on a translation task, although to simplify I used Pytorch inbuilt transformer modules (but you can use the ones we implemented).
The shapes for sending in to crossentropy can be tricky, but let's first know understand the input shapes to cross entropy by looking at something like MNIST where it will take (N, 10) for the outputs and targets will be simply (N). In this case you will reshape so that you have (N*seq_length, vocab_size) and (N*seq_length), sort of viewing every time stamp as it's one example.
Here is the code for that transformer model I talked about (I also have a separate video if you feel something is confusing), which you might want to take a look at: github.com/AladdinPerzon/Machine-Learning-Collection/blob/master/ML/Pytorch/more_advanced/seq2seq_transformer/seq2seq_transformer.py
I havn't done any tests but I would imagine Pytorch inbuilt transformer is faster, so I would follow the other video I did when you want to actually use it for training a model and this video is more about understanding the transformer.