Yeah, and you see how these technologies works. It's insane, that in the end it looks easy that you can do something that companies of millions and billions of dollars do. In a small way but the same idea at the end.
Hi Umar. I am a first year student at MIT who wants to do AI startups. Your explanation and comments during coding were really helpful. After spending about 10 hours on the video, I walk away with great learnings and great inspiration. Thank you so much, you are an amazing teacher!
The full code is available on GitHub: github.com/hkproj/pytorch-transformer It also includes a Colab Notebook so you can train the model directly on Colab. Of course nobody reinvents the wheel, so I have watched many resources about the transformer to learn how to code it. All of the code is written by me from zero except for the code to visualize the attention, which I have taken from the Harvard NLP group article about the Transformer. I highly recommend all of you to do the same: watch my video and try to code your own version of the Transformer... that's the best way to learn it. Another suggestion I can give is to download my git repo, run it on your computer while debugging the training and inference line by line, while trying to guess the tensor size at each step. This will make sure you understand all the operations. Plus, if some operation was not clear to you, you can just watch the variables in real time to understand the shapes involved. Have a wonderful day!
I have browsed UA-cam for the perfect set of videos on transformer, but your set of videos (the video explanation you did on the transformer architecture) and this one is by far the best !! Take a bow brother, you have really contributed to the viewers in amount you cant even imagine. Really appreciate this !!!
Hi Umar. Absolutely amazing 🤯. Your clear breakdown and explanation of the concepts and code is just next level. Until I watched your video I had a very tentative handle on transformers. After watching I have a much better fundamental grasp of EVERY component. I can't say thank you enough. Please keep doing what you are doing.
Thank God, it's not one of those 'ML in 5 lines of Python code' or 'learn AI in 5 minutes'. Thank you. I can not imagine how much time you must have spent on making this tutorial. thank you so much. I have watched it three times already and wrote the code while watching the second time (with a lot of typos :D).
I'm not sure if it is because I have study this content 1000000 times or not, but is the first time that I understood the code, and feel confident about it. Thanks!
Keep doing what you are doing. I really appreciate you taking out so much time to spread such knowledge for free. Been studying transformers for a long time but never have I understood it so well. The theoretical explanation in the other video combined with this practical implementation, just splendid. Will be going through your other tutorials as well. I know how much time taking it is to produce such high level content and all I can really say is that I really am grateful for what you are doing and hope that you continue doing it. Wish you a great day!
One random afternoon last year I decided to watch the whole video, and now I have my own LLM with 1B parameter with your code. Thank you so much. Don't ever stop inspring new ai programmers! Greetings from Philippines.
Loving this video (only 13 minutes in), really like you using type hints, commenting, descriptive variable names, etc. Way better coding practices than most of the ML code I've looked at. At 13:00, for the 2nd arg of the array indexing, you could just do ":" and it would be identical.
Thank you for this comment! I'm coding along with this video and I wasn't sure if my understanding was correct. I'm glad someone else was thinking the same thing. Just to be clear, I am VERY THANKFUL for this video and am in no way complaining. I just wanted to make sure I understand because I want to fully internalize this information.
I really appreciate your efforts. The explanations are very clear. This is a great service for people that wish to learn the future of AI! All the best from Spain!
What a WONDERFUL example of transformer! I am Chinese and I am doing my PhD program in Korea. My research is also about AI. This video helps me a lot. Thank you! BTW, your Chinese is very good!😁😁
This is all going over my head, but I'm trying really hard to understand the process of building a transformer, and implementing a system in real world scenario. This video is a really great reference to study and understand better the 'attention is all you need' paper. Thank you sensei !
Hi! About the layer normalization, there are different opinions on where to add it in the model. I suggest you read this paper (arxiv.org/abs/2002.04745) which discusses this issue. Have a nice day!
Dear Umar - thank you so much for this amazing and very clear explanation. It has deeply helped me and many others in understanding the theoretical and practical implementation of transformers! Take a bow!
Thanks Umar for this comprehensive tutorial, after watching many videos I would say, this is AWESOME! It would be really nice if you can provide us with more tutorials on Transformers especially training them for longer sequences. :)
Thank you so much for taking the time to code and explain the transformer model in such detail. You are amazing and please do a series on how transformers can be used for time series anomaly detection and forecasting!
Dear Umar, thank you so so much for the video! I don't have much experience in deep learning, but your explanations are so clear and detailed I understood almost everything 😄. It wil be a great help for me at my work. Wish you all the best! ❤
Thanks Bro. With your explanation, I am able to build the transformer model for my application. You explained so awesome. Please do what you are doing.
Hi. Just wanted to understand this: At 18:08 mark (where you return the result from the LayerNormalization class - in the forward function ie), shouldn't it be ` torch.sqrt(std + self.eps) + self.bias`. that's also what as per the formula. Pardon me if am I missing something. Great video nonetheless.
Hi Nikhil! The variable "std" in the denominator is already a standard deviation (it's NOT the variance), so we don't need to take the square root of it.
@@umarjamilai - OK yes! My apologies. That makes sense. So then I guess the reason you're not using sqrt(σ^2 + eps) is because eps a very small number, so it wouldn't make much differnce? Meaning, sqrt(σ^2 + eps) and (std + eps) wouldn't be far from each other?
Hey there! I enjoyed watching that video, you did a wonderful job explaining everything, and I found it super easy to follow along. Overall, it was a really great experience!
It is really amazing video. I tried understanding the code of it from various other youtube channel; but was always getting confused. Thanks a lot :) . Can you make a series on BERT & GPT aswell; where you build these models and train on custom data?
Hi Phanindra! I'll definetely continue making more videos. It takes a lot of time and patience to make just one video, not considering the preparation time to study the model, write the code and test it. Please share the channel and subscribe, that's the biggest motivation to continue providing high quality content to you all.
I am at 4:27 and I'm already lost. You said that in a previous video you talked about embeddings, but the only 2 previous videos are about CLIP and wav2lip, there is no embedding. Then at 4:27 you say "a detail written in the paper" and a paper shows up. Which paper? You don't cite the paper at all.
in the 13:13/2:59:23, when we build the PositionalEncoding function, this line x = x + (self.pe[:,:x.shape[1],:]).requires_grad_(False), the x.shape[1] looks like not be used in the transformer model, because when we build the dataset.py function, we pad all the sentences into the same length, and then we load the (batch, seq_len, input_embedding_dim) into the PositionalEncoding function, where all x.shape[1] in the batch is the seq_len, instead of varying by their original sentence length.
@umarjamilai. I have the same question. x.shape[1] in this case will alway equal seq_len. So every time this will just return the entire pe tensor. Wondering if this is unique to this use case example??
perfect video!! Thank you so much. I always wonder the detail code and its explanation and now I almost understand all of it. thanks:) you are the best for me!
Note: this implementation follows 'pre-LN' version of transformer -- which is slightly different from the original transformer in residual connection part. In the original block diagram, the layer normalization(LN) should be applied AFTER multi-head attention / feed-forward network. However, this code applies the LN BEFORE multi-head attention and feed-forward network. You can see the difference by comparing the ResidualConnection forward() code and section 3.2 of original "Attention Is All You Need" paper. This is a valid architecture too (proposed by the other papers), but it is not exactly as proposed in the original one.
here LN is applied after residual, which is basically merging multi-head attention / feed-forward network with original input. What u r saying is incorrect
Hi, I just happen to see your video. It's really amazing, your channel is so good with valuable information. Hope, you keep this up because I really love your contents.
Hi, thanks for the video, it was really helpful, I do have 1 question though. at 57:01, shouldn't the Q, K, V be encoder_output, encoder_output, x; instead of x, encoder_output, encoder_output? if we're calculating Q@K.T, I think of that somewhat similar to "capturing the essence of the input sentence", which would be the encoder output for both Q & K, and the V would build on top of the input sentence's "essence". Can you please elaborate why the order of inputs are what they are? Even in the attention paper, the first 2 inputs(Q and K) are coming from the encoder output, and the Last input V comes from the decoder's self attention output.
Hi! In the encoder input we add the token "SOS" (start of sentence) and the token "EOS" (end of sentence) while in the decoder we add only the "SOS". If you want to understand the reason, please watch my video on the transformer, when I talk about training and inference (last 20 minutes of the video more or less).
Hi! Thanks for your feedback and yes, it was a mistake from my side. The LayerNorm needs to be "elementwise affine", that is, each gamma parameter must be for each feature. I have fixed the code in the repository. Have a nice day!
Just finished watching. Thanks so much for the detailed video. I plan to spend this weekend on coding this model. How long did it take to train on your hardware?
Hi Mohsin! Good job! It took around 3 hours to train 30 epochs on my computer. You can train even for 20 epochs to see good results. Have a wonderful day!
@@NaofumiShinomiya Training time depends on the architecture of the network, on your hardware and the amount of data you use, plus other factors like learning rate, learning scheduler, optimizer, etc. So many conditions to factor in.
At 1:42:03 you are using SOS special token from source language tokenizer in sentence with target language. Tokenizers are trained on different languages so is it correct to use special tokens between them? SOS token from source language tokenizer won't have different idx compared to SOS from target language tokenizer?
Because each vector of the positional encoding has d_model dimensions. Otherwise you wouldn't be able to add the embedding and the position vectors together, they need to have the same dimensions.
Thank you for your straight to the point no bs videos. Good code alongs and commentary. But it looks like positional encoding aren't correct (as per paper). There is power there in: denominator = np.power(10000, 2*i/d). I get it you decided to use exp+log pair for stability, but no mentions of the power gone. And extra layer norm after encoder. As in we "norm + add" (like you defined in the video, instead add & norm as per paper, but you said "lets stick with it", I understand) but than we norm again, after the last encoder block. Like so: layer norm + add (in last encoder block (inside residual connection ) + layer norm again (inside Encoder class) (I could be wrong but it looks like that)
At 29:14, the part on multihead attention, we feed each Q,V, K multiply by Wq, Wv, Wk then split them into n heads then dot product and concat them again. But should we not split them first, then apply Wq_h where Wq_h is the weight matrix for the hth query matrix, same for V and K? Because it seems like we just split them, apply attention, then concat?
23:25 @umarjamilai Can you please explain how the second equation is incorporated into the calculations? I don’t believe there’s a single W^Q that you dot product. This doesn’t align with the second equation where it includes W_i^Q, W_i^K, W_i^V…
for determining the max len of tgt sentence, I believe you should point to tokenizer_tgt rather than tokenizer_src. tgt_ids = tokenizer_tgt.encode(item['transaltion'][config['ang_tgt']]).ids
Hi! The video is part of the "from scratch" series, so I try to make as many things as possible from scratch. The goal is to learn the underlying mechanisms.
Hello Sir, A small doubt at 25.50 timestamp. ( ua-cam.com/video/ISNdQcPhsts/v-deo.htmlsi=-M9N3TPd6o1CAd2D&t=1550 ) In the slide you have shown that after getting the Q', K' and V', the matrices are splited into "h" number of partitions. But in the second equation of head_i, it is shown that, head_i = Attention ( Q(W^Q)_i, K(W^K)_i, V(W^V)_i) That means for every head_i there is different W_i (for Q, K and V) But you have shown there is only one (W^Q) for Q, one (W^K) for K and one (W^V) for V. As per the second equation of head_i, I THINK, 1. For each head there will be different W^Q, W^K and W^V 2. There will be no split of Q', K' and V' As per the third equation of MultiHead 3. Different head_i, generated by the different Ws will be concatenated. No splitting concept is there. By the way, this is my observation only, but you have the done the actual research on it. So you will be better to justify it. PLEASE, check and clear the doubts. I am really thankful to you for having this nice and effective video. It is really helpful for me. ☺
Hello Umar, really impressive work on Transformer. I have followed your step on this experiment. One small thing I am not sure is when you compute the loss you use the nn.CrossEntropyLoss() method, this method have already apply the softmax itself. As their document said:"The input is expected to contain the unnormalized logits for each class (which do not need to be positive or sum to 1, in general)". But in your project method in the built Transformer model, it has applied softmax. I wonder if we should only output the logits without this softmax to fit the nn.CrossEntropyLoss() method? Thank you anyway.
In the matrices w_k, w_q, w_v and w_o in the MultiHeadAttention module, why did you not set the bias=False? Don't you need that for this to work properly?
i love your video! just a question regarding line: (line 20 in github) forward method of the layerNorm class line20: return self.alpha * (x - mean) / (std + self.eps) + self.bias in the formula we have (x - mean) / sqrt(variance + self.eps) that means here we're doing (std + self.eps) which is good but we're adding the epsilon directly but in the slides formula we're adding the sqrt(epsilon). wont our very small epsilon which isnt square rooted cause any issues?
the epsilon is just a constant small number that never changes so you dont need to explicitly take a sqrt of it. its just there so the denominator is never zero which would cause an error but it is small enough to not impact the result.
Thanks so much such a great video. Really liked it a lot. I have a small query. For ResidualConnection, in the paper the equation is given by "LayerNorm(x + Sublayer(x))". In the code, we have: x + self.dropout(sublayer(self.norm(x))). Why it is not self.norm(self.dropout((x + sublayer(x))) ?
personally, I find that seeing someone actually code something from scratch is the best way to get a basic understanding
indeed
indeed
i don't need to see someone typing... but you might also enjoy watching the gras grow or paint dry
Yeah, and you see how these technologies works.
It's insane, that in the end it looks easy that you can do something that companies of millions and billions of dollars do. In a small way but the same idea at the end.
Yeah kinda ironic how that works. The simplest stuff required the most complex explanations
Hi Umar. I am a first year student at MIT who wants to do AI startups. Your explanation and comments during coding were really helpful. After spending about 10 hours on the video, I walk away with great learnings and great inspiration. Thank you so much, you are an amazing teacher!
Best of luck with your studies and thank you for your support!
I am a 3rd semester student at IIT Roorkee. I am also interested in AI startups.
Greeting from China! I am PhD student focused on AI study. Your video really helped me a lot. Thank you so much and hope you enjoy your life in China.
谢谢你!我们在领英联系吧
I am also a Ph.D. student. This video is valuable. Many thanks!
The full code is available on GitHub: github.com/hkproj/pytorch-transformer
It also includes a Colab Notebook so you can train the model directly on Colab.
Of course nobody reinvents the wheel, so I have watched many resources about the transformer to learn how to code it. All of the code is written by me from zero except for the code to visualize the attention, which I have taken from the Harvard NLP group article about the Transformer.
I highly recommend all of you to do the same: watch my video and try to code your own version of the Transformer... that's the best way to learn it.
Another suggestion I can give is to download my git repo, run it on your computer while debugging the training and inference line by line, while trying to guess the tensor size at each step. This will make sure you understand all the operations. Plus, if some operation was not clear to you, you can just watch the variables in real time to understand the shapes involved.
Have a wonderful day!
The best video ever
Can you provide with the pretrained models?
🎉is this Bert architecture?
@@wilfredomartel7781 Its complete encoder- decoder based model, bert is the encoder part of this encoder-decoder model
I love you bro
I have browsed UA-cam for the perfect set of videos on transformer, but your set of videos (the video explanation you did on the transformer architecture) and this one is by far the best !! Take a bow brother, you have really contributed to the viewers in amount you cant even imagine. Really appreciate this !!!
这是我见过最详细的从零创建Transformer模型的视频,从代码实现到数据处理,再到可视化,up主真是嚼碎磨细了讲,感谢!
Nn entendi nada! Mas botei meu like.
@@decarteaoO cara da China e muito engracado con o video
Thanks a ton for making this video and all your other videos. Incredibly useful.
Thanks for your support!
Hi Umar. Absolutely amazing 🤯. Your clear breakdown and explanation of the concepts and code is just next level. Until I watched your video I had a very tentative handle on transformers. After watching I have a much better fundamental grasp of EVERY component. I can't say thank you enough. Please keep doing what you are doing.
Do professor in MIT won't teach like this?
Thank God, it's not one of those 'ML in 5 lines of Python code' or 'learn AI in 5 minutes'. Thank you. I can not imagine how much time you must have spent on making this tutorial. thank you so much. I have watched it three times already and wrote the code while watching the second time (with a lot of typos :D).
I'm not sure if it is because I have study this content 1000000 times or not, but is the first time that I understood the code, and feel confident about it. Thanks!
Keep doing what you are doing. I really appreciate you taking out so much time to spread such knowledge for free. Been studying transformers for a long time but never have I understood it so well. The theoretical explanation in the other video combined with this practical implementation, just splendid. Will be going through your other tutorials as well. I know how much time taking it is to produce such high level content and all I can really say is that I really am grateful for what you are doing and hope that you continue doing it. Wish you a great day!
Thank you for your kind words. I wish you a wonderful day and success for your journey in deep learning!
Thanks for your detailed tutorial. Learned a lot!
One of the best tutorial to understand and implement the Transformer model...Thank you for making such a wonderful video
One random afternoon last year I decided to watch the whole video, and now I have my own LLM with 1B parameter with your code. Thank you so much. Don't ever stop inspring new ai programmers! Greetings from Philippines.
Thanks for a great video.
Loving this video (only 13 minutes in), really like you using type hints, commenting, descriptive variable names, etc. Way better coding practices than most of the ML code I've looked at.
At 13:00, for the 2nd arg of the array indexing, you could just do ":" and it would be identical.
Thank you for this comment! I'm coding along with this video and I wasn't sure if my understanding was correct. I'm glad someone else was thinking the same thing. Just to be clear, I am VERY THANKFUL for this video and am in no way complaining. I just wanted to make sure I understand because I want to fully internalize this information.
Thank you Umar for our extraordinary excellent work! Best transformer tutorial ever I have seen!
老哥你救了我啊, 我是中科大的一名研究生,看你的视频,不仅学习了深度学习,还练习了我的英语听力 😁
不客气! 我最近会发新视频, stay tuned!
Dear Umar, your video is full of knowledge; thanks for sharing.
I really appreciate your efforts. The explanations are very clear. This is a great service for people that wish to learn the future of AI! All the best from Spain!
best video I have ever seen on whole youtube eon transformer model. Thank you so much sir!
This feels really fantastic when looking someone write a program from bottom up
What a WONDERFUL example of transformer! I am Chinese and I am doing my PhD program in Korea. My research is also about AI. This video helps me a lot. Thank you!
BTW, your Chinese is very good!😁😁
This is all going over my head, but I'm trying really hard to understand the process of building a transformer, and implementing a system in real world scenario. This video is a really great reference to study and understand better the 'attention is all you need' paper. Thank you sensei !
This video is incredible, never understood it like this before. I will watch your next videos for sure, thank you so much!
WOW WOW WOW, though it was a bit tough for me to understand it, I was able to understand around 80 % of the code, beautiful. Thank you soo much
Just to repeat what everyone else is saying here - many thanks for an amazing explanation! Looking forward to more of your videos.
Thanks!
Really helpful video. I watched it many times. Hope you enjoy your life in China. 龙年大吉
谢谢老板的精准扶贫
Best video I came across for transformer from scratch.
I learnt a lot from following the steps out of this video and create a transformer myself step by step!! Thank you!!
I can't possibly thank you enough for this incredibly informative video
Hello, at 51:16 why do we add normalization in the end of encoder?
Hi! About the layer normalization, there are different opinions on where to add it in the model. I suggest you read this paper (arxiv.org/abs/2002.04745) which discusses this issue. Have a nice day!
Dear Umar - thank you so much for this amazing and very clear explanation. It has deeply helped me and many others in understanding the theoretical and practical implementation of transformers! Take a bow!
Thanks Umar for this comprehensive tutorial, after watching many videos I would say, this is AWESOME! It would be really nice if you can provide us with more tutorials on Transformers especially training them for longer sequences. :)
Hi mohamednabil374, stay tuned for my next video on the LongNet, a new transformer architecture that can scale up to 1 billion tokens.
Thanks a lot for such a detailed video. Your videos on transformer are best.
Thank you so much for taking the time to code and explain the transformer model in such detail. You are amazing and please do a series on how transformers can be used for time series anomaly detection and forecasting!
Thanks!
Umar, thank you for the amazing example and clear explanation of all your steps and actions.
Thank you for watching my video and your kind words! Subscribe for more videos coming soon!
@@umarjamilai , mission completed 😎.
Already subscribed.
All the best, Umar
Thanks for making it so easy to understand. I definitely learn a lot and gain much more confidence from this!
Awesome! Highly appreciate. 超級讚!非常的感謝。
This is such a great work, I don't really know how to thank you but this is an amazing explanation of an advanced topic such as transformer.
this is an incredible contribution to the topic
Wow Your explanation amazing
Thanks for your video and code.
Dear Umar, thank you so so much for the video! I don't have much experience in deep learning, but your explanations are so clear and detailed I understood almost everything 😄. It wil be a great help for me at my work. Wish you all the best! ❤
Thank you for your kind words, @angelinakoval8360!
Thanks Bro. With your explanation, I am able to build the transformer model for my application. You explained so awesome. Please do what you are doing.
Great work!!
thank you so much for these videos
Great Job!
Hi. Just wanted to understand this: At 18:08 mark (where you return the result from the LayerNormalization class - in the forward function ie), shouldn't it be ` torch.sqrt(std + self.eps) + self.bias`. that's also what as per the formula. Pardon me if am I missing something.
Great video nonetheless.
Hi Nikhil! The variable "std" in the denominator is already a standard deviation (it's NOT the variance), so we don't need to take the square root of it.
@@umarjamilai - OK yes! My apologies. That makes sense. So then I guess the reason you're not using sqrt(σ^2
+ eps) is because eps a very small number, so it wouldn't make much differnce? Meaning, sqrt(σ^2
+ eps) and (std + eps) wouldn't be far from each other?
i enjoyed the video ! now i can transform the world !
Hey there! I enjoyed watching that video, you did a wonderful job explaining everything, and I found it super easy to follow along. Overall, it was a really great experience!
It is really amazing video. I tried understanding the code of it from various other youtube channel; but was always getting confused. Thanks a lot :) . Can you make a series on BERT & GPT aswell; where you build these models and train on custom data?
Hi Phanindra! I'll definetely continue making more videos. It takes a lot of time and patience to make just one video, not considering the preparation time to study the model, write the code and test it. Please share the channel and subscribe, that's the biggest motivation to continue providing high quality content to you all.
A coding example for BERT would be great!@@umarjamilai
I am at 4:27 and I'm already lost. You said that in a previous video you talked about embeddings, but the only 2 previous videos are about CLIP and wav2lip, there is no embedding. Then at 4:27 you say "a detail written in the paper" and a paper shows up. Which paper? You don't cite the paper at all.
谢谢!
谢谢老板的精准扶贫🧧
Thank you very much, this is very useful.
This is amazing thank you 🙏
in the 13:13/2:59:23, when we build the PositionalEncoding function,
this line x = x + (self.pe[:,:x.shape[1],:]).requires_grad_(False), the x.shape[1] looks like not be used in the transformer model, because when we build the dataset.py function, we pad all the sentences into the same length, and then we load the (batch, seq_len, input_embedding_dim) into the PositionalEncoding function, where all x.shape[1] in the batch is the seq_len, instead of varying by their original sentence length.
@umarjamilai. I have the same question. x.shape[1] in this case will alway equal seq_len. So every time this will just return the entire pe tensor. Wondering if this is unique to this use case example??
Wow super usefull! Coding really helps me understand the process better than visuals.
perfect video!! Thank you so much. I always wonder the detail code and its explanation and now I almost understand all of it. thanks:) you are the best for me!
You're welcome!
Mashallah what a video. U r an inspiration
Hidden gem!💎
Valeu!
Grazie grazie grazie!
Great video, you are insanely talented btw.
Note: this implementation follows 'pre-LN' version of transformer -- which is slightly different from the original transformer in residual connection part. In the original block diagram, the layer normalization(LN) should be applied AFTER multi-head attention / feed-forward network. However, this code applies the LN BEFORE multi-head attention and feed-forward network. You can see the difference by comparing the ResidualConnection forward() code and section 3.2 of original "Attention Is All You Need" paper. This is a valid architecture too (proposed by the other papers), but it is not exactly as proposed in the original one.
here LN is applied after residual, which is basically merging multi-head attention / feed-forward network with original input. What u r saying is incorrect
Hi, I just happen to see your video. It's really amazing, your channel is so good with valuable information. Hope, you keep this up because I really love your contents.
seriously awesome
Hi, thanks for the video, it was really helpful, I do have 1 question though.
at 57:01, shouldn't the Q, K, V be encoder_output, encoder_output, x; instead of x, encoder_output, encoder_output?
if we're calculating Q@K.T, I think of that somewhat similar to "capturing the essence of the input sentence", which would be the encoder output for both Q & K, and the V would build on top of the input sentence's "essence". Can you please elaborate why the order of inputs are what they are? Even in the attention paper, the first 2 inputs(Q and K) are coming from the encoder output, and the Last input V comes from the decoder's self attention output.
thank you for the effort !
Dottore...sei un grande!
the code is really well written. very easy and nicely organized.
In 1:38:47, i can't understand why are we subtracting -2 from enc_tokens and -1 from dec_tokens, can someone please explain this to me?
Hi! In the encoder input we add the token "SOS" (start of sentence) and the token "EOS" (end of sentence) while in the decoder we add only the "SOS". If you want to understand the reason, please watch my video on the transformer, when I talk about training and inference (last 20 minutes of the video more or less).
@@umarjamilai Hey, i was watched that before this but it seems like i miss that part :D, thank you for your kind response.
Hi. Why are you taking size of alpha and bias as 1 in LayerNorm ? because in other sources LayerNorm has a size parameter also.
Hi! Thanks for your feedback and yes, it was a mistake from my side. The LayerNorm needs to be "elementwise affine", that is, each gamma parameter must be for each feature. I have fixed the code in the repository. Have a nice day!
Good find! had the same question after watching your LLama RMSNorm code :)
Just finished watching. Thanks so much for the detailed video. I plan to spend this weekend on coding this model. How long did it take to train on your hardware?
Hi Mohsin! Good job! It took around 3 hours to train 30 epochs on my computer. You can train even for 20 epochs to see good results.
Have a wonderful day!
@@umarjamilai what is your hardware? Just started studying deep learning few days ago and i didnt know transformers could take this long to train
@@NaofumiShinomiya Training time depends on the architecture of the network, on your hardware and the amount of data you use, plus other factors like learning rate, learning scheduler, optimizer, etc. So many conditions to factor in.
Amazingly useful video. Thank you.
Excellent lecture. What lib or method do you recommend to parallelise your code using more than 1 GPU?
Awesome tutorial, thank you very much!
Really great explanation to understand Transformer, many thanks to you.
For time 18:07, is it (self.alpha * (x - mean) / math.sqrt(std + self.eps) + self.bias) or (self.alpha * (x - mean) / (std + self.eps) + self.bias) ?
At 1:42:03 you are using SOS special token from source language tokenizer in sentence with target language. Tokenizers are trained on different languages so is it correct to use special tokens between them? SOS token from source language tokenizer won't have different idx compared to SOS from target language tokenizer?
Great video by the way, Thank You!
around 6:30, why does positional encoding need d_model? wouldn't seq_length suffice?
Because each vector of the positional encoding has d_model dimensions. Otherwise you wouldn't be able to add the embedding and the position vectors together, they need to have the same dimensions.
I appreciate you for this explanation. Great video!
You are a genius
This video is great! But can you explain how you convert the formula of positional embeddings into log form?
Thank you for your straight to the point no bs videos. Good code alongs and commentary.
But it looks like positional encoding aren't correct (as per paper). There is power there in: denominator = np.power(10000, 2*i/d). I get it you decided to use exp+log pair for stability, but no mentions of the power gone.
And extra layer norm after encoder. As in we "norm + add" (like you defined in the video, instead add & norm as per paper, but you said "lets stick with it", I understand) but than we norm again, after the last encoder block. Like so:
layer norm + add (in last encoder block (inside residual connection ) + layer norm again (inside Encoder class)
(I could be wrong but it looks like that)
At 29:14, the part on multihead attention, we feed each Q,V, K multiply by Wq, Wv, Wk then split them into n heads then dot product and concat them again. But should we not split them first, then apply Wq_h where Wq_h is the weight matrix for the hth query matrix, same for V and K? Because it seems like we just split them, apply attention, then concat?
23:25 @umarjamilai Can you please explain how the second equation is incorporated into the calculations?
I don’t believe there’s a single W^Q that you dot product. This doesn’t align with the second equation where it includes W_i^Q, W_i^K, W_i^V…
for determining the max len of tgt sentence, I believe you should point to tokenizer_tgt rather than tokenizer_src. tgt_ids = tokenizer_tgt.encode(item['transaltion'][config['ang_tgt']]).ids
Why can't we use pytorch inbuilt layer normalization ? Why you are creating a class for separate layer normalization ?
Hi! The video is part of the "from scratch" series, so I try to make as many things as possible from scratch. The goal is to learn the underlying mechanisms.
You are a great professional, thanks a ton for this
Hello Sir,
A small doubt at 25.50 timestamp. ( ua-cam.com/video/ISNdQcPhsts/v-deo.htmlsi=-M9N3TPd6o1CAd2D&t=1550 )
In the slide you have shown that after getting the Q', K' and V', the matrices are splited into "h" number of partitions. But in the second equation of head_i, it is shown that,
head_i = Attention ( Q(W^Q)_i, K(W^K)_i, V(W^V)_i)
That means for every head_i there is different W_i (for Q, K and V)
But you have shown there is only one (W^Q) for Q, one (W^K) for K and one (W^V) for V.
As per the second equation of head_i, I THINK,
1. For each head there will be different W^Q, W^K and W^V
2. There will be no split of Q', K' and V'
As per the third equation of MultiHead
3. Different head_i, generated by the different Ws will be concatenated. No splitting concept is there.
By the way, this is my observation only, but you have the done the actual research on it. So you will be better to justify it.
PLEASE, check and clear the doubts.
I am really thankful to you for having this nice and effective video. It is really helpful for me.
☺
I'm working on Speech-to-Text conversion using Transformers, this was very helpful, but how can I change the code to be suitable for my task?
Great explanation! Thanks very much
Hello Umar, really impressive work on Transformer. I have followed your step on this experiment. One small thing I am not sure is when you compute the loss you use the nn.CrossEntropyLoss() method, this method have already apply the softmax itself. As their document said:"The input is expected to contain the unnormalized logits for each class (which do not need to be positive or sum to 1, in general)". But in your project method in the built Transformer model, it has applied softmax. I wonder if we should only output the logits without this softmax to fit the nn.CrossEntropyLoss() method? Thank you anyway.
Please provide a playlist with the sequence , where to start. Like if i want to go through al of your videos, where do i start?
Thanks for the suggestion. Will make one
@@umarjamilai Thanks a lot.
In the matrices w_k, w_q, w_v and w_o in the MultiHeadAttention module, why did you not set the bias=False? Don't you need that for this to work properly?
i love your video! just a question regarding line: (line 20 in github) forward method of the layerNorm class
line20: return self.alpha * (x - mean) / (std + self.eps) + self.bias
in the formula we have (x - mean) / sqrt(variance + self.eps)
that means here we're doing (std + self.eps) which is good but we're adding the epsilon directly but in the slides formula we're adding the sqrt(epsilon). wont our very small epsilon which isnt square rooted cause any issues?
the epsilon is just a constant small number that never changes so you dont need to explicitly take a sqrt of it. its just there so the denominator is never zero which would cause an error but it is small enough to not impact the result.
Thanks so much such a great video. Really liked it a lot. I have a small query. For ResidualConnection, in the paper the equation is given by "LayerNorm(x + Sublayer(x))". In the code, we have: x + self.dropout(sublayer(self.norm(x))). Why it is not self.norm(self.dropout((x + sublayer(x))) ?
Great video! Where the Residual connection is calculated?