Pytorch Transformers from Scratch (Attention is all you need)

Поділитися
Вставка
  • Опубліковано 31 тра 2024
  • In this video we read the original transformer paper "Attention is all you need" and implement it from scratch!
    Attention is all you need paper:
    arxiv.org/abs/1706.03762
    A good blogpost on Transformers:
    www.peterbloem.nl/blog/transfo...
    ❤️ Support the channel ❤️
    / @aladdinpersson
    Paid Courses I recommend for learning (affiliate links, no extra cost for you):
    ⭐ Machine Learning Specialization bit.ly/3hjTBBt
    ⭐ Deep Learning Specialization bit.ly/3YcUkoI
    📘 MLOps Specialization bit.ly/3wibaWy
    📘 GAN Specialization bit.ly/3FmnZDl
    📘 NLP Specialization bit.ly/3GXoQuP
    ✨ Free Resources that are great:
    NLP: web.stanford.edu/class/cs224n/
    CV: cs231n.stanford.edu/
    Deployment: fullstackdeeplearning.com/
    FastAI: www.fast.ai/
    💻 My Deep Learning Setup and Recording Setup:
    www.amazon.com/shop/aladdinpe...
    GitHub Repository:
    github.com/aladdinpersson/Mac...
    ✅ One-Time Donations:
    Paypal: bit.ly/3buoRYH
    ▶️ You Can Connect with me on:
    Twitter - / aladdinpersson
    LinkedIn - / aladdin-persson-a95384153
    Github - github.com/aladdinpersson
    OUTLINE:
    0:00 - Introduction
    0:54 - Paper Review
    11:20 - Attention Mechanism
    27:00 - TransformerBlock
    32:18 - Encoder
    38:20 - DecoderBlock
    42:00 - Decoder
    46:55 - Putting it togethor to form The Transformer
    52:45 - A Small Example
    54:25 - Fixing Errors
    56:44 - Ending

КОМЕНТАРІ • 316

  • @AladdinPersson
    @AladdinPersson  3 роки тому +63

    Here's the outline for the video:
    0:00 - Introduction
    0:54 - Paper Review
    11:20 - Attention Mechanism
    27:00 - TransformerBlock
    32:18 - Encoder
    38:20 - DecoderBlock
    42:00 - Decoder
    46:55 - Forming The Transformer
    52:45 - A Small Example
    54:25 - Fixing Errors
    56:44 - Ending

    • @alhasanalkhaddour434
      @alhasanalkhaddour434 3 роки тому

      First thanks for this amazing video, but I have one question regarding the implementation of Self Attention.
      To distribute values, keys and queries to heads you just did a reshape for the input, while the original paper suggested to do projection using trainable matrices.
      Am I right or I missed up something?

    • @feravladimirovna1044
      @feravladimirovna1044 3 роки тому

      @@alhasanalkhaddour434 yes i think he did the projection already using self.values, self,keys, self.queries cause these are linear layers . the real inputs comes from the parameters passed to forward function see 14.43 for more details

    • @riyajatar6859
      @riyajatar6859 2 роки тому

      Why did you use self. Values, self. Keys in the init method bcz they are not used at all in forward

    • @devstuff2576
      @devstuff2576 2 роки тому

      it would be far better if you coded with an illustration of the architecture on the side.

    • @somayehseifi8269
      @somayehseifi8269 Рік тому

      Sorry can you share the github link of this special code?

  • @errrust
    @errrust 3 роки тому +246

    Attention is not all we need, this video is all we need

  • @pratikhmanas5486
    @pratikhmanas5486 3 роки тому +153

    Not found a tutorial so much detail oriented. Now I am completely able to understand the Transformer and Attention Mechanism.Great Work.Thank you😊

    • @AladdinPersson
      @AladdinPersson  3 роки тому +11

      I really appreciate you saying that, thanks a lot :)

    • @NICe-wm9xn
      @NICe-wm9xn 9 місяців тому +1

      ​@@AladdinPerssonHi! You missed one error in your video. In your GitHub code, you have `self.values = nn.Linear(embed_size, embed_size)`, but in your video, you used `self.values = nn.Linear(self.head_dim, self.head_dim, bias=False)`. I couldn't reproduce your results until I noticed this discrepancy.

  • @bhargav7476
    @bhargav7476 3 роки тому +31

    I watched 3 Transformer videos before this one and thought I would never understand it. Love the way you explained such a complicated topic.

  • @nishantraj95
    @nishantraj95 2 роки тому +34

    This is undoubtedly one of the best transformer implementation videos I have seen. Thanks for posting such good content. Looking forward to seeing some more paper implementation videos.

  • @TheClaxterix
    @TheClaxterix 2 роки тому

    This is one of the best explanation videos about a paper to code I've watched in a loong time! Congratz Aladdin dude!

  • @niranjansitapure4740
    @niranjansitapure4740 Рік тому +2

    I have been struggling to implement and understand custom transformer code from various sources. This was perhaps one of the best tutorials.

  • @cc-to2jn
    @cc-to2jn 2 роки тому +9

    you're an absolute saint. idk if i can even put it into words the amount of respect and appreciation I have for you man! thank you!

  • @chefmemesupreme
    @chefmemesupreme Рік тому +3

    This is cool. It would be helpful to have a section highlighting what parts of the dimensions should be changed if you are using a dataset of a different size or you want to change the input length. ie: keeping the architecture constant but noting how it could be used flexibly

  • @FawzyBasily
    @FawzyBasily Рік тому +1

    Many thanks to you for this impressive tutorial, amazing job and outstanding explanation, and also thanks for sharing all these resources in the description.

  • @xiangzhang7723
    @xiangzhang7723 2 роки тому

    Hi, I really like your channel. I have been learning from your tutorials for a while. Best wishes!

  • @user-qx3jn9ii7s
    @user-qx3jn9ii7s 3 роки тому

    Love your work!! I was very confused when dealing with other tutorials... but your work made me clear about Transformer. I wish only I know you and your work.

  • @sehbanomer8151
    @sehbanomer8151 3 роки тому +56

    In the original paper each head should have seperate weights, but in your code all heads share the same weights. here are two steps to fix it:
    1. in __init__: self.queries = nn.Linear(self.embed_size, self.embed_size, bias=False) (same for key and value weights)
    2. in forward: put "queries = self.queries(queries)" above "queries = queries.reshape(...)" (also same for keys and values)
    Great video btw

    • @AladdinPersson
      @AladdinPersson  3 роки тому +12

      Hey, thank you so much for bringing this to my attention ;) When reading through the paper I get the same idea that you do, namely that each head should have separate weights, and when reading blog posts like "The Annotated Transformer" he has done exactly what you describe. From the blog post www.peterbloem.nl/blog/transformers he explains narrow vs wide self attention and in his Github implementation he does similarly as I do, however I noticed now that an issue has been raised regarding the same issue you bring up: github.com/pbloem/former/issues/13.
      And I agree with the point brought up there also, if each head is using same weights it doesn't feel like you can say they are different. I'm having difficulty finding other implementations, but I will keep a close look at this and if I get some more time I will try to spend more time and investigate this. I'm also a bit surprised that when training on this implementation it provides good results if I remember correctly with only 3x32x32 vs 3x256x256 parameters.

    • @sehbanomer8151
      @sehbanomer8151 3 роки тому +4

      @@AladdinPersson Yes both methods should work just fine, but I believe using seperate weights for each head would give better performance, without slowing down the model. it would use more memory of course, but it's almost nothing compared to number of parameters in the feedforward sublayers.

    • @66dp97
      @66dp97 2 роки тому +2

      @@sehbanomer8151 I think your inplementation may still have some issues. Since each head shoud have seperate weights, shouldn't there be eight(number of heads) different head_dim*head_dim linear layers instead of one embed_size*embed_size linear layer. Additionally, these two implementations have different number of parameters.

    • @sehbanomer8151
      @sehbanomer8151 2 роки тому +6

      @@66dp97 the key, query & value projection of each head will project an _embed_dim_ dimentional vector to _head_dim_ dimentional space, so for each attention head, the projection matrix will have shape (head_dim, embed_dim). Fusing _n_heads_ seperate linear layers into a single (embed_dim, head_dim * n_heads) linear layer is more GPU friendly, thus faster.

  • @geriskenderi
    @geriskenderi 3 роки тому

    Compliments for the video, really gives better insight into a complex architecture. Thanks for sharing all this information.

  • @rayzhang2589
    @rayzhang2589 2 роки тому

    Really thank you!!! This really helps me deeply understand Transformer!!!

  • @user-hm9sz4xg7s
    @user-hm9sz4xg7s Рік тому

    It is the best description for transformer implementation.
    thank you so much.
    best regards.

  • @soorkie
    @soorkie 3 роки тому

    I found this very helpful. I always used to get confused regarding the tensor sizes. Now it's all clear. Thank you very much. Also this is the first time I came across einsum. Thanks again for that too.

  • @srinivasvinnakota1747
    @srinivasvinnakota1747 Рік тому

    Dude, you rock! I bow to your expertise 🙏😊

  • @bingochipspass08
    @bingochipspass08 10 місяців тому

    Ya,.. agreed,.. this was an extremely difficult architecture to implement,. with .a LOT of moving parts,.. but this has to be the best walkthrough out there,.. sure, there are certain things like the src_mask unsqueeze that were a little tricky to visualize,.. but even barring that, you broke it down quite well! Thank you for this!. I'm so glad that we have all of this implemented in HF/PT hahah

  • @zhengyuancui7837
    @zhengyuancui7837 Рік тому

    Great work! Really helped me. Thanks.

  • @parthasarathyk5476
    @parthasarathyk5476 2 роки тому

    Superb...Hats off. Thank you for explanation.

  • @TomatoPua
    @TomatoPua 3 роки тому

    Gonna try this for my uni assignment! Thank you

  • @alikhodabakhsh2653
    @alikhodabakhsh2653 10 місяців тому

    excellent video and thank u for sharing this. I have one point about implementation, in "SelfAttention" class for query, value and key matrices (linear layer) you used (head_dim, head_dim) dimension. so these matrices will be shared in all heads. I think it's better to use (embed_dim, embed_dim) matrix to map input to q, k, v vectors and reshape it to have head dimension.

  • @arjunpukale3310
    @arjunpukale3310 3 роки тому +1

    I want to use the encoder of transformer for video classification, where each frame of the video will be first passed through a pretrained cnn and the output of this would act as an embedding and then passed as an input tor encoder. Any suggestions on how to do that?

  • @joeyk2346
    @joeyk2346 2 роки тому

    Great Job!! Thanks for the video!

  • @kolla_teja
    @kolla_teja 3 роки тому

    excellent work mate cleared all my doubts

  • @MrWilliamducfer
    @MrWilliamducfer 3 роки тому

    Very nice! Congratulations!!

  • @matteofabro4486
    @matteofabro4486 3 роки тому +1

    Thank you very much for the info!

  • @garrettosborne4364
    @garrettosborne4364 3 роки тому

    Great video, advanced my understanding.

  • @mykytahordia
    @mykytahordia Рік тому

    making something sophisticated so easy and clear that's what I call magic. Aladdin, you are truly the magician.

  • @qiguosun129
    @qiguosun129 2 роки тому

    First of all, thank you for the video. The most valuable thing I learned from it is how to create a so complex model step by step from the flow chart. Next, I will find out weither this self-attention model can be used in environmental pollution problems.

  • @gautamvashishtha3923
    @gautamvashishtha3923 Рік тому +3

    Great Tutorial! Thanks Aladdin

    • @jushkunjuret4386
      @jushkunjuret4386 11 місяців тому

      for actually training it, what would we do?

    • @gautamvashishtha3923
      @gautamvashishtha3923 11 місяців тому

      @@jushkunjuret4386 Can you specify where you're exactly getting stuck?

  • @fuat7775
    @fuat7775 3 роки тому

    Very detailed and clear! Thank you very much!

  • @KelvinKongOfficial
    @KelvinKongOfficial 3 роки тому

    This is the best way to learn, through hands on. Great video! Also may I know which font is used in this video? I noticed that your choice of font is very clean and easy to work with!

  • @marcopleines112
    @marcopleines112 2 роки тому

    Thanks for your educational contribution! Just one question: what are the linear layers self.values, self.keys and self.queries for? These are not used inside the forward pass.

  • @SahilKhose
    @SahilKhose 3 роки тому +2

    Hey Aladdin,
    Really amazing videos brother!
    This was the first video of yours that I stumbled upon and I fell in love with your channel.

    • @AladdinPersson
      @AladdinPersson  3 роки тому +1

      Hey Sahil, I definitely need a refresher and go through transformers again, so I'm not sure if I will be able to give you the best answer right now. So from what I recall the most important part of the masking with regards to padding is that we make sure these are not backpropagated through. We want the network weights and embeddings etc not to learn to be associated with the padded values, and that's what we are trying to accomplish with setting it to -infinity since gradient of softmax will then be 0.

    • @SahilKhose
      @SahilKhose 3 роки тому

      @@AladdinPersson Yeah I get the reason why we do it and the -inf setting. I had doubts with the padding that we use, I feel we need more padding to take care of the cases where both sentences are padded and then we have attention over them. I feel I have made it pretty clear in the comment above.

  • @user-oy1te7ls8x
    @user-oy1te7ls8x Рік тому +1

    great explanation, much more helpful than the theoretical only explanations

  • @amankushwaha8927
    @amankushwaha8927 Рік тому

    Thanks Aladdin. The video helped a lot.

  • @rubelahmed3474
    @rubelahmed3474 2 роки тому

    Thanks for the helpful video. Could I interpret that in line 244, the transformer is being trained on 'x' and predicting the last number in the 'trg' sequences? If so, how to find which number was predicted with the highest probability/likelyhood? My goal is to use a transformer for a similar task, like it will be trained on a set of sequence (like in 'x') to detect the temporal relations between different events, and predict next events for a given sequence (say it a test). Any hint on how to do that will be an immense help. Thanks for reading my comment.

  • @shahnawazrshaikh9108
    @shahnawazrshaikh9108 2 роки тому

    One of the best resource on the internet!

  • @YL-nx3yk
    @YL-nx3yk 3 роки тому +1

    The best video I watched on youtube! Why I found you so late!!!

  • @thatipelli1
    @thatipelli1 3 роки тому +1

    This is the best tutorial on Transformers online. I was able to understand the nuts and bolts of it. Kudos to you!! It will be great if you can cover Graph Convolutional Networks from scratch

  • @anas.2k866
    @anas.2k866 10 місяців тому

    In the Jay Alammar blog there is no split of the embeddings in order to compute attention for each head.

  • @deepshankarjha5344
    @deepshankarjha5344 3 роки тому +1

    fantastic, awesome videos as ever.

  • @michaeldurand9309
    @michaeldurand9309 11 місяців тому +1

    Thank you for your video. You did a great job! I was wondering how to train a transformer if the input form is (batch_size, sequence_length, number_of_features). Let's say number_of_features = 2 (it could be X and Y coordinates in time, for example). What impact does this type of input have on positional encoding, the masking strategy and the attention mechanism?

  • @krishnachauhan2822
    @krishnachauhan2822 2 роки тому

    I am not understanding sir, does this input sequence is divided into a number of chunks like here you did 256/8 where 8 is the number of attention heads. I am thinking for the self-attention whole of the input embeddings need to transform into three parts namely Q, K and V. and then we need to divide this for 8 times in the case of 8 multi heads. that's why the name is the multi head. Please clear. Regards

  • @alfonsocvu
    @alfonsocvu Рік тому

    Thanks a lot for the video, this was great an its helping me a lot.

  • @sahilriders
    @sahilriders 3 роки тому

    Thanks for the great video.
    I have one doubt though. Encoder output is fed into each decoder block. So the last encoder block is fed to each decoder block or like layer1 encoder block output is fed to layer 1 decoder block like that and so on.

  • @marksaroufim
    @marksaroufim 2 роки тому +1

    wow, the best transformer tutorial I've seen

  • @yashrathi6862
    @yashrathi6862 2 роки тому +1

    Thank you the video was very helpful. In the end we got output of dim (2,7,10). So why did we got the probabilities of the next 7 words? And why is the output len dependent on the number of words we feed to the decoder?

  • @davidray6126
    @davidray6126 Рік тому

    Thx for this amazing tutorial. I think the "energy" (Q * K_transpose) should be divided by the square root of head_dim instead of embedding_size.

  • @danielmaier6665
    @danielmaier6665 2 роки тому

    Very good tutorial!
    Just one thing though: this is not how multihead attention is implemented in the original attention is all you need paper. In the paper the input is not split into h smaller vectors, but linearly transformed h times. So their wouldn't be reshape and then linear(head_dim, head_dim) but rather linear(embed_size, head_dim) in each head.
    Also you can have more heads than heads*head_dim = embed_size. This is because in the paper you would transform your concatenated head-outputs again with a jointly trained matrix WO (concatenation size x embed_size)

  • @shazm4020
    @shazm4020 2 роки тому

    Thank you so much!

  • @raminsaljoughinejad6307
    @raminsaljoughinejad6307 3 роки тому

    hey man great video. how should i remove the embedding part of this network? and replace it with just an LSTM layer. i want to use this model for time series prediction and dont need nn.embedding.

  • @lencazero4712
    @lencazero4712 10 місяців тому

    @Aladdin Persson. Thank you. Great lesson. Which IDE and theme you used ?

  • @learner3539
    @learner3539 2 роки тому

    Great! ❤ Thanks for this master piece. Hmm, I follow you along and I have not any error when I run, since I already noted your error in code and update it 😊. Waiting for your new Video. This is the first video I go along with you. Subscribed! Bell Notification on.

  • @flamingflamingo4021
    @flamingflamingo4021 3 роки тому +1

    It's an extremely useful video for researches trying to implement paper codes. Do make a series implementing other Machine Learning codes described in other papers as well.
    Please make a video to use this model on an actual NLP task such as translation, etc.

    • @AladdinPersson
      @AladdinPersson  3 роки тому +1

      Thank you for saying that I really appreciate it. I have made one other video on transformers for machine translation, and I will do my best to continue making videos and to cover more advanced topics! :)

    • @flamingflamingo4021
      @flamingflamingo4021 3 роки тому

      @@AladdinPersson I can't seem to find it. Can you please paste the link here, please? I'd truly appreciate it. :)

    • @AladdinPersson
      @AladdinPersson  3 роки тому

      @@flamingflamingo4021 Yeah for sure: ua-cam.com/video/M6adRGJe5cQ/v-deo.html
      It's the last video of an inofficial serie of building Seq2Seq models for the task of machine translation. First video was normal seq2seq, second video was seq2seq+attention and the last video that I linked above is using transformers. These videos were inspired a lot by Bentrevett on Github and I recommend you check him out also if you're interested in NLP :)

  • @czarhada
    @czarhada 3 роки тому +3

    Excellent! Thank you so much for this! Had a small request, can you please come up with videos on BERT and controlled text generation models like PPLM? Thanks again!

    • @AladdinPersson
      @AladdinPersson  3 роки тому +1

      Thank you for the comment! I will look into it, got a few videos that I'm planning but will come back to this in the future for sure :)

  • @allessandroable
    @allessandroable 3 роки тому +3

    Thank you! great explanation, I just wonder why in the attention mechanism you have to inizialize self.queries, self.keys ecc as Linear layers

    • @ayyythatguy
      @ayyythatguy 2 роки тому

      From paper, the attention mechanism is fully connected, which means you should use linear layers.

  • @Decapodd
    @Decapodd 7 місяців тому +1

    This video is all I needed

  • @jianweitang4790
    @jianweitang4790 3 роки тому +2

    i've got a question here. In order to generate a target sentence, there should be multiple time steps right?
    The first output word from Decoder will go through Decoder again to generate the secend output word.
    i cant find where you difine this in this video. Or maybe i understand it wrong.

    • @AladdinPersson
      @AladdinPersson  3 роки тому +2

      During training everything is done in parallel (we have the entire translated target sentence) and we utilize these target masks that I talked about in the video. This is a major difference between transformer and normal Seq2Seq, where we actually send in the entire target sentence rather than word by word. When we evaluate the model you're completely right that we need to do multiple time steps (one word at a time) but this is not the case during training. In this video we kind of just do the transformer from scratch, the question you're asking is more related to actually training & evaluating transformer models. I'll try to see if I find code for what you're asking for.
      So here is a full code example of using transformers (also have a separate video on it): github.com/AladdinPerzon/Machine-Learning-Collection/blob/master/ML/Pytorch/more_advanced/seq2seq_transformer/seq2seq_transformer.py
      When we actually evaluate the model we need to do it time step by time step and it would look like this (translate sentence function) and I believe THIS is what you're asking for: github.com/AladdinPerzon/Machine-Learning-Collection/blob/master/ML/Pytorch/more_advanced/seq2seq_transformer/utils.py

    • @jianweitang4790
      @jianweitang4790 3 роки тому

      @@AladdinPersson Thank you, the link explained it pretty well. Thank you a lot.

  • @MrTennis666666
    @MrTennis666666 3 роки тому

    why is the input embedding input is 256 dimensions? at 10:02

  • @saadatkhan2791
    @saadatkhan2791 2 роки тому +1

    Hello, This is a great explanation of transformers. I have a question. How did you know that query.shape[0] would give you the number of training examples? Why is it later used in reshaping the keys, query, and values?

    • @takihasan8310
      @takihasan8310 10 місяців тому +1

      The first dimension is always the batch size in tensor operations. As any model is trained on batches, and the batch size is the number of samples

  • @chrisogonas
    @chrisogonas Рік тому

    Superb!

  • @nasirop7551
    @nasirop7551 3 роки тому

    I love your toturials

  • @MohamedAli-dk6cb
    @MohamedAli-dk6cb 2 місяці тому

    I got confused a bit. What you are sending from the encoder to the decoder, Do they represent queries and keys, or keys and values??

  • @merlinchristy5178
    @merlinchristy5178 3 роки тому

    Hi @Aladdin, can I take the attention weight values from model = Transformer(src_vocab_size, trg_vocab_size, src_pad_idx, trg_pad_idx).to(
    device
    )
    out = model(x, trg[:, :-1])
    attention_weights =

    • @AladdinPersson
      @AladdinPersson  3 роки тому +1

      I haven't tried it but you should definitely be able to take the attention weights and I've seen other people like: github.com/bentrevett/pytorch-seq2seq/blob/master/6%20-%20Attention%20is%20All%20You%20Need.ipynb
      Where he uses them to create these Attention Maps but I haven't personally tried that. Let me know how it goes for you

  • @dunderinit
    @dunderinit Рік тому

    Why do you split the embedding vector into heads instead of using the same embedding in different heads like the paper does?

  • @jihoonpark2404
    @jihoonpark2404 3 роки тому

    It is one of the best Transformer videos I have ever seen! Thanks a lot!!!
    I have a question. In the paper, for multi-head attention, Wi^Q is in the domain of R^(d_model x d_k), but Wi^Q in your implementation seems to be in the domain of R^(d_model/h x d_k) because self.queries in class SelfAttention is defined as nn.Linear(head_dim, head_dim) and the queries are reshaped before coming into the linear layer. The cases of Wi^K and Wi^V are the same as the case of Wi^Q. Do I miss something? Thanks again!

  • @mohdkashif7295
    @mohdkashif7295 2 роки тому

    In decoder block in forward function why src_mask passed in transformer?

  • @obiohagwu788
    @obiohagwu788 2 роки тому

    Dude! you're amazing!

  • @siennypoole4366
    @siennypoole4366 2 роки тому +1

    Hi everyone! I finished following the this tutorial to the end... But now I am confused on how to "train" and "test/predict" this model? Any help is appreciated! Thanks!

  • @kaustubhshete6250
    @kaustubhshete6250 3 роки тому

    Exactly what I need

  • @anirband
    @anirband 3 роки тому +1

    Very nice video. I have a question.
    The positional encoding you used is different from the one in the paper where they use sin/cos function of word position and vector index. It seems in your code, these positional embedding will be trained unlike the one in the paper. Do you have the code for how positional encoding is done in the paper?

    • @AladdinPersson
      @AladdinPersson  3 роки тому +1

      Yes you're right about that, if I recall I did mention it in the video but I could have missed that. There have been other questions about this as well so I might try to implement positional encoding also but as of right now I have not

  • @matejmnoucek2865
    @matejmnoucek2865 2 роки тому

    Why do you set bias=False for nn.Linear of keys, values and queries?

  • @abdulrahmanadel8917
    @abdulrahmanadel8917 2 роки тому +1

    if I'm using transformer for a speech recognition task (speech-to-text). after training the model, in prediction what should I place on the target parameter if I have only audio file (not transcibed)?

    • @popamaji
      @popamaji 2 роки тому

      did you get ur answer?

  • @ScriptureFirst
    @ScriptureFirst 3 роки тому +1

    45:59: WHOA! slow that down! Pause a sec, be emphatic if we're going to change something back up there

  • @avinashrai6725
    @avinashrai6725 Рік тому +2

    in SelfAttention, you have not used the linears self.keys, self.values, self.queries in forward method, whats the use of those layers?

  • @adamtran5747
    @adamtran5747 2 роки тому

    Love this content.

  • @beizhou2488
    @beizhou2488 3 роки тому +2

    Thank you for recording and publishing such informative tutorials. Could self-attention be regarded as a replacement for the RNN, meaning that anything RNN could do can be substituted by using self-attention? If so, could you do a tutorial regarding how we can use self-attention to classify the text?

    • @AladdinPersson
      @AladdinPersson  3 роки тому +3

      Yes it can, in fact many have proclaimed RNN/GRU/LSTMs are "dead" (im not so sure I would be that dramatic) but transformers have definitely taken over in terms of SOTA performance. I haven't done any projects personally on using it so far to classify text so far though

    • @beizhou2488
      @beizhou2488 3 роки тому

      @@AladdinPersson Okay. Thanks for your reply. I will give it a try and see how it goes.

  • @1potdish271
    @1potdish271 2 роки тому

    Why are you not using `sin` or `cos` function for positional encoding?

  • @feravladimirovna1044
    @feravladimirovna1044 3 роки тому +1

    the last question please what is intuitive meaning for the source and target inputs of transformer why model takes x, trg[:, :-1]
    model = Transformer(src_vocab_size, trg_vocab_size, src_pad_idx, trg_pad_idx).to(device)
    out = model(x, trg[:, :-1])
    what we could get from out?
    I tried
    model = Transformer(src_vocab_size, trg_vocab_size, src_pad_idx, trg_pad_idx).to(device)
    out = model(x, trg)
    and got
    torch.Size([2, 8, 10])
    to be honest I could not interpret that :(

    • @SahilKhose
      @SahilKhose 3 роки тому

      Okay so I understand your question correctly:
      Your doubt is why are we using trg[:, :-1] instead of trg
      First:
      trg[:, :-1] this means all the batches(sentences) and entire sentences except the last word in all the sentences.
      Second:
      We do this because of how the transformer model is developed to train. Unlike RNNs our transformer model does not predict the entire output sentence, instead it predicts one word at a time. So the decoder takes in (t-1) time step's output of the transformer and then predicts the t time step output word. Hence we provide the entire sentence but the last word so as to predict the last word.
      Refer to the beautiful video by Yannic Kelcher:
      ua-cam.com/video/iDulhoQ2pro/v-deo.html
      Hope your doubt is solved. Let me know if it's still unclear.

  • @congcongzhang7740
    @congcongzhang7740 Рік тому

    thans for you tuturial! but there is one thing i cant resolve. Whether embeddings splits to num_heads parts along embed_size dimension then goes to linear layer OR goes to linear first then split to 8 heads?

  • @PaAGadirajuSanjayVarma
    @PaAGadirajuSanjayVarma 3 роки тому

    Thank you so much sir

  • @romajain2425
    @romajain2425 2 роки тому

    Great video! But why do we add dropout after skip connection?

  • @xanyula2738
    @xanyula2738 2 роки тому +1

    I can't seem to understand the necessity for self.keys, self.queries and self.values in the SelfAttention class. Am I missing something?

  • @nicknguyen690
    @nicknguyen690 3 роки тому +3

    for the dropout in your codes, for example DecoderBlock forward, I think it should be:
    query = self.norm(x + self.dropout(attention))
    instead:
    query = self.dropout(self.norm(attention + x))
    Here is the paper quote:
    "We apply dropout [33] to the output of each sub-layer, before it is added to the sub-layer input and normalized."

    • @nicknguyen690
      @nicknguyen690 3 роки тому +1

      Thanks so much for the great work!

    • @AladdinPersson
      @AladdinPersson  3 роки тому +3

      I think you're right, I'll look into this some more soon and update the Github code :)

    • @user-mj3jb6js5d
      @user-mj3jb6js5d 2 роки тому

      Thank you very much!

  • @teetanrobotics5363
    @teetanrobotics5363 3 роки тому

    World's best UA-cam channel evaaaa.

  • @ocean6709
    @ocean6709 11 місяців тому

    Hi, at 18:45, energy = queries* keys. Are you doing outer product?

  • @Mesenqe
    @Mesenqe 3 роки тому

    Thank you for the complete video. Excellent work, if you have time, can you do some more videos on attention maps on images? like the one in the Learn to pay attention paper.

  • @Astrovic1
    @Astrovic1 Рік тому

    great video! is the code in your GitHub Repository? because I can't find it there? in which folder should it be?

  • @somayehseifi8269
    @somayehseifi8269 Рік тому

    Thank you for your toturial, I have a question : you said in encoder all value, key and query are the same. as the paper said value, query and key are just the same in size not in element. can u plz explain it a little more?

  • @feravladimirovna1044
    @feravladimirovna1044 3 роки тому

    I have a question please when you calculate the attention : according to the formula we should divid on the square root of the key length right? you divided on the embedding length so here I did not understand is it a mistake or I miss something? should not we divide on key_len? in he paper it was mentioned that "The input consist of queries and keys of dimension dk and values of dimension dv", in the minute you have said that key_len and value len are always going to be the same wherease in the paper the opposite key_len and query_len are alwyas the same and the value_len differ

  • @risheshgarg9990
    @risheshgarg9990 3 роки тому

    I am a bit confused about that in encoder block where you create the positional embedding layer. Why do we initialize that layer with max len parameter. Can you please explain it in more detail?

    • @AladdinPersson
      @AladdinPersson  3 роки тому

      Sure! Sorry for the late response. When using positional embedding (in constrast to positional encodings) there's a pro that it's very simple we just add an embedding layer for the positions and in this removes the restriction that the transformer is permutationally invariant. Although one con of doing it this way is that we need to restrict the sentences to be within some max_length and that's why we initialize it with this parameter. Essentially we won't be able to have any sentence longer than this parameter

  • @N3xUss99
    @N3xUss99 5 місяців тому

    i don't know if there are still people who are watching this but i have a question at code level, in the decoder method "forward" when i pass the parameters to layer i had as the fifth one "target_mask" but in the decoder block you decided to put the parameter "device". Did i miss something, is it just an error or there is an other explanation? Thanks a lot

  • @fq6475
    @fq6475 3 роки тому +1

    Thanks so much!

  • @afsarabenazir8558
    @afsarabenazir8558 4 місяці тому

    great video! what does the forward_expansion parameter mean?

  • @Dhirajkumar-ls1ws
    @Dhirajkumar-ls1ws 2 роки тому

    Great video.

  • @zhuchencao2527
    @zhuchencao2527 Рік тому

    Hi Aladdin. Very nice coding!
    But I am confused as to why, here, the kqv projections for the different heads seem to be shared. It seems like we should use nn.Linear(embed_dim, embed_dim), and later divide it into different heads?

    • @zimuzeng6577
      @zimuzeng6577 Рік тому

      Some early comments have addressed the issue.

  • @user-vd7im8gc2w
    @user-vd7im8gc2w 2 місяці тому

    After the decoder block we have to again pass the matrix to neural network with output set to target vocab dimension and apply softmax to get the probabilities of word right ?

    • @elvissun8844
      @elvissun8844 9 днів тому

      nope, it says in the previous comments.
      the softmax is contained in loss function(cross entropy).
      if you do softmax again, it cause the gradient diminishing.
      this what author replied in other comments
      Thank you for the comment! First you're going to probably use CrossEntropyLoss and softmax is then included in that loss function, so you don't want to do softmax as output. I have another video where we trained the transformer model on a translation task, although to simplify I used Pytorch inbuilt transformer modules (but you can use the ones we implemented).
      The shapes for sending in to crossentropy can be tricky, but let's first know understand the input shapes to cross entropy by looking at something like MNIST where it will take (N, 10) for the outputs and targets will be simply (N). In this case you will reshape so that you have (N*seq_length, vocab_size) and (N*seq_length), sort of viewing every time stamp as it's one example.
      Here is the code for that transformer model I talked about (I also have a separate video if you feel something is confusing), which you might want to take a look at: github.com/AladdinPerzon/Machine-Learning-Collection/blob/master/ML/Pytorch/more_advanced/seq2seq_transformer/seq2seq_transformer.py
      I havn't done any tests but I would imagine Pytorch inbuilt transformer is faster, so I would follow the other video I did when you want to actually use it for training a model and this video is more about understanding the transformer.