Multi Head Attention in Transformer Neural Networks with Code!

Поділитися
Вставка
  • Опубліковано 4 чер 2024
  • Let's talk about multi-head attention in transformer neural networks
    Let's understand the intuition, math and code of Self Attention in Transformer Neural Networks
    ABOUT ME
    ⭕ Subscribe: ua-cam.com/users/CodeEmporiu...
    📚 Medium Blog: / dataemporium
    💻 Github: github.com/ajhalthor
    👔 LinkedIn: / ajay-halthor-477974bb
    RESOURCES
    [ 1🔎] Code for video: github.com/ajhalthor/Transfor...
    [2 🔎] Transformer Main Paper: arxiv.org/abs/1706.03762
    [3 🔎] Bidirectional RNN Paper: deeplearning.cs.cmu.edu/F20/d...
    PLAYLISTS FROM MY CHANNEL
    ⭕ ChatGPT Playlist of all other videos: • ChatGPT
    ⭕ Transformer Neural Networks: • Natural Language Proce...
    ⭕ Convolutional Neural Networks: • Convolution Neural Net...
    ⭕ The Math You Should Know : • The Math You Should Know
    ⭕ Probability Theory for Machine Learning: • Probability Theory for...
    ⭕ Coding Machine Learning: • Code Machine Learning
    MATH COURSES (7 day free trial)
    📕 Mathematics for Machine Learning: imp.i384100.net/MathML
    📕 Calculus: imp.i384100.net/Calculus
    📕 Statistics for Data Science: imp.i384100.net/AdvancedStati...
    📕 Bayesian Statistics: imp.i384100.net/BayesianStati...
    📕 Linear Algebra: imp.i384100.net/LinearAlgebra
    📕 Probability: imp.i384100.net/Probability
    OTHER RELATED COURSES (7 day free trial)
    📕 ⭐ Deep Learning Specialization: imp.i384100.net/Deep-Learning
    📕 Python for Everybody: imp.i384100.net/python
    📕 MLOps Course: imp.i384100.net/MLOps
    📕 Natural Language Processing (NLP): imp.i384100.net/NLP
    📕 Machine Learning in Production: imp.i384100.net/MLProduction
    📕 Data Science Specialization: imp.i384100.net/DataScience
    📕 Tensorflow: imp.i384100.net/Tensorflow
    TIMSTAMPS
    0:00 Introduction
    0:33 Transformer Overview
    2:32 Multi-head attention theory
    4:35 Code Breakdown
    13:47 Final Coded Class

КОМЕНТАРІ • 82

  • @Dhanush-zj7mf
    @Dhanush-zj7mf 8 місяців тому +14

    We are very much fortunate to have all this for free. Thank You.

  • @hackie321
    @hackie321 10 днів тому

    Wow. You also put a background music. Great work!!
    Sun rays falling on your face. Felt like God himself is teaching us Transformers.

  • @barni_7762
    @barni_7762 Рік тому +10

    Wow! I have watched a few other transformer explaination videos (they were shorter and yet tried to cover more content) and I honestly didn't understand anything. Your video on the other hand was crystal clear and not only do I now understand how every part works, but also have an idea WHY it is there. Also you were super specific about the details that are otherwise left out, great work!

  • @romainjouhameau2764
    @romainjouhameau2764 Рік тому +4

    Very well explained. I really enjoy this mix between explanations and your code examples.
    Your videos are the best ressources to learn about transformers.
    Really thankful for your work ! Thanks a lot

  • @ulassbingol
    @ulassbingol 6 місяців тому +1

    This was one of the best explanations of multi-attention. Thanks for your effort.

  • @user-fe2mj9ze5v
    @user-fe2mj9ze5v 7 місяців тому +1

    Great works. One of the most clear explaination ever about Multi Head Attention

  • @vio_tio12
    @vio_tio12 2 місяці тому +1

    Good job Ajay! Best explanation I have seen so far!

  • @vivekmettu9374
    @vivekmettu9374 11 місяців тому

    Absolutely loved your explanation. Thank you for contributing!!

  • @amiralioghli8622
    @amiralioghli8622 8 місяців тому

    Thank you so much for taking the time to code and explain the transformer model in such detail, I followed your series from zeros to heros. You are amazing and, if possible please do a series on how transformers can be used for time series anomaly detection and forecasting. it is extremly necessary on yotube for somone!

  • @thanhtuantran7926
    @thanhtuantran7926 Місяць тому

    i literally understand all of it, thank you so much

  • @user-wr4yl7tx3w
    @user-wr4yl7tx3w Рік тому +5

    exactly the type of content needed. thanks

    • @CodeEmporium
      @CodeEmporium  Рік тому

      You are so welcome! Thanks for watching!

  • @ajaytaneja111
    @ajaytaneja111 Рік тому +3

    Ajay, I'm currently on a holiday and was watching your Transformer videos on my mobile whilst taking my evening coffee with my mom! And I have been doing this for the past 3 to 4 days. Today my mom seemed so impressed with your oratory skills asked me if I could also lecture on a subject as spontaneously as the Ajay on the video?! Now you've started giving me a complex dude! Ha ha.

    • @CodeEmporium
      @CodeEmporium  Рік тому +1

      Hahahaha. Thanks to you and your mom for the kind words! And sorry for the tough spot :) Maybe you should show her some of your blogs since you’re pretty good at writing yourself

  • @rajv4509
    @rajv4509 Рік тому

    Brilliant stuff! Thanks for the time & effort you have put in to create these videos ... dhanyavadagalu :)

  • @DailySFY
    @DailySFY 2 місяці тому

    Thank you !! For all the effort you have put it.

  • @user-mo2wj2zu5d
    @user-mo2wj2zu5d Рік тому +1

    Exactly the content I needed.Thanks very much.

  • @SarvaniChinthapalli
    @SarvaniChinthapalli Місяць тому

    Great lecture..Thank you so much for this video.. Great resource..

  • @ayoghes2277
    @ayoghes2277 Рік тому

    Thank you for making this video Ajay !!

    • @CodeEmporium
      @CodeEmporium  Рік тому

      My pleasure! Hope you enjoy the rest of the series’s

  • @DouglasASean
    @DouglasASean Рік тому

    Thanks for your work, much neaded right now.

  • @prashantlawhatre7007
    @prashantlawhatre7007 Рік тому +2

    5:44, we should also set `bias=False` in nn.Linear().

  • @jiefeiwang5330
    @jiefeiwang5330 11 місяців тому +4

    Really nice explanation! Just a small catch. 13:25 I believe you need to permute the variable "values" from size [1, 8, 4, 64] to [1, 4, 8, 64] before reshaping it(Line 71). Otherwise, you are trying to combine the same part of head from multiple words, rather than combine multiple parts of heads from the same word

  • @prashantlawhatre7007
    @prashantlawhatre7007 Рік тому +1

    ❤❤ Loving this series on Transformers.

    • @CodeEmporium
      @CodeEmporium  Рік тому

      Thanks so much for commenting and watching! I really appreciate it

  • @prasenjitgiri919
    @prasenjitgiri919 Рік тому

    ty for the effort you have put in, much appreciated but will you please explain the start token, it leave an understanding gap for me.

  • @saikiranbondi6868
    @saikiranbondi6868 Рік тому

    You are wonderful my brother your way of explaining is soo good

  • @surajgorai618
    @surajgorai618 Рік тому

    Very rich content as always.. Thanks for sharing

    • @CodeEmporium
      @CodeEmporium  Рік тому

      Thanks so much for commenting and watching!

  • @simonebonato5881
    @simonebonato5881 9 місяців тому

    Outstanding video and clear explanation!

    • @CodeEmporium
      @CodeEmporium  9 місяців тому

      Thanks so much! Real glad this is helpful

  • @raphango
    @raphango Рік тому

    Thanks very much again! 😄

  • @chenmargalit7375
    @chenmargalit7375 10 місяців тому +1

    Hi, thanks for the great series !. Something I don't understand and I'd love to hear ur opinion:
    You say the initial input is a one hot encoded vector which is the size of sequence length. Lets say my vocab is 1000 (all the words I want to support) and the sequence length is 30. How do I represent one word out of 1000, in a 30 sequence length vector? the index I put the 1 will not be correct as it might actually be in position 500 in the real vocab tensor

  • @paull923
    @paull923 Рік тому +2

    interesting and useful

  • @yanlu914
    @yanlu914 Рік тому

    After getting values, I think it should permute values first like before, and then reshape values.

  • @davefaulkner6302
    @davefaulkner6302 5 днів тому

    Thanks for your efforts to explain a complicated subject. Couple of questions: did you intentionally skip the Layer Normalization or did I miss something? Also -- the final linear layer in the attention block has dimension 512 x 512 (input, output size). Does this mean that each token (logit?) output from the attention layer is passed token-by-token through the linear layer to create a new set of tokens, that set being of size token sequence length. This connection between the attention output and the Linear layer is baffling me. The output of the attention layer is (Sequence-length x transformed-embedding-length) or (4 x 512), ignoring batch dimension in the tensor. Yet the linear layer accepts a (1 x 512) input and yields a (1 x 512) output. So is each (1 x 512) output token in the attention layer output sequence passed one at a time through the linear layer? And does this imply that the same linear layer is used for all tokens in the sequence?

  • @ivantankoua9286
    @ivantankoua9286 8 місяців тому

    Thanks!

  • @kollivenkatamadhukar5059
    @kollivenkatamadhukar5059 7 місяців тому

    Where can I get the theory part of it is good that you are explaining the code part of it can you share any link where we can read the theory part as well

  • @creativeuser9086
    @creativeuser9086 Рік тому

    what about the weights for K,V,Q for each head as well as the output?

  • @stanislavdidenko8436
    @stanislavdidenko8436 Рік тому +3

    maybe you have to divide at first 1536 by 3, and then by 8. But you do it by 8 first and then by 3, which sounds like you mix q, k, v vectors dimensions.

    • @oussamawahbi4976
      @oussamawahbi4976 Рік тому

      good point, but i think because the parameters that generate q, k, v are learned , it doesnt matter which you should divide by first, i could be wrong though

  • @seddikboudissa8668
    @seddikboudissa8668 2 місяці тому

    Hello good job but i have a small misunderstanding on the transformer paper they computed different different key query .. for each head and here you splitting the key and query where each head takes a split . Whats the difference between the two approachs ?

  • @yusuke.s2551
    @yusuke.s2551 16 днів тому

    Can you please make notebooks in the repo accessible again? because most of them are not accessible right now. Thank you in advance!

  • @pi5549
    @pi5549 Рік тому +1

    5:05 Why separate variables for input_dim (embedding dimension IIUC) and d_model? Aren't these always going to be the same? Would we ever want this component to spit out a contextualized-wordVector that's a different length from the input wordVector?

    • @oussamawahbi4976
      @oussamawahbi4976 Рік тому

      I have the same question , and i assume that most of the times input_dim should equal d_model in order to have a consistent vocabulary between the input and the output

    • @ShawnMorel
      @ShawnMorel Рік тому

      My understanding is that it sets you up to be able to choose different hyper-parameters e.g. if you want a smaller input word embedding space size but a larger internal representation. Table 3 of the original transformers paper shows a few different combinations of these parameters arxiv.org/pdf/1706.03762.pdf

  • @superghettoindian01
    @superghettoindian01 Рік тому

    You are incredible, I’ve seen a good chunk of your videos and wanted to thank you from the bottom of my heart! With your content I feel like that maybe even an idiot like me can understand it (one day - maybe? 🤔)!
    I hope you enjoy a lot of success!

    • @CodeEmporium
      @CodeEmporium  Рік тому

      Super kind words. Thank you so much! I’m sure you aren’t an idiot and we hope can all learn together!

  • @tonywang7933
    @tonywang7933 Рік тому

    At 4:57 d_model is 512, so is input_dim. But at 14:23 input_dim is 1024, I thought they should be the same number, are you saying you reduce the dimension of input into the dimension of the model by some compression technique like PCA?
    at 14:23, it looks like input_dim is only used at the very beginning, once we are in the model, input dimension is shrinked to 512

    • @jubaerhossain1865
      @jubaerhossain1865 10 місяців тому +1

      It's not PCA. It dimension conversion by weight matrix multiplication. For example, to make (1x1024) -> (1x512), we need a weight matrix of 1024x512... This is just an example, not the actual scenario demonstrated here.

  • @Slayer-dan
    @Slayer-dan Рік тому +5

    You never Never disappoint bro. Vielen vielen dank!

    • @CodeEmporium
      @CodeEmporium  Рік тому +1

      Thanks for the kind words and the support :)

  • @fayezalhussein7115
    @fayezalhussein7115 Рік тому

    please code you explain how could way implemnt hybrid model(vision transfomrer+cnn) for image classification task

  • @xdhanav5449
    @xdhanav5449 3 місяці тому

    Wow, this is a very intuitive explanation! I have a question though. From my understanding, the attention aids the encoder and decoder blocks in the transformer to understand which words that came either before or after (sometimes) will have a strong impact on the generation of the next word, through the feedforward neural network and other processes. Given a sentence like "The cook is always teaching the assistant new techniques and giving her advice.", what is a method I could implement to determine the pronoun-profession relationships to understand that cook is not paired with "her", rather "assistant" is. I have tried two methods so far. 1. Using the pretrained contextual embeddings from BERT. 2. (relating to this video) I thought that I could almost reverse engineer the attention methods by creating an attention vector to understand what pair of pronoun-professions WOULD be relevant, through self attention. However, this method did not work as well (better than method 1) and I believe this is because the sentence structures are very nuanced, so I believe that the attention process is not actually understanding the grammatical relationships between words in the sentence. How could I achieve this: a method that could determine which of the two professions in a sentence like above are referenced by the pronoun. I hope you can see why I thought that using an attention matrix would be beneficial here because the attention would explain which profession was more important in deciding whether the pronoun would be "he" or "her". This is a brief description of what I am trying to do, so if you can, I could elaborate more about this over email or something else. Thank you in advance for your help and thanks a million for your amazing explanations of transformer processes!

    • @xdhanav5449
      @xdhanav5449 3 місяці тому

      I would like to add additionally that in my approach of using attention, I don't actually create query, key, value vectors. I take the embeddings, do the dot product, scale it, and use softmax to convert it into a probability distribution. Possibly this is where my approach goes wrong. The original embeddings of the words in the sentence are created from BERT, so there should already be positional encoding and other relevant things for embeddings.

  • @creativityoverload2049
    @creativityoverload2049 8 місяців тому

    For how much i tried to understood, query, key and value are representation of embedded word after positional embedding so, with different purposes, but why are we dividing it into multiple heads in first place and dividing it into 64 each when we can just have 1 head with 512 q,k,v and then perform self attention. Even if we are using multiple head it for increasing context wouldn't 8 different vector of 512 for each q,k,v then performing self attention on each and combine them later will give us more accurate result. I mean to say why 512 representation of word is having 64 qkv each
    Someone please explain this.

  • @josephfemia8496
    @josephfemia8496 Рік тому +1

    Hello, I was wondering what the actual difference is between key and value? I’m a bit confused between the difference is between “What I can offer” vs “What I actually offer”.

    • @yashs761
      @yashs761 Рік тому +3

      This is a great video that might help you build intuition behind the difference of query, key and value. I've linked the exact timestamp: ua-cam.com/video/QvkQ1B3FBqA/v-deo.html

    • @ShawnMorel
      @ShawnMorel Рік тому

      First, remember that what we're trying to learn is Q-weights, K-weights, V-weights such that
      - input-embedding * Q-weights = Q (a vector that can be used as a query)
      - input-embedding * K-weights = K (a vector that can be used as a key)
      - input-embedding * V-weights = V (a vector that can be used as a value)
      Linguistic / Grammar intuition
      Let's assume that we had those Q, K and V, and we wanted to search for content for some query Q, how might we do that? Lgrammatically

    • @healthertsy1863
      @healthertsy1863 8 місяців тому +1

      @@yashs761 Thank you so much, this video has helped me a lot! The lecturer is brilliant!

  • @kaitoukid1088
    @kaitoukid1088 Рік тому +3

    Are you a full-time creator or do you work on AI while making digital content?

    • @CodeEmporium
      @CodeEmporium  Рік тому +7

      The latter. I have a full time job as a machine learning engineer. I make content like this on the side for now :)

    • @trevorthieme5157
      @trevorthieme5157 Рік тому

      ​@CodeEmporium How complex is the work you do with the AI VS. what you teach us here? Would you say it's harder to code by far or is it mostly just scaling up, reformatting, and sorting data to train the models?

    • @Stopinvadingmyhardware
      @Stopinvadingmyhardware Рік тому

      @@CodeEmporium Are you able to disclose your employer’s name?

  • @pi5549
    @pi5549 Рік тому +1

    14:40 Your embedding dimension is 1024. So how come qkv.shape[-1] is 3x512 not 3x1024?

    • @oussamawahbi4976
      @oussamawahbi4976 Рік тому +1

      qkv is the result of the qkv_layer , which takes embeddings of size 1024 and has 3*d_model=3*512 neurons , therefor the output of this layer will be of dimension (batch_size, seq_length, 3*512)

  • @wishIKnewHowToLove
    @wishIKnewHowToLove Рік тому

    thx)

    • @CodeEmporium
      @CodeEmporium  Рік тому

      You are very welcome! Hope you enjoy your stay on the channel :)

  • @physicsphere
    @physicsphere Рік тому

    I just started with AI-ML for few months, can you guide me what should I learn for getting a job .. I like your videos.

    • @CodeEmporium
      @CodeEmporium  Рік тому

      Nice! There are many answers to this. But to keep it short and effective, I would say know your fundamentals. This could be just picking one Regression model (like Linear Regression) and understand exactly how it works and why it works. I do the same for 1 classification model (like logistic regression). Look at both from the lens of Code, Math and real life problems.
      I think this is a good starting point for now. Honestly, it doesn’t exactly matter where you start as long as you start and don’t stop. I’m sure you’ll succeed!
      That said, if you are interested in the content I mentioned earlier, I should have some playlists with titles “Linear Regression “ and “Logistic Regression”. So do check them out if / when you’re interested. Hope this helps.

    • @physicsphere
      @physicsphere Рік тому +1

      @@CodeEmporium thanks for the reply.. sure I will check.. I am going to do a work using transformers.. ur videos really help, specially the coding demonstration...

  • @suchinthanawijesundara6464
    @suchinthanawijesundara6464 Рік тому

    ❤❤

  • @stanislavdidenko8436
    @stanislavdidenko8436 Рік тому

    Priemlеmo!

  • @Handelsbilanzdefizit
    @Handelsbilanzdefizit Рік тому

    But why do they do this multihead-thing? Is it to reduce computational cost? 8*(64²) < 512²

  • @kartikpodugu
    @kartikpodugu 3 місяці тому

    I have two doubts.
    1. How Q, K, V are calculated from input text ?
    2. How Q, K, V are calculated for multiple heads ?
    Can you elaborate or point me to a proper resource.

    • @naveenpoliasetty954
      @naveenpoliasetty954 2 місяці тому

      word embeddings are fed into separate linear layers (fully connected neural networks) to generate the Q, K, and V vectors. These layers project the word embeddings into a new vector space specifically designed for the attention mechanism within the transformer architecture.

  • @thechoosen4240
    @thechoosen4240 4 місяці тому

    Good job bro, JESUS IS COMING BACK VERY SOON; WATCH AND PREPARE