Multi Head Attention in Transformer Neural Networks with Code!
Вставка
- Опубліковано 4 чер 2024
- Let's talk about multi-head attention in transformer neural networks
Let's understand the intuition, math and code of Self Attention in Transformer Neural Networks
ABOUT ME
⭕ Subscribe: ua-cam.com/users/CodeEmporiu...
📚 Medium Blog: / dataemporium
💻 Github: github.com/ajhalthor
👔 LinkedIn: / ajay-halthor-477974bb
RESOURCES
[ 1🔎] Code for video: github.com/ajhalthor/Transfor...
[2 🔎] Transformer Main Paper: arxiv.org/abs/1706.03762
[3 🔎] Bidirectional RNN Paper: deeplearning.cs.cmu.edu/F20/d...
PLAYLISTS FROM MY CHANNEL
⭕ ChatGPT Playlist of all other videos: • ChatGPT
⭕ Transformer Neural Networks: • Natural Language Proce...
⭕ Convolutional Neural Networks: • Convolution Neural Net...
⭕ The Math You Should Know : • The Math You Should Know
⭕ Probability Theory for Machine Learning: • Probability Theory for...
⭕ Coding Machine Learning: • Code Machine Learning
MATH COURSES (7 day free trial)
📕 Mathematics for Machine Learning: imp.i384100.net/MathML
📕 Calculus: imp.i384100.net/Calculus
📕 Statistics for Data Science: imp.i384100.net/AdvancedStati...
📕 Bayesian Statistics: imp.i384100.net/BayesianStati...
📕 Linear Algebra: imp.i384100.net/LinearAlgebra
📕 Probability: imp.i384100.net/Probability
OTHER RELATED COURSES (7 day free trial)
📕 ⭐ Deep Learning Specialization: imp.i384100.net/Deep-Learning
📕 Python for Everybody: imp.i384100.net/python
📕 MLOps Course: imp.i384100.net/MLOps
📕 Natural Language Processing (NLP): imp.i384100.net/NLP
📕 Machine Learning in Production: imp.i384100.net/MLProduction
📕 Data Science Specialization: imp.i384100.net/DataScience
📕 Tensorflow: imp.i384100.net/Tensorflow
TIMSTAMPS
0:00 Introduction
0:33 Transformer Overview
2:32 Multi-head attention theory
4:35 Code Breakdown
13:47 Final Coded Class
We are very much fortunate to have all this for free. Thank You.
Wow. You also put a background music. Great work!!
Sun rays falling on your face. Felt like God himself is teaching us Transformers.
Wow! I have watched a few other transformer explaination videos (they were shorter and yet tried to cover more content) and I honestly didn't understand anything. Your video on the other hand was crystal clear and not only do I now understand how every part works, but also have an idea WHY it is there. Also you were super specific about the details that are otherwise left out, great work!
Very well explained. I really enjoy this mix between explanations and your code examples.
Your videos are the best ressources to learn about transformers.
Really thankful for your work ! Thanks a lot
This was one of the best explanations of multi-attention. Thanks for your effort.
Great works. One of the most clear explaination ever about Multi Head Attention
Good job Ajay! Best explanation I have seen so far!
Absolutely loved your explanation. Thank you for contributing!!
Thank you so much for taking the time to code and explain the transformer model in such detail, I followed your series from zeros to heros. You are amazing and, if possible please do a series on how transformers can be used for time series anomaly detection and forecasting. it is extremly necessary on yotube for somone!
i literally understand all of it, thank you so much
exactly the type of content needed. thanks
You are so welcome! Thanks for watching!
Ajay, I'm currently on a holiday and was watching your Transformer videos on my mobile whilst taking my evening coffee with my mom! And I have been doing this for the past 3 to 4 days. Today my mom seemed so impressed with your oratory skills asked me if I could also lecture on a subject as spontaneously as the Ajay on the video?! Now you've started giving me a complex dude! Ha ha.
Hahahaha. Thanks to you and your mom for the kind words! And sorry for the tough spot :) Maybe you should show her some of your blogs since you’re pretty good at writing yourself
Brilliant stuff! Thanks for the time & effort you have put in to create these videos ... dhanyavadagalu :)
Thank you !! For all the effort you have put it.
Exactly the content I needed.Thanks very much.
You're welcome! :)
Great lecture..Thank you so much for this video.. Great resource..
Thank you for making this video Ajay !!
My pleasure! Hope you enjoy the rest of the series’s
Thanks for your work, much neaded right now.
5:44, we should also set `bias=False` in nn.Linear().
Really nice explanation! Just a small catch. 13:25 I believe you need to permute the variable "values" from size [1, 8, 4, 64] to [1, 4, 8, 64] before reshaping it(Line 71). Otherwise, you are trying to combine the same part of head from multiple words, rather than combine multiple parts of heads from the same word
Yes I believe so too
❤❤ Loving this series on Transformers.
Thanks so much for commenting and watching! I really appreciate it
ty for the effort you have put in, much appreciated but will you please explain the start token, it leave an understanding gap for me.
You are wonderful my brother your way of explaining is soo good
Thanks a lot :)
Very rich content as always.. Thanks for sharing
Thanks so much for commenting and watching!
Outstanding video and clear explanation!
Thanks so much! Real glad this is helpful
Thanks very much again! 😄
Hi, thanks for the great series !. Something I don't understand and I'd love to hear ur opinion:
You say the initial input is a one hot encoded vector which is the size of sequence length. Lets say my vocab is 1000 (all the words I want to support) and the sequence length is 30. How do I represent one word out of 1000, in a 30 sequence length vector? the index I put the 1 will not be correct as it might actually be in position 500 in the real vocab tensor
interesting and useful
Thanks so much
After getting values, I think it should permute values first like before, and then reshape values.
Thanks for your efforts to explain a complicated subject. Couple of questions: did you intentionally skip the Layer Normalization or did I miss something? Also -- the final linear layer in the attention block has dimension 512 x 512 (input, output size). Does this mean that each token (logit?) output from the attention layer is passed token-by-token through the linear layer to create a new set of tokens, that set being of size token sequence length. This connection between the attention output and the Linear layer is baffling me. The output of the attention layer is (Sequence-length x transformed-embedding-length) or (4 x 512), ignoring batch dimension in the tensor. Yet the linear layer accepts a (1 x 512) input and yields a (1 x 512) output. So is each (1 x 512) output token in the attention layer output sequence passed one at a time through the linear layer? And does this imply that the same linear layer is used for all tokens in the sequence?
Thanks!
Where can I get the theory part of it is good that you are explaining the code part of it can you share any link where we can read the theory part as well
what about the weights for K,V,Q for each head as well as the output?
maybe you have to divide at first 1536 by 3, and then by 8. But you do it by 8 first and then by 3, which sounds like you mix q, k, v vectors dimensions.
good point, but i think because the parameters that generate q, k, v are learned , it doesnt matter which you should divide by first, i could be wrong though
Hello good job but i have a small misunderstanding on the transformer paper they computed different different key query .. for each head and here you splitting the key and query where each head takes a split . Whats the difference between the two approachs ?
Can you please make notebooks in the repo accessible again? because most of them are not accessible right now. Thank you in advance!
5:05 Why separate variables for input_dim (embedding dimension IIUC) and d_model? Aren't these always going to be the same? Would we ever want this component to spit out a contextualized-wordVector that's a different length from the input wordVector?
I have the same question , and i assume that most of the times input_dim should equal d_model in order to have a consistent vocabulary between the input and the output
My understanding is that it sets you up to be able to choose different hyper-parameters e.g. if you want a smaller input word embedding space size but a larger internal representation. Table 3 of the original transformers paper shows a few different combinations of these parameters arxiv.org/pdf/1706.03762.pdf
You are incredible, I’ve seen a good chunk of your videos and wanted to thank you from the bottom of my heart! With your content I feel like that maybe even an idiot like me can understand it (one day - maybe? 🤔)!
I hope you enjoy a lot of success!
Super kind words. Thank you so much! I’m sure you aren’t an idiot and we hope can all learn together!
At 4:57 d_model is 512, so is input_dim. But at 14:23 input_dim is 1024, I thought they should be the same number, are you saying you reduce the dimension of input into the dimension of the model by some compression technique like PCA?
at 14:23, it looks like input_dim is only used at the very beginning, once we are in the model, input dimension is shrinked to 512
It's not PCA. It dimension conversion by weight matrix multiplication. For example, to make (1x1024) -> (1x512), we need a weight matrix of 1024x512... This is just an example, not the actual scenario demonstrated here.
You never Never disappoint bro. Vielen vielen dank!
Thanks for the kind words and the support :)
please code you explain how could way implemnt hybrid model(vision transfomrer+cnn) for image classification task
Wow, this is a very intuitive explanation! I have a question though. From my understanding, the attention aids the encoder and decoder blocks in the transformer to understand which words that came either before or after (sometimes) will have a strong impact on the generation of the next word, through the feedforward neural network and other processes. Given a sentence like "The cook is always teaching the assistant new techniques and giving her advice.", what is a method I could implement to determine the pronoun-profession relationships to understand that cook is not paired with "her", rather "assistant" is. I have tried two methods so far. 1. Using the pretrained contextual embeddings from BERT. 2. (relating to this video) I thought that I could almost reverse engineer the attention methods by creating an attention vector to understand what pair of pronoun-professions WOULD be relevant, through self attention. However, this method did not work as well (better than method 1) and I believe this is because the sentence structures are very nuanced, so I believe that the attention process is not actually understanding the grammatical relationships between words in the sentence. How could I achieve this: a method that could determine which of the two professions in a sentence like above are referenced by the pronoun. I hope you can see why I thought that using an attention matrix would be beneficial here because the attention would explain which profession was more important in deciding whether the pronoun would be "he" or "her". This is a brief description of what I am trying to do, so if you can, I could elaborate more about this over email or something else. Thank you in advance for your help and thanks a million for your amazing explanations of transformer processes!
I would like to add additionally that in my approach of using attention, I don't actually create query, key, value vectors. I take the embeddings, do the dot product, scale it, and use softmax to convert it into a probability distribution. Possibly this is where my approach goes wrong. The original embeddings of the words in the sentence are created from BERT, so there should already be positional encoding and other relevant things for embeddings.
For how much i tried to understood, query, key and value are representation of embedded word after positional embedding so, with different purposes, but why are we dividing it into multiple heads in first place and dividing it into 64 each when we can just have 1 head with 512 q,k,v and then perform self attention. Even if we are using multiple head it for increasing context wouldn't 8 different vector of 512 for each q,k,v then performing self attention on each and combine them later will give us more accurate result. I mean to say why 512 representation of word is having 64 qkv each
Someone please explain this.
Hello, I was wondering what the actual difference is between key and value? I’m a bit confused between the difference is between “What I can offer” vs “What I actually offer”.
This is a great video that might help you build intuition behind the difference of query, key and value. I've linked the exact timestamp: ua-cam.com/video/QvkQ1B3FBqA/v-deo.html
First, remember that what we're trying to learn is Q-weights, K-weights, V-weights such that
- input-embedding * Q-weights = Q (a vector that can be used as a query)
- input-embedding * K-weights = K (a vector that can be used as a key)
- input-embedding * V-weights = V (a vector that can be used as a value)
Linguistic / Grammar intuition
Let's assume that we had those Q, K and V, and we wanted to search for content for some query Q, how might we do that? Lgrammatically
@@yashs761 Thank you so much, this video has helped me a lot! The lecturer is brilliant!
Are you a full-time creator or do you work on AI while making digital content?
The latter. I have a full time job as a machine learning engineer. I make content like this on the side for now :)
@CodeEmporium How complex is the work you do with the AI VS. what you teach us here? Would you say it's harder to code by far or is it mostly just scaling up, reformatting, and sorting data to train the models?
@@CodeEmporium Are you able to disclose your employer’s name?
14:40 Your embedding dimension is 1024. So how come qkv.shape[-1] is 3x512 not 3x1024?
qkv is the result of the qkv_layer , which takes embeddings of size 1024 and has 3*d_model=3*512 neurons , therefor the output of this layer will be of dimension (batch_size, seq_length, 3*512)
thx)
You are very welcome! Hope you enjoy your stay on the channel :)
I just started with AI-ML for few months, can you guide me what should I learn for getting a job .. I like your videos.
Nice! There are many answers to this. But to keep it short and effective, I would say know your fundamentals. This could be just picking one Regression model (like Linear Regression) and understand exactly how it works and why it works. I do the same for 1 classification model (like logistic regression). Look at both from the lens of Code, Math and real life problems.
I think this is a good starting point for now. Honestly, it doesn’t exactly matter where you start as long as you start and don’t stop. I’m sure you’ll succeed!
That said, if you are interested in the content I mentioned earlier, I should have some playlists with titles “Linear Regression “ and “Logistic Regression”. So do check them out if / when you’re interested. Hope this helps.
@@CodeEmporium thanks for the reply.. sure I will check.. I am going to do a work using transformers.. ur videos really help, specially the coding demonstration...
❤❤
Thanks! :)
Priemlеmo!
But why do they do this multihead-thing? Is it to reduce computational cost? 8*(64²) < 512²
I have two doubts.
1. How Q, K, V are calculated from input text ?
2. How Q, K, V are calculated for multiple heads ?
Can you elaborate or point me to a proper resource.
word embeddings are fed into separate linear layers (fully connected neural networks) to generate the Q, K, and V vectors. These layers project the word embeddings into a new vector space specifically designed for the attention mechanism within the transformer architecture.
Good job bro, JESUS IS COMING BACK VERY SOON; WATCH AND PREPARE