i rarely comment on YT videos, but I wanted to say thanks. This video doesn't have all the marketing BS and provides the type of understanding I was looking for
Inference: 1. Tokens are generated one at a time conditioned on input+prev generation2 2. Language modelling head converts the hidden states to logits 3. Greedy search or beam search is possible Training: 1. Input ids: input prompt, labels: output 2. Decoder input ids are copied from labels, prepended with 3. Decoder generates text all at once but uses causal attention mask to mask out future tokens from decoder input ids 4. -100 is given to padded position in labels to indicate cross entropy function to not compute loss there
This is the best explanation I have met so far on this particular topic (inference vs training). I hope that more videos like this are released in the future. Well done!
I am using the huggingface library and this video finally gave me a clear understanding of the wordings used and the transformer architecture flow. Thank you!
This is the best video on transformers. Everybody explains about the structure and attention mechanism but you choose to explain the training and inference phase. Thank you so much for this video. You are awesome 😎. Love from India ❤
I didn't find a lot of resources that include both drawings of the process, as well as code examples / snippets that demonstrate the drawings practically. Thank you, this helps me a lot :)
Niels, thank you very much for this video! It was really helpful! The concept behind Transformers is pretty complicated, but your explanation definitely helped me to understand it.
@NielsRogge thanks for the super clear and helpful video! It's really one of the most clean and concise presentations I've watched on this topic! 🙌 I had a question though: At point 24:09, you are saying that *during inference* in the *last hidden state of the decoder* we get a hidden vector *for each of the decoder input ids*. In your example after 6 time steps, we have 6 decoder tokens: , salut, ..., mignon, which means the last hidden state (at time step t = 6) would produce a 6 x 768 matrix. Is that true though? I thought the last hidden state of the decoder produces the embedding of the *next token*. In other words, a 1 x 768 vector, that is later passed through a `nn.Linear(768, 50000)` layer to give us the next decoder input id. In other words, the 1 x 768 vector is passed to `nn.Linear(768, 50000)` and gives us a 1 x 50000 logit vector. But if what you say it's true, then when a 6 x 768 matrix is created at time step t = 6, then the end result after the last linear head would be 6 x 50000 logit matrix. No?
Thank you Niels, this was really helpful to me for understanding this complex topic. These aspects of the model are not normally covered in most resources I've seen.
Great video! I have to say thank you. This video is just what I need, because I have learned some basic ideas about word2vec, LSTM, RNN and something like that, but, I cannot understand how the Transformer works and what are the input and output, your video make me all clear about them. Yes, someone drop comments said this video is "pointless" or something, no, I cannot agree that, as different audiences have different background, so it is really hard to make something happy for everyone! Someone lack some basic ideas like word2vec(why use input_ids) then they would not be able to understand this video, and instead that someone are superior good at Transformer/Diffusion, then they won't need to watch this video! So how can I say that? This video taught me how are the encoder and decoder working on every single step, very detailed, really appreciated!
Best video on AI ive seen so far. Thank you so much for making & sharing! Only parts that might need a bit more explanation are logits area + vector embedding creation (but the later already has lots of content)
🎯 Key Takeaways for quick navigation: 00:00 🧭 *Overview of Transformer Model Functionality* - Provides an overview of the Transformer model. - Discusses the distinction between using a Transformer during training versus inference. - Highlights the importance of understanding Transformer usage for tasks like text generation. 02:05 🤖 *Tokenization Process* - Describes the tokenization process where input text is converted into tokens. - Explains the mapping of tokens to integer indices using vocabulary. - Discusses the role of input IDs in feeding data to the model. 06:06 📚 *Vocabulary in Transformer Models* - Explores the concept of vocabulary in Transformer models. - Illustrates how tokens are mapped to integer indices in the vocabulary. - Emphasizes the importance of vocabulary in processing text inputs for Transformer models. 07:44 🧠 *Transformer Encoder Functionality* - Details the process of the Transformer encoder, converting tokens into embedding vectors. - Explains how the encoder generates hidden representations of input tokens. - Highlights the role of embedding vectors in representing input sequences. 10:45 🛠️ *Transformer Decoder Operation at Inference* - Demonstrates how the Transformer decoder operates during inference. - Discusses the generation process of new text using the decoder. - Describes the utilization of cached embedding vectors for generating subsequent tokens. 23:04 🔄 *Iterative Generation Process* - Illustrates the iterative process of token generation by the Transformer decoder. - Explains how the decoder predicts subsequent tokens based on previous predictions. - Discusses the termination condition of the generation process upon predicting the end-of-sequence token. 25:33 🧠 *Illustrating Inference Process with Transformers* - At inference time, text generation with Transformer models occurs in a loop, generating one token at a time. - Transformer models like GPT use a generation loop, allowing for flexibility in text generation. - Different decoding strategies, such as greedy decoding and beam search, impact the text generation process. 30:59 🛠️ *Explaining Decoding Strategies for Transformers* - Greedy decoding is a basic method where the token with the highest probability is chosen at each step. - Beam search is a more advanced decoding strategy that considers multiple potential sequences simultaneously. - Various decoding strategies, including beam search, are available in the `generate` method of Transformer libraries like Hugging Face's Transformers. 31:13 🎓 *Training Process of Transformer Models* - During training, the model learns to generate text by minimizing a loss function based on input sequences and target labels. - Teacher forcing is used during training, where the model is provided with ground truth tokens at each step. - The training process involves tokenizing input sequences, encoding them, and using labeled sequences to compute loss via cross-entropy calculations. 48:58 🤯 *Understanding Causal Attention Masking in Transformers* - Causal attention masking prevents the model from "cheating" by looking into the future during training. - At training time, the model predicts subsequent tokens based on the ground truth sequence, with the help of the causal attention mask. - This mechanism ensures that the model generates text one step at a time during training, similar to the inference process. Made with HARPA AI
Very nice explanation. I request you to create video on how LLM can be derived based on, Prompt engineering., Fine tuning and generating New LLM with practical approach.❤❤❤❤❤❤❤
Great video! I like the pace and easy explanation on things that are not necessarily straightforward. And clean excalidraw skills 😉 Hope to see more soon
Great work. Thanks a lot for this video. I had a small doubt, during the transformer inference you mentioned we stop generating the sequence when we reach the token. But during the training, in the decoder_input_ids, I noticed you didn't add the token to the sentence, did I miss something here ?
@@NielsRogge Got it. Thanks. I believe will be added before the padding tokens ? " sentence tokens + padding tokens to reach the fixed sequence length. Am I correct ?
Hi @NielsRogge, I understand that the decoder input is " sentences_tokens pad_tokens" and one token gets generated as output for each token in the input. However, when calculating the loss, the output tokens are compared to "sentence_tokens -100 -100 ...", by removing the , right? But that means there is 1 less token when calculating the loss when compared to the generated output tokens by Decoder. How is this problem resolved? Will an additional token with id -100 be appended at the end ?
Hi Niels, Here's a corrected version: I greatly appreciate that you've taken the time to create a fantastic summary of training and inference times from the user's perspective. Q1: during training, do you also involve the end-of-sentence token generation into the loss function? You haven’ mentioned it though IMHO a good model must detect the end of translation. Q2: why do you need to introduce padding? Everything works perfectly with arbitrary length of input and output sentence which is a true beauty. Why is it needed for batch training? Thank you.
he said in the video that padding is introduced because the training is done in batches. The elements of the batches will have a very different lengths. If we don't use padding, we will have to dynamically allocate memory for every element in the batch. This is not very efficient for the computation.
great video. the only thing that literally all videos on transformers don't mention is: how and when happens some kind of backpropagation? I understand how it works for a simple neural network with a hidden layer and we use gradient descent to update all the weights... but in the transformer architecture I find it hard to visualize which numbers get updated after we calculated the loss.
yeah, conceptually at first maybe but I would argue the transformations themselves are not more complicated than a normal NN for classification, coz its really doing just that, predicting the most probable token from the dictionary. At least its way easier than backprop for RNNs, LSTMs etc. The transformers book from Huggingface has a great explanation for attention which is really all you need to know to demystify the whole transformer architecture. And attention is really just adding a few linear projections and doing a dot product.
Thank you very much for the explanation, Niels. It was excellent. I have just one question regarding 'conditioning the decoder' during inference: How exactly does it work? Does it operate in the same way it does during training, i.e., the encoder hidden states are projected into queries, keys, and values, and then the dot products between the decoder and encoder hidden states are computed to generate the new hidden states? It seems like a lot of calculations for me, and in this way, the text generation process would be very slow, wouldn't it?
Nice explanation ! I have these doubts - -During training, do we learn the Query, Value and Key matrix ? , in short do we learn the final embeddings of encoder through back propagation ? -During training, we supply encoders final embeddings to decoder, one at a time ? (Suppose we have 5 final encoders embeddings, then for first time step do we supply only first out of 5 embeddings to decoder?) - How this architecture is used in QA model ? (I am confuse !!!)
Can you elaborate on why seemingly all new models are decoder-only? And are trained with the sole objective of next token prediction. Does the enc-dec architecture of T5 have any advantages? And is there any reason to train in different ways that T5 do?
Hi, great question! Encoder-decoder architectures are typically good at tasks where the goal is to predict some output given a structured input, like machine translation or text-to-SQL. One first encodes the structured input, and then uses that as condition to the decoder using cross-attention. However, nowadays you can actually perfectly do these tasks with decoder-only models as well, like ChatGPT or LLaMa. The main disadvantage of encoder-decoders is that you need to recompute the keys/values at every time step, which is why all companies are using decoder-only at the moment (much faster at inference time)
Thanks so much for the video, and answering questions! Can you explain (or provide a pointer to a paper) how the key/values can be cached to avoid recomputation in a decoder-only transformer? Edit: I figured it out while re-watching the training part of your video, so you needn’t answer unless you think others would benefit (I wouldn’t be able to explain very well, I fear)
Question on the tensor shapes of the Encoder that go into the Decoder during inference: If the Encoder output is of shape (1,6,768), during cross attention, how can this be combined with the Decoder's input which is only one token in length [e.g. Shape (1,1,768)]?
Hi Niels, great explanation on this. I just couldn't get my head around one point. At each time step we are producing n number of vectors (same as decoder input). Is it guaranteed that the previous predicted tokens vector won't change? What if the decoded token vector changes as we include more tokens in decoder input?
That's a great video, I just have one question related to the video. In translation, there could be multiple valid translations. In this example the english output could be 'Hello, my dog is cute' or 'Hi, my dog is a cute dog' etc. In the real translation product, would there be use of metric like BLEU score, and how to use this score to evaluate and improve the product quality ?
Superb video. Just a doubt. @11:46 you mention that decoder would use the embeddings from encoder and the start of sequence token to generate the first output token. By embeddings did you mean the key value vectors from the last encoder stage? Also if encoder is being used to encode the input question then why are GPT, llama, etc., called decoder only models? Thanks
Yes the embeddings from the encoder (after the last layer) are used as keys and values in the cross-attention operations of the decoder. The decoder inputs serve as queries. Decoder-only models like ChatGPT and Llama don't have an encoder. They directly feed the text to the decoder, and only use self-attention (with a causal mask to prevent future leakage).
@@NielsRogge Thanks for the quick reply. But my confusion is that when we ask a question to GPT or llama like "what is transformer?", as per all the sources and including this video, they mention that decoders start with the SOS or EOS token to generate the output. But from where does the decoder learn the context? Even in this video you use the encoder to encode the input question and then pass the encoded embeddings to decoder right?
Hi Niels, you describe a lot of steps that are taken, but don't really explain why they are taken. It becomes a kind of magic formula. For example, you have a sentence and break it up in tokens. OK. But hang on, why break it up in tokens rather than in words? What's different? Then you look up the tokens in a dictionary to replace them by numbers. Is that because it is easier to deal with numbers than with words? Then you do "something" and each number turns into a vector of 768 numbers. What is it that you do there, and why? What is the information in the other 767 numbers and where does that information come from? What do you want it for? It would be nice if you could give the context, both the big picture and the details.
Yes good point! I indeed assume in the video that you take the architecture of the Transformer as is, without asking why it looks that way. Let me give you some pointers: - subword tokens rather than words are used because it was proven in papers prior to the Transformer paper that they improved performance on machine translation benchmarks, see e.g. arxiv.org/abs/1609.08144. - we deal with numbers rather than text since computers only work with numbers, we can't do linear algebra on text. Each token ID (integer) is turned into a numerical representation, also called embedding. Tokens that have a similar meaning (like "cat" and "dog") will be closer in the embedding space (when you would project these embeddings in a n-dimensional space, with n = 768 for instance). The whole idea of creating embeddings for words or subword tokens comes from the Word2Vec paper: en.wikipedia.org/wiki/Word2vec.
Missing softmax during training, mandatory to calculate cross entropy loss. An unrelated question : Am I understanding right that there is a thus a maximum length for all these sentences, like 512 tokens ? Isn't that an issue ?
i think cross entropy loss in pytorch (atleast!) apply the softmax internally. yes token limit is a sort of limitation because of how encoder and decoder internally works but it can be resolved while making the dataset pipeline for training and inference.
It seems wasteful to run the entire decoder each time. Since it will do computations for all 6 positions regardless. There seems to be an opportunity to optimize this by only using the relevant part of the decoder mask each iteration.
ok man, you tried, but honestly this is a totally pointless video, someone who knows what the transformer is about learns absolutely nothing except that -100 means 'ignore', and somebody who's still trying to wrap their heads around the transformer won't understand a single piece of what you kept typing in there. There you go, it's not just a thubs-down from me, i also took a couple of minutes to write this reply. Just try and see if you can define what the target audience of this video is, and you'll instantly see just how meaningless this video is.
agree a little...this is good for audience who is interested in using huggingface library especially ...but not understanding the transformer and attention in generic way !
i rarely comment on YT videos, but I wanted to say thanks. This video doesn't have all the marketing BS and provides the type of understanding I was looking for
Gosh, imagine the day videos were ranked based on content and not fake marketing tactics 😂
Thanks for the kind words!
Beautifully explained! I want to shamelessly request you for a series where you go one step deeper to explain this beautiful architecture.
Awesome job explaining how Inference works. This clarified my confusion about most videos which largely discuss only pre-training. 🙏
Inference:
1. Tokens are generated one at a time conditioned on input+prev generation2
2. Language modelling head converts the hidden states to logits
3. Greedy search or beam search is possible
Training:
1. Input ids: input prompt, labels: output
2. Decoder input ids are copied from labels, prepended with
3. Decoder generates text all at once but uses causal attention mask
to mask out future tokens from decoder input ids
4. -100 is given to padded position in labels to indicate cross entropy function to not compute loss there
This is the best explanation I have met so far on this particular topic (inference vs training). I hope that more videos like this are released in the future. Well done!
For someone comming from a software enginering background this was hands down the most useful explanation of the transformer architecture.
yes amazing explanation but it is not about transformer architecture, it is about how the input and output differs during training and inference.
I am using the huggingface library and this video finally gave me a clear understanding of the wordings used and the transformer architecture flow. Thank you!
You are a great teacher Niels! Would really appreciate if you add more such videos on hot ML/DL topics.
This is one of the cleanest explaination of transformer inference and training on the web. Great Video!
The most clear explaination of transformer model I have seen. Thanks Niels!
Thanks so much, you hit upon the points that are confusing for a first-time user of LLMs. Thank you!
Excellent overview of how the encoder-decoder work together. Thanks.
This is the best video on transformers. Everybody explains about the structure and attention mechanism but you choose to explain the training and inference phase. Thank you so much for this video. You are awesome 😎.
Love from India ❤
I didn't find a lot of resources that include both drawings of the process, as well as code examples / snippets that demonstrate the drawings practically. Thank you, this helps me a lot :)
Unbelievably great and intuitive explanation. Something for us to learn. Thanks a lot, Niels.
Thanks Man. We need more this type of Video.
Thanks Niels for the video. I look forward to more content on the topic.
谢谢你,讲得很好,之前只是大概了解,现在是更清楚其中的细节了。非常感谢,爱来自瓷器
Niels, thank you very much for this video! It was really helpful! The concept behind Transformers is pretty complicated, but your explanation definitely helped me to understand it.
Great video, very comprehensible explanation of a complex subject.
This is one of the greatest explanations I know. Thanks!
Amazing video... exactly covered what most other resources on this topic is missing.. keep this great work going Niels
@NielsRogge thanks for the super clear and helpful video! It's really one of the most clean and concise presentations I've watched on this topic! 🙌 I had a question though: At point 24:09, you are saying that *during inference* in the *last hidden state of the decoder* we get a hidden vector *for each of the decoder input ids*. In your example after 6 time steps, we have 6 decoder tokens: , salut, ..., mignon, which means the last hidden state (at time step t = 6) would produce a 6 x 768 matrix. Is that true though?
I thought the last hidden state of the decoder produces the embedding of the *next token*. In other words, a 1 x 768 vector, that is later passed through a `nn.Linear(768, 50000)` layer to give us the next decoder input id. In other words, the 1 x 768 vector is passed to `nn.Linear(768, 50000)` and gives us a 1 x 50000 logit vector. But if what you say it's true, then when a 6 x 768 matrix is created at time step t = 6, then the end result after the last linear head would be 6 x 50000 logit matrix. No?
Thank you Niels, this was really helpful to me for understanding this complex topic. These aspects of the model are not normally covered in most resources I've seen.
Great video! I have to say thank you. This video is just what I need, because I have learned some basic ideas about word2vec, LSTM, RNN and something like that, but, I cannot understand how the Transformer works and what are the input and output, your video make me all clear about them. Yes, someone drop comments said this video is "pointless" or something, no, I cannot agree that, as different audiences have different background, so it is really hard to make something happy for everyone! Someone lack some basic ideas like word2vec(why use input_ids) then they would not be able to understand this video, and instead that someone are superior good at Transformer/Diffusion, then they won't need to watch this video! So how can I say that? This video taught me how are the encoder and decoder working on every single step, very detailed, really appreciated!
Very intuitive, concise explanation to a very important topic. Thank you very much !
Very nice lecture. It clarified so many concepts for me.
Thanks you very much! Now I can say that I completely understand the Transformer!
Best video on AI ive seen so far. Thank you so much for making & sharing!
Only parts that might need a bit more explanation are logits area + vector embedding creation (but the later already has lots of content)
i really wanted this exact content and i found you, thank you.
Excellent and simple video to understand the working of the transformer thanks a lot!
Great tutorial!! It will be great if you make a video personalize GPT , how to keep trained data and load for Q&N any recommendation.
🎯 Key Takeaways for quick navigation:
00:00 🧭 *Overview of Transformer Model Functionality*
- Provides an overview of the Transformer model.
- Discusses the distinction between using a Transformer during training versus inference.
- Highlights the importance of understanding Transformer usage for tasks like text generation.
02:05 🤖 *Tokenization Process*
- Describes the tokenization process where input text is converted into tokens.
- Explains the mapping of tokens to integer indices using vocabulary.
- Discusses the role of input IDs in feeding data to the model.
06:06 📚 *Vocabulary in Transformer Models*
- Explores the concept of vocabulary in Transformer models.
- Illustrates how tokens are mapped to integer indices in the vocabulary.
- Emphasizes the importance of vocabulary in processing text inputs for Transformer models.
07:44 🧠 *Transformer Encoder Functionality*
- Details the process of the Transformer encoder, converting tokens into embedding vectors.
- Explains how the encoder generates hidden representations of input tokens.
- Highlights the role of embedding vectors in representing input sequences.
10:45 🛠️ *Transformer Decoder Operation at Inference*
- Demonstrates how the Transformer decoder operates during inference.
- Discusses the generation process of new text using the decoder.
- Describes the utilization of cached embedding vectors for generating subsequent tokens.
23:04 🔄 *Iterative Generation Process*
- Illustrates the iterative process of token generation by the Transformer decoder.
- Explains how the decoder predicts subsequent tokens based on previous predictions.
- Discusses the termination condition of the generation process upon predicting the end-of-sequence token.
25:33 🧠 *Illustrating Inference Process with Transformers*
- At inference time, text generation with Transformer models occurs in a loop, generating one token at a time.
- Transformer models like GPT use a generation loop, allowing for flexibility in text generation.
- Different decoding strategies, such as greedy decoding and beam search, impact the text generation process.
30:59 🛠️ *Explaining Decoding Strategies for Transformers*
- Greedy decoding is a basic method where the token with the highest probability is chosen at each step.
- Beam search is a more advanced decoding strategy that considers multiple potential sequences simultaneously.
- Various decoding strategies, including beam search, are available in the `generate` method of Transformer libraries like Hugging Face's Transformers.
31:13 🎓 *Training Process of Transformer Models*
- During training, the model learns to generate text by minimizing a loss function based on input sequences and target labels.
- Teacher forcing is used during training, where the model is provided with ground truth tokens at each step.
- The training process involves tokenizing input sequences, encoding them, and using labeled sequences to compute loss via cross-entropy calculations.
48:58 🤯 *Understanding Causal Attention Masking in Transformers*
- Causal attention masking prevents the model from "cheating" by looking into the future during training.
- At training time, the model predicts subsequent tokens based on the ground truth sequence, with the help of the causal attention mask.
- This mechanism ensures that the model generates text one step at a time during training, similar to the inference process.
Made with HARPA AI
This explanation is gold! Thank you so much! 💯
I watched the whole video and I understand now so much more.
Thank you very much for this great video! Please keep it up!
Wonderful, thank you Niels!
amazing videos . Clear lot of doubt . Thanks Niels .
Very nice explanation. I request you to create video on how LLM can be derived based on, Prompt engineering., Fine tuning and generating New LLM with practical approach.❤❤❤❤❤❤❤
Thanks Niels. Such a great explanation!
Great video! I like the pace and easy explanation on things that are not necessarily straightforward. And clean excalidraw skills 😉 Hope to see more soon
Very informative. Thanks Niels!
Nice explanation! Thank you!
Great work. Thanks a lot for this video.
I had a small doubt, during the transformer inference you mentioned we stop generating the sequence when we reach the token. But during the training, in the decoder_input_ids, I noticed you didn't add the token to the sentence, did I miss something here ?
Hi, during training, the token is indeed added to the labels (and in turn, to the decoder input ids), should have mentioned that!
@@NielsRogge Got it. Thanks. I believe will be added before the padding tokens ?
" sentence tokens + padding tokens to reach the fixed sequence length. Am I correct ?
@@NaveenRock1 yes correct!
@@NielsRogge Awesome. Thank you. :)
Hi @NielsRogge, I understand that the decoder input is " sentences_tokens pad_tokens" and one token gets generated as output for each token in the input. However, when calculating the loss, the output tokens are compared to "sentence_tokens -100 -100 ...", by removing the , right? But that means there is 1 less token when calculating the loss when compared to the generated output tokens by Decoder. How is this problem resolved? Will an additional token with id -100 be appended at the end ?
What is the shape of the target tensor in training phase? (batch_size, maximum_supported_sequence_len_by_model, 50000) ? ( PLEASE answer anybody )
Great explanation video, really informative!
Hi Niels,
Here's a corrected version:
I greatly appreciate that you've taken the time to create a fantastic summary of training and inference times from the user's perspective.
Q1: during training, do you also involve the end-of-sentence token generation into the loss function? You haven’ mentioned it though IMHO a good model must detect the end of translation.
Q2: why do you need to introduce padding? Everything works perfectly with arbitrary length of input and output sentence which is a true beauty. Why is it needed for batch training?
Thank you.
he said in the video that padding is introduced because the training is done in batches. The elements of the batches will have a very different lengths. If we don't use padding, we will have to dynamically allocate memory for every element in the batch. This is not very efficient for the computation.
@@nouamaneelgueddari7518 Makes sense to me. Thanks.
great video. the only thing that literally all videos on transformers don't mention is: how and when happens some kind of backpropagation? I understand how it works for a simple neural network with a hidden layer and we use gradient descent to update all the weights... but in the transformer architecture I find it hard to visualize which numbers get updated after we calculated the loss.
yeah, conceptually at first maybe but I would argue the transformations themselves are not more complicated than a normal NN for classification, coz its really doing just that, predicting the most probable token from the dictionary. At least its way easier than backprop for RNNs, LSTMs etc.
The transformers book from Huggingface has a great explanation for attention which is really all you need to know to demystify the whole transformer architecture. And attention is really just adding a few linear projections and doing a dot product.
Thank you very much for the explanation, Niels. It was excellent. I have just one question regarding 'conditioning the decoder' during inference: How exactly does it work? Does it operate in the same way it does during training, i.e., the encoder hidden states are projected into queries, keys, and values, and then the dot products between the decoder and encoder hidden states are computed to generate the new hidden states? It seems like a lot of calculations for me, and in this way, the text generation process would be very slow, wouldn't it?
Thanks Niels. This is pretty useful
Unless the token is predicted with 100% probability, you will still have non-zero loss
One word, Perfect!
thank you. great explanation.
In the description around 45:00 isn't there an end-token missing in the labels which the model should predict after the last label(231)?
Loved your explanation
Excellent video. Why do we want a different post-embedding vector for the same token in the decoder versus the encoder? reference 12:34
I learned a lot thank you.
Nice explanation !
I have these doubts -
-During training, do we learn the Query, Value and Key matrix ? , in short do we learn the final embeddings of encoder through back propagation ?
-During training, we supply encoders final embeddings to decoder, one at a time ? (Suppose we have 5 final encoders embeddings, then for first time step do we supply only first out of 5 embeddings to decoder?)
- How this architecture is used in QA model ? (I am confuse !!!)
Transformers are '"COMPLICATED" ? Not really after this video. Thanks.
It was so helpful, could you please share the drawing notes. Thank you!
Are your notes from this video available anywhere online? Really liked the video and would love to add your notes to my personal study notes as well
Awesome! Great explanation
Great vid, thanks!
really great vidéo !
Merci beaucoup !
great content thank you very much for the detailed explanation :)
Thank you so Much!! Subscribed
Can you elaborate on why seemingly all new models are decoder-only? And are trained with the sole objective of next token prediction. Does the enc-dec architecture of T5 have any advantages? And is there any reason to train in different ways that T5 do?
Hi, great question! Encoder-decoder architectures are typically good at tasks where the goal is to predict some output given a structured input, like machine translation or text-to-SQL. One first encodes the structured input, and then uses that as condition to the decoder using cross-attention. However, nowadays you can actually perfectly do these tasks with decoder-only models as well, like ChatGPT or LLaMa. The main disadvantage of encoder-decoders is that you need to recompute the keys/values at every time step, which is why all companies are using decoder-only at the moment (much faster at inference time)
Thanks so much for the video, and answering questions! Can you explain (or provide a pointer to a paper) how the key/values can be cached to avoid recomputation in a decoder-only transformer?
Edit: I figured it out while re-watching the training part of your video, so you needn’t answer unless you think others would benefit (I wouldn’t be able to explain very well, I fear)
Don’t you have to recalculate in the decoder-only architecture aswell? Or is this where the non-default KV-cache comes in?
Question on the tensor shapes of the Encoder that go into the Decoder during inference:
If the Encoder output is of shape (1,6,768), during cross attention, how can this be combined with the Decoder's input which is only one token in length [e.g. Shape (1,1,768)]?
Thank you
what's the tool you used for plotting these figures
Hi Niels, great explanation on this.
I just couldn't get my head around one point. At each time step we are producing n number of vectors (same as decoder input). Is it guaranteed that the previous predicted tokens vector won't change?
What if the decoded token vector changes as we include more tokens in decoder input?
Hi, do we also apply masking during inference?
Sir can you please provide that ExcaliDraw notes .
Thanks for this amazing explanation .
This is sooooooo good
Thanks Niels
can you share excalildraw explanation link here
Awesome!🎉
31:02 Training
are attention vectors used during inference?
Very nice work , can you please make modification on decoder part in TrOCR model like replacing language model by gpt-2 ?
That's a great video, I just have one question related to the video.
In translation, there could be multiple valid translations. In this example the english output could be 'Hello, my dog is cute' or 'Hi, my dog is a cute dog' etc. In the real translation product, would there be use of metric like BLEU score, and how to use this score to evaluate and improve the product quality ?
Superb video. Just a doubt. @11:46 you mention that decoder would use the embeddings from encoder and the start of sequence token to generate the first output token. By embeddings did you mean the key value vectors from the last encoder stage? Also if encoder is being used to encode the input question then why are GPT, llama, etc., called decoder only models? Thanks
Yes the embeddings from the encoder (after the last layer) are used as keys and values in the cross-attention operations of the decoder. The decoder inputs serve as queries.
Decoder-only models like ChatGPT and Llama don't have an encoder. They directly feed the text to the decoder, and only use self-attention (with a causal mask to prevent future leakage).
@@NielsRogge Thanks for the quick reply. But my confusion is that when we ask a question to GPT or llama like "what is transformer?", as per all the sources and including this video, they mention that decoders start with the SOS or EOS token to generate the output. But from where does the decoder learn the context? Even in this video you use the encoder to encode the input question and then pass the encoded embeddings to decoder right?
NICE!!!!
Hi Niels, you describe a lot of steps that are taken, but don't really explain why they are taken. It becomes a kind of magic formula. For example, you have a sentence and break it up in tokens. OK. But hang on, why break it up in tokens rather than in words? What's different? Then you look up the tokens in a dictionary to replace them by numbers. Is that because it is easier to deal with numbers than with words? Then you do "something" and each number turns into a vector of 768 numbers. What is it that you do there, and why? What is the information in the other 767 numbers and where does that information come from? What do you want it for? It would be nice if you could give the context, both the big picture and the details.
Yes good point! I indeed assume in the video that you take the architecture of the Transformer as is, without asking why it looks that way. Let me give you some pointers:
- subword tokens rather than words are used because it was proven in papers prior to the Transformer paper that they improved performance on machine translation benchmarks, see e.g. arxiv.org/abs/1609.08144.
- we deal with numbers rather than text since computers only work with numbers, we can't do linear algebra on text. Each token ID (integer) is turned into a numerical representation, also called embedding. Tokens that have a similar meaning (like "cat" and "dog") will be closer in the embedding space (when you would project these embeddings in a n-dimensional space, with n = 768 for instance). The whole idea of creating embeddings for words or subword tokens comes from the Word2Vec paper: en.wikipedia.org/wiki/Word2vec.
I like the video
Crisp and concise
Keep it up
One doubt please, does ChatGPT (decoder-only model) also use the Teacher forcing technique while training?
Yes it does!
@@NielsRogge Thanks a lot for your reply !!
Missing softmax during training, mandatory to calculate cross entropy loss.
An unrelated question : Am I understanding right that there is a thus a maximum length for all these sentences, like 512 tokens ? Isn't that an issue ?
i think cross entropy loss in pytorch (atleast!) apply the softmax internally. yes token limit is a sort of limitation because of how encoder and decoder internally works but it can be resolved while making the dataset pipeline for training and inference.
Perfect french :)
It seems wasteful to run the entire decoder each time. Since it will do computations for all 6 positions regardless. There seems to be an opportunity to optimize this by only using the relevant part of the decoder mask each iteration.
Yes indeed! That's where the key-value cache comes in: huggingface.co/blog/optimize-llm#32-the-key-value-cache
heyy niels
ok man, you tried, but honestly this is a totally pointless video, someone who knows what the transformer is about learns absolutely nothing except that -100 means 'ignore', and somebody who's still trying to wrap their heads around the transformer won't understand a single piece of what you kept typing in there. There you go, it's not just a thubs-down from me, i also took a couple of minutes to write this reply. Just try and see if you can define what the target audience of this video is, and you'll instantly see just how meaningless this video is.
agree a little...this is good for audience who is interested in using huggingface library especially ...but not understanding the transformer and attention in generic way !