How a Transformer works at inference vs training time

Поділитися
Вставка
  • Опубліковано 25 лис 2024

КОМЕНТАРІ • 113

  • @TempusWarrior
    @TempusWarrior Рік тому +27

    i rarely comment on YT videos, but I wanted to say thanks. This video doesn't have all the marketing BS and provides the type of understanding I was looking for

    • @waynelau3256
      @waynelau3256 Рік тому +6

      Gosh, imagine the day videos were ranked based on content and not fake marketing tactics 😂

    • @NielsRogge
      @NielsRogge  Рік тому +2

      Thanks for the kind words!

  • @sohelshaikhh
    @sohelshaikhh 10 місяців тому +8

    Beautifully explained! I want to shamelessly request you for a series where you go one step deeper to explain this beautiful architecture.

  • @ARATHI2000
    @ARATHI2000 2 місяці тому +2

    Awesome job explaining how Inference works. This clarified my confusion about most videos which largely discuss only pre-training. 🙏

  • @MathavRaj
    @MathavRaj Рік тому +2

    Inference:
    1. Tokens are generated one at a time conditioned on input+prev generation2
    2. Language modelling head converts the hidden states to logits
    3. Greedy search or beam search is possible
    Training:
    1. Input ids: input prompt, labels: output
    2. Decoder input ids are copied from labels, prepended with
    3. Decoder generates text all at once but uses causal attention mask
    to mask out future tokens from decoder input ids
    4. -100 is given to padded position in labels to indicate cross entropy function to not compute loss there

  • @farrugiamarc0
    @farrugiamarc0 7 місяців тому +1

    This is the best explanation I have met so far on this particular topic (inference vs training). I hope that more videos like this are released in the future. Well done!

  • @kevinsummerian
    @kevinsummerian 11 місяців тому +2

    For someone comming from a software enginering background this was hands down the most useful explanation of the transformer architecture.

    • @kamal9294
      @kamal9294 Місяць тому +1

      yes amazing explanation but it is not about transformer architecture, it is about how the input and output differs during training and inference.

  • @omgwenxx
    @omgwenxx 7 місяців тому

    I am using the huggingface library and this video finally gave me a clear understanding of the wordings used and the transformer architecture flow. Thank you!

  • @ashishnegi9663
    @ashishnegi9663 Рік тому +16

    You are a great teacher Niels! Would really appreciate if you add more such videos on hot ML/DL topics.

  • @jasonzhang5378
    @jasonzhang5378 Рік тому

    This is one of the cleanest explaination of transformer inference and training on the web. Great Video!

  • @zobinhuang3955
    @zobinhuang3955 Рік тому +1

    The most clear explaination of transformer model I have seen. Thanks Niels!

  • @sanjaybhatikar
    @sanjaybhatikar 10 місяців тому

    Thanks so much, you hit upon the points that are confusing for a first-time user of LLMs. Thank you!

  • @forecenterforcustomermanag7715
    @forecenterforcustomermanag7715 5 місяців тому

    Excellent overview of how the encoder-decoder work together. Thanks.

  • @shivamsengupta121
    @shivamsengupta121 10 місяців тому

    This is the best video on transformers. Everybody explains about the structure and attention mechanism but you choose to explain the training and inference phase. Thank you so much for this video. You are awesome 😎.
    Love from India ❤

  • @vsucc3176
    @vsucc3176 5 місяців тому

    I didn't find a lot of resources that include both drawings of the process, as well as code examples / snippets that demonstrate the drawings practically. Thank you, this helps me a lot :)

  • @RamDhiwakarSeetharaman
    @RamDhiwakarSeetharaman Рік тому +2

    Unbelievably great and intuitive explanation. Something for us to learn. Thanks a lot, Niels.

  • @amitsingha1637
    @amitsingha1637 11 місяців тому +1

    Thanks Man. We need more this type of Video.

  • @lovekesh88
    @lovekesh88 8 місяців тому

    Thanks Niels for the video. I look forward to more content on the topic.

  • @徐迟-i2t
    @徐迟-i2t 6 місяців тому

    谢谢你,讲得很好,之前只是大概了解,现在是更清楚其中的细节了。非常感谢,爱来自瓷器

  • @lucasbandeira5392
    @lucasbandeira5392 9 місяців тому

    Niels, thank you very much for this video! It was really helpful! The concept behind Transformers is pretty complicated, but your explanation definitely helped me to understand it.

  • @henrik-ts
    @henrik-ts 4 місяці тому

    Great video, very comprehensible explanation of a complex subject.

  • @thorty24
    @thorty24 7 місяців тому

    This is one of the greatest explanations I know. Thanks!

  • @jagadeeshm9526
    @jagadeeshm9526 Рік тому +1

    Amazing video... exactly covered what most other resources on this topic is missing.. keep this great work going Niels

  • @giofou711
    @giofou711 10 місяців тому +1

    @NielsRogge thanks for the super clear and helpful video! It's really one of the most clean and concise presentations I've watched on this topic! 🙌 I had a question though: At point 24:09, you are saying that *during inference* in the *last hidden state of the decoder* we get a hidden vector *for each of the decoder input ids*. In your example after 6 time steps, we have 6 decoder tokens: , salut, ..., mignon, which means the last hidden state (at time step t = 6) would produce a 6 x 768 matrix. Is that true though?
    I thought the last hidden state of the decoder produces the embedding of the *next token*. In other words, a 1 x 768 vector, that is later passed through a `nn.Linear(768, 50000)` layer to give us the next decoder input id. In other words, the 1 x 768 vector is passed to `nn.Linear(768, 50000)` and gives us a 1 x 50000 logit vector. But if what you say it's true, then when a 6 x 768 matrix is created at time step t = 6, then the end result after the last linear head would be 6 x 50000 logit matrix. No?

  • @marcoxs36
    @marcoxs36 Рік тому

    Thank you Niels, this was really helpful to me for understanding this complex topic. These aspects of the model are not normally covered in most resources I've seen.

  • @zagorot
    @zagorot Рік тому

    Great video! I have to say thank you. This video is just what I need, because I have learned some basic ideas about word2vec, LSTM, RNN and something like that, but, I cannot understand how the Transformer works and what are the input and output, your video make me all clear about them. Yes, someone drop comments said this video is "pointless" or something, no, I cannot agree that, as different audiences have different background, so it is really hard to make something happy for everyone! Someone lack some basic ideas like word2vec(why use input_ids) then they would not be able to understand this video, and instead that someone are superior good at Transformer/Diffusion, then they won't need to watch this video! So how can I say that? This video taught me how are the encoder and decoder working on every single step, very detailed, really appreciated!

  • @chenqu773
    @chenqu773 Рік тому

    Very intuitive, concise explanation to a very important topic. Thank you very much !

  • @mathlife5495
    @mathlife5495 Рік тому

    Very nice lecture. It clarified so many concepts for me.

  • @imatrixx572
    @imatrixx572 Рік тому

    Thanks you very much! Now I can say that I completely understand the Transformer!

  • @abhikhubby
    @abhikhubby Рік тому

    Best video on AI ive seen so far. Thank you so much for making & sharing!
    Only parts that might need a bit more explanation are logits area + vector embedding creation (but the later already has lots of content)

  • @kamal9294
    @kamal9294 Місяць тому

    i really wanted this exact content and i found you, thank you.

  • @mytr8986
    @mytr8986 Рік тому

    Excellent and simple video to understand the working of the transformer thanks a lot!

  • @PravasMohanty
    @PravasMohanty Рік тому

    Great tutorial!! It will be great if you make a video personalize GPT , how to keep trained data and load for Q&N any recommendation.

  • @Wlodixpro
    @Wlodixpro 9 місяців тому

    🎯 Key Takeaways for quick navigation:
    00:00 🧭 *Overview of Transformer Model Functionality*
    - Provides an overview of the Transformer model.
    - Discusses the distinction between using a Transformer during training versus inference.
    - Highlights the importance of understanding Transformer usage for tasks like text generation.
    02:05 🤖 *Tokenization Process*
    - Describes the tokenization process where input text is converted into tokens.
    - Explains the mapping of tokens to integer indices using vocabulary.
    - Discusses the role of input IDs in feeding data to the model.
    06:06 📚 *Vocabulary in Transformer Models*
    - Explores the concept of vocabulary in Transformer models.
    - Illustrates how tokens are mapped to integer indices in the vocabulary.
    - Emphasizes the importance of vocabulary in processing text inputs for Transformer models.
    07:44 🧠 *Transformer Encoder Functionality*
    - Details the process of the Transformer encoder, converting tokens into embedding vectors.
    - Explains how the encoder generates hidden representations of input tokens.
    - Highlights the role of embedding vectors in representing input sequences.
    10:45 🛠️ *Transformer Decoder Operation at Inference*
    - Demonstrates how the Transformer decoder operates during inference.
    - Discusses the generation process of new text using the decoder.
    - Describes the utilization of cached embedding vectors for generating subsequent tokens.
    23:04 🔄 *Iterative Generation Process*
    - Illustrates the iterative process of token generation by the Transformer decoder.
    - Explains how the decoder predicts subsequent tokens based on previous predictions.
    - Discusses the termination condition of the generation process upon predicting the end-of-sequence token.
    25:33 🧠 *Illustrating Inference Process with Transformers*
    - At inference time, text generation with Transformer models occurs in a loop, generating one token at a time.
    - Transformer models like GPT use a generation loop, allowing for flexibility in text generation.
    - Different decoding strategies, such as greedy decoding and beam search, impact the text generation process.
    30:59 🛠️ *Explaining Decoding Strategies for Transformers*
    - Greedy decoding is a basic method where the token with the highest probability is chosen at each step.
    - Beam search is a more advanced decoding strategy that considers multiple potential sequences simultaneously.
    - Various decoding strategies, including beam search, are available in the `generate` method of Transformer libraries like Hugging Face's Transformers.
    31:13 🎓 *Training Process of Transformer Models*
    - During training, the model learns to generate text by minimizing a loss function based on input sequences and target labels.
    - Teacher forcing is used during training, where the model is provided with ground truth tokens at each step.
    - The training process involves tokenizing input sequences, encoding them, and using labeled sequences to compute loss via cross-entropy calculations.
    48:58 🤯 *Understanding Causal Attention Masking in Transformers*
    - Causal attention masking prevents the model from "cheating" by looking into the future during training.
    - At training time, the model predicts subsequent tokens based on the ground truth sequence, with the help of the causal attention mask.
    - This mechanism ensures that the model generates text one step at a time during training, similar to the inference process.
    Made with HARPA AI

  • @trilovio
    @trilovio 9 місяців тому

    This explanation is gold! Thank you so much! 💯

  • @VitalContribution
    @VitalContribution Рік тому

    I watched the whole video and I understand now so much more.
    Thank you very much for this great video! Please keep it up!

  • @fabianaltendorfer11
    @fabianaltendorfer11 Рік тому

    Wonderful, thank you Niels!

  • @nageswarsahoo1132
    @nageswarsahoo1132 Рік тому

    amazing videos . Clear lot of doubt . Thanks Niels .

  • @AniketThorat-w4w
    @AniketThorat-w4w 10 місяців тому

    Very nice explanation. I request you to create video on how LLM can be derived based on, Prompt engineering., Fine tuning and generating New LLM with practical approach.❤❤❤❤❤❤❤

  • @samilyalciner
    @samilyalciner Рік тому

    Thanks Niels. Such a great explanation!

  • @thomasvrancken895
    @thomasvrancken895 Рік тому

    Great video! I like the pace and easy explanation on things that are not necessarily straightforward. And clean excalidraw skills 😉 Hope to see more soon

  • @HerrBundesweit
    @HerrBundesweit Рік тому

    Very informative. Thanks Niels!

  • @phucdoitoanable
    @phucdoitoanable Рік тому

    Nice explanation! Thank you!

  • @NaveenRock1
    @NaveenRock1 Рік тому +2

    Great work. Thanks a lot for this video.
    I had a small doubt, during the transformer inference you mentioned we stop generating the sequence when we reach the token. But during the training, in the decoder_input_ids, I noticed you didn't add the token to the sentence, did I miss something here ?

    • @NielsRogge
      @NielsRogge  Рік тому +3

      Hi, during training, the token is indeed added to the labels (and in turn, to the decoder input ids), should have mentioned that!

    • @NaveenRock1
      @NaveenRock1 Рік тому +2

      @@NielsRogge Got it. Thanks. I believe will be added before the padding tokens ?
      " sentence tokens + padding tokens to reach the fixed sequence length. Am I correct ?

    • @NielsRogge
      @NielsRogge  Рік тому +3

      @@NaveenRock1 yes correct!

    • @NaveenRock1
      @NaveenRock1 Рік тому +3

      @@NielsRogge Awesome. Thank you. :)

    • @maneeshbabuadhikari8651
      @maneeshbabuadhikari8651 3 місяці тому

      Hi @NielsRogge, I understand that the decoder input is " sentences_tokens pad_tokens" and one token gets generated as output for each token in the input. However, when calculating the loss, the output tokens are compared to "sentence_tokens -100 -100 ...", by removing the , right? But that means there is 1 less token when calculating the loss when compared to the generated output tokens by Decoder. How is this problem resolved? Will an additional token with id -100 be appended at the end ?

  • @DmitryPesegov
    @DmitryPesegov Рік тому

    What is the shape of the target tensor in training phase? (batch_size, maximum_supported_sequence_len_by_model, 50000) ? ( PLEASE answer anybody )

  • @minhajulhoque2113
    @minhajulhoque2113 Рік тому

    Great explanation video, really informative!

  • @zbynekba
    @zbynekba Рік тому

    Hi Niels,
    Here's a corrected version:
    I greatly appreciate that you've taken the time to create a fantastic summary of training and inference times from the user's perspective.
    Q1: during training, do you also involve the end-of-sentence token generation into the loss function? You haven’ mentioned it though IMHO a good model must detect the end of translation.
    Q2: why do you need to introduce padding? Everything works perfectly with arbitrary length of input and output sentence which is a true beauty. Why is it needed for batch training?
    Thank you.

    • @nouamaneelgueddari7518
      @nouamaneelgueddari7518 Рік тому +1

      he said in the video that padding is introduced because the training is done in batches. The elements of the batches will have a very different lengths. If we don't use padding, we will have to dynamically allocate memory for every element in the batch. This is not very efficient for the computation.

    • @zbynekba
      @zbynekba Рік тому

      @@nouamaneelgueddari7518 Makes sense to me. Thanks.

  • @mbrochh82
    @mbrochh82 Рік тому

    great video. the only thing that literally all videos on transformers don't mention is: how and when happens some kind of backpropagation? I understand how it works for a simple neural network with a hidden layer and we use gradient descent to update all the weights... but in the transformer architecture I find it hard to visualize which numbers get updated after we calculated the loss.

    • @jeffrey5602
      @jeffrey5602 Рік тому +1

      yeah, conceptually at first maybe but I would argue the transformations themselves are not more complicated than a normal NN for classification, coz its really doing just that, predicting the most probable token from the dictionary. At least its way easier than backprop for RNNs, LSTMs etc.
      The transformers book from Huggingface has a great explanation for attention which is really all you need to know to demystify the whole transformer architecture. And attention is really just adding a few linear projections and doing a dot product.

  • @lucasbandeira5392
    @lucasbandeira5392 5 місяців тому

    Thank you very much for the explanation, Niels. It was excellent. I have just one question regarding 'conditioning the decoder' during inference: How exactly does it work? Does it operate in the same way it does during training, i.e., the encoder hidden states are projected into queries, keys, and values, and then the dot products between the decoder and encoder hidden states are computed to generate the new hidden states? It seems like a lot of calculations for me, and in this way, the text generation process would be very slow, wouldn't it?

  • @kmsravindra
    @kmsravindra Рік тому

    Thanks Niels. This is pretty useful

  • @IevaSimas
    @IevaSimas 7 місяців тому +1

    Unless the token is predicted with 100% probability, you will still have non-zero loss

  • @nizamphoenix
    @nizamphoenix Рік тому

    One word, Perfect!

  • @achyutanandasahoo4775
    @achyutanandasahoo4775 Рік тому

    thank you. great explanation.

  • @BB-uy4bb
    @BB-uy4bb Рік тому +1

    In the description around 45:00 isn't there an end-token missing in the labels which the model should predict after the last label(231)?

  • @muhammadramismajeedrajput5632
    @muhammadramismajeedrajput5632 7 місяців тому

    Loved your explanation

  • @19AKS58
    @19AKS58 20 днів тому

    Excellent video. Why do we want a different post-embedding vector for the same token in the decoder versus the encoder? reference 12:34

  • @omerali3320
    @omerali3320 6 місяців тому

    I learned a lot thank you.

  • @aspboss1973
    @aspboss1973 Рік тому +1

    Nice explanation !
    I have these doubts -
    -During training, do we learn the Query, Value and Key matrix ? , in short do we learn the final embeddings of encoder through back propagation ?
    -During training, we supply encoders final embeddings to decoder, one at a time ? (Suppose we have 5 final encoders embeddings, then for first time step do we supply only first out of 5 embeddings to decoder?)
    - How this architecture is used in QA model ? (I am confuse !!!)

  • @SanKum7
    @SanKum7 6 місяців тому +1

    Transformers are '"COMPLICATED" ? Not really after this video. Thanks.

  • @shaxy6689
    @shaxy6689 8 місяців тому

    It was so helpful, could you please share the drawing notes. Thank you!

  • @sophiacas
    @sophiacas 3 місяці тому

    Are your notes from this video available anywhere online? Really liked the video and would love to add your notes to my personal study notes as well

  • @sebastianconrady7696
    @sebastianconrady7696 Рік тому

    Awesome! Great explanation

  • @TobiasStenzel
    @TobiasStenzel 11 місяців тому

    Great vid, thanks!

  • @sitrakaforler8696
    @sitrakaforler8696 Рік тому

    really great vidéo !
    Merci beaucoup !

  • @atmismahir
    @atmismahir 11 місяців тому

    great content thank you very much for the detailed explanation :)

  • @dhirajkumarsahu999
    @dhirajkumarsahu999 7 місяців тому

    Thank you so Much!! Subscribed

  • @robmarks6800
    @robmarks6800 Рік тому +1

    Can you elaborate on why seemingly all new models are decoder-only? And are trained with the sole objective of next token prediction. Does the enc-dec architecture of T5 have any advantages? And is there any reason to train in different ways that T5 do?

    • @NielsRogge
      @NielsRogge  Рік тому +1

      Hi, great question! Encoder-decoder architectures are typically good at tasks where the goal is to predict some output given a structured input, like machine translation or text-to-SQL. One first encodes the structured input, and then uses that as condition to the decoder using cross-attention. However, nowadays you can actually perfectly do these tasks with decoder-only models as well, like ChatGPT or LLaMa. The main disadvantage of encoder-decoders is that you need to recompute the keys/values at every time step, which is why all companies are using decoder-only at the moment (much faster at inference time)

    • @schwajj
      @schwajj Рік тому

      Thanks so much for the video, and answering questions! Can you explain (or provide a pointer to a paper) how the key/values can be cached to avoid recomputation in a decoder-only transformer?
      Edit: I figured it out while re-watching the training part of your video, so you needn’t answer unless you think others would benefit (I wouldn’t be able to explain very well, I fear)

    • @robmarks6800
      @robmarks6800 Рік тому

      Don’t you have to recalculate in the decoder-only architecture aswell? Or is this where the non-default KV-cache comes in?

  • @braunagn
    @braunagn Рік тому

    Question on the tensor shapes of the Encoder that go into the Decoder during inference:
    If the Encoder output is of shape (1,6,768), during cross attention, how can this be combined with the Decoder's input which is only one token in length [e.g. Shape (1,1,768)]?

  • @syerwinD
    @syerwinD Рік тому

    Thank you

  • @YL-ln4ls
    @YL-ln4ls Місяць тому

    what's the tool you used for plotting these figures

  • @pulkitsingh2149
    @pulkitsingh2149 Рік тому

    Hi Niels, great explanation on this.
    I just couldn't get my head around one point. At each time step we are producing n number of vectors (same as decoder input). Is it guaranteed that the previous predicted tokens vector won't change?
    What if the decoded token vector changes as we include more tokens in decoder input?

  • @algorithmo134
    @algorithmo134 2 місяці тому

    Hi, do we also apply masking during inference?

  • @norman9174
    @norman9174 Рік тому

    Sir can you please provide that ExcaliDraw notes .
    Thanks for this amazing explanation .

  • @yo-yoyo2303
    @yo-yoyo2303 Рік тому

    This is sooooooo good

  • @botfactory1510
    @botfactory1510 Рік тому

    Thanks Niels

  • @arjunwankhede3706
    @arjunwankhede3706 8 місяців тому

    can you share excalildraw explanation link here

  • @FalguniDasShuvo
    @FalguniDasShuvo Рік тому

    Awesome!🎉

  • @kaustuvray5066
    @kaustuvray5066 Рік тому +1

    31:02 Training

  • @sporrow
    @sporrow Рік тому

    are attention vectors used during inference?

  • @mohammedal-hitawi4667
    @mohammedal-hitawi4667 Рік тому

    Very nice work , can you please make modification on decoder part in TrOCR model like replacing language model by gpt-2 ?

  • @leiyang2176
    @leiyang2176 Рік тому

    That's a great video, I just have one question related to the video.
    In translation, there could be multiple valid translations. In this example the english output could be 'Hello, my dog is cute' or 'Hi, my dog is a cute dog' etc. In the real translation product, would there be use of metric like BLEU score, and how to use this score to evaluate and improve the product quality ?

  • @bhujithmadav1481
    @bhujithmadav1481 7 місяців тому

    Superb video. Just a doubt. @11:46 you mention that decoder would use the embeddings from encoder and the start of sequence token to generate the first output token. By embeddings did you mean the key value vectors from the last encoder stage? Also if encoder is being used to encode the input question then why are GPT, llama, etc., called decoder only models? Thanks

    • @NielsRogge
      @NielsRogge  7 місяців тому +1

      Yes the embeddings from the encoder (after the last layer) are used as keys and values in the cross-attention operations of the decoder. The decoder inputs serve as queries.
      Decoder-only models like ChatGPT and Llama don't have an encoder. They directly feed the text to the decoder, and only use self-attention (with a causal mask to prevent future leakage).

    • @bhujithmadav1481
      @bhujithmadav1481 7 місяців тому

      @@NielsRogge Thanks for the quick reply. But my confusion is that when we ask a question to GPT or llama like "what is transformer?", as per all the sources and including this video, they mention that decoders start with the SOS or EOS token to generate the output. But from where does the decoder learn the context? Even in this video you use the encoder to encode the input question and then pass the encoded embeddings to decoder right?

  • @VaibhavPatil-rx7pc
    @VaibhavPatil-rx7pc Рік тому

    NICE!!!!

  • @andygrouwstra1384
    @andygrouwstra1384 Рік тому +1

    Hi Niels, you describe a lot of steps that are taken, but don't really explain why they are taken. It becomes a kind of magic formula. For example, you have a sentence and break it up in tokens. OK. But hang on, why break it up in tokens rather than in words? What's different? Then you look up the tokens in a dictionary to replace them by numbers. Is that because it is easier to deal with numbers than with words? Then you do "something" and each number turns into a vector of 768 numbers. What is it that you do there, and why? What is the information in the other 767 numbers and where does that information come from? What do you want it for? It would be nice if you could give the context, both the big picture and the details.

    • @NielsRogge
      @NielsRogge  Рік тому +5

      Yes good point! I indeed assume in the video that you take the architecture of the Transformer as is, without asking why it looks that way. Let me give you some pointers:
      - subword tokens rather than words are used because it was proven in papers prior to the Transformer paper that they improved performance on machine translation benchmarks, see e.g. arxiv.org/abs/1609.08144.
      - we deal with numbers rather than text since computers only work with numbers, we can't do linear algebra on text. Each token ID (integer) is turned into a numerical representation, also called embedding. Tokens that have a similar meaning (like "cat" and "dog") will be closer in the embedding space (when you would project these embeddings in a n-dimensional space, with n = 768 for instance). The whole idea of creating embeddings for words or subword tokens comes from the Word2Vec paper: en.wikipedia.org/wiki/Word2vec.

    • @EkShunya
      @EkShunya Рік тому

      I like the video
      Crisp and concise
      Keep it up

  • @dhirajkumarsahu999
    @dhirajkumarsahu999 7 місяців тому

    One doubt please, does ChatGPT (decoder-only model) also use the Teacher forcing technique while training?

  • @37-2ensoiree7
    @37-2ensoiree7 Рік тому

    Missing softmax during training, mandatory to calculate cross entropy loss.
    An unrelated question : Am I understanding right that there is a thus a maximum length for all these sentences, like 512 tokens ? Isn't that an issue ?

    • @navdeep8697
      @navdeep8697 Рік тому

      i think cross entropy loss in pytorch (atleast!) apply the softmax internally. yes token limit is a sort of limitation because of how encoder and decoder internally works but it can be resolved while making the dataset pipeline for training and inference.

  • @adrienforbu5165
    @adrienforbu5165 Рік тому +1

    Perfect french :)

  • @acasualviewer5861
    @acasualviewer5861 10 місяців тому

    It seems wasteful to run the entire decoder each time. Since it will do computations for all 6 positions regardless. There seems to be an opportunity to optimize this by only using the relevant part of the decoder mask each iteration.

    • @NielsRogge
      @NielsRogge  10 місяців тому

      Yes indeed! That's where the key-value cache comes in: huggingface.co/blog/optimize-llm#32-the-key-value-cache

  • @fra4897
    @fra4897 Рік тому

    heyy niels

  • @isiisorisiaint
    @isiisorisiaint Рік тому

    ok man, you tried, but honestly this is a totally pointless video, someone who knows what the transformer is about learns absolutely nothing except that -100 means 'ignore', and somebody who's still trying to wrap their heads around the transformer won't understand a single piece of what you kept typing in there. There you go, it's not just a thubs-down from me, i also took a couple of minutes to write this reply. Just try and see if you can define what the target audience of this video is, and you'll instantly see just how meaningless this video is.

    • @navdeep8697
      @navdeep8697 Рік тому

      agree a little...this is good for audience who is interested in using huggingface library especially ...but not understanding the transformer and attention in generic way !