Why Transformer over Recurrent Neural Networks

Поділитися
Вставка
  • Опубліковано 27 сер 2024
  • #transformers #machinelearning #chatgpt #gpt #deeplearning

КОМЕНТАРІ • 56

  • @IshtiaqueAman
    @IshtiaqueAman Рік тому +94

    That's not the main reason, RNN keep adding the embeddings and hence override information that came before where as in case of transformer embeddings are there all the time and attention can pick the ones that are important.

  • @untitledc
    @untitledc Місяць тому +3

    Note that the decoder in Transformer outputs one vector at a time as well

  • @NoahElRhandour
    @NoahElRhandour Рік тому +29

    that was a great video!
    i find learning about such things generally easier and more interesting, if they are compared to other models/ideas that are similar but not equal

    • @CodeEmporium
      @CodeEmporium  Рік тому +3

      Thank you for the kind words. And yep, agreed 👍🏽

    • @NoahElRhandour
      @NoahElRhandour Рік тому +3

      @@CodeEmporium i guess just like CLIP our brains perform contrastive learning as well xd

  • @schillaci5590
    @schillaci5590 Рік тому +9

    This answered a question I didn't have. Thanks!

    • @CodeEmporium
      @CodeEmporium  Рік тому +1

      Always glad to help when not needed!

  • @brianprzezdziecki
    @brianprzezdziecki Рік тому +21

    UA-cam recommend me more videos like this plz

  • @IgorAherne
    @IgorAherne Рік тому +4

    I think lstms are more tuned towards keeping the order, because although transformers can assemble embeddings from various tokens, they don't know what follows what in a sentence.
    But, perhaps with relative positional encoding they might be equipped just about enough to understand the order of sequential input

    • @evanshlom1
      @evanshlom1 Рік тому +1

      Your comment came right before gpt blew up so maybe you wouldn’t say this anymore?

  • @GregHogg
    @GregHogg Рік тому +7

    This is a great video!!

  • @FluffyAnvil
    @FluffyAnvil 6 місяців тому +7

    This video is 90% wrong…

    • @ccreutzig
      @ccreutzig 6 місяців тому +3

      But presented confidently and getting praise. Reminds me of ChatGPT. 😂

  • @sandraviknander7898
    @sandraviknander7898 9 місяців тому +2

    An important caveat is that transformers like the decoder and GPT models are trained autoregresively with no context of the words coming after.

    • @sreedharsn-xw9yi
      @sreedharsn-xw9yi 9 місяців тому

      ya its masked multi head attention only focuses on left-to-right right ?

    • @free_thinker4958
      @free_thinker4958 6 місяців тому +1

      ​@@sreedharsn-xw9yiyes that's decoders only transformers such as gpt 3.5 for example and any text generation model

  • @borregoayudando1481
    @borregoayudando1481 Рік тому +3

    I would like to have a more skeleton-up or foundation-up understanding (to better understand the top down representation of the transformer). Where should I start, linear algebra?

  • @lavishly
    @lavishly Місяць тому

    This was cool but not sure it was explained correctly or I didn’t understand fully. I study transformers and the global attention mechanism is word prediction comparing it to every other past word and input. How does that predict future words?

  • @aron2922
    @aron2922 Рік тому +5

    You should have put LSTMs as a middle step

    • @CodeEmporium
      @CodeEmporium  Рік тому +1

      Good call. I just bundled them with Recurrent Neural Networks here

  • @kenichisegawa3981
    @kenichisegawa3981 9 місяців тому

    This is the best explanation I’ve ever seen RNN vs Transformer. Is there similar video like this for self attention by any chance? Thank you

    • @CodeEmporium
      @CodeEmporium  9 місяців тому

      Thanks so much for the kind words. There is a full video on self attention on the channel. Check out the first video below in the playlist “Transformers from scratch “

  • @free_thinker4958
    @free_thinker4958 6 місяців тому

    The main reason is that rnn has what we call the exploding and vanishing gradient descent..

  • @alfredwindslow1894
    @alfredwindslow1894 8 місяців тому +1

    Don’t transformer models generate one token at a time? It’s just they’re faster as calculations can be done in parallel

    • @nomecriativo8433
      @nomecriativo8433 8 місяців тому +1

      Transformers aren't only used for text generation.
      But in the case of text generation, the model internally predicts the next token for every token on the sentence.
      E.g the model is trained to do this:
      This is an example phrase
      is an example phrase
      So the training requires a single step.
      Text generation models also have a causal mask, tokens can only attend to the tokens that come before it. So the network doesn't cheat during training.
      During inference, only one token is generated at a time, indeed.
      If I'm not mistaken, there's an optimization to avoid recalculating the previously calculated tokens.

    • @ccreutzig
      @ccreutzig 6 місяців тому +1

      Not all transformers use a causal mask. Encoder models like BERT usually don't - it would break the usefulness of the [CLS] token, for starters.

  • @drdca8263
    @drdca8263 15 днів тому

    Aren’t most of the transformers used, based on causal self-attention? That doesn’t seem to have the bidirectional thing to it?

  • @jackrayner1263
    @jackrayner1263 9 місяців тому

    Does a decoder model share these same advantages? Without the attention mapping wouldn’t it would be operating with the same context as an RNN?

  • @jugsma6676
    @jugsma6676 Рік тому +1

    Can you do Fourier Transform replacing the attention head

  • @vastabyss6496
    @vastabyss6496 10 місяців тому

    What if you wanted to train a network to take a sequence of images (like in a video) and generate what comes next? Wouldn't that be a case where RNNs and its variations like LSTM and GRUs are better since each image is most closely related to the images coming directly before and after it?

    • @-p2349
      @-p2349 10 місяців тому

      This is done by “GAN” networks. Or generative adversarial networks. This would have two CNNs one is a “discriminator ” network and the other a “generator” network.

    • @vastabyss6496
      @vastabyss6496 9 місяців тому

      ​@@-p2349 I thought that GANs could only generate an image that was similar to those in the dataset (such as a dataset containing faces). Also, how would a GAN deal with the sequential nature of videos?

    • @ccreutzig
      @ccreutzig 6 місяців тому

      There is ViT (Vision Transformer), although that predicts parts of an image, and I've seen at least one example of ViT feeding into a Longformer network for video input. But I have no experience using it.
      GAN are not the answer to what I read in your question.

  • @Laszer271
    @Laszer271 Рік тому

    What I'm wondering is. Why do all APIs charge you credits for input tokens for transformers? For me, it shouldn't make a difference for a transformer to take 20 tokens as input or 1000 (as long as it's within its maximum context lengths). Isn't that the case that transformer always pads the input to its maximum context length anyway?

    • @ccreutzig
      @ccreutzig 6 місяців тому +1

      No, the attention layers usually take a padding mask into account and can use smaller matrices. It just makes the implementation a bit more involved.
      The actual cost should be roughly quadratic in your input size, but that's probably not something the marketing department would accept.

  • @PhuongNguyen-gq8yq
    @PhuongNguyen-gq8yq Місяць тому

    is this before or after manba?

  • @sreedharsn-xw9yi
    @sreedharsn-xw9yi 9 місяців тому

    how we can relate this to masked multi head attention concept of transformers, this video is kind of conflicting with that, any expert ideas here please ..

  • @UnderstandingCode
    @UnderstandingCode Рік тому

    Ty

  • @manikantabandla3923
    @manikantabandla3923 Рік тому +1

    But there is also a version of RNN with attention.

    • @gpt-jcommentbot4759
      @gpt-jcommentbot4759 Рік тому +1

      These RNNs are still worse than Transformers. However, there have been Transformers + LSTM combinations. Such neural networks have theoretical potential to create extremely long term chatbots, far higher than 4000 tokens, due to their recurrent nature.

  • @vtrandal
    @vtrandal Рік тому

    Fantastic!

  • @TheScott10012
    @TheScott10012 Рік тому +1

    I respect the craft! Also, pick up a pop filter

    • @CodeEmporium
      @CodeEmporium  Рік тому +2

      I have p-p-p-predilection for p-p-plosives

  • @wissalmasmoudi3780
    @wissalmasmoudi3780 Рік тому

    I need your help about my narx neural network please

  • @cxsey8587
    @cxsey8587 Рік тому +1

    Do LSTMs have any advantage over transformers ?

    • @gpt-jcommentbot4759
      @gpt-jcommentbot4759 Рік тому +3

      They work better with less text data, they also work better as decoders. While LSTMs don't have many advantages, future iterations of RNNs could lead to learning far longer term dependencies than Transformers. I think that LSTMs are more biologically accurate than Transformers since they incorporate time and are not layered like conventional networks but instead are theoretically capable of simple topological structures.
      However, their have been "recurrent Transformers" which is basically Long Short Term Memory + Transformers. The architecture is literally a transformer layer turned into a recurrent cell along with gates inspired by LSTM.

  • @kvlnnguyieb9522
    @kvlnnguyieb9522 7 місяців тому

    how the new one SSM in MAMBA? the Mamba said to better than transformer

  • @Userforeverneverever
    @Userforeverneverever 4 місяці тому

    For the algo

  • @cate9541
    @cate9541 Рік тому

    cool

  • @AshKetchumAllNow
    @AshKetchumAllNow 8 місяців тому +2

    No model understands

  • @sijoguntayo2282
    @sijoguntayo2282 Рік тому

    Great video! I’m addition to this, RNNs due to their sequential nature are unable to take advantage of transfer learning. Transformers do not have this limitation