Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (Paper Explained)

Поділитися
Вставка
  • Опубліковано 9 чер 2024
  • #ai #attention #transformer #deeplearning
    Transformers are famous for two things: Their superior performance and their insane requirements of compute and memory. This paper reformulates the attention mechanism in terms of kernel functions and obtains a linear formulation, which reduces these requirements. Surprisingly, this formulation also surfaces an interesting connection between autoregressive transformers and RNNs.
    OUTLINE:
    0:00 - Intro & Overview
    1:35 - Softmax Attention & Transformers
    8:40 - Quadratic Complexity of Softmax Attention
    9:40 - Generalized Attention Mechanism
    13:45 - Kernels
    20:40 - Linear Attention
    25:20 - Experiments
    28:30 - Intuition on Linear Attention
    33:55 - Connecting Autoregressive Transformers and RNNs
    41:30 - Caveats with the RNN connection
    46:00 - More Results & Conclusion
    Paper: arxiv.org/abs/2006.16236
    Website: linear-transformers.com/
    Code: github.com/idiap/fast-transfo...
    My Video on Attention: • Attention Is All You Need
    My Video on BERT: • BERT: Pre-training of ...
    Abstract:
    Transformers achieve remarkable performance in several tasks but due to their quadratic complexity, with respect to the input's length, they are prohibitively slow for very long sequences. To address this limitation, we express the self-attention as a linear dot-product of kernel feature maps and make use of the associativity property of matrix products to reduce the complexity from (N2) to (N), where N is the sequence length. We show that this formulation permits an iterative implementation that dramatically accelerates autoregressive transformers and reveals their relationship to recurrent neural networks. Our linear transformers achieve similar performance to vanilla transformers and they are up to 4000x faster on autoregressive prediction of very long sequences.
    Authors: Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, François Fleuret
    Links:
    UA-cam: / yannickilcher
    Twitter: / ykilcher
    Discord: / discord
    BitChute: www.bitchute.com/channel/yann...
    Minds: www.minds.com/ykilcher
  • Наука та технологія

КОМЕНТАРІ • 81

  • @dingleberriesify
    @dingleberriesify 3 роки тому +35

    The one thing that irks me is kernels keep being referred to as "mapping to infinite dimensional space" in these vids. While true, it's kind of unhelpful. It's probably more helpful to think of it as mapping to a (potentially infinite) space where the (potentially infinite) bases aren't restricted to be vectors. The whole trick about the Hilbert space is it's just a space with a well-defined inner product, and that inner product can be between functions etc. You could have a function mapping to a space where the bases are the sin and cosine functions, for example, and you use that space to take the inner product between different sine/cosine wave combinations. Typically the kind of space you want to map to will express something about the problem you're trying to solve.
    I bring this up because I've seen kernels mentioned twice now, and both times the explanations are kind of lacking. Not Yannic's fault, I was lucky enough to spend a bunch of time around kernel gurus that cleared up a bunch of misconceptions for me...before that, I was making similar mistakes. Content otherwise very solid, as usual!

    • @YannicKilcher
      @YannicKilcher  3 роки тому +5

      Thank you :)

    • @zhonghuahe3556
      @zhonghuahe3556 3 роки тому

      Hi, can you recommend some learning materials about "kernels"

    • @brandomiranda6703
      @brandomiranda6703 2 роки тому

      Mit’s 9.520 with Lorenzo Rosasco ;)

    • @CosmiaNebula
      @CosmiaNebula Рік тому +1

      This is a note for the mathematicians:
      When people say "vector" in computer science, they mean "an array with finitely many entries". In other words, they mean a function from {1, 2, ..., n} to some set X. This is a severe generalization/weakening of the typical vectors that we mean, where a vector lives in a vector space, where there are all kins of symmetries (the group GL(n, K)).
      So when Alex Stenlake says "they aren't restricted to be vectors", he meant "they aren't restricted to be a finite-entry array". They are in fact, *exactly* what we mean when we say "vectors in a countable, separable vector space", in other words, any vector space we use in functional analysis. So as we see, this is a severe strengthening of a severe weakening (first, the CS scientist weakens the mathematician's idea of finite dimensional vector space, then they strengthen it to return to the mathematician's idea), which is why Alex had to point it out.

  • @NicheAsQuiche
    @NicheAsQuiche 3 роки тому +61

    I love that you go over and explain the topics each time you introduce them (e.g. transformers) even though you go over them in other videos it's so useful to me to hear different explanations of it and realy clrafiy what's going on inside. Really helps me get a better understanding.

  • @florianhonicke5448
    @florianhonicke5448 3 роки тому +17

    I like that you explain some stuff again in different videos. It makes sure everyone is on track and if someone would be annoyed by the redundance the person could just fast forward.
    To me it was very helpful to get multiple explanations about transformers

  • @utku_yucel
    @utku_yucel 3 роки тому +14

    the idea of Explaining the papers is awesome! Thanks!

    • @snippletrap
      @snippletrap 3 роки тому

      It's a big part of grad school.

  • @andyblue953
    @andyblue953 3 роки тому

    Great explanation. I really like the way you break down publications to the most essential parts. Never was "reading" articles that comfortable.

  • @fotisj321
    @fotisj321 3 роки тому +3

    Thanks for making a difficult paper accessible. I especially like it that there a red threads in you selection of papers, which create almost something like a story-line, for example the one starting with the paper on the transformer architecture and the follow-up papers trying to remedy some of its shortcomings but keep its achievements.

  • @OwenCampbellMoore
    @OwenCampbellMoore 3 роки тому +4

    These are amazing, love the depth and really appreciate the amount of content!
    I know you’ve said you’re not planning on it, but I’d be super happy to support on Patreon if that helps support you in making so much content!

  • @good_user35
    @good_user35 3 роки тому +9

    At first, I thought your videos are a bit lengthy, but after watching this now I feel that I woudn't be able to understand the contents without sacrficing such an amount of time. And what you point out on the paper similarly tells me that the cost of softmax function is the needed price considering that the function can do in an unknown task. Thanks, it was good watching.

    • @YannicKilcher
      @YannicKilcher  3 роки тому +1

      also, I suggest just changing the playback speed to 2x if it's too long :)

  • @darkmythos4457
    @darkmythos4457 3 роки тому +15

    Thanks Yannic, soon enough you will be one of the requirements for ML jobs; Published at Neurips, ICML, and Yannic YT Channel.

    • @YannicKilcher
      @YannicKilcher  3 роки тому +6

      Haha :D problem is I'm very corrupt so idk how that's going to work out ;)

  • @davidsabatini9100
    @davidsabatini9100 Рік тому

    A really good video! I found it super informative, thank you!

  • @sitioprueba2855
    @sitioprueba2855 2 роки тому

    Thank you so much for all this great explinations!

  • @that_guy4690
    @that_guy4690 3 роки тому +1

    Your videos are pretty long, but they are really not long enough :). Each time the video ends, I get a feeling that I need an even deeper explanation e.g. their code review

  • @tomw4688
    @tomw4688 3 роки тому

    Good shit! This video was very helpful and many ELI5 moments useful for students. Thanks!

  • @raminrasoulinezhad
    @raminrasoulinezhad 3 роки тому

    Cool description. Thanks

  • @daryoushmehrtash7601
    @daryoushmehrtash7601 3 роки тому

    I appreciate your comment about different neural network architecture as different routing of information.

  • @veedrac
    @veedrac 3 роки тому +8

    The natural question I have after this paper: why not both? Test a model with X% compute allocated towards ‘real’ attention and 1-X% towards linear attention, and see what's best.

    • @GeekProdigyGuy
      @GeekProdigyGuy 3 роки тому +2

      You can always ensemble any two models, but unless they provide independent contributions, there isn't really any point. Plus in this case their proposed model's main advantage is reduced compute/memory, so it would seem entirely counterproductive to ensemble with plain attention.

    • @veedrac
      @veedrac 3 роки тому +4

      @@GeekProdigyGuy The idea is that traditional attention is high quality, but struggles at reaching a large context. Clearly that high quality attention is necessary on short scales, but only a small reduction in the traditional attention would be needed to allow for a linear attention that reached very large context sizes. Assuming the far context needs less character-level detail (same thesis as Compressive Transformers), the lossier linear attention should suffice there.

    • @lucidraisin
      @lucidraisin 3 роки тому +6

      Veedrac - I actually explore that here github.com/lucidrains/linear-attention-transformer

  • @noreddine
    @noreddine 3 роки тому +1

    Yannic: just lett me think what you think.
    Me: I think you're awesome

  • @welcomeaioverlords
    @welcomeaioverlords 3 місяці тому

    I'm very late, but well done. Good call out on the RNN caveat! I missed that. And one correction (or I misunderstand something): it seems that we're not actually going to a higher dimensional space through phi, but rather applying a point-wise non-linearity and maintaining the same dimensionality. Any comments on that choice? That's seemingly contrary to the point of the kernel trick as I understood it. Why not do the typical DL thing and simply make phi a learnable function?

  • @snippletrap
    @snippletrap 3 роки тому +1

    Re: causal masking implementation @38:00, input[i] != target[i] because the target is shifted by one before being passed into the model.

  • @mookee
    @mookee 3 роки тому +5

    Cool charts at 26:20. Interesting that Reformer is slower than regular transformer at shorter sequences. You would't know that just from asymptotic time complexity. Looks like the paper conveniently omits that :)
    Linear attention seems super fast tho.
    Also it's probably obvious, but two times higher slope on a log chart corresponds to a square of a function on a linear chart.

    • @YannicKilcher
      @YannicKilcher  3 роки тому

      Yep, I'm just a moron when it comes to non-linear charts :D

  • @DanielHesslow
    @DanielHesslow 3 роки тому +2

    This is super cool and I think we will see the performance improve down the line as people explore other feature mappings, elu + 1 seems a biiit arbitrary

  • @MrMIB983
    @MrMIB983 3 роки тому +1

    Love it! Universal transformer please!

  • @samm9840
    @samm9840 3 роки тому +1

    How about applying the kernel (O(n)) trick at inferencing alone after training?
    I am especially interested whether the DETR model (Yannic already did a video on that), that uses the heavy-weight transformer be modified to do object detection simply with this linear version! May be it's an idea for a new paper. Looking forward to it.

  • @Erotemic
    @Erotemic 3 роки тому +3

    That is quite an impressive speed increase, and the ability to perform RNN like inference is incredibly useful.
    I wish they investigated other kernel functions (or maybe they did and I missed it), but I'd be interested in seeing how the choice of kernel impacts the results.
    Am I understanding the RNN autoregressive-transformer equivalence correctly? Is it the case that translation transformer models are not RNNs?

    • @YannicKilcher
      @YannicKilcher  3 роки тому

      only transformers that are using the specific causal masking are RNNs

  • @muneebth5508
    @muneebth5508 3 роки тому +1

    Sir, Thank You

  • @myungchulkang5716
    @myungchulkang5716 3 роки тому +1

    thank you :)

  • @ShokhanBirlikov
    @ShokhanBirlikov 3 роки тому

    Your intuition at 31:40 is great! Maybe a silly question, but is it a possible architecture where the transformation function phi is learned as Nx(#of intermediate units) matrix of trainable parameters?

  • @andyblue953
    @andyblue953 3 роки тому

    I was wondering to which extend their idea is related to A² double-net attention? In the latter, the outer product is likewise separated into a gathering step and a distribution step, and it's likewise shown to be on par with the original transformer. Anyway, the general mathematical idea can already be found in the SVD.

  • @sajjadayobi688
    @sajjadayobi688 3 роки тому

    thank Yannic, we saw a few papers for transformers in long sequences like ReFormer, LinFormer, LongFormer but which one of them is really well in real-world
    if you make a video about comparing these papers for faster or longer Transformer I would be very thankful

  • @kiyoonyoo4761
    @kiyoonyoo4761 3 роки тому +1

    Thanks Yannic for the explanation.
    I was actually having some difficulty understanding how they used the associative property to transition from Eq. 4 -> Eq.5. Glad to see you've already made a video!
    This wasn't explained in detail on the video also, maybe because it is something trivial, but am I the only one who cannot get around why the two formulations (Eq.4 and Eq.5) are equivalent?
    The dimensions of the numerators definitely are not the same as the former is a linear combinations of column vectors, while the latter is a row vector multiplied by square matrix. Also, associative property does not seem to be true in general for vector multiplication involving vector dot products.
    Any help or explanation from anyone would be very appreciated :0

    • @YannicKilcher
      @YannicKilcher  3 роки тому

      Sometimes people are very loose with row/column vectors and don't properly apply transposes. Can you find a counter-example where the equation doesn't hold (even when glossing over transpose issues)?

    • @norik1616
      @norik1616 3 роки тому

      ​@@YannicKilcher I had hard times with the indexes as well, so I wrote it in numpy:
      It does not help with transpositions (as numpy matches them *mostly correctly* automatically), but it helps with where to put which product - the equation holds true.
      import numpy as np
      shape = 5, 7
      K = np.random.rand(*shape)
      Q = np.random.rand(*shape)
      V = np.random.rand(*shape)
      classic_attention = np.matmul(np.matmul(Q, K.T), V)
      assert classic_attention.shape == shape # stripped down classic attention
      res = []
      for i in range(shape[0]):
      res.append(sum([np.matmul(Q[i], K[j]) * V[j] for j in range(shape[0])]))
      matmul_byhand_attention = np.array(res)
      print(np.allclose(classic_attention, matmul_byhand_attention))
      res = []
      for i in range(shape[0]):
      s = sum([np.outer(K[j], V[j]) for j in range(shape[0])]) # associativity and distributivity
      res.append(np.matmul(Q[i], s))
      matmul_byhand_attention2 = np.array(res)
      print(np.allclose(classic_attention, matmul_byhand_attention2))

  • @Kerrosene
    @Kerrosene 3 роки тому +1

    I seems like the closer we are to approximating the soft-max operator with a finite dimensional feature map, the better the performance of this linear transformer would be...which is what they suggested in the conclusion..Also, they don't use positional encodings..

  • @mattiasfagerlund
    @mattiasfagerlund 3 роки тому +1

    Hi - is the reason that the softmax version is O(N^2) that softmax (in the general case) has N^2 gradients before they're collapsed back? Or is there something else going on?

    • @YannicKilcher
      @YannicKilcher  3 роки тому +1

      It's because you need to do N^2 inner products. Maybe if the softmax wasn't there, you could somehow optimize that away, so I think you do have a point

  • @rayaengazou2140
    @rayaengazou2140 Місяць тому

    Hey! that's greate !! i would like to ask about the model he used to produce results in the paper ?in github there was only the library not the model ! can you share with us the model ?

  • @RuminRoman
    @RuminRoman 3 роки тому +1

    I wonder why you do not have a video about Universal Transformers

  • @Bencurlis
    @Bencurlis 3 роки тому

    The generalized formulation of equation 3 looks like Bayes formula, has anyone noticed? Is it accidental or is there a deeper meaning here?

  • @couragefox
    @couragefox 3 роки тому +1

    commenting for algo. love your channel

  • @herp_derpingson
    @herp_derpingson 3 роки тому +2

    12:33 If Adding all the K[i] terms means that there is no way for the transformer to know the position of words in the sentence. Maybe the positional embeddings are helping it somehow. It would be interesting to see the performance of this transformer without the positional embeddings. Then again, RNNs dont have any explicit mechanism to store positions of words.

    • @YannicKilcher
      @YannicKilcher  3 роки тому +1

      True, the positional embeddings probably do a lot of work here. RNNs don't have explicit position, but they can theoretically learn to count the number of input tokens - or what's probably more relevant - learn to estimate distances between tokens, which is another thing that position embeddings in transformers are good for.

  • @bhuvanb8156
    @bhuvanb8156 3 роки тому +2

    at 14:17 numerator and denominator are same except multiplying with Vj, how's is that possible

    • @YannicKilcher
      @YannicKilcher  3 роки тому

      it characterizes a normalized distribution over V

  • @andres_pq
    @andres_pq 3 роки тому +1

    So in practice you could stack way more tranformer layers and have better performance?

  • @lucasnestler6933
    @lucasnestler6933 3 роки тому +1

    As the paper states that transformers attending to one side only are equal to RNNs, doesn't it imply that full transformers are equal to residual bidirectional RNNs?

    • @YannicKilcher
      @YannicKilcher  3 роки тому

      Only in an abstract sense. The computations are really different from a bidirectional RNN

  • @zhangcx93
    @zhangcx93 3 роки тому +3

    According to there code, the main computation is like this:
    KV = torch.einsum("nshd,nshm->nhmd", K, values)
    Z = 1/(torch.einsum("nlhd,nhd->nlh", Q, K.sum(dim=1))+self.eps)
    V = torch.einsum("nlhd,nhmd,nlh->nlhm", Q, KV, Z)
    can someone translate this einsum into a normal pytorch way?

    • @eelcohoogendoorn8044
      @eelcohoogendoorn8044 3 роки тому +3

      Better off just reading up on einsum and getting familiar with it. it is by far the most readable and intuitive way of expressing these statements; and I say that as a numpy veteran of 15 years.

    • @YannicKilcher
      @YannicKilcher  3 роки тому +1

      yea the problem is that would translate to multiple matmuls, elementwise products, transposes and summations :D I also suggest you dive into einsum notation

    • @markdaoust4598
      @markdaoust4598 3 роки тому

      I came to ask “what’s the einsum for 21:03.”. Thanks!
      So.. dropping the “batch” and “head” indices it’s:
      Vld = sum_sd Qld*Ksd*Vsm / sum_sd Qld Ksd

  • @RobotProctor
    @RobotProctor 3 роки тому +1

    If Q.K is usually the highest for the same token, I wonder why Q and K need to be different at all for a transformed to work properly. Why not just output a K and a V?

    • @YannicKilcher
      @YannicKilcher  3 роки тому +1

      interesting idea. some papers fuse k and v, but I've never heard of fusing k and q

  • @minhoryu2786
    @minhoryu2786 3 роки тому +1

    Can anyone summarize difference between linformer and this paper?

    • @meerkatj9363
      @meerkatj9363 3 роки тому +10

      Linformer has a linear complexity thanks to a constant dimension of the keys and values because they are projected in a fixed lower dimension. Whereas this does not fix the dimension of the sequence, it only reorders the computation. The reordering can only be made with conditions that need the kernel thing.

  • @baqirhusain5652
    @baqirhusain5652 Рік тому

    The complexity works well when N > D^2. So the paper is saying if dimensionality is 512 my sequence length should be greater than 512^2 = 262144. Is this really practical !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

  • @seraphim9723
    @seraphim9723 3 роки тому +1

    Wait a second: Figure 5c is showing PER over wall-clock time, and despite the significant speed-up the original softmax transformer is better AND faster (at reaching any given error rate).

    • @YannicKilcher
      @YannicKilcher  3 роки тому

      Yes, that's because on this task it seems to be very important to have the quadratic attention

  • @siyn007
    @siyn007 3 роки тому +2

    speed over accuracy for me tbh. Not everyone has Google's resources to run it on 112312412312423 TPUs

  • @first-thoughtgiver-of-will2456
    @first-thoughtgiver-of-will2456 3 роки тому +1

    If I was rich I'd give you money.

  • @charudattamanwatkar8340
    @charudattamanwatkar8340 3 роки тому +8

    Who else saw balls in the thumbnail?

  • @joery8290
    @joery8290 3 роки тому +1

    I love your videos, but could you try to make them more compact? 50 minutes is a big time investment. I feel like most papers can be explained in high detail in half that time.