Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention (AI Paper Explained)

Поділитися
Вставка
  • Опубліковано 5 чер 2024
  • #transformer #nystromer #nystromformer
    The Nyströmformer (or Nystromformer, Nyströmer, Nystromer), is a new drop-in replacement for approximating the Self-Attention matrix in Transformers with linear memory and time requirements. Most importantly, it uses the Nystrom-Method to subselect (or segment mean) queries and keys as so-called landmarks and uses those to reconstruct the inherently low-rank attention matrix. This is relevant for many areas of Machine Learning, especially Natural Language processing, where it enables longer sequences of text to be processed at once.
    OUTLINE:
    0:00 - Intro & Overview
    2:30 - The Quadratic Memory Bottleneck in Self-Attention
    7:20 - The Softmax Operation in Attention
    11:15 - Nyström-Approximation
    14:00 - Getting Around the Softmax Problem
    18:05 - Intuition for Landmark Method
    28:05 - Full Algorithm
    30:20 - Theoretical Guarantees
    35:55 - Avoiding the Large Attention Matrix
    36:55 - Subsampling Keys vs Negative Sampling
    43:15 - Experimental Results
    47:00 - Conclusion & Comments
    Paper: arxiv.org/abs/2102.03902
    Code: github.com/mlpen/Nystromformer
    Appendix: github.com/mlpen/Nystromforme...
    LRA Results: / 1359301186734620675
    Twitter lucidrains w/ author: / 1359597104075661312
    Twitter lucidrains w/ _clashluke: / 1359483460851802115
    Abstract:
    Transformers have emerged as a powerful tool for a broad range of natural language processing tasks. A key component that drives the impressive performance of Transformers is the self-attention mechanism that encodes the influence or dependence of other tokens on each specific token. While beneficial, the quadratic complexity of self-attention on the input sequence length has limited its application to longer sequences -- a topic being actively studied in the community. To address this limitation, we propose Nyströmformer -- a model that exhibits favorable scalability as a function of sequence length. Our idea is based on adapting the Nyström method to approximate standard self-attention with O(n) complexity. The scalability of Nyströmformer enables application to longer sequences with thousands of tokens. We perform evaluations on multiple downstream tasks on the GLUE benchmark and IMDB reviews with standard sequence length, and find that our Nyströmformer performs comparably, or in a few cases, even slightly better, than standard Transformer. Our code is at this https URL.
    Authors: Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh
    Links:
    TabNine Code Completion (Referral): bit.ly/tabnine-yannick
    UA-cam: / yannickilcher
    Twitter: / ykilcher
    Discord: / discord
    BitChute: www.bitchute.com/channel/yann...
    Minds: www.minds.com/ykilcher
    Parler: parler.com/profile/YannicKilcher
    LinkedIn: / yannic-kilcher-488534136
    BiliBili: space.bilibili.com/1824646584
    If you want to support me, the best thing to do is to share out the content :)
    If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
    SubscribeStar: www.subscribestar.com/yannick...
    Patreon: / yannickilcher
    Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
    Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
    Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
    Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
  • Наука та технологія

КОМЕНТАРІ • 70

  • @andreassyren329
    @andreassyren329 3 роки тому +56

    If nothing else, the contribution to model naming is a clear increment to SOTA.

  • @VikasSingh-jv7fn
    @VikasSingh-jv7fn 3 роки тому +18

    Hello Yannic,
    Your comment about the order of operations is correct. It is one of those things where you set out to check how poorly it performs and find out that it could work empirically (at least in limited settings). The lemma is not practically useful but merely evaluates the setting that if/when everything is idealized, the results/procedure does not lead to nonsensical conclusions. The choice of F early on in the paper was to avoid a conflict with D (D and d were both used) and E (ones matrix).

    • @tchlux
      @tchlux 3 роки тому

      What about the A^+ comment, was that actually a typo in the paper? ua-cam.com/video/m-zrcmRd7E4/v-deo.html

    • @zhanpengzeng8592
      @zhanpengzeng8592 3 роки тому +1

      @@tchlux Yes, that is a typo. We somehow left out the pseudo inverse sign.

    • @xiongyunyang9643
      @xiongyunyang9643 3 роки тому +2

      @@tchlux This is a typo. We will update it. Thanks for the catch.

    • @JoelHough
      @JoelHough 3 роки тому

      I have seen this exact type of lemma in many discussions about approximations. It did not seem out of place to me. It's nice to know that in the limit your approximation will agree with the ground truth which is certainly not the case in all approximation methods.

  • @xiongyunyang9643
    @xiongyunyang9643 3 роки тому +3

    Thanks for making this great video. Nice catch for the typo. We will update the draft soon.

  • @herp_derpingson
    @herp_derpingson 3 роки тому +7

    0:42 Nyan-storm-former!
    .
    3:30 Time for my weekly Transformer explanation :)
    .
    27:00 That was a really sweet and easy to understand explanation.
    .
    35:00 I wonder if we can have a DNN just predict a landmark tensor..

  • @poesaste
    @poesaste 3 роки тому +1

    Incredible breakdown, subbed!

  • @Anirudh-cf3oc
    @Anirudh-cf3oc 2 роки тому

    Very nice explaination sir, Thank you!!

  • @lucidraisin
    @lucidraisin 3 роки тому +2

    Lol, unexpectedly mentioned 😅 thanks for the video!

  • @G12GilbertProduction
    @G12GilbertProduction 3 роки тому

    Sweet Thursday with more sweety kind of paper. Buon appetite, Yannic! :)

  • @mdmishfaqahmed8356
    @mdmishfaqahmed8356 3 роки тому +1

    the pronounciation of those author names were clutch :D

  • @mathematicalninja2756
    @mathematicalninja2756 3 роки тому +10

    What I hear: nice transformer

  • @jamiekawabata7101
    @jamiekawabata7101 3 роки тому

    If I lift box 1 onto shelf 1 and box 1 onto shelf 2 and box 2 onto shelf 1, then I can predict the effort in lifting box 2 onto shelf 2. Great analogy, thank you.

  • @chaitanyaparmar888
    @chaitanyaparmar888 2 роки тому

    Love this video!

  • @user-qu2oz2ut2h
    @user-qu2oz2ut2h 3 роки тому +4

    that's a nice trömformer

  • @scarcommander5517
    @scarcommander5517 3 роки тому

    We like this transformer!

  • @RobEnglebright
    @RobEnglebright 3 роки тому +5

    top work pronouncing the authors

  • @MausamJain
    @MausamJain 3 роки тому +1

    How did you import pdf to onenote with such a good quality? The printout option generally inserts a very poor quality images of the pages.

    • @YannicKilcher
      @YannicKilcher  3 роки тому

      It's definitely poor for me, too, it's right on the edge of being useful.

  • @benjaminho351
    @benjaminho351 3 роки тому +11

    Nobody:
    Yannic: uöu 1:07

  • @mizupof
    @mizupof 3 роки тому +8

    By the power of Yannic, I rename you!

  • @KennedDansker
    @KennedDansker 3 роки тому

    It is F because it is forward attention right? (Then it would fit with B being backward). It is not entirely right (A contain part of the forward attention) but I think that is the intention

  • @otaviodzb1
    @otaviodzb1 3 роки тому +2

    One thing that I still couldn't understand is how backprop works in a transformer. Does someone have a good reference or video that explains it?

    • @pg1337ful
      @pg1337ful 3 роки тому +2

      seems like you have fundamental gaps in ML.

    • @YannicKilcher
      @YannicKilcher  3 роки тому +2

      It works like in any other neural network, by applying the chain rule to all involved operations

  • @NilabhraRoyChowdhury
    @NilabhraRoyChowdhury 3 роки тому +3

    I bet you wish you could:
    from previous_videos import SelfAttention
    every time you make a video related to transformers

  • @osiris42
    @osiris42 3 роки тому +1

    Does it even matter that the softmax doesn't commute, if the softmax is just a heuristic / hack in the first place?
    Or is there something inherintly special about softmax in the transformer architecture?

    • @tchlux
      @tchlux 3 роки тому +4

      I don't know if I'd call it "special", but I like to think of it geometrically. When you use a softmax, you make it so that the layer immediately after the softmax only has to model a "surface" that lives on the inner wedge of the unit cube (points with 1-norm

    • @mathematicalninja2756
      @mathematicalninja2756 3 роки тому

      @@tchlux that is a good perspective

  • @muhammadsaadmansoor7777
    @muhammadsaadmansoor7777 3 роки тому

    I was not expecting this until a month later.
    But where do keys queries and value come from

    • @IRWBRW964
      @IRWBRW964 3 роки тому +1

      They are learned.

  • @JamesAwokeKnowing
    @JamesAwokeKnowing 3 роки тому

    So is that like a softmax over time, where it's valid kindof because over many iterations it's pulling random samples? Well hope a better way is found.

  • @jonatan01i
    @jonatan01i 3 роки тому +1

    I struggle to believe that it actually is named as Nyströmformer.
    I'll call it Nyströmer, as suggested and should be.

  • @JamesAwokeKnowing
    @JamesAwokeKnowing 3 роки тому +1

    The name was designed to sound like 'the nice transformer'. So leave the name as is.

  • @ZedaZ80
    @ZedaZ80 3 роки тому

    I have no idea what most of this means, but the lemma was funny

  • @BorrWick
    @BorrWick 3 роки тому +4

    Didn't this came out like yesterday??

    • @ingusmant
      @ingusmant 3 роки тому +2

      And?

    • @BorrWick
      @BorrWick 3 роки тому +8

      @@ingusmant Just amazed by the speed Yannic can read, understand and produce these videos :o

    • @YannicKilcher
      @YannicKilcher  3 роки тому +4

      You're right, it's already old now... ;)

  • @kicckicc
    @kicckicc 3 роки тому +1

    just fyi, I tried to implement this the day before yesterday, but got NAN. I checked the code and realized that the formula (14) isn't accurate and also Z0 = AS/(||AS ||1||AS ||∞) should be Z0 = transpose(AS)/(||AS ||1||AS ||∞).

    • @xiongyunyang9643
      @xiongyunyang9643 3 роки тому

      You mean the NAN is from your own implementation or our implementation? The accuracy to approximate pseudoinverse using formula (14) depends on how many iterations. Z0 is AS^T/(||AS ||1||AS ||∞). We will fix the typo in our update.

    • @kicckicc
      @kicckicc 3 роки тому

      @@xiongyunyang9643 thanks for the reply. After I used correct (14) and correct z0, the nan is gone. Just fyi, formula (16) is also inaccurate, but it is easy to be noticed.

    • @xiongyunyang9643
      @xiongyunyang9643 3 роки тому

      @@kicckicc Cool. Formula (16), similar to average local pooling, is to compute landmarks efficiently.

  • @FreddySnijder-TheOnlyOne
    @FreddySnijder-TheOnlyOne 3 роки тому +3

    Hi there!

  • @Xrey56Cheyz
    @Xrey56Cheyz 3 роки тому +1

    To be honest, I expected the Performer to be the ImageNet moment for transformers, but it seems there is still a long way to go and random Fourier features are not the best way to do the thing. Somewhat sad cause Performer's idea looked so cool and well grounded :(

    • @redjammie8342
      @redjammie8342 3 роки тому +3

      Big leaps come through simple ideas like ReLU, convolution, drop-out, residual connections, self-attention... The moment an idea becomes too convoluted, it is less likely to be game changing.

    • @charlesfoster6326
      @charlesfoster6326 3 роки тому +1

      What are you waiting for? If anything, the transformer revolution seems like it's come with even more force and speed than ImageNet.

  • @ahmadmoussa3771
    @ahmadmoussa3771 3 роки тому +3

    *The NICEtrömer*

  • @weizhu2230
    @weizhu2230 3 роки тому

    OK i vote down for this work and i think the "asymmetric non local neural network for semantic segmentation" should be a better one.

  • @lennartvandergoten6592
    @lennartvandergoten6592 3 роки тому

    Grüße an meinen alten ETH Kumpanen Yannic, richte Jonas schöne Grüße von mir aus :-)

  • @pratik245
    @pratik245 2 роки тому

    Have you heard Michelle Srivastav but more probably you would hear Peter Chakraborty. If you can tell me the reason, you would know a lot about caste and region based targeting in India.

    • @pratik245
      @pratik245 2 роки тому

      So, no body hates India when they are born, but, as you keep growing you see these divisions between people, majoritarianism, govt repression, targetting of intellectual class, poverty, corruption and then you start seeing trends in these concepts all in the name of highly preached American democracy and capitalism... But, Surely everything is a joke even misery.. Right Guys?

  • @yaaank6725
    @yaaank6725 3 роки тому

    In the last twitter chart, it's quite surprising that Performer has the worst performance across the other efficient transformers. Is this also verified by other tasks?

    • @yaaank6725
      @yaaank6725 3 роки тому

      Or other people maybe..

    • @xiongyunyang9643
      @xiongyunyang9643 3 роки тому

      We have released the scores on individual LRA tasks. It will be interesting to see how Performer works for other tasks beyond LRA tasks.

  • @Ronnypetson
    @Ronnypetson 3 роки тому

    Noice

  • @CandidDate
    @CandidDate 3 роки тому +2

    I'd bet a million dollars that AGI, when discovered, uses frequencies of waves rather than any matrices.

  • @kimchi_taco
    @kimchi_taco 3 роки тому

    mathematically ugly but somehow works well. I don't feel good in that both Nyströmformer and Performer rely on random sampling.

    • @xiongyunyang9643
      @xiongyunyang9643 3 роки тому +1

      No, Nyströmformer does not rely on random sampling.