FNet: Mixing Tokens with Fourier Transforms (Machine Learning Research Paper Explained)

Поділитися
Вставка
  • Опубліковано 15 тра 2024
  • #fnet #attention #fourier
    Do we even need Attention? FNets completely drop the Attention mechanism in favor of a simple Fourier transform. They perform almost as well as Transformers, while drastically reducing parameter count, as well as compute and memory requirements. This highlights that a good token mixing heuristic could be as valuable as a learned attention matrix.
    OUTLINE:
    0:00 - Intro & Overview
    0:45 - Giving up on Attention
    5:00 - FNet Architecture
    9:00 - Going deeper into the Fourier Transform
    11:20 - The Importance of Mixing
    22:20 - Experimental Results
    33:00 - Conclusions & Comments
    Paper: arxiv.org/abs/2105.03824
    ADDENDUM:
    Of course, I completely forgot to discuss the connection between Fourier transforms and Convolutions, and that this might be interpreted as convolutions with very large kernels.
    Abstract:
    We show that Transformer encoder architectures can be massively sped up, with limited accuracy costs, by replacing the self-attention sublayers with simple linear transformations that "mix" input tokens. These linear transformations, along with simple nonlinearities in feed-forward layers, are sufficient to model semantic relationships in several text classification tasks. Perhaps most surprisingly, we find that replacing the self-attention sublayer in a Transformer encoder with a standard, unparameterized Fourier Transform achieves 92% of the accuracy of BERT on the GLUE benchmark, but pre-trains and runs up to seven times faster on GPUs and twice as fast on TPUs. The resulting model, which we name FNet, scales very efficiently to long inputs, matching the accuracy of the most accurate "efficient" Transformers on the Long Range Arena benchmark, but training and running faster across all sequence lengths on GPUs and relatively shorter sequence lengths on TPUs. Finally, FNet has a light memory footprint and is particularly efficient at smaller model sizes: for a fixed speed and accuracy budget, small FNet models outperform Transformer counterparts.
    Authors: James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon
    Links:
    TabNine Code Completion (Referral): bit.ly/tabnine-yannick
    UA-cam: / yannickilcher
    Twitter: / ykilcher
    Discord: / discord
    BitChute: www.bitchute.com/channel/yann...
    Minds: www.minds.com/ykilcher
    Parler: parler.com/profile/YannicKilcher
    LinkedIn: / yannic-kilcher-488534136
    BiliBili: space.bilibili.com/1824646584
    If you want to support me, the best thing to do is to share out the content :)
    If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
    SubscribeStar: www.subscribestar.com/yannick...
    Patreon: / yannickilcher
    Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
    Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
    Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
    Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
  • Наука та технологія

КОМЕНТАРІ • 100

  • @YannicKilcher
    @YannicKilcher  3 роки тому +25

    Do we even need Attention? FNets completely drop the Attention mechanism in favor of a simple Fourier transform. They perform almost as well as Transformers, while drastically reducing parameter count, as well as compute and memory requirements. This highlights that a good token mixing heuristic could be as valuable as a learned attention matrix.
    OUTLINE:
    0:00 - Intro & Overview
    0:45 - Giving up on Attention
    5:00 - FNet Architecture
    9:00 - Going deeper into the Fourier Transform
    11:20 - The Importance of Mixing
    22:20 - Experimental Results
    33:00 - Conclusions & Comments
    Paper: arxiv.org/abs/2105.03824
    ADDENDUM:
    Of course, I completely forgot to discuss the connection between Fourier transforms and Convolutions, and that this might be interpreted as convolutions with very large kernels.

    • @aspergale9836
      @aspergale9836 3 роки тому

      4:30 - Which papers did you have in mind here?

    • @emilzakirov5173
      @emilzakirov5173 3 роки тому

      @@aspergale9836 he mentioned some of them later. AFAIK it's probably: "MLP-Mixer", "Pay Attention to MLPs" and "ResMLP ..." papers

    • @RobertWeikel
      @RobertWeikel 3 роки тому

      KISS always wins.

    • @bediosoro7786
      @bediosoro7786 3 роки тому

      Are they saying that position encoding was useless. I am not seeing that part in the paper. Can someone explain what I missed.

    • @ibrahimaba8966
      @ibrahimaba8966 2 роки тому

      Hello Yannick, what do tou think about mixing strategies like fourier transform for one layer and attention for the next layer and so one?

  • @samernoureddine
    @samernoureddine 3 роки тому +69

    "I know I'm a bit late with this one" - paper hasn't even been up for two weeks

  • @ChaiTimeDataScience
    @ChaiTimeDataScience 3 роки тому +16

    Thank you Yannic. I know what I'm watching for Friday evening now, no more scrolling! :D

  • @antraxuran9
    @antraxuran9 3 роки тому +17

    @17:20 You need the imaginary part to do the reverse, it contains the phase information!

    • @Sam-pd1be
      @Sam-pd1be 3 роки тому +2

      Yeah, I don’t know why the authors talk about the reverse when they drop the imaginary part. That fact makes me wonder just how relevant the Fourier transforms properties could possibly be to these results. I feel like the main reason to use it might be that we have readily available fast implementations.

  • @julianke455
    @julianke455 3 роки тому +2

    Your videos are so valuable. Seriously I've learned way more watching your explanation than listening my profs at college. Thank you Yannic!

  • @carlossegura403
    @carlossegura403 3 роки тому +4

    I was getting ready to read the paper, but then I said, "oh, I'll just wait for Yannic to explain it." Thank you!

  • @galchinsky
    @galchinsky 3 роки тому +43

    I'm a guy with DSP roots, and this paper causes a lot of cringe. There was an elephant (the convolution theorem) and it was totally missed. FFT offers circular convolution, which makes little sense in NLP context, so a natural solution to improve results would be to try to pad values. Also cudnn does perform FFT for large kernels, breaking news. But that's not interesting actually. The only interesting part is taking "real" part of the FFT transform. This looks like quite an exotic type of non-linearity: resetting phase. I wonder if it's only additional noise, or really adds something new. Instead of this, there are soulless words about "mixing tokens". I hope LeCun will kick their ass not only in twitter, because this "convolutions in mustache" things are starting to be frightening.

    • @narsimhar.chilkuri7290
      @narsimhar.chilkuri7290 3 роки тому

      so true lol

    • @narsimhar.chilkuri7290
      @narsimhar.chilkuri7290 3 роки тому +3

      Although I must say that having the kernel weights analytically defined allows us to "see" the whole sequence in just one layer. This is I believe different from how CNN architectures do it i.e they use small kernels but very deep architecurtes to have a large receptive field.

    • @aspergale9836
      @aspergale9836 3 роки тому +6

      Doesn't feel like LeCun would be too opposed.
      This seems to be the style of ML papers that's generally accepted nowadays. More rigorous works are harder to read and apply, unless they come with a ready-to-use, open source framework. So the easier incremental works end up taking the spotlight, more often than not.
      I didn't have that many reservations about it when reading it, but one especially stood out when you mentioned it: "there are soulless words like 'mixing tokens'". That ties back to my "unsubstantiated" claim on lack of rigor in most well-publicized papers recently.

    • @aniruddhadatta925
      @aniruddhadatta925 2 роки тому

      I agree with you my comrade

    • @alexeifigueroa
      @alexeifigueroa 2 роки тому +3

      Same cringe here, I wonder if they tried first ABS() then RE() and settled for the best number, plus this intuition of multilayered FTs inverting each other in the sense of "duality" sounds plain wrong to me

  • @andrewgrebenisan6141
    @andrewgrebenisan6141 3 роки тому +1

    I read this last weekend! Really interesting work

  • @akimtsvigun8783
    @akimtsvigun8783 2 роки тому

    Thank you, Yannic! Amazing video.

  • @sangramgaikwad7663
    @sangramgaikwad7663 3 роки тому

    Thanks Yannic. Great Explaination, tradeoff between accuracy/Speedup. Linear Transforms instead of Attention mechanisms seem more practical to deploy for small-mid scale dataset.

  • @kirillnovik8661
    @kirillnovik8661 3 роки тому +7

    Wow, you make reading papers entertaining 👏

  • @machinelearningdojowithtim2898
    @machinelearningdojowithtim2898 3 роки тому +4

    I was intrigued by yannics comments about the existence of a universal attention (or information routing) matrix. Try visualising the DFT matrix, it's got a cool repeating circular pattern. The most interesting thing here is that not (that much) worse results can be obtained this way and that such information transfer regularities exist in language. We can almost certainly learn a better universal matrix but then we lose the FFT efficiency boost. Why don't we do something with neural program synthesis in this attention layer? We could learn a replacement for the FFT! Great video as always

  • @CristianGarcia
    @CristianGarcia 3 роки тому +4

    There are 2 properties I think are still missing from these Attention-less alternatives:
    * Variable length inputs and outputs.
    * To be order agnostic.

  • @ce6535
    @ce6535 3 роки тому +13

    In response to the comment "I'm even open to the idea that the Fourier transform might even be the optimal mixing technique." I think the actual optimum is something that could be trained, or at least investigated empirically. Other functions, such as Legendre polynomials or Bessel functions, are 'a' Fourier basis for different differential operators/measures. It's easy to find the functional bases using Sturm-Liouville theory. It is possible that you could optimize the basis by allowing the functions that define the ODE to become parameters to the model.

    • @ce6535
      @ce6535 3 роки тому +3

      To be clear, the advantage of this method is that once you have your basis, you can bake it in to the architecture. The method is trained, but the parameters you are training don't ship with the final model.

    • @alpers.2123
      @alpers.2123 3 роки тому +3

      We need more weird math:)

    • @st0a
      @st0a 2 місяці тому

      Wow, I'm dumb.

  • @user-wg6ch7wx2x
    @user-wg6ch7wx2x 2 місяці тому

    Fantastic work.

  • @user-qu2oz2ut2h
    @user-qu2oz2ut2h 3 роки тому +7

    I think that we should investigate the application of a fractional Fourier transform instead of a regular one.
    Fractional FT is a generalization of FT and is parametrized by an angle alpha.
    If alpha=0 function doesn't change (skip connection)
    If alpha=pi/2 it performs regular FT.
    If alpha=-pi/2 it performs inverse FT.
    If alpha=pi it gives mirror reflection of a function.
    In consecutive transforms angles of individual transforms add. pi/2 + (-pi/2)=0 It corresponds to FT followed by an inverse FT and that is identity transform.
    So we could use linear operations in different domains, parametrized by alpha. It could be organized as, say, 12 channel tensor with alpha ranging from 2*pi *0/12 to 2*pi*11/12
    then we normalize and apply fully connected layer to all these 12 channels and get skip connection, convolution, order change and 9 other promising operations
    or we could just use linear operation and then collapse these 12 channels into one by element-wise addition

    • @user-qu2oz2ut2h
      @user-qu2oz2ut2h 3 роки тому

      If i get it right, element-wise multiplication in fractional Fourier domain attenuates various chirps in the signal and in case of 2d images(tensors) you could use different transform angles for x and y directions: for example transform only rows and don't transform the columns (pi/2 along x and 0 along y)

  • @ashoknp
    @ashoknp 3 роки тому +2

    Thank you Yannic!!!

  • @rabago85
    @rabago85 3 роки тому +1

    Was waiting for this.

  • @SimonJackson13
    @SimonJackson13 2 роки тому

    The transform is a 2D version as the axis pair are different. Yes the FFT is a self inverse if scaling is done right when the same data axis is used. The fact that convolution becomes a multiplication filter also likely helps to extract data.

  • @SDKIM0211
    @SDKIM0211 2 роки тому

    Hi Yannic, thanks for your nice explanation. May I ask which tool and device did you use to record this video? (e.g., copy and paste pdf on Onenote and scribble on it)

  • @lesumsi
    @lesumsi 3 роки тому +1

    It has come to a point, where my first reaction to a hyped paper is checking whether Yannic has already published a video.

  • @MrMIB983
    @MrMIB983 3 роки тому +1

    Great video

  • @sampruden6684
    @sampruden6684 3 роки тому +1

    Do we expect this to work equally well with ViTs, or is this Fourier magic likely to be limited to the NLP domain? That might be an obvious next paper.

  • @LukaszWiklendt
    @LukaszWiklendt 3 роки тому +16

    Why do we even need a position embedding if the DFT over the tokens already provides this? Is it because they drop the imaginary component? If so, why not keep both imaginary and real and drop the position embedding?

    • @YannicKilcher
      @YannicKilcher  3 роки тому +1

      Good question! Maybe worth a try.

    • @machinelearningdojowithtim2898
      @machinelearningdojowithtim2898 3 роки тому +1

      I think we would lose the information after the inverse is computed

    • @EyedMoon
      @EyedMoon 3 роки тому

      That was my thought too, since the Imaginary component tells you where to shift the sines

  • @sharp7j
    @sharp7j Рік тому

    Soon as I learned about attention I thought "Pretty sure the main benefit is the weight sharing per token just like in convolution, combined with looking at the entire sentence per token" Turns out I was fucking right since the linear model is about as good as BERT, I'm sure if you added a few more layers/made it fancier it would perform better. This paper is awesome and your summary is amazingly intuitive

  • @freasy123
    @freasy123 3 роки тому +2

    How about a multi-scale approach by using wavelet transform?

  • @JTMoustache
    @JTMoustache 3 роки тому

    Whole sequence + Fully connected + Conditionned meta parameters is all you need.

  • @sarvagyagupta1744
    @sarvagyagupta1744 2 роки тому

    The Nx value that's there in the architecture, does it work in serial fashion or parallel? Is it replacing multi-head attention or just increasing the number of encoding layers? Because in the implementations, it works in a serial fashion. So do let me know.

  • @sieyk
    @sieyk 3 роки тому +1

    Glad to see attention go. I never could make sense of it! 😂

  • @MsFearco
    @MsFearco 3 роки тому +1

    I love attention but I was saying fourier is the key to everything since like forever. Do we need attention? Idk. Probably yes. But fourier is also needed. In what context can we combine these ? I have no idea.

  • @sarvagyagupta1744
    @sarvagyagupta1744 3 роки тому

    What is the "Type" embedding? Can I get more information on that?

  • @abhilashnandy
    @abhilashnandy 3 роки тому

    Thanks, very cool! Is there any huggingface model on this?

  • @TheShadyStudios
    @TheShadyStudios 3 роки тому +10

    wassup Yannic youre awesome

  • @Guesstahw
    @Guesstahw 3 роки тому

    The way I see it, the success of attention in transformers or models alike was in explicit "bilinear" nature of information flow between tokens and hence the O(n^2) issue. I dont see how replacing such a nonlinear interaction with a weighted sum (they might as well used an MLP) could bring in the same expressive power.
    On a different note the frequencies for sequences of different lengths would mean differently and hence probably one would have to resort to STFT-like transformations which would not resolve the variable sequence length.

  • @chihochan7345
    @chihochan7345 3 роки тому

    Eq3 bascially is 2D FFT and it only keeps real part. I guess it simplifies the computation and the real part of FFT is related to the power spectrum of FFT. In fact, the power spectrum of FFT is autocorrleation. Self attention is the softmax of the cross-correlation of signal pairs. Therefore, I think they are equivalent in some sense.

  • @CristianGarcia
    @CristianGarcia 3 роки тому +2

    gMLP is also interesting!

  • @aspergale9836
    @aspergale9836 3 роки тому

    4:30 - Which papers?

  • @mbunooo
    @mbunooo 3 роки тому +2

    excelent video!! we should note here that they are using the discrete Fourier Transform… Ir only lasts that someone makes an interpretation on how exactly is that FT mixing the tokens vs linear mixing… how the time vs frecuency applies to tokens? what means frecuency on discrete tokens? but looks like not even the authors have figured out that😂😂😂

  • @ericadar
    @ericadar 3 роки тому +1

    why throw away half the Fourier output by only considering the real part? What would be the effect of doubling the number of tokens/nodes of the Fourier output later by splitting into real and imaginary parts?

  • @AIology2022
    @AIology2022 3 роки тому +1

    So maybe wavelet can be used instead of Fourier, with this difference that wavelet parameters can be tuned by network.

  • @Mrbits01
    @Mrbits01 2 роки тому +1

    A very nitpicky error in the video: at around 10:44, when taking about Fourier and inverse Fourier transforms, you say "inverse transform is simply if you don't do the negative sign right here (the complex exponential)"
    That's not entirely correct, in inverse DFT, x_n and X_k are interchanged, and the summation goes from k=0,....,N-1 instead. (Although this summation thingy doesn't matter much, esp when doing DFT of a real-valued signal, it's necessary to keep the notation uniform)

  • @tinski
    @tinski 3 роки тому +2

    Sorry, if this has been asked a million times already, but does anyone know what pdf annotation software Yannic is using? Looks so clean!

  • @G12GilbertProduction
    @G12GilbertProduction 3 роки тому

    I realize yourself what it could be this O(n² × log14²) attention matrix work in superposition difference equation by VGA.

  • @christianc8265
    @christianc8265 Рік тому

    did they completely drop the whole decoder part?

  • @aniruddhadatta925
    @aniruddhadatta925 2 роки тому

    Now it's my time to shine with my ece degree

  • @alpers.2123
    @alpers.2123 3 роки тому +2

    Instead of learning to route information in quadratic complexity, cant we train a layer to output an index of a permutation table?

    • @alpers.2123
      @alpers.2123 3 роки тому +1

      Partially structured data like natural languages probably would have a small subset if possible routes. So it can be discretised.

    • @dandan-gf4jk
      @dandan-gf4jk 3 роки тому +1

      @@alpers.2123 No because it's not differentiable

    • @herp_derpingson
      @herp_derpingson 3 роки тому

      I havent found a good implementation of an efficient differentiable permutation algorithm.

    • @alpers.2123
      @alpers.2123 3 роки тому

      Permutation is a matrix multiplication with a binary matrix

    • @dandan-gf4jk
      @dandan-gf4jk 3 роки тому +1

      @@alpers.2123 Oh and where will you get the binary matrix from?

  • @Coolguydudeness1234
    @Coolguydudeness1234 3 роки тому +7

    instead of doing 2 1D fourier transforms like total virgins they should just stack the vectors and do a 2D fourier transform

    • @Coolguydudeness1234
      @Coolguydudeness1234 3 роки тому +11

      which is equivalent to just using a CONV layer with a large kernel 🤡

    • @bertchristiaens6355
      @bertchristiaens6355 2 роки тому +1

      isn't it equivalent but easier to compute? "a 2D Fourier transform is achieved by first transforming each row, i.e. replacing each row with its 1D Fourier transform. This first step yields an intermediary 'picture' in which the horizontal axis is frequency f and the vertical axis is space y. The second step is to apply 1D Fourier transform individually to the vertical line of the intermediate image. This new image will be the 2D Fourier transformation of the initial image."

  • @Skythedragon
    @Skythedragon 3 роки тому +1

    Next paper: Fourier transforms are all you need
    I actually wonder, how long will it take to go full circle and have a "neural network training thing" that literally generates code that does a specific task?

  • @ankitaharwal5886
    @ankitaharwal5886 3 роки тому +3

    I think attention mechanism is much more than just information sharing between datapoints. Like, weights in attention are computed at run time and variable. That makes attention much more general.
    So, how about we try to Fourier transform and apply attention layer above that.

  • @pratik245
    @pratik245 2 роки тому

    Although this is game theory, i can see for each token, complexity is o(1), for each attention layer above it is o(1) ,but real worl concepts are not so neatly segregated. But, that is mysterious part

    • @pratik245
      @pratik245 2 роки тому

      Thats infinte dimensional space

    • @pratik245
      @pratik245 2 роки тому

      We could if possible do inverse Fourier transform of real world, it would tell us why we are what we are. The cure of all diseases, the solution of all mysteries

  • @kushasareen3259
    @kushasareen3259 3 роки тому +2

    Someone can correct me if I'm wrong but applying the DFT twice and taking the real part will absolutely not give you back the real signal however it does something close. Taking the real part will immediately throw out all of the phase information. Taking the FT twice actually returns f(-x), essentially reversing the signal (this can be seen pretty clearly from the definition of the FT and its inverse). Taking the FT 4 times however will give you back the signal but I don't think this reversing really plays a role on the learning since the signal is identically reversed each time. I think once you take the real part only, the transformation will roughly approximate this with phase information lost.

  • @herp_derpingson
    @herp_derpingson 3 роки тому +1

    26:30 Why is FF-only worse than random?
    .
    Also, I wonder if someone tried a similar idea before. It sounds like a very obvious thing to do.

    • @YannicKilcher
      @YannicKilcher  3 роки тому +1

      I guess any mixing is still preferred. It's random, but always the same random afaik.

  • @NeoShameMan
    @NeoShameMan 3 роки тому +2

    Isn't attention mechanism with query, value, key, the evolution of neural turing architecture? Can someone clue me into this?

  • @leecarraher
    @leecarraher Місяць тому

    Feels like we want locality and sparsity,why not trade floating point complex fft with walsh hadamard.

  • @mathematicalninja2756
    @mathematicalninja2756 3 місяці тому

    Wavelet transform is the most effective non parametric mixing, I bet. They are mathematical microscopes.

  • @Ronnypetson
    @Ronnypetson 3 роки тому

    30:45 LOL

  • @natthaphonhongcharoen
    @natthaphonhongcharoen 3 роки тому +1

    3:47 Anything but n squared? Let's go n power n.

  • @ecitslos
    @ecitslos 3 роки тому +2

    People have investigated predefined basis functions before in the early days of deep learning (the paper even pointed that out). They performed badly (even against regular dense layers). This paper does the same thing and again predefined basis lost to dense layers. This paper is just getting attention (pun intended) it does not deserve due to it being from Google. I think the various works on local attention should be much better than the idea in this paper.

  • @user-vx1fj9uf6w
    @user-vx1fj9uf6w Рік тому

    17:20 if you did not tf to infinite. It cannot be perfectly restored

  • @adamweishaupt7074
    @adamweishaupt7074 2 роки тому

    "The verb at the end attends to the second word."
    Human language was a huge mistake.

  • @justinwhite2725
    @justinwhite2725 2 роки тому

    'for the longest time'. Wasn't 'Attention all you need' published in 2015? And wasn't FNET published in 2019? We're talking 4 years.

  • @XOPOIIIO
    @XOPOIIIO 3 роки тому +1

    So attention is not what you need after all.

  • @pascalhuber157
    @pascalhuber157 3 роки тому

    I don‘t really get the hype about this paper. Their results are apparantly much worse than with Attention. In my opinion this is an interesting approach, but ultimately nothing but hot air.