Do we even need Attention? FNets completely drop the Attention mechanism in favor of a simple Fourier transform. They perform almost as well as Transformers, while drastically reducing parameter count, as well as compute and memory requirements. This highlights that a good token mixing heuristic could be as valuable as a learned attention matrix. OUTLINE: 0:00 - Intro & Overview 0:45 - Giving up on Attention 5:00 - FNet Architecture 9:00 - Going deeper into the Fourier Transform 11:20 - The Importance of Mixing 22:20 - Experimental Results 33:00 - Conclusions & Comments Paper: arxiv.org/abs/2105.03824 ADDENDUM: Of course, I completely forgot to discuss the connection between Fourier transforms and Convolutions, and that this might be interpreted as convolutions with very large kernels.
Yeah, I don’t know why the authors talk about the reverse when they drop the imaginary part. That fact makes me wonder just how relevant the Fourier transforms properties could possibly be to these results. I feel like the main reason to use it might be that we have readily available fast implementations.
In response to the comment "I'm even open to the idea that the Fourier transform might even be the optimal mixing technique." I think the actual optimum is something that could be trained, or at least investigated empirically. Other functions, such as Legendre polynomials or Bessel functions, are 'a' Fourier basis for different differential operators/measures. It's easy to find the functional bases using Sturm-Liouville theory. It is possible that you could optimize the basis by allowing the functions that define the ODE to become parameters to the model.
To be clear, the advantage of this method is that once you have your basis, you can bake it in to the architecture. The method is trained, but the parameters you are training don't ship with the final model.
I'm a guy with DSP roots, and this paper causes a lot of cringe. There was an elephant (the convolution theorem) and it was totally missed. FFT offers circular convolution, which makes little sense in NLP context, so a natural solution to improve results would be to try to pad values. Also cudnn does perform FFT for large kernels, breaking news. But that's not interesting actually. The only interesting part is taking "real" part of the FFT transform. This looks like quite an exotic type of non-linearity: resetting phase. I wonder if it's only additional noise, or really adds something new. Instead of this, there are soulless words about "mixing tokens". I hope LeCun will kick their ass not only in twitter, because this "convolutions in mustache" things are starting to be frightening.
Although I must say that having the kernel weights analytically defined allows us to "see" the whole sequence in just one layer. This is I believe different from how CNN architectures do it i.e they use small kernels but very deep architecurtes to have a large receptive field.
Doesn't feel like LeCun would be too opposed. This seems to be the style of ML papers that's generally accepted nowadays. More rigorous works are harder to read and apply, unless they come with a ready-to-use, open source framework. So the easier incremental works end up taking the spotlight, more often than not. I didn't have that many reservations about it when reading it, but one especially stood out when you mentioned it: "there are soulless words like 'mixing tokens'". That ties back to my "unsubstantiated" claim on lack of rigor in most well-publicized papers recently.
Same cringe here, I wonder if they tried first ABS() then RE() and settled for the best number, plus this intuition of multilayered FTs inverting each other in the sense of "duality" sounds plain wrong to me
I think that we should investigate the application of a fractional Fourier transform instead of a regular one. Fractional FT is a generalization of FT and is parametrized by an angle alpha. If alpha=0 function doesn't change (skip connection) If alpha=pi/2 it performs regular FT. If alpha=-pi/2 it performs inverse FT. If alpha=pi it gives mirror reflection of a function. In consecutive transforms angles of individual transforms add. pi/2 + (-pi/2)=0 It corresponds to FT followed by an inverse FT and that is identity transform. So we could use linear operations in different domains, parametrized by alpha. It could be organized as, say, 12 channel tensor with alpha ranging from 2*pi *0/12 to 2*pi*11/12 then we normalize and apply fully connected layer to all these 12 channels and get skip connection, convolution, order change and 9 other promising operations or we could just use linear operation and then collapse these 12 channels into one by element-wise addition
If i get it right, element-wise multiplication in fractional Fourier domain attenuates various chirps in the signal and in case of 2d images(tensors) you could use different transform angles for x and y directions: for example transform only rows and don't transform the columns (pi/2 along x and 0 along y)
I was intrigued by yannics comments about the existence of a universal attention (or information routing) matrix. Try visualising the DFT matrix, it's got a cool repeating circular pattern. The most interesting thing here is that not (that much) worse results can be obtained this way and that such information transfer regularities exist in language. We can almost certainly learn a better universal matrix but then we lose the FFT efficiency boost. Why don't we do something with neural program synthesis in this attention layer? We could learn a replacement for the FFT! Great video as always
Soon as I learned about attention I thought "Pretty sure the main benefit is the weight sharing per token just like in convolution, combined with looking at the entire sentence per token" Turns out I was fucking right since the linear model is about as good as BERT, I'm sure if you added a few more layers/made it fancier it would perform better. This paper is awesome and your summary is amazingly intuitive
Why do we even need a position embedding if the DFT over the tokens already provides this? Is it because they drop the imaginary component? If so, why not keep both imaginary and real and drop the position embedding?
The transform is a 2D version as the axis pair are different. Yes the FFT is a self inverse if scaling is done right when the same data axis is used. The fact that convolution becomes a multiplication filter also likely helps to extract data.
Thanks Yannic. Great Explaination, tradeoff between accuracy/Speedup. Linear Transforms instead of Attention mechanisms seem more practical to deploy for small-mid scale dataset.
A very nitpicky error in the video: at around 10:44, when taking about Fourier and inverse Fourier transforms, you say "inverse transform is simply if you don't do the negative sign right here (the complex exponential)" That's not entirely correct, in inverse DFT, x_n and X_k are interchanged, and the summation goes from k=0,....,N-1 instead. (Although this summation thingy doesn't matter much, esp when doing DFT of a real-valued signal, it's necessary to keep the notation uniform)
Do we expect this to work equally well with ViTs, or is this Fourier magic likely to be limited to the NLP domain? That might be an obvious next paper.
I think attention mechanism is much more than just information sharing between datapoints. Like, weights in attention are computed at run time and variable. That makes attention much more general. So, how about we try to Fourier transform and apply attention layer above that.
why throw away half the Fourier output by only considering the real part? What would be the effect of doubling the number of tokens/nodes of the Fourier output later by splitting into real and imaginary parts?
I love attention but I was saying fourier is the key to everything since like forever. Do we need attention? Idk. Probably yes. But fourier is also needed. In what context can we combine these ? I have no idea.
The way I see it, the success of attention in transformers or models alike was in explicit "bilinear" nature of information flow between tokens and hence the O(n^2) issue. I dont see how replacing such a nonlinear interaction with a weighted sum (they might as well used an MLP) could bring in the same expressive power. On a different note the frequencies for sequences of different lengths would mean differently and hence probably one would have to resort to STFT-like transformations which would not resolve the variable sequence length.
Someone can correct me if I'm wrong but applying the DFT twice and taking the real part will absolutely not give you back the real signal however it does something close. Taking the real part will immediately throw out all of the phase information. Taking the FT twice actually returns f(-x), essentially reversing the signal (this can be seen pretty clearly from the definition of the FT and its inverse). Taking the FT 4 times however will give you back the signal but I don't think this reversing really plays a role on the learning since the signal is identically reversed each time. I think once you take the real part only, the transformation will roughly approximate this with phase information lost.
Eq3 bascially is 2D FFT and it only keeps real part. I guess it simplifies the computation and the real part of FFT is related to the power spectrum of FFT. In fact, the power spectrum of FFT is autocorrleation. Self attention is the softmax of the cross-correlation of signal pairs. Therefore, I think they are equivalent in some sense.
Hi Yannic, thanks for your nice explanation. May I ask which tool and device did you use to record this video? (e.g., copy and paste pdf on Onenote and scribble on it)
isn't it equivalent but easier to compute? "a 2D Fourier transform is achieved by first transforming each row, i.e. replacing each row with its 1D Fourier transform. This first step yields an intermediary 'picture' in which the horizontal axis is frequency f and the vertical axis is space y. The second step is to apply 1D Fourier transform individually to the vertical line of the intermediate image. This new image will be the 2D Fourier transformation of the initial image."
Although this is game theory, i can see for each token, complexity is o(1), for each attention layer above it is o(1) ,but real worl concepts are not so neatly segregated. But, that is mysterious part
We could if possible do inverse Fourier transform of real world, it would tell us why we are what we are. The cure of all diseases, the solution of all mysteries
The Nx value that's there in the architecture, does it work in serial fashion or parallel? Is it replacing multi-head attention or just increasing the number of encoding layers? Because in the implementations, it works in a serial fashion. So do let me know.
Next paper: Fourier transforms are all you need I actually wonder, how long will it take to go full circle and have a "neural network training thing" that literally generates code that does a specific task?
excelent video!! we should note here that they are using the discrete Fourier Transform… Ir only lasts that someone makes an interpretation on how exactly is that FT mixing the tokens vs linear mixing… how the time vs frecuency applies to tokens? what means frecuency on discrete tokens? but looks like not even the authors have figured out that😂😂😂
People have investigated predefined basis functions before in the early days of deep learning (the paper even pointed that out). They performed badly (even against regular dense layers). This paper does the same thing and again predefined basis lost to dense layers. This paper is just getting attention (pun intended) it does not deserve due to it being from Google. I think the various works on local attention should be much better than the idea in this paper.
I don‘t really get the hype about this paper. Their results are apparantly much worse than with Attention. In my opinion this is an interesting approach, but ultimately nothing but hot air.
Do we even need Attention? FNets completely drop the Attention mechanism in favor of a simple Fourier transform. They perform almost as well as Transformers, while drastically reducing parameter count, as well as compute and memory requirements. This highlights that a good token mixing heuristic could be as valuable as a learned attention matrix.
OUTLINE:
0:00 - Intro & Overview
0:45 - Giving up on Attention
5:00 - FNet Architecture
9:00 - Going deeper into the Fourier Transform
11:20 - The Importance of Mixing
22:20 - Experimental Results
33:00 - Conclusions & Comments
Paper: arxiv.org/abs/2105.03824
ADDENDUM:
Of course, I completely forgot to discuss the connection between Fourier transforms and Convolutions, and that this might be interpreted as convolutions with very large kernels.
4:30 - Which papers did you have in mind here?
@@aspergale9836 he mentioned some of them later. AFAIK it's probably: "MLP-Mixer", "Pay Attention to MLPs" and "ResMLP ..." papers
KISS always wins.
Are they saying that position encoding was useless. I am not seeing that part in the paper. Can someone explain what I missed.
Hello Yannick, what do tou think about mixing strategies like fourier transform for one layer and attention for the next layer and so one?
@17:20 You need the imaginary part to do the reverse, it contains the phase information!
Yeah, I don’t know why the authors talk about the reverse when they drop the imaginary part. That fact makes me wonder just how relevant the Fourier transforms properties could possibly be to these results. I feel like the main reason to use it might be that we have readily available fast implementations.
"I know I'm a bit late with this one" - paper hasn't even been up for two weeks
Thank you Yannic. I know what I'm watching for Friday evening now, no more scrolling! :D
I was getting ready to read the paper, but then I said, "oh, I'll just wait for Yannic to explain it." Thank you!
In response to the comment "I'm even open to the idea that the Fourier transform might even be the optimal mixing technique." I think the actual optimum is something that could be trained, or at least investigated empirically. Other functions, such as Legendre polynomials or Bessel functions, are 'a' Fourier basis for different differential operators/measures. It's easy to find the functional bases using Sturm-Liouville theory. It is possible that you could optimize the basis by allowing the functions that define the ODE to become parameters to the model.
To be clear, the advantage of this method is that once you have your basis, you can bake it in to the architecture. The method is trained, but the parameters you are training don't ship with the final model.
We need more weird math:)
Wow, I'm dumb.
Your videos are so valuable. Seriously I've learned way more watching your explanation than listening my profs at college. Thank you Yannic!
I'm a guy with DSP roots, and this paper causes a lot of cringe. There was an elephant (the convolution theorem) and it was totally missed. FFT offers circular convolution, which makes little sense in NLP context, so a natural solution to improve results would be to try to pad values. Also cudnn does perform FFT for large kernels, breaking news. But that's not interesting actually. The only interesting part is taking "real" part of the FFT transform. This looks like quite an exotic type of non-linearity: resetting phase. I wonder if it's only additional noise, or really adds something new. Instead of this, there are soulless words about "mixing tokens". I hope LeCun will kick their ass not only in twitter, because this "convolutions in mustache" things are starting to be frightening.
so true lol
Although I must say that having the kernel weights analytically defined allows us to "see" the whole sequence in just one layer. This is I believe different from how CNN architectures do it i.e they use small kernels but very deep architecurtes to have a large receptive field.
Doesn't feel like LeCun would be too opposed.
This seems to be the style of ML papers that's generally accepted nowadays. More rigorous works are harder to read and apply, unless they come with a ready-to-use, open source framework. So the easier incremental works end up taking the spotlight, more often than not.
I didn't have that many reservations about it when reading it, but one especially stood out when you mentioned it: "there are soulless words like 'mixing tokens'". That ties back to my "unsubstantiated" claim on lack of rigor in most well-publicized papers recently.
I agree with you my comrade
Same cringe here, I wonder if they tried first ABS() then RE() and settled for the best number, plus this intuition of multilayered FTs inverting each other in the sense of "duality" sounds plain wrong to me
I think that we should investigate the application of a fractional Fourier transform instead of a regular one.
Fractional FT is a generalization of FT and is parametrized by an angle alpha.
If alpha=0 function doesn't change (skip connection)
If alpha=pi/2 it performs regular FT.
If alpha=-pi/2 it performs inverse FT.
If alpha=pi it gives mirror reflection of a function.
In consecutive transforms angles of individual transforms add. pi/2 + (-pi/2)=0 It corresponds to FT followed by an inverse FT and that is identity transform.
So we could use linear operations in different domains, parametrized by alpha. It could be organized as, say, 12 channel tensor with alpha ranging from 2*pi *0/12 to 2*pi*11/12
then we normalize and apply fully connected layer to all these 12 channels and get skip connection, convolution, order change and 9 other promising operations
or we could just use linear operation and then collapse these 12 channels into one by element-wise addition
If i get it right, element-wise multiplication in fractional Fourier domain attenuates various chirps in the signal and in case of 2d images(tensors) you could use different transform angles for x and y directions: for example transform only rows and don't transform the columns (pi/2 along x and 0 along y)
I was intrigued by yannics comments about the existence of a universal attention (or information routing) matrix. Try visualising the DFT matrix, it's got a cool repeating circular pattern. The most interesting thing here is that not (that much) worse results can be obtained this way and that such information transfer regularities exist in language. We can almost certainly learn a better universal matrix but then we lose the FFT efficiency boost. Why don't we do something with neural program synthesis in this attention layer? We could learn a replacement for the FFT! Great video as always
Soon as I learned about attention I thought "Pretty sure the main benefit is the weight sharing per token just like in convolution, combined with looking at the entire sentence per token" Turns out I was fucking right since the linear model is about as good as BERT, I'm sure if you added a few more layers/made it fancier it would perform better. This paper is awesome and your summary is amazingly intuitive
There are 2 properties I think are still missing from these Attention-less alternatives:
* Variable length inputs and outputs.
* To be order agnostic.
Wow, you make reading papers entertaining 👏
It has come to a point, where my first reaction to a hyped paper is checking whether Yannic has already published a video.
How about a multi-scale approach by using wavelet transform?
I read this last weekend! Really interesting work
Why do we even need a position embedding if the DFT over the tokens already provides this? Is it because they drop the imaginary component? If so, why not keep both imaginary and real and drop the position embedding?
Good question! Maybe worth a try.
I think we would lose the information after the inverse is computed
That was my thought too, since the Imaginary component tells you where to shift the sines
The transform is a 2D version as the axis pair are different. Yes the FFT is a self inverse if scaling is done right when the same data axis is used. The fact that convolution becomes a multiplication filter also likely helps to extract data.
Thanks Yannic. Great Explaination, tradeoff between accuracy/Speedup. Linear Transforms instead of Attention mechanisms seem more practical to deploy for small-mid scale dataset.
A very nitpicky error in the video: at around 10:44, when taking about Fourier and inverse Fourier transforms, you say "inverse transform is simply if you don't do the negative sign right here (the complex exponential)"
That's not entirely correct, in inverse DFT, x_n and X_k are interchanged, and the summation goes from k=0,....,N-1 instead. (Although this summation thingy doesn't matter much, esp when doing DFT of a real-valued signal, it's necessary to keep the notation uniform)
Whole sequence + Fully connected + Conditionned meta parameters is all you need.
Do we expect this to work equally well with ViTs, or is this Fourier magic likely to be limited to the NLP domain? That might be an obvious next paper.
Thank you, Yannic! Amazing video.
So maybe wavelet can be used instead of Fourier, with this difference that wavelet parameters can be tuned by network.
I think attention mechanism is much more than just information sharing between datapoints. Like, weights in attention are computed at run time and variable. That makes attention much more general.
So, how about we try to Fourier transform and apply attention layer above that.
Sorry, if this has been asked a million times already, but does anyone know what pdf annotation software Yannic is using? Looks so clean!
OneNote
why throw away half the Fourier output by only considering the real part? What would be the effect of doubling the number of tokens/nodes of the Fourier output later by splitting into real and imaginary parts?
I love attention but I was saying fourier is the key to everything since like forever. Do we need attention? Idk. Probably yes. But fourier is also needed. In what context can we combine these ? I have no idea.
Instead of learning to route information in quadratic complexity, cant we train a layer to output an index of a permutation table?
Partially structured data like natural languages probably would have a small subset if possible routes. So it can be discretised.
@@alpers.2123 No because it's not differentiable
I havent found a good implementation of an efficient differentiable permutation algorithm.
Permutation is a matrix multiplication with a binary matrix
@@alpers.2123 Oh and where will you get the binary matrix from?
26:30 Why is FF-only worse than random?
.
Also, I wonder if someone tried a similar idea before. It sounds like a very obvious thing to do.
I guess any mixing is still preferred. It's random, but always the same random afaik.
What is the "Type" embedding? Can I get more information on that?
Glad to see attention go. I never could make sense of it! 😂
The way I see it, the success of attention in transformers or models alike was in explicit "bilinear" nature of information flow between tokens and hence the O(n^2) issue. I dont see how replacing such a nonlinear interaction with a weighted sum (they might as well used an MLP) could bring in the same expressive power.
On a different note the frequencies for sequences of different lengths would mean differently and hence probably one would have to resort to STFT-like transformations which would not resolve the variable sequence length.
4:30 - Which papers?
Someone can correct me if I'm wrong but applying the DFT twice and taking the real part will absolutely not give you back the real signal however it does something close. Taking the real part will immediately throw out all of the phase information. Taking the FT twice actually returns f(-x), essentially reversing the signal (this can be seen pretty clearly from the definition of the FT and its inverse). Taking the FT 4 times however will give you back the signal but I don't think this reversing really plays a role on the learning since the signal is identically reversed each time. I think once you take the real part only, the transformation will roughly approximate this with phase information lost.
Thank you Yannic!!!
Eq3 bascially is 2D FFT and it only keeps real part. I guess it simplifies the computation and the real part of FFT is related to the power spectrum of FFT. In fact, the power spectrum of FFT is autocorrleation. Self attention is the softmax of the cross-correlation of signal pairs. Therefore, I think they are equivalent in some sense.
Hi Yannic, thanks for your nice explanation. May I ask which tool and device did you use to record this video? (e.g., copy and paste pdf on Onenote and scribble on it)
Fantastic work.
instead of doing 2 1D fourier transforms like total virgins they should just stack the vectors and do a 2D fourier transform
which is equivalent to just using a CONV layer with a large kernel 🤡
isn't it equivalent but easier to compute? "a 2D Fourier transform is achieved by first transforming each row, i.e. replacing each row with its 1D Fourier transform. This first step yields an intermediary 'picture' in which the horizontal axis is frequency f and the vertical axis is space y. The second step is to apply 1D Fourier transform individually to the vertical line of the intermediate image. This new image will be the 2D Fourier transformation of the initial image."
Although this is game theory, i can see for each token, complexity is o(1), for each attention layer above it is o(1) ,but real worl concepts are not so neatly segregated. But, that is mysterious part
Thats infinte dimensional space
We could if possible do inverse Fourier transform of real world, it would tell us why we are what we are. The cure of all diseases, the solution of all mysteries
The Nx value that's there in the architecture, does it work in serial fashion or parallel? Is it replacing multi-head attention or just increasing the number of encoding layers? Because in the implementations, it works in a serial fashion. So do let me know.
did they completely drop the whole decoder part?
wassup Yannic youre awesome
Thanks, very cool! Is there any huggingface model on this?
3:47 Anything but n squared? Let's go n power n.
Next paper: Fourier transforms are all you need
I actually wonder, how long will it take to go full circle and have a "neural network training thing" that literally generates code that does a specific task?
Was waiting for this.
excelent video!! we should note here that they are using the discrete Fourier Transform… Ir only lasts that someone makes an interpretation on how exactly is that FT mixing the tokens vs linear mixing… how the time vs frecuency applies to tokens? what means frecuency on discrete tokens? but looks like not even the authors have figured out that😂😂😂
gMLP is also interesting!
I realize yourself what it could be this O(n² × log14²) attention matrix work in superposition difference equation by VGA.
People have investigated predefined basis functions before in the early days of deep learning (the paper even pointed that out). They performed badly (even against regular dense layers). This paper does the same thing and again predefined basis lost to dense layers. This paper is just getting attention (pun intended) it does not deserve due to it being from Google. I think the various works on local attention should be much better than the idea in this paper.
Did you implain for this a Turing machine algorithm?
Isn't attention mechanism with query, value, key, the evolution of neural turing architecture? Can someone clue me into this?
Now it's my time to shine with my ece degree
Great video
17:20 if you did not tf to infinite. It cannot be perfectly restored
Feels like we want locality and sparsity,why not trade floating point complex fft with walsh hadamard.
Wavelet transform is the most effective non parametric mixing, I bet. They are mathematical microscopes.
'for the longest time'. Wasn't 'Attention all you need' published in 2015? And wasn't FNET published in 2019? We're talking 4 years.
"The verb at the end attends to the second word."
Human language was a huge mistake.
30:45 LOL
So attention is not what you need after all.
I don‘t really get the hype about this paper. Their results are apparantly much worse than with Attention. In my opinion this is an interesting approach, but ultimately nothing but hot air.