The one thing that irks me is kernels keep being referred to as "mapping to infinite dimensional space" in these vids. While true, it's kind of unhelpful. It's probably more helpful to think of it as mapping to a (potentially infinite) space where the (potentially infinite) bases aren't restricted to be vectors. The whole trick about the Hilbert space is it's just a space with a well-defined inner product, and that inner product can be between functions etc. You could have a function mapping to a space where the bases are the sin and cosine functions, for example, and you use that space to take the inner product between different sine/cosine wave combinations. Typically the kind of space you want to map to will express something about the problem you're trying to solve. I bring this up because I've seen kernels mentioned twice now, and both times the explanations are kind of lacking. Not Yannic's fault, I was lucky enough to spend a bunch of time around kernel gurus that cleared up a bunch of misconceptions for me...before that, I was making similar mistakes. Content otherwise very solid, as usual!
This is a note for the mathematicians: When people say "vector" in computer science, they mean "an array with finitely many entries". In other words, they mean a function from {1, 2, ..., n} to some set X. This is a severe generalization/weakening of the typical vectors that we mean, where a vector lives in a vector space, where there are all kins of symmetries (the group GL(n, K)). So when Alex Stenlake says "they aren't restricted to be vectors", he meant "they aren't restricted to be a finite-entry array". They are in fact, *exactly* what we mean when we say "vectors in a countable, separable vector space", in other words, any vector space we use in functional analysis. So as we see, this is a severe strengthening of a severe weakening (first, the CS scientist weakens the mathematician's idea of finite dimensional vector space, then they strengthen it to return to the mathematician's idea), which is why Alex had to point it out.
This probably comes from the idea that, practically, these bases are computed via an infinite polynomial series that can only be approximated in compute using techniques like random fourier sampling, no?
I love that you go over and explain the topics each time you introduce them (e.g. transformers) even though you go over them in other videos it's so useful to me to hear different explanations of it and realy clrafiy what's going on inside. Really helps me get a better understanding.
I like that you explain some stuff again in different videos. It makes sure everyone is on track and if someone would be annoyed by the redundance the person could just fast forward. To me it was very helpful to get multiple explanations about transformers
At first, I thought your videos are a bit lengthy, but after watching this now I feel that I woudn't be able to understand the contents without sacrficing such an amount of time. And what you point out on the paper similarly tells me that the cost of softmax function is the needed price considering that the function can do in an unknown task. Thanks, it was good watching.
Thanks for making a difficult paper accessible. I especially like it that there a red threads in you selection of papers, which create almost something like a story-line, for example the one starting with the paper on the transformer architecture and the follow-up papers trying to remedy some of its shortcomings but keep its achievements.
The natural question I have after this paper: why not both? Test a model with X% compute allocated towards ‘real’ attention and 1-X% towards linear attention, and see what's best.
You can always ensemble any two models, but unless they provide independent contributions, there isn't really any point. Plus in this case their proposed model's main advantage is reduced compute/memory, so it would seem entirely counterproductive to ensemble with plain attention.
@@GeekProdigyGuy The idea is that traditional attention is high quality, but struggles at reaching a large context. Clearly that high quality attention is necessary on short scales, but only a small reduction in the traditional attention would be needed to allow for a linear attention that reached very large context sizes. Assuming the far context needs less character-level detail (same thesis as Compressive Transformers), the lossier linear attention should suffice there.
These are amazing, love the depth and really appreciate the amount of content! I know you’ve said you’re not planning on it, but I’d be super happy to support on Patreon if that helps support you in making so much content!
I'm very late, but well done. Good call out on the RNN caveat! I missed that. And one correction (or I misunderstand something): it seems that we're not actually going to a higher dimensional space through phi, but rather applying a point-wise non-linearity and maintaining the same dimensionality. Any comments on that choice? That's seemingly contrary to the point of the kernel trick as I understood it. Why not do the typical DL thing and simply make phi a learnable function?
That is quite an impressive speed increase, and the ability to perform RNN like inference is incredibly useful. I wish they investigated other kernel functions (or maybe they did and I missed it), but I'd be interested in seeing how the choice of kernel impacts the results. Am I understanding the RNN autoregressive-transformer equivalence correctly? Is it the case that translation transformer models are not RNNs?
Cool charts at 26:20. Interesting that Reformer is slower than regular transformer at shorter sequences. You would't know that just from asymptotic time complexity. Looks like the paper conveniently omits that :) Linear attention seems super fast tho. Also it's probably obvious, but two times higher slope on a log chart corresponds to a square of a function on a linear chart.
Your videos are pretty long, but they are really not long enough :). Each time the video ends, I get a feeling that I need an even deeper explanation e.g. their code review
I seems like the closer we are to approximating the soft-max operator with a finite dimensional feature map, the better the performance of this linear transformer would be...which is what they suggested in the conclusion..Also, they don't use positional encodings..
This is super cool and I think we will see the performance improve down the line as people explore other feature mappings, elu + 1 seems a biiit arbitrary
How about applying the kernel (O(n)) trick at inferencing alone after training? I am especially interested whether the DETR model (Yannic already did a video on that), that uses the heavy-weight transformer be modified to do object detection simply with this linear version! May be it's an idea for a new paper. Looking forward to it.
Thanks Yannic for the explanation. I was actually having some difficulty understanding how they used the associative property to transition from Eq. 4 -> Eq.5. Glad to see you've already made a video! This wasn't explained in detail on the video also, maybe because it is something trivial, but am I the only one who cannot get around why the two formulations (Eq.4 and Eq.5) are equivalent? The dimensions of the numerators definitely are not the same as the former is a linear combinations of column vectors, while the latter is a row vector multiplied by square matrix. Also, associative property does not seem to be true in general for vector multiplication involving vector dot products. Any help or explanation from anyone would be very appreciated :0
Sometimes people are very loose with row/column vectors and don't properly apply transposes. Can you find a counter-example where the equation doesn't hold (even when glossing over transpose issues)?
@@YannicKilcher I had hard times with the indexes as well, so I wrote it in numpy: It does not help with transpositions (as numpy matches them *mostly correctly* automatically), but it helps with where to put which product - the equation holds true. import numpy as np shape = 5, 7 K = np.random.rand(*shape) Q = np.random.rand(*shape) V = np.random.rand(*shape) classic_attention = np.matmul(np.matmul(Q, K.T), V) assert classic_attention.shape == shape # stripped down classic attention res = [] for i in range(shape[0]): res.append(sum([np.matmul(Q[i], K[j]) * V[j] for j in range(shape[0])])) matmul_byhand_attention = np.array(res) print(np.allclose(classic_attention, matmul_byhand_attention)) res = [] for i in range(shape[0]): s = sum([np.outer(K[j], V[j]) for j in range(shape[0])]) # associativity and distributivity res.append(np.matmul(Q[i], s)) matmul_byhand_attention2 = np.array(res) print(np.allclose(classic_attention, matmul_byhand_attention2))
12:33 If Adding all the K[i] terms means that there is no way for the transformer to know the position of words in the sentence. Maybe the positional embeddings are helping it somehow. It would be interesting to see the performance of this transformer without the positional embeddings. Then again, RNNs dont have any explicit mechanism to store positions of words.
True, the positional embeddings probably do a lot of work here. RNNs don't have explicit position, but they can theoretically learn to count the number of input tokens - or what's probably more relevant - learn to estimate distances between tokens, which is another thing that position embeddings in transformers are good for.
As the paper states that transformers attending to one side only are equal to RNNs, doesn't it imply that full transformers are equal to residual bidirectional RNNs?
If Q.K is usually the highest for the same token, I wonder why Q and K need to be different at all for a transformed to work properly. Why not just output a K and a V?
Hi - is the reason that the softmax version is O(N^2) that softmax (in the general case) has N^2 gradients before they're collapsed back? Or is there something else going on?
It's because you need to do N^2 inner products. Maybe if the softmax wasn't there, you could somehow optimize that away, so I think you do have a point
According to there code, the main computation is like this: KV = torch.einsum("nshd,nshm->nhmd", K, values) Z = 1/(torch.einsum("nlhd,nhd->nlh", Q, K.sum(dim=1))+self.eps) V = torch.einsum("nlhd,nhmd,nlh->nlhm", Q, KV, Z) can someone translate this einsum into a normal pytorch way?
Better off just reading up on einsum and getting familiar with it. it is by far the most readable and intuitive way of expressing these statements; and I say that as a numpy veteran of 15 years.
yea the problem is that would translate to multiple matmuls, elementwise products, transposes and summations :D I also suggest you dive into einsum notation
I was wondering to which extend their idea is related to A² double-net attention? In the latter, the outer product is likewise separated into a gathering step and a distribution step, and it's likewise shown to be on par with the original transformer. Anyway, the general mathematical idea can already be found in the SVD.
thank Yannic, we saw a few papers for transformers in long sequences like ReFormer, LinFormer, LongFormer but which one of them is really well in real-world if you make a video about comparing these papers for faster or longer Transformer I would be very thankful
Your intuition at 31:40 is great! Maybe a silly question, but is it a possible architecture where the transformation function phi is learned as Nx(#of intermediate units) matrix of trainable parameters?
Hey! that's greate !! i would like to ask about the model he used to produce results in the paper ?in github there was only the library not the model ! can you share with us the model ?
The complexity works well when N > D^2. So the paper is saying if dimensionality is 512 my sequence length should be greater than 512^2 = 262144. Is this really practical !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Wait a second: Figure 5c is showing PER over wall-clock time, and despite the significant speed-up the original softmax transformer is better AND faster (at reaching any given error rate).
Linformer has a linear complexity thanks to a constant dimension of the keys and values because they are projected in a fixed lower dimension. Whereas this does not fix the dimension of the sequence, it only reorders the computation. The reordering can only be made with conditions that need the kernel thing.
I love your videos, but could you try to make them more compact? 50 minutes is a big time investment. I feel like most papers can be explained in high detail in half that time.
The one thing that irks me is kernels keep being referred to as "mapping to infinite dimensional space" in these vids. While true, it's kind of unhelpful. It's probably more helpful to think of it as mapping to a (potentially infinite) space where the (potentially infinite) bases aren't restricted to be vectors. The whole trick about the Hilbert space is it's just a space with a well-defined inner product, and that inner product can be between functions etc. You could have a function mapping to a space where the bases are the sin and cosine functions, for example, and you use that space to take the inner product between different sine/cosine wave combinations. Typically the kind of space you want to map to will express something about the problem you're trying to solve.
I bring this up because I've seen kernels mentioned twice now, and both times the explanations are kind of lacking. Not Yannic's fault, I was lucky enough to spend a bunch of time around kernel gurus that cleared up a bunch of misconceptions for me...before that, I was making similar mistakes. Content otherwise very solid, as usual!
Thank you :)
Hi, can you recommend some learning materials about "kernels"
Mit’s 9.520 with Lorenzo Rosasco ;)
This is a note for the mathematicians:
When people say "vector" in computer science, they mean "an array with finitely many entries". In other words, they mean a function from {1, 2, ..., n} to some set X. This is a severe generalization/weakening of the typical vectors that we mean, where a vector lives in a vector space, where there are all kins of symmetries (the group GL(n, K)).
So when Alex Stenlake says "they aren't restricted to be vectors", he meant "they aren't restricted to be a finite-entry array". They are in fact, *exactly* what we mean when we say "vectors in a countable, separable vector space", in other words, any vector space we use in functional analysis. So as we see, this is a severe strengthening of a severe weakening (first, the CS scientist weakens the mathematician's idea of finite dimensional vector space, then they strengthen it to return to the mathematician's idea), which is why Alex had to point it out.
This probably comes from the idea that, practically, these bases are computed via an infinite polynomial series that can only be approximated in compute using techniques like random fourier sampling, no?
I love that you go over and explain the topics each time you introduce them (e.g. transformers) even though you go over them in other videos it's so useful to me to hear different explanations of it and realy clrafiy what's going on inside. Really helps me get a better understanding.
Haha I was writing almost the same
Me too, I third this ^
I like that you explain some stuff again in different videos. It makes sure everyone is on track and if someone would be annoyed by the redundance the person could just fast forward.
To me it was very helpful to get multiple explanations about transformers
the idea of Explaining the papers is awesome! Thanks!
It's a big part of grad school.
At first, I thought your videos are a bit lengthy, but after watching this now I feel that I woudn't be able to understand the contents without sacrficing such an amount of time. And what you point out on the paper similarly tells me that the cost of softmax function is the needed price considering that the function can do in an unknown task. Thanks, it was good watching.
also, I suggest just changing the playback speed to 2x if it's too long :)
Thanks for making a difficult paper accessible. I especially like it that there a red threads in you selection of papers, which create almost something like a story-line, for example the one starting with the paper on the transformer architecture and the follow-up papers trying to remedy some of its shortcomings but keep its achievements.
Thanks Yannic, soon enough you will be one of the requirements for ML jobs; Published at Neurips, ICML, and Yannic YT Channel.
Haha :D problem is I'm very corrupt so idk how that's going to work out ;)
The natural question I have after this paper: why not both? Test a model with X% compute allocated towards ‘real’ attention and 1-X% towards linear attention, and see what's best.
You can always ensemble any two models, but unless they provide independent contributions, there isn't really any point. Plus in this case their proposed model's main advantage is reduced compute/memory, so it would seem entirely counterproductive to ensemble with plain attention.
@@GeekProdigyGuy The idea is that traditional attention is high quality, but struggles at reaching a large context. Clearly that high quality attention is necessary on short scales, but only a small reduction in the traditional attention would be needed to allow for a linear attention that reached very large context sizes. Assuming the far context needs less character-level detail (same thesis as Compressive Transformers), the lossier linear attention should suffice there.
Veedrac - I actually explore that here github.com/lucidrains/linear-attention-transformer
These are amazing, love the depth and really appreciate the amount of content!
I know you’ve said you’re not planning on it, but I’d be super happy to support on Patreon if that helps support you in making so much content!
Great explanation. I really like the way you break down publications to the most essential parts. Never was "reading" articles that comfortable.
Re: causal masking implementation @38:00, input[i] != target[i] because the target is shifted by one before being passed into the model.
I'm very late, but well done. Good call out on the RNN caveat! I missed that. And one correction (or I misunderstand something): it seems that we're not actually going to a higher dimensional space through phi, but rather applying a point-wise non-linearity and maintaining the same dimensionality. Any comments on that choice? That's seemingly contrary to the point of the kernel trick as I understood it. Why not do the typical DL thing and simply make phi a learnable function?
That is quite an impressive speed increase, and the ability to perform RNN like inference is incredibly useful.
I wish they investigated other kernel functions (or maybe they did and I missed it), but I'd be interested in seeing how the choice of kernel impacts the results.
Am I understanding the RNN autoregressive-transformer equivalence correctly? Is it the case that translation transformer models are not RNNs?
only transformers that are using the specific causal masking are RNNs
I appreciate your comment about different neural network architecture as different routing of information.
Cool charts at 26:20. Interesting that Reformer is slower than regular transformer at shorter sequences. You would't know that just from asymptotic time complexity. Looks like the paper conveniently omits that :)
Linear attention seems super fast tho.
Also it's probably obvious, but two times higher slope on a log chart corresponds to a square of a function on a linear chart.
Yep, I'm just a moron when it comes to non-linear charts :D
Your videos are pretty long, but they are really not long enough :). Each time the video ends, I get a feeling that I need an even deeper explanation e.g. their code review
I seems like the closer we are to approximating the soft-max operator with a finite dimensional feature map, the better the performance of this linear transformer would be...which is what they suggested in the conclusion..Also, they don't use positional encodings..
Yannic: just lett me think what you think.
Me: I think you're awesome
This is super cool and I think we will see the performance improve down the line as people explore other feature mappings, elu + 1 seems a biiit arbitrary
How about applying the kernel (O(n)) trick at inferencing alone after training?
I am especially interested whether the DETR model (Yannic already did a video on that), that uses the heavy-weight transformer be modified to do object detection simply with this linear version! May be it's an idea for a new paper. Looking forward to it.
Thank you so much for all this great explinations!
Thanks Yannic for the explanation.
I was actually having some difficulty understanding how they used the associative property to transition from Eq. 4 -> Eq.5. Glad to see you've already made a video!
This wasn't explained in detail on the video also, maybe because it is something trivial, but am I the only one who cannot get around why the two formulations (Eq.4 and Eq.5) are equivalent?
The dimensions of the numerators definitely are not the same as the former is a linear combinations of column vectors, while the latter is a row vector multiplied by square matrix. Also, associative property does not seem to be true in general for vector multiplication involving vector dot products.
Any help or explanation from anyone would be very appreciated :0
Sometimes people are very loose with row/column vectors and don't properly apply transposes. Can you find a counter-example where the equation doesn't hold (even when glossing over transpose issues)?
@@YannicKilcher I had hard times with the indexes as well, so I wrote it in numpy:
It does not help with transpositions (as numpy matches them *mostly correctly* automatically), but it helps with where to put which product - the equation holds true.
import numpy as np
shape = 5, 7
K = np.random.rand(*shape)
Q = np.random.rand(*shape)
V = np.random.rand(*shape)
classic_attention = np.matmul(np.matmul(Q, K.T), V)
assert classic_attention.shape == shape # stripped down classic attention
res = []
for i in range(shape[0]):
res.append(sum([np.matmul(Q[i], K[j]) * V[j] for j in range(shape[0])]))
matmul_byhand_attention = np.array(res)
print(np.allclose(classic_attention, matmul_byhand_attention))
res = []
for i in range(shape[0]):
s = sum([np.outer(K[j], V[j]) for j in range(shape[0])]) # associativity and distributivity
res.append(np.matmul(Q[i], s))
matmul_byhand_attention2 = np.array(res)
print(np.allclose(classic_attention, matmul_byhand_attention2))
12:33 If Adding all the K[i] terms means that there is no way for the transformer to know the position of words in the sentence. Maybe the positional embeddings are helping it somehow. It would be interesting to see the performance of this transformer without the positional embeddings. Then again, RNNs dont have any explicit mechanism to store positions of words.
True, the positional embeddings probably do a lot of work here. RNNs don't have explicit position, but they can theoretically learn to count the number of input tokens - or what's probably more relevant - learn to estimate distances between tokens, which is another thing that position embeddings in transformers are good for.
Love it! Universal transformer please!
A really good video! I found it super informative, thank you!
As the paper states that transformers attending to one side only are equal to RNNs, doesn't it imply that full transformers are equal to residual bidirectional RNNs?
Only in an abstract sense. The computations are really different from a bidirectional RNN
I wonder why you do not have a video about Universal Transformers
Good shit! This video was very helpful and many ELI5 moments useful for students. Thanks!
If Q.K is usually the highest for the same token, I wonder why Q and K need to be different at all for a transformed to work properly. Why not just output a K and a V?
interesting idea. some papers fuse k and v, but I've never heard of fusing k and q
Hi - is the reason that the softmax version is O(N^2) that softmax (in the general case) has N^2 gradients before they're collapsed back? Or is there something else going on?
It's because you need to do N^2 inner products. Maybe if the softmax wasn't there, you could somehow optimize that away, so I think you do have a point
According to there code, the main computation is like this:
KV = torch.einsum("nshd,nshm->nhmd", K, values)
Z = 1/(torch.einsum("nlhd,nhd->nlh", Q, K.sum(dim=1))+self.eps)
V = torch.einsum("nlhd,nhmd,nlh->nlhm", Q, KV, Z)
can someone translate this einsum into a normal pytorch way?
Better off just reading up on einsum and getting familiar with it. it is by far the most readable and intuitive way of expressing these statements; and I say that as a numpy veteran of 15 years.
yea the problem is that would translate to multiple matmuls, elementwise products, transposes and summations :D I also suggest you dive into einsum notation
I came to ask “what’s the einsum for 21:03.”. Thanks!
So.. dropping the “batch” and “head” indices it’s:
Vld = sum_sd Qld*Ksd*Vsm / sum_sd Qld Ksd
So in practice you could stack way more tranformer layers and have better performance?
that's always the goal :D
I was wondering to which extend their idea is related to A² double-net attention? In the latter, the outer product is likewise separated into a gathering step and a distribution step, and it's likewise shown to be on par with the original transformer. Anyway, the general mathematical idea can already be found in the SVD.
Cool description. Thanks
thank Yannic, we saw a few papers for transformers in long sequences like ReFormer, LinFormer, LongFormer but which one of them is really well in real-world
if you make a video about comparing these papers for faster or longer Transformer I would be very thankful
at 14:17 numerator and denominator are same except multiplying with Vj, how's is that possible
it characterizes a normalized distribution over V
Your intuition at 31:40 is great! Maybe a silly question, but is it a possible architecture where the transformation function phi is learned as Nx(#of intermediate units) matrix of trainable parameters?
sure, I don't see why not
The generalized formulation of equation 3 looks like Bayes formula, has anyone noticed? Is it accidental or is there a deeper meaning here?
Hey! that's greate !! i would like to ask about the model he used to produce results in the paper ?in github there was only the library not the model ! can you share with us the model ?
The complexity works well when N > D^2. So the paper is saying if dimensionality is 512 my sequence length should be greater than 512^2 = 262144. Is this really practical !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Wait a second: Figure 5c is showing PER over wall-clock time, and despite the significant speed-up the original softmax transformer is better AND faster (at reaching any given error rate).
Yes, that's because on this task it seems to be very important to have the quadratic attention
commenting for algo. love your channel
Can anyone summarize difference between linformer and this paper?
Linformer has a linear complexity thanks to a constant dimension of the keys and values because they are projected in a fixed lower dimension. Whereas this does not fix the dimension of the sequence, it only reorders the computation. The reordering can only be made with conditions that need the kernel thing.
Sir, Thank You
speed over accuracy for me tbh. Not everyone has Google's resources to run it on 112312412312423 TPUs
That's a lot of damage.
I mean TPUs
thank you :)
If I was rich I'd give you money.
Who else saw balls in the thumbnail?
Now, that I see it. I cannot unsee it.
Now we are all see the message
I love your videos, but could you try to make them more compact? 50 minutes is a big time investment. I feel like most papers can be explained in high detail in half that time.