∞-former: Infinite Memory Transformer (aka Infty-Former / Infinity-Former, Research Paper Explained)

Yannic Kilcher

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 5 чер 2024
#inftyformer #infinityformer #transformer
Vanilla Transformers are excellent sequence models, but suffer from very harsch constraints on the length of the sequences they can process. Several attempts have been made to extend the Transformer's sequence length, but few have successfully gone beyond a constant factor improvement. This paper presents a method, based on continuous attention mechanisms, to attend to an unbounded past sequence by representing the past as a continuous signal, rather than a sequence. This enables the Infty-Former to effectively enrich the current context with global information, which increases performance on long-range dependencies in sequence tasks. Further, the paper presents the concept of sticky memories, which highlight past events that are of particular importance and elevates their representation in the long-term memory.
OUTLINE:
0:00 - Intro & Overview
1:10 - Sponsor Spot: Weights & Biases
3:35 - Problem Statement
8:00 - Continuous Attention Mechanism
16:25 - Unbounded Memory via concatenation & contraction
18:05 - Does this make sense?
20:25 - How the Long-Term Memory is used in an attention layer
27:40 - Entire Architecture Recap
29:30 - Sticky Memories by Importance Sampling
31:25 - Commentary: Pros and cons of using heuristics
32:30 - Experiments & Results
Paper: arxiv.org/abs/2109.00301
Sponsor: Weights & Biases
wandb.me/start
Abstract:
Transformers struggle when attending to long contexts, since the amount of computation grows with the context length, and therefore they cannot model long-term memories effectively. Several variations have been proposed to alleviate this problem, but they all have a finite memory capacity, being forced to drop old information. In this paper, we propose the ∞-former, which extends the vanilla transformer with an unbounded long-term memory. By making use of a continuous-space attention mechanism to attend over the long-term memory, the ∞-former's attention complexity becomes independent of the context length. Thus, it is able to model arbitrarily long contexts and maintain "sticky memories" while keeping a fixed computation budget. Experiments on a synthetic sorting task demonstrate the ability of the ∞-former to retain information from long sequences. We also perform experiments on language modeling, by training a model from scratch and by fine-tuning a pre-trained language model, which show benefits of unbounded long-term memories.
Authors: Pedro Henrique Martins, Zita Marinho, André F. T. Martins
Links:
TabNine Code Completion (Referral): bit.ly/tabnine-yannick
UA-cam: / yannickilcher
Twitter: / ykilcher
Discord: / discord
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher
Parler: parler.com/profile/YannicKilcher
LinkedIn: / yannic-kilcher-488534136
BiliBili: space.bilibili.com/1824646584
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
Наука та технологія

КОМЕНТАРІ • 81

@YannicKilcher 2 роки тому ⁺⁶
OUTLINE:
0:00 - Intro & Overview
1:10 - Sponsor Spot: Weights & Biases
3:35 - Problem Statement
8:00 - Continuous Attention Mechanism
16:25 - Unbounded Memory via concatenation & contraction
18:05 - Does this make sense?
20:25 - How the Long-Term Memory is used in an attention layer
27:40 - Entire Architecture Recap
29:30 - Sticky Memories by Importance Sampling
31:25 - Commentary: Pros and cons of using heuristics
32:30 - Experiments & Results
@mgostIH 2 роки тому ⁺⁹⁹
Cool fact: The infinite in Infinity Former stands for the amount of papers about transformer variations get published!
@carlossegura403 2 роки тому
Seriously. But I cannot complain; NLP has progressed thanks to the popularity of the Transformer / GPT drastically. NLP was slow, tedious, and with many architectures for each sub-specialized problem a few years back.
@L9X 2 роки тому ⁺²
i thought the infinite stood for the number of variations of attention we have come up with, which aren't really even attention but we call them attention because its cool. (Like seriously, why the hell do we call modern "Gated Linear Units" an attention mechanism? It seems anything that involves a sigmoid or softmax applied to some vector then multiplied by some other vector is called attention theses days)
@alpers.2123 2 роки тому ⁺¹³
Learning is compression
@user-ey2vv1dl3n 2 роки тому ⁺¹
nice format!
@JuanCamiloGamboaHiguera 2 роки тому ⁺³
You could argue that when training the model, if the embedding space is learned, the model would learn to map the training sequences to an embedding space where they can be represented with continuous signals. Also, this might just be a limitation of the choice of basis functions but there is no reason why you couldn't have different types of basis functions in there (sawtooth, triangle, square, gaussians, etc) at the same time.
@geoffreysworkaccount160 2 роки тому
Thank you for this.
@samuelelwell7575 2 роки тому
Hey Yannic, your videos are awesome! Do you know if you'll be doing one on the AlphaFold 2 paper?
@user-xs9ey2rd5h 2 роки тому ⁺³
I think this model would work really well in reinforcement learning so I'm really curious as to how well it performs in scenario's that weren't described in the paper. So I'd love to see you throwing some problems at it and see how well it works.
@theohintemann9374 2 роки тому
Thanks - good work
@diagorasofmel0s 2 роки тому ⁺¹
keep up the good work!
@lirothen 2 роки тому ⁺¹
Feels like that compressing & appending long term memory in figure 2 should be applied to the attention
@L9X 2 роки тому
Man, I love your videos
@umabrahma2019 2 роки тому
Amazing
@mgostIH 2 роки тому ⁺¹¹
17:43 I don't quite agree with this, the basis functions doesn't seem to be biased towards storing more recent information, it seems like the precision you get is uniform across the entire time domai n
20:00 It might be that the network learns to map embeddings in a way to help the interpolation in performing better, but I think the most important question is how well does this approach scale compared to storing the discrete representations themselves and whether tasks like NLP may benefit more than other general time sequence predictions.
I think it would be cool to see a learned interpolator too, although the performance might be very bad 🤔
@michaelparis6039 2 роки тому ⁺¹
Exactly, this parameter tau gives you the needed control parameter to equally weight newly learnt and already known information
@ademord 2 роки тому
Can i ask u for feedback on my masters thesis ?
@sophiez7952 2 роки тому
Hi I can learn a lot from you you are great thx have a wonderful day ,
@freemind.d2714 2 роки тому ⁺⁸
Infinite Memory Transformer -> Informer
@vaibhavbansal9358 2 роки тому ⁺²
Yeah it would had been a great name for it, but someone already stole the name for a transformer architecture for time series forecasting, published during early 2021.
@agentds1624 2 роки тому ⁺¹
Who or what is the name in the last sentenc of the video I understand "lucid rains" (and so does the automatic sutitles)?
@michaelparis6039 2 роки тому ⁺⁴
18:47 The reason (why a learned embedding could be modeled to be continuous) could be that language "feels" mostly fluid and continuous - or wouldn't you say?
@user-rh8hi4ph4b 2 роки тому ⁺³
I don't think that language either is or feels continuous at all. I think the reason this approximation works here, is that the feature width is large enough that there is enough redundancy in the hidden representations, that the model can reconstruct some of the lost information when attending to/reading the long-term memory.
Also note how the compression is applied to each feature dimension separately. For any element of the original sequence, the lossiness of the compression might have been particularly unfavorable to one dimension, but particularly gentle in another dimension. In total, even with the lossiness and the seemingly arbitrary assumption of continuity, after the compression there's still so much information for the model to extract relevant features, and that only gets better the larger you make the feature width.
@HD3.0 2 роки тому
Nice
@nocturnomedieval 2 роки тому ⁺⁶
During my PhD studies I came across the not largely known and sometimes surprising dark-magic-sorcery fact that markov chains can have memory and even more it can be infinite. I was wondering when this could reach the AI niche. I am a believer that Information Theory can synergise pretty well with DL.
@rpcruz 2 роки тому
Markov chains have very strong assumptions. Namely, that the next step depends only on the current step.
@vaibhavbansal9358 2 роки тому ⁺¹⁰
Why does this paper feel like ' when Perceiver IO, Language modelling and Fourier transform walks into a bar' thing?😛
Though, great video, once again!😄
@florianro.9185 2 роки тому ⁺²
MLFlow is a good alternative to weights & biases in my opinion :)
@starkest 2 роки тому ⁺¹
sorry, what was the reference at 36:28 where the implementation will be available?
@Anoyzify 2 роки тому
Also want to know this.
@Kram1032 2 роки тому ⁺⁷
Can't wait for somebody to come up with the idea to "just learn the attention functions" by using an entire fully connected DNN dedicated to just the attention mechanism to encode arbitrary functions - and then for somebody to switch that out with first a CNN and then a transformer to get a second order attention-attention
... and then just continue from there. Construct the infinite tower of attentions all the way down
Also: try the same but in fourier space!
More seriously though, the smoothing could work out if you allow, like, deeper lookups. I'm imagining something like the network going "oh I remember there was something relevant roughly in the first quarter of the book" at which point this information can be unpacked and attended to at higher resolution. A sort of binary search -ish thing could be possible that way, assuming you can't just hold everything in RAM, but you *can* have access to it on your hard drive and it might be fast enough to retrieve that way.
In that case, smoothing the signal first might make sense. If you can somehow reconstruct the deeper attention once you see what actually lies there again.
@Virsconte 2 роки тому ⁺¹
Before he said they were using RBFs, I just assumed it would be using Fourier series. I wonder if you could reorder the dimensions of the embeddings to try to get rid of higher frequency components. Basically looking at all of the tokens in your dataset (or some subset) and looking at what dimensions correlate the most.
@oncedidactic 2 роки тому ⁺¹
@@Virsconte This seems to make good intuitive sense as well as attractive from the engineering perspective. Like, you don't need to remember the exact sentence structure of a story years later, just the major nouns and verbs, and hence a simplified but effective description of events. It's another tradeoff for compression, but seems reasonable.
@oncedidactic 2 роки тому ⁺³
I forget which other transformer paper you did, but it brought up the idea of why not use fourier transforms to define attention. At that point idea being, it's not the exact form of the attention that matters, since the learning modulates it, but just some mixing in general.
This one gets me thinking, if we want dabble in heuristic compression hell for a good tradeoff against incalculable backprop, why not use fourier for the longterm signal memory (instead of RBF) and also the attention learner (instead of whatever du jour). Like, signals are all waves anyway, tokens are projections of whatever the "upstream" process was that generated them. It's not too crazy to think that the compression lossiness against tokens might actually overlap well with the generating function making the tokens, or at least its relevant features that you're trying to learn anyway.
Promise I'm not trying to superficially conflate two areas here where "hey look, it's a curvy continuous thing". More a remark on the artificiality of tokens as learnable data.
I guess another thing you could say here is, if not fourier or another good jack of all trades compression system, what gimmick is supposed to work best? It can't be that we're just hunting for the right gimmick which runs well on our chips.
Forgot to say, totally agree, super shady to call it infinite with such a modest performance bump of a "new architecture". And skeezy outsourcing to the heuristic.
@RuminRoman 2 роки тому
Yannic, but you can place infinite Universe in your finite head. Even if you have forgotten something, you can read the appropriate paper or talk to a specialist or expand your brain with new modules by yourself with your own money which you yourself have earned. So we need a model that can read and talk and make money and expand itself with it's own money. Positive cash flow.
@JTMoustache 2 роки тому ⁺¹
Rather than using a decomposition based on RBFs they could have used a more classic and richer wavelet decomposition.. the structure of the wavelet output should also make more sense to the network, imho
@ademord 2 роки тому
Can i ask u for feedback on my masters thesis 🧐
@Anujkumar-my1wi 2 роки тому ⁺¹
Hey, can you clear my confusion regarding as to why the mathematical model of artificial neuron is like : The input data x, is subject to an affine transformation defined by W
, followed by a non-linear transformation i.e nonlinear_fun(weights*inputs+bias). but why not like this : nonlinear transformation on input and then affine transformation on the transformed input i.e nonlinear_fun(inputs)*weights+bias?
And also the mathematical model of a artificial neural net is like : weights*nonlinear_fun(weights*inputs+bias)+bias ,um isn't it the output of the artificial neural net ,then shouldn't it be like this : nonlinear_fun(weights*nonlinear_fun(weights*inputs+bias)+bias) or is it beacuse the activation function of output neuron is linear so that's why ?
EDIT: I mean shouldn't mathematical model of a single neuron and artificial neural net be same ?
@rpcruz 2 роки тому ⁺¹
For most problems you do "nonlinear_fun(weights*nonlinear_fun(weights*inputs+bias)+bias)". But for some problems, when the output is a linear regression, then the last nonlinear_fun is the identity function, so you can omit it.
@Anujkumar-my1wi 2 роки тому
@@rpcruz Thanks ,but about the first ,do you have any thoughts on that i.e
why the mathematical model of artificial neuron is like : The input data x, is subject to an affine transformation defined by W
, followed by a non-linear transformation i.e nonlinear_fun(weights*inputs+bias). but why not like this : nonlinear transformation on input and then affine transformation on the transformed input i.e nonlinear_fun(inputs)*weights+bias?
@drdca8263 2 роки тому ⁺¹
@@Anujkumar-my1wi so, take the example of fully connected feed forward network. In the way this is usually done, you can do this by applying a matrix to the input vector, and then applying the non-linearity to each coordinate separately.
If you also mean to have the non-linearity apply to each coordinate separately, then uh, the non-linearity doesn’t get to use the mixing from the different inputs.
If you add more layers, then the two should be equivalent, but this is basically just the same as “apply this non-linearity at the start of your network, and then continue as normal”,
But that’s kinda pointless I think?
I am not experienced in ML so take this with a grain of salt
@Anujkumar-my1wi 2 роки тому
@@drdca8263 Thanks
@mgostIH 2 роки тому
Doing a non linear operation first will lose information on the inputs: For example if your inputs are [1., -1.], if you apply a ReLU before doing any matrix multiplication you will lose the -1 as it'll become a 0.
@nikitastaf1996 2 роки тому ⁺¹
I don't know if you did it intentionally or not but image for problem statement chapter is weights and biases ad.
@draxd3045 2 роки тому ⁺¹
I like the sound of the helicopter
@srh80 2 роки тому ⁺³
I think the next team who writes a transformer paper needs to donate a dollar to the swear jar.
@rochne 2 роки тому
"Informer" would have been the best choice of name IMO 🙂
@jean-baptistedelabroise5391 2 роки тому
it is weird to limit to gaussian, if you make this model into a BERT type model, as the BERT attention for the CLS token will often attend at every SEP tokens...
@TheGallowtree 2 роки тому ⁺¹
I can't believe no-one spotted the typo in equation 8.
@hosseinmobahi4841 2 роки тому
Why are you recording yourself on a roof top? :) Love the location.
@Metaloid-wv4kz 2 роки тому ⁺²
He's trying to go infinity and beyond, he's reforming his thinking and having a watershed moment bro! Or he's just scared, we all need a rooftop moment.
@norik1616 2 роки тому ⁺²
Best W&B ad ever
@pensiveintrovert4318 2 роки тому ⁺⁶
Basically quantization.
@TheGodSaw 2 роки тому
i think you meant to say. "brings about the same problems as LSTM namely you get angry post from Schmidbhuber"
@IvanHe-gc7bf Місяць тому
I think it should be called Continous Attention is All you Need.
@kanfoosj 2 роки тому ⁺⁸
I prefer calling it "nifty-former"
@NeoShameMan 2 роки тому ⁺¹
I was wondering when such architecture would emerged, back with the released of ai dungeon, i was like, what if we compressed into a summary the previous half of the working memory, such that the sliding working memory retain more information for continuity. It's a function that the language model could do back then.
@jawadmansoor6064 2 роки тому ⁺³
I think the best transformer was the stack former or add former (where you can stack layers without increasing complexity quadratically.
@__E__ 2 роки тому
which papers are you exactly referring to ? I can't find them googling addformer or stackformer
@jawadmansoor6064 2 роки тому ⁺¹
@@__E__ I am sorry, I have a habit of modifying names (as they sound to me for fun). I actually meant fastformer (the additive attention paper). "Additive attention CAN BE all you need".
@__E__ 2 роки тому ⁺¹
@@jawadmansoor6064 It's alright man :)
But I'm not sure I got this paper right because afaik it's not about stacking layers but rather having a global key and a global query per layer, the quadratic stuff happens inside a layer, not because you stack them
@jawadmansoor6064 2 роки тому
@@__E__ Transformers (ordinarily) are expensive due to quadratic complexity. If you stack layers in them the BigO would still be quadratic (being the most expensive operation) however this paper suggests that you only compute KVQ once (Q once, K twice and V thrice, I forgot which was which) hence the maximum complexity just depends on number of tokens (linearly).
What I mean by a layer is that all the operation from "global" key/query computation is one layer. And you can stack the same operation above it. (It is difficult to explain in a few words, better yet please refer to the video by Yanic Kilcher). If you still don't get it (after watching the video, then do ask again, I will write an explanation of it in a few days (motivated for you, though will make it public) insha ALLAH.
@ericadar 2 роки тому ⁺⁷
why not treat the problem of compressing the embedding vectors as a learnable task or at least show that their choice (ridge regression using RBFs) is superior to other lossy compression heuristics? seems arbitrary.
@bigbuckey7687 2 роки тому ⁺²
The problem with learning compression is now you have to learn through time, and run into the classic problems the LSTMs had to solve like the vanishing/exploding gradient problem, not to mention that you can't parallelize training anymore. But yeah I agree their choices seem arbitrary and not justified (as someone who hasn't read the paper myself).
@mgostIH 2 роки тому
@@bigbuckey7687 > The problem with learning compression is now you have to learn through time
I don't think it's true in this case: you only need to learn something that's good at interpolating points, there's no variation in the training regime that would be caused by the amount of tokens.
@bigbuckey7687 2 роки тому ⁺¹
@@mgostIH Well we have plenty of algorithms to simply interpolate points, there's no need to learn that. But if we did make some custom interpolation that was learned, then since this architecture samples the past interpolations the loss would depend on those previous iterations, a.k.a learning through time. This is in the same idea as what an LSTM does (with major differences of course). Not sure what you mean by "there's no variation in the training regime that would be caused by the amount of tokens".
@mgostIH 2 роки тому
@@bigbuckey7687
> Well we have plenty of algorithms to simply interpolate points, there's no need to learn that.
A lot of algorithms for interpolation make specific assumptions about the data, given that there's a strong link with compression and intelligence (Marcus Hutter) I would start thinking of new neural approaches much more for this sort of problems.
> since this architecture samples the past interpolations the loss would depend on those previous iterations, a.k.a learning through time
But there's no "previous iterations" in this case, in training models like these you only need a single forward pass to get all the tokens in a sentence their continuous representation and then interpolate between them.
> Not sure what you mean by "there's no variation in the training regime that would be caused by the amount of tokens".
An example could be training a SIREN network to fit the points you are given, their amount doesn't change anything about how you train the network but you still get out an interpolation that doesn't depend on the amount of tokens you had.
@bigbuckey7687 2 роки тому
@@mgostIH Ah ok now I get your point. As long as the learned interpolation only depends on the input and no hidden states updated through time then each iteration doesn't depend on each other. Thanks for the clarification.
@ngginger2944 7 місяців тому
This paper reminds me of VQ-VAE
@L9X 2 роки тому
W&B gang
@sergiomanuel2206 2 роки тому
Hello there!!
@minos99 2 роки тому
This is one of the xformers I'm gonna forget. Continuous representation is welcome but I don't think they justify their choice enough. The model feels too unjustifiably handcrafted.
@brandonmckinzie2737 2 роки тому
Cool idea. Really misleading title.
@umabh2339 2 роки тому
Nice

Наступне

Автоматичне відтворення

Fastformer: Additive Attention Can Be All You Need (Machine Learning Research Paper Explained)