Sparse is Enough in Scaling Transformers (aka Terraformer) | ML Research Paper Explained
Вставка
- Опубліковано 31 тра 2024
- #scalingtransformers #terraformer #sparsity
Transformers keep pushing the state of the art in language and other domains, mainly due to their ability to scale to ever more parameters. However, this scaling has made it prohibitively expensive to run a lot of inference requests against a Transformer, both in terms of compute and memory requirements. Scaling Transformers are a new kind of architecture that leverage sparsity in the Transformer blocks to massively speed up inference, and by including additional ideas from other architectures, they create the Terraformer, which is both fast, accurate, and consumes very little memory.
OUTLINE:
0:00 - Intro & Overview
4:10 - Recap: Transformer stack
6:55 - Sparse Feedforward layer
19:20 - Sparse QKV Layer
43:55 - Terraformer architecture
55:05 - Experimental Results & Conclusion
Paper: arxiv.org/abs/2111.12763
Code: github.com/google/trax/blob/m...
Abstract:
Large Transformer models yield impressive results on many tasks, but are expensive to train, or even fine-tune, and so slow at decoding that their use and study becomes out of reach. We address this problem by leveraging sparsity. We study sparse variants for all layers in the Transformer and propose Scaling Transformers, a family of next generation Transformer models that use sparse layers to scale efficiently and perform unbatched decoding much faster than the standard Transformer as we scale up the model size. Surprisingly, the sparse layers are enough to obtain the same perplexity as the standard Transformer with the same number of parameters. We also integrate with prior sparsity approaches to attention and enable fast inference on long sequences even with limited memory. This results in performance competitive to the state-of-the-art on long text summarization.
Authors: Sebastian Jaszczur, Aakanksha Chowdhery, Afroz Mohiuddin, Łukasz Kaiser, Wojciech Gajewski, Henryk Michalewski, Jonni Kanerva
Links:
TabNine Code Completion (Referral): bit.ly/tabnine-yannick
UA-cam: / yannickilcher
Twitter: / ykilcher
Discord: / discord
BitChute: www.bitchute.com/channel/yann...
LinkedIn: / ykilcher
BiliBili: space.bilibili.com/2017636191
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n - Наука та технологія
OUTLINE:
0:00 - Intro & Overview
4:10 - Recap: Transformer stack
6:55 - Sparse Feedforward layer
19:20 - Sparse QKV Layer
43:55 - Terraformer architecture
55:05 - Experimental Results & Conclusion
Paper: arxiv.org/abs/2111.12763
Code: github.com/google/trax/blob/master/trax/examples/Terraformer_from_scratch.ipynb
Great explanation! About 41:10 and F: you're right, I've made a typo in the paper; "F=3" should be placed in the previous line, as it describes a conv-layer, not a mult-layer. Sorry!
I think it would be great for you to make a video reply to this answering some of the questions / assumptions he brings up in the summary.
Wait. It's called Terraformer and they don't use the Earth Mover's Distance?
Wow the reversible block is literally a feistel network but on reals instead of Z2
The hundreds of things that have been done in the paper makes my head hurt. :(
I see a lot of similarities from explanation by Jeff Hawkins on how their architecture works (activate, not activate, etc)
Can you share a link? :)
It's already known that the large language models can be put through substantial distillation without substantial loss. And that is with dense models which are sort of like the drunk looking for his keys underneath the light post due to the availability of gpus and lack of sparse-optimized hardware.
hey i remember your comment about the manhattan project
has anyone ever told you that you’re based
@@flightrisk7566 what was his comment? Couldn't find it with a Google search
check the top comments on Yannic’s video about learning rate grafting, that’s where you can find it
and then if i didn’t misremember that come tell this man he’s based
@@flightrisk7566 Found it thanks. Yeah based on his comments this guy's thoughts seem based in reality
@@bobspianosbffl “this guys thoughts seem based on reality.” only _observed_ reality 👀 theoretical justification????
/s
Has anyone tried biasing the network towards relatively high activations? So the network can descend into zeroing out rows instead of having an inductive bias linear layer that estimates which rows are zeros?
Like a l1 regularization does that as far as I know this never really worked well in Dl
Great Videos! I would really like to hear your opinion on that paper: Efficiently Modeling Long Sequences with Structured State Spaces. Metrics look quite promising :-)
You should start a pattern of one sentence summaries
I wanted to mention even though I am subscribed that there's almost no reason to subscribe from the user end anymore. UA-cam is become completely useless in terms of letting me know who out of my subscriptions has released a video.
I literally have to go to subscriptions and then click the all button and then turn my phone sideways and then scroll through looking for blue dots.
I hate UA-cam so deeply.
Turn on bell notifications (all)?
Would you say innovation is plateauing @yannic?
a bit, but we're probably still doing fine :)
We are not plateauing, we are Grokking ;)
If your prediction of q×k is not perfect, then why bother calculating higher rank products?
Use the low rank "prediction" values directly, and choose only the highest x% of the products to sum up the values.
Wouldn't that also work?
That dumb hand drawn graph made me subscribe heh. I don't know why I wasn't subscribed before since your videos show up in my feed anyway lmao.
Do temporal fusion transformers ! Many thanks in advance!
This feels like squeezing toothpaste out to make sure there's none left. Not much fun, but at least you can brush your teeth.
I thought this had to do with the robots that transform and shoot each other
I'd this related to compressed sensing ?
First!
second
I see 17B parameters. Ok I close video