Sparse is Enough in Scaling Transformers (aka Terraformer) | ML Research Paper Explained

Yannic Kilcher

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 31 тра 2024
#scalingtransformers #terraformer #sparsity
Transformers keep pushing the state of the art in language and other domains, mainly due to their ability to scale to ever more parameters. However, this scaling has made it prohibitively expensive to run a lot of inference requests against a Transformer, both in terms of compute and memory requirements. Scaling Transformers are a new kind of architecture that leverage sparsity in the Transformer blocks to massively speed up inference, and by including additional ideas from other architectures, they create the Terraformer, which is both fast, accurate, and consumes very little memory.
OUTLINE:
0:00 - Intro & Overview
4:10 - Recap: Transformer stack
6:55 - Sparse Feedforward layer
19:20 - Sparse QKV Layer
43:55 - Terraformer architecture
55:05 - Experimental Results & Conclusion
Paper: arxiv.org/abs/2111.12763
Code: github.com/google/trax/blob/m...
Abstract:
Large Transformer models yield impressive results on many tasks, but are expensive to train, or even fine-tune, and so slow at decoding that their use and study becomes out of reach. We address this problem by leveraging sparsity. We study sparse variants for all layers in the Transformer and propose Scaling Transformers, a family of next generation Transformer models that use sparse layers to scale efficiently and perform unbatched decoding much faster than the standard Transformer as we scale up the model size. Surprisingly, the sparse layers are enough to obtain the same perplexity as the standard Transformer with the same number of parameters. We also integrate with prior sparsity approaches to attention and enable fast inference on long sequences even with limited memory. This results in performance competitive to the state-of-the-art on long text summarization.
Authors: Sebastian Jaszczur, Aakanksha Chowdhery, Afroz Mohiuddin, Łukasz Kaiser, Wojciech Gajewski, Henryk Michalewski, Jonni Kanerva
Links:
TabNine Code Completion (Referral): bit.ly/tabnine-yannick
UA-cam: / yannickilcher
Twitter: / ykilcher
Discord: / discord
BitChute: www.bitchute.com/channel/yann...
LinkedIn: / ykilcher
BiliBili: space.bilibili.com/2017636191
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
Наука та технологія

КОМЕНТАРІ • 33

@YannicKilcher 2 роки тому ⁺³
OUTLINE:
0:00 - Intro & Overview
4:10 - Recap: Transformer stack
6:55 - Sparse Feedforward layer
19:20 - Sparse QKV Layer
43:55 - Terraformer architecture
55:05 - Experimental Results & Conclusion
Paper: arxiv.org/abs/2111.12763
Code: github.com/google/trax/blob/master/trax/examples/Terraformer_from_scratch.ipynb
@sebastian.jaszczur 2 роки тому ⁺³⁹
Great explanation! About 41:10 and F: you're right, I've made a typo in the paper; "F=3" should be placed in the previous line, as it describes a conv-layer, not a mult-layer. Sorry!
@renebidart 2 роки тому ⁺²
I think it would be great for you to make a video reply to this answering some of the questions / assumptions he brings up in the summary.
@ssssssstssssssss 2 роки тому ⁺²³
Wait. It's called Terraformer and they don't use the Earth Mover's Distance?
@g_glop 2 роки тому ⁺⁴
Wow the reversible block is literally a feistel network but on reals instead of Z2
@herp_derpingson 2 роки тому ⁺¹
The hundreds of things that have been done in the paper makes my head hurt. :(
@filoautomata 2 роки тому ⁺⁴
I see a lot of similarities from explanation by Jeff Hawkins on how their architecture works (activate, not activate, etc)
@aspergale9836 2 роки тому ⁺¹
Can you share a link? :)
@jabowery 2 роки тому ⁺⁸
It's already known that the large language models can be put through substantial distillation without substantial loss. And that is with dense models which are sort of like the drunk looking for his keys underneath the light post due to the availability of gpus and lack of sparse-optimized hardware.
@flightrisk7566 2 роки тому
hey i remember your comment about the manhattan project
has anyone ever told you that you’re based
@bobspianosbffl 2 роки тому
@@flightrisk7566 what was his comment? Couldn't find it with a Google search
@flightrisk7566 2 роки тому
check the top comments on Yannic’s video about learning rate grafting, that’s where you can find it
and then if i didn’t misremember that come tell this man he’s based
@bobspianosbffl 2 роки тому
@@flightrisk7566 Found it thanks. Yeah based on his comments this guy's thoughts seem based in reality
@flightrisk7566 2 роки тому ⁺²
@@bobspianosbffl “this guys thoughts seem based on reality.” only _observed_ reality 👀 theoretical justification????
/s
@paulcurry8383 2 роки тому ⁺²
Has anyone tried biasing the network towards relatively high activations? So the network can descend into zeroing out rows instead of having an inductive bias linear layer that estimates which rows are zeros?
@TheGodSaw 2 роки тому
Like a l1 regularization does that as far as I know this never really worked well in Dl
@maxkleinebrahm2174 2 роки тому ⁺¹
Great Videos! I would really like to hear your opinion on that paper: Efficiently Modeling Long Sequences with Structured State Spaces. Metrics look quite promising :-)
@brandomiranda6703 2 роки тому ⁺²
You should start a pattern of one sentence summaries
@sean_vikoren 2 роки тому ⁺³
I wanted to mention even though I am subscribed that there's almost no reason to subscribe from the user end anymore. UA-cam is become completely useless in terms of letting me know who out of my subscriptions has released a video.
I literally have to go to subscriptions and then click the all button and then turn my phone sideways and then scroll through looking for blue dots.
I hate UA-cam so deeply.
@Thepewdiepiebro5 2 роки тому ⁺¹
Turn on bell notifications (all)?
@Pmaisterify 2 роки тому ⁺⁸
Would you say innovation is plateauing @yannic?
@YannicKilcher 2 роки тому ⁺⁹
a bit, but we're probably still doing fine :)
@herp_derpingson 2 роки тому
We are not plateauing, we are Grokking ;)
@jonatan01i 2 роки тому
If your prediction of q×k is not perfect, then why bother calculating higher rank products?
Use the low rank "prediction" values directly, and choose only the highest x% of the products to sum up the values.
Wouldn't that also work?
@PaganPegasus 2 роки тому
That dumb hand drawn graph made me subscribe heh. I don't know why I wasn't subscribed before since your videos show up in my feed anyway lmao.
@PierLim 2 роки тому
Do temporal fusion transformers ! Many thanks in advance!
@dragonwaiter5070 2 роки тому ⁺¹
This feels like squeezing toothpaste out to make sure there's none left. Not much fun, but at least you can brush your teeth.
@GinjeetX 2 роки тому ⁺³
I thought this had to do with the robots that transform and shoot each other
@imranq9241 2 роки тому
I'd this related to compressed sensing ?
@mahdipourmirzaei1048 2 роки тому ⁺²
First!
@meselfobviouslyme6292 2 роки тому ⁺¹
second
@user-sv5vb1mj1q 2 роки тому
I see 17B parameters. Ok I close video

Наступне

Автоматичне відтворення

EfficientZero: Mastering Atari Games with Limited Data (Machine Learning Research Paper Explained)