GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding (Paper Explained)
Вставка
- Опубліковано 8 чер 2024
- Google builds a 600 billion parameter transformer to do massively multilingual, massive machine translation. Interestingly, the larger model scale does not come from increasing depth of the transformer, but from increasing width in the feedforward layers, combined with a hard routing to parallelize computations on up to 2048 TPUs. A very detailed engineering paper!
OUTLINE:
0:00 - Intro & Overview
4:10 - Main Results
5:10 - Mixture-of-Experts
16:00 - Difference to Scaling Classic Transformers
18:50 - Backpropagation in Mixture-of-Experts
20:05 - MoE Routing Algorithm in GShard
38:20 - GShard Einsum Examples
47:40 - Massively Multilingual Translation
56:00 - Results
1:11:30 - Conclusion & Comments
ERRATA:
I said the computation of MoE scales linearly, but actually, it's sub(!)-linear.
Paper: arxiv.org/abs/2006.16668
Abstract:
Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.
Authors:
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen
Links:
UA-cam: / yannickilcher
Twitter: / ykilcher
Discord: / discord
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher - Наука та технологія
ERRATA:
I said the computation of MoE scales linearly, but actually, it's sub(!)-linear.
How is it sub-linear, again? Both the distribution, computation, and gathering all sound like linear scaling ops.
Because that part is parallelizeable, they don't include it. The distribution itself is O(sqrt(D)). They have a detailed section in the paper
What about the speed? Let's say that you have two models withthe same n of params, one without MoE and the other with MoE. The model with MoE would be faster?
If only human experts cooperated this well.
OpenAI: This is GPT3 and it costs me everything.
Google: What? Just 175B, here you go GShard
the open AI model is dense
this is a sparse model. You cant compare them that way.
Do you have a Patreon account or some other way to support your educational work here on UA-cam?
I'm thinking about it :) Thanks for the compliment though
I'll gladly watch 1h+ videos of papers in the higher weight class. If the ideas are good, but long and many of them, there is no problem for me to keep watching.
Einstein Summation is such an awesome tool
Noam Shazeer sees every paper as an opportunity to spread the good word about einsum and I love it :D
Learning to Combine Top-Down and Bottom-Up Signals
in Recurrent Neural Networks with Attention over Modules
This is another good paper to review!
The bigger the networks the longer the video. Gonna need some next gen sentiment analysis algorithms pretty soon.
Hi Yannic, it would be great if you could make a video on tensors and their dynamics. This einsum notation amazed me, I would like to learn more about it. Thanks for your efforts and cya.
Very nice! Thank you!
thanks to have take some of your time to analyse the paper.
I find that when Yannick excited that's when I enjoy videos the most. This video 🔥
Wait until you see my Minecraft Let's play!
I'm surprised that some org hasn't made a distributed machine learning server farm ala Folding@Home. I'm sure there are plenty of people who would contribute GPU time. There are alot of 1080ti and 2080ti GPUs in the world (for games) and those owners are the kind that have a few TB of space on HDs and fast internet connections.
Even if there were ineffencies and overhead, it'd be allow researchers to do interesting work. It'd be cool to be able to say, I helped with this model where a voice was talking naturally, this image model that generates exact faces, this language model where I can ask it questions and get surprising sophisticated answers.
As always, great work. This one was a massive and I will need to revisit this to fully understand everything that going on.
I think these TPUs and their low-latency interconnection are unmatched even if you were to connect all gamers of the world. Sadly.
My prediction for the next contestant in NLP Body part measuring contest will be NVIDIA with their A100s - which I think will actually be interesting since with TP32 can dynamically adjust for sparse matrixes. My assumption is that within the M4 dataset, there is a lot of sparsity that can be addressed to massively reduce the compute required
Yes that would be a game changer
when he said "Body Part measuring contest"..lol. I seriously had to stop the video to settle down.
First of I always enjoy a good engineering paper.
Second I am thinking about Expert sub net segmentation now for a while, in the context of solving multitasks NN and solving the catastrophic forgetting problem.
I guess it shouldn't be too hard to apply an additional contrastive loss to the attention mechanism. Maybe use the Magnitude movement from the parameters for the contrastive loss. Maybe if the NN is just using a part of the experts and if then a new task comes along the Transformer could switch to deferent experts without forgetting the first task, because the experts for the first task a not overwritten.
But I am just thinking loud at this point.
Indeed. I think with a library like this, we'll see lots of such ideas being implemented!
Someday we need a cost benefit analysis paper for these giant models.
Hi Yannic, do you know what tool people use to make the tranformer diagrams? I use yEd for making graphical flow diagrams but it's not as slick looking.
app.diagrams.net
Nice!
Great!!
Uh... Gpt-3 had been trained at that time already and it is a major success now... I wish I could test Gshard too.
Wouldn't it be nice to see translation examples
of the different models regarding the claimed
quality improvement?
Yes, that's why I said I feel the task itself is almost like an afterthought in this paper :)
@@YannicKilcher Are you saying that they don't give these examples?
The question is for those who are versed in the topic: how many Nvidia A100 GPUs are required to train this model and how long will it take?
(Let's say you have this huge dataset).
Yannic, thanks a lot for this video!
Gee, since you put out the question in the open it makes me so antsy to just solve it. I guess my engineering brain just can't help it....
and the answer is... figure it out yourself.
@@thomasachee463 You just wasted a minute of your time
If we are sharding across different machines then it will add more bottleneck in IO & latency right ???
Me: But how do they backprop through the hard routing?!?
Yannic: ...to backprop through the hard routing, they just add some noise...
Me: [thoughtful contemplation] Well...of course that's how they do it, totally obvious.
I must be missing something, but if using 128 TPUs is equivalent to 6 TPU-core-years, then I would expect it to take 17 days of training, not the 6 days or so in the graph. Can anyone enlighten me?
EDIT: looks at graph again...oh, now I get it!
1:04:34 For training very deep Transformers, see arxiv.org/pdf/2003.04887.pdf
How do we know that the experts are actually doing different things? How much expressiveness are we buying vs GPT3 ? (Haven't finished the video yet)
We don't but it's safe to assume if we gain in performance
Can I run this model on my toaster? I really want to test this out myself!
yes, it runs on Toast Processing Units (TPUs)
is this avaible for download? Some API maybe?
Are there examples of text generated by the transformer?
Not in the paper
how do they backpropagate through the hard routing?
Like you backpropagate through dropout.
Each attention mechanism of the transformer layers have the same q,k and v weights? asking for a friend hehe
Within the same layer, the parameters are replicated, but the different layer parameters are different.
@@YannicKilcher Thanks Yannic!
can I zip the 6B components to make it fit on my 16GB
I think you need to zip twice to get it down really small
It's so big you need the best hydraulic press you can get to compress it.
waiting for the code release
i just managed to train 1024x1024 gan with loss of 0.05 with gtx 1080 Ti and my GPU is now DEAD!
I'm really wondering about it's carbon footprint.
I am almost certain that the number of authors of a paper scale (sub?)linearly with model size.
.
There is a line in the paper, "Neural networks demand non-negligible amounts of computation power.". I mean the authors are not wrong :)
.
I was a bit curious about what Broader Impact could the authors conjure for this paper XD
I guess the evil people could shard their models to 666 machines!
RIP academia
"Someone made the powerpoint mistake" LOL
There seems to be hysteria against algorithmic information theory in language modeling.
OpenAI boasts a 175 BILLION parameter model.
Now Google boasts a 600 BILLION parameter model.
Now, I wouldn't call this "anti-AIT" if it weren't for the fact that these papers don't even attempt to estimate the actual information content of these parameters. Instead, they seem to take _pride_ in the obviously-inflated "parameter count" (and we haven't even gotten into measuring the complexity of the parameterized code).
We're just seeing what happens when we scale current algorithms. No one is stealing the other science from researchers!
I agree, but it is rather peculiar that the more outrageously overparameterized a model is, the better it performs, don't you think so? :)
@Big Deeper So long as they have _accurate_ measures of "perplexity" there is a direct relationship to AIT, hence avoidance of overfitting. The problem* is that perplexity measures can be contaminated by the corpus itself, and the likelihood of that happening increases with the comprehensiveness of the corpus. Indeed, such contamination has already been detected once. Moreover, the likelihood of it happening in an undetectable way also increases with the size of the corpus.
This is why Marcus Hutter uses the size of the losslessly compressed corpus (including de/compression algorithm), as explained the two questions starting with this one in his FAQ:
www.hutter1.net/prize/hfaq.htm#xvalid
*There are other problems with perplexity that are avoided by using the size of an executable archive of the data as the benchmark, such as bringing the size of the code base into the same units (bits) with the size of the losslessly compressed data, but that gets washed out as the size of the compressed corpus gets into the multigigabyte range.
@@YannicKilcher It doesn't surprise me* because, as I said, there is no attempt to compress the parameters to estimate their actual information content -- that combined with the fact that the approaches being taken are within the same general paradigm of machine learning -- so one may well expect the compression ratio of inflated parameters to deflated parameters to be similarly invariant.
*Nor would it surprise me to find that they aren't suffering from a great deal of overfitting since one of the virtues of recent machine learning approaches is that they tend to be quite compressible using straight forward if not brute force statistical techniques. But wouldn't it be nice to have a benchmark that guaranteed protection against these things so we can actually make principled model if not ML algorithm selection decisions (given the same general class of compute resource)?
Having these gigacorps out there throwing around vast piles of iron isn't exactly inspiring a new generation of innovative thinkers, now is it?
Academia can hit back with smarter models instead of bigger models
9:42 "I don't know other languages" just wow. Giggle on.
*Out of context:*
Maybe could have a drama video on this: www.reddit.com/r/MachineLearning/comments/hiv3vf/d_the_machine_learning_community_has_a_toxicity/
Bigger models, bigger problems
Wish they used Giga Parameters instead of billions, it's to confusing how the english speaking world do billions wrong... it's really just milliards.
I like the short scale much bettern than the non-english longscale. It´s just more logical.
@@TheXergon I find billion=million^2, trillion=million^3 etc more logical, but that's not the point. It's ambiguous and we do have an alternativ in mega, giga, terra etc.
I would not be so enthusiastic about this new “research” from Google. The goal of this research is to try to improve the accuracy of the Google Translate engine, as estimated using this BLEU score. I must admit that I don’t know much about the BLEU score. ..But what I know is that the Translation engine developed by DeepL is *much* better than the one provided by Google (I know that because I am using it every day: DeepL is incredibly better than Google). So, I fail to see the point of this paper. For me, it’s just one of these gigantic useless architectures that fails to deliver any benefits. Actually, that’s even worse than that: Not only is this useless, it’s also the source of a huge CO2 pollution.
About CO2 pollution: This scientific paper:
download.timi.eu/docs/ThirdParty/Energy_and_Policy_Considerations_for_Deep_Learning_in_NLP.pdf
…explains that the CO2 foot-print of training one “tiny” deep learning model that has *only* 213 millions parameters is equivalent to all the CO2 gaz emitted by five normal cars during ALL THE LIFETIME OF THE 5 CARS (More precisely, it’s 5.5 car: (39+78468+192+626155)/126000 ≈ 5.5). In other words, each time you train a tiny NLP model, you create the same amount of pollution than 5.5 cars during all their lifetime.
I have difficulties to imagine the CO2 pollution that google created when they develop their “small” 37.5 billions parameters models. Yes, that’s 37500 millions parameters or, in other words, equivalent to the pollution of *minimum* 985 cars (985 ≈ 5.5*37500/213). To get this number, I made the optimistic assumption where the pollution only growth linearly with the number of parameters. This is really optimistic. In reality, the growth is near exponential, so that it’s *much* worse.
What’s even worse is that Google did not stop at a 37.5 billions parameters models. They made a 150 billions parameters models, a 600 billions parameters models and they attempted to make a 1000 billions parameters models (this attempt failed but it still produced staggering amount of CO2).
It’s difficult to get the measure of the CO2 pollution produced by this stupid useless NLP infrastructure. It’s around equivalent to the LIFETIME pollution of 50.000 cars, just for one stupid experiment (and a failed experiment at that because DeepL is much better).
It might be time to re-think what a good data science architecture looks like. A little bit of efficiency would really not hurt.
I was surprised to learn that inference (i.e. the front-end) of translation / search / etc. actually takes orders of magnitude more co2 than training one of these models (because once you've trained it, you can use it repeatedly). That's also true for DeepL
I don´t think that Google is using any vast networks for this task atm, inference of such huge models would consume too much energy and TPU cycles. I suppose the current Google Translate Engine is using neural networks from 2018/begin of 2019 only.