GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding (Paper Explained)

Поділитися
Вставка
  • Опубліковано 8 чер 2024
  • Google builds a 600 billion parameter transformer to do massively multilingual, massive machine translation. Interestingly, the larger model scale does not come from increasing depth of the transformer, but from increasing width in the feedforward layers, combined with a hard routing to parallelize computations on up to 2048 TPUs. A very detailed engineering paper!
    OUTLINE:
    0:00 - Intro & Overview
    4:10 - Main Results
    5:10 - Mixture-of-Experts
    16:00 - Difference to Scaling Classic Transformers
    18:50 - Backpropagation in Mixture-of-Experts
    20:05 - MoE Routing Algorithm in GShard
    38:20 - GShard Einsum Examples
    47:40 - Massively Multilingual Translation
    56:00 - Results
    1:11:30 - Conclusion & Comments
    ERRATA:
    I said the computation of MoE scales linearly, but actually, it's sub(!)-linear.
    Paper: arxiv.org/abs/2006.16668
    Abstract:
    Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.
    Authors:
    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen
    Links:
    UA-cam: / yannickilcher
    Twitter: / ykilcher
    Discord: / discord
    BitChute: www.bitchute.com/channel/yann...
    Minds: www.minds.com/ykilcher
  • Наука та технологія

КОМЕНТАРІ • 82

  • @YannicKilcher
    @YannicKilcher  3 роки тому +15

    ERRATA:
    I said the computation of MoE scales linearly, but actually, it's sub(!)-linear.

    • @socratic-programmer
      @socratic-programmer 3 роки тому

      How is it sub-linear, again? Both the distribution, computation, and gathering all sound like linear scaling ops.

    • @YannicKilcher
      @YannicKilcher  3 роки тому +1

      Because that part is parallelizeable, they don't include it. The distribution itself is O(sqrt(D)). They have a detailed section in the paper

    • @gaceladri
      @gaceladri 3 роки тому

      What about the speed? Let's say that you have two models withthe same n of params, one without MoE and the other with MoE. The model with MoE would be faster?

    • @veedrac
      @veedrac 3 роки тому +1

      If only human experts cooperated this well.

  • @sahiluppal3093
    @sahiluppal3093 3 роки тому +15

    OpenAI: This is GPT3 and it costs me everything.
    Google: What? Just 175B, here you go GShard

    • @xsuploader
      @xsuploader 3 роки тому +1

      the open AI model is dense
      this is a sparse model. You cant compare them that way.

  • @kappadistributive
    @kappadistributive 3 роки тому +23

    Do you have a Patreon account or some other way to support your educational work here on UA-cam?

    • @YannicKilcher
      @YannicKilcher  3 роки тому +6

      I'm thinking about it :) Thanks for the compliment though

  • @norik1616
    @norik1616 3 роки тому +1

    I'll gladly watch 1h+ videos of papers in the higher weight class. If the ideas are good, but long and many of them, there is no problem for me to keep watching.

  • @majork.651
    @majork.651 3 роки тому +12

    Einstein Summation is such an awesome tool

    • @MadisonMay-X
      @MadisonMay-X 3 роки тому +1

      Noam Shazeer sees every paper as an opportunity to spread the good word about einsum and I love it :D

  • @teslaonly2136
    @teslaonly2136 3 роки тому +1

    Learning to Combine Top-Down and Bottom-Up Signals
    in Recurrent Neural Networks with Attention over Modules
    This is another good paper to review!

  • @qeter129
    @qeter129 3 роки тому +7

    The bigger the networks the longer the video. Gonna need some next gen sentiment analysis algorithms pretty soon.

  • @AbgezocktXD
    @AbgezocktXD 3 роки тому +4

    Hi Yannic, it would be great if you could make a video on tensors and their dynamics. This einsum notation amazed me, I would like to learn more about it. Thanks for your efforts and cya.

  • @DimaLepikhin
    @DimaLepikhin 3 роки тому +5

    Very nice! Thank you!

  • @Fine_Mouche
    @Fine_Mouche 2 роки тому

    thanks to have take some of your time to analyse the paper.

  • @dmitrysamoylenko6775
    @dmitrysamoylenko6775 3 роки тому +2

    I find that when Yannick excited that's when I enjoy videos the most. This video 🔥

    • @YannicKilcher
      @YannicKilcher  3 роки тому +1

      Wait until you see my Minecraft Let's play!

  • @slackstation
    @slackstation 3 роки тому +1

    I'm surprised that some org hasn't made a distributed machine learning server farm ala Folding@Home. I'm sure there are plenty of people who would contribute GPU time. There are alot of 1080ti and 2080ti GPUs in the world (for games) and those owners are the kind that have a few TB of space on HDs and fast internet connections.
    Even if there were ineffencies and overhead, it'd be allow researchers to do interesting work. It'd be cool to be able to say, I helped with this model where a voice was talking naturally, this image model that generates exact faces, this language model where I can ask it questions and get surprising sophisticated answers.
    As always, great work. This one was a massive and I will need to revisit this to fully understand everything that going on.

    • @YannicKilcher
      @YannicKilcher  3 роки тому +1

      I think these TPUs and their low-latency interconnection are unmatched even if you were to connect all gamers of the world. Sadly.

  • @tringuyen4386
    @tringuyen4386 3 роки тому +5

    My prediction for the next contestant in NLP Body part measuring contest will be NVIDIA with their A100s - which I think will actually be interesting since with TP32 can dynamically adjust for sparse matrixes. My assumption is that within the M4 dataset, there is a lot of sparsity that can be addressed to massively reduce the compute required

    • @YannicKilcher
      @YannicKilcher  3 роки тому

      Yes that would be a game changer

    • @vikramsinghnimbalkar8502
      @vikramsinghnimbalkar8502 3 роки тому

      when he said "Body Part measuring contest"..lol. I seriously had to stop the video to settle down.

  • @XMaster96DE
    @XMaster96DE 3 роки тому +2

    First of I always enjoy a good engineering paper.
    Second I am thinking about Expert sub net segmentation now for a while, in the context of solving multitasks NN and solving the catastrophic forgetting problem.
    I guess it shouldn't be too hard to apply an additional contrastive loss to the attention mechanism. Maybe use the Magnitude movement from the parameters for the contrastive loss. Maybe if the NN is just using a part of the experts and if then a new task comes along the Transformer could switch to deferent experts without forgetting the first task, because the experts for the first task a not overwritten.
    But I am just thinking loud at this point.

    • @YannicKilcher
      @YannicKilcher  3 роки тому

      Indeed. I think with a library like this, we'll see lots of such ideas being implemented!

  • @dshlai
    @dshlai 3 роки тому

    Someday we need a cost benefit analysis paper for these giant models.

  • @DavenH
    @DavenH 3 роки тому

    Hi Yannic, do you know what tool people use to make the tranformer diagrams? I use yEd for making graphical flow diagrams but it's not as slick looking.

  • @florianhonicke5448
    @florianhonicke5448 3 роки тому +1

    Nice!

  • @junaidahmed4682
    @junaidahmed4682 3 роки тому +1

    Great!!

  • @KlausRosenberg-et2xv
    @KlausRosenberg-et2xv Рік тому

    Uh... Gpt-3 had been trained at that time already and it is a major success now... I wish I could test Gshard too.

  • @silberlinie
    @silberlinie 3 роки тому +2

    Wouldn't it be nice to see translation examples
    of the different models regarding the claimed
    quality improvement?

    • @YannicKilcher
      @YannicKilcher  3 роки тому +1

      Yes, that's why I said I feel the task itself is almost like an afterthought in this paper :)

    • @silberlinie
      @silberlinie 3 роки тому

      @@YannicKilcher Are you saying that they don't give these examples?

  • @MrSchweppes
    @MrSchweppes 3 роки тому +2

    The question is for those who are versed in the topic: how many Nvidia A100 GPUs are required to train this model and how long will it take?
    (Let's say you have this huge dataset).
    Yannic, thanks a lot for this video!

    • @thomasachee463
      @thomasachee463 3 роки тому +1

      Gee, since you put out the question in the open it makes me so antsy to just solve it. I guess my engineering brain just can't help it....
      and the answer is... figure it out yourself.

    • @DavenH
      @DavenH 3 роки тому +3

      @@thomasachee463 You just wasted a minute of your time

  • @Basant5911
    @Basant5911 9 місяців тому

    If we are sharding across different machines then it will add more bottleneck in IO & latency right ???

  • @ta6847
    @ta6847 3 роки тому +2

    Me: But how do they backprop through the hard routing?!?
    Yannic: ...to backprop through the hard routing, they just add some noise...
    Me: [thoughtful contemplation] Well...of course that's how they do it, totally obvious.

  • @MMc9081
    @MMc9081 3 роки тому +1

    I must be missing something, but if using 128 TPUs is equivalent to 6 TPU-core-years, then I would expect it to take 17 days of training, not the 6 days or so in the graph. Can anyone enlighten me?
    EDIT: looks at graph again...oh, now I get it!

  • @CalclaviaProductions
    @CalclaviaProductions 3 роки тому +1

    1:04:34 For training very deep Transformers, see arxiv.org/pdf/2003.04887.pdf

  • @guillaumewenzek4210
    @guillaumewenzek4210 3 роки тому

    How do we know that the experts are actually doing different things? How much expressiveness are we buying vs GPT3 ? (Haven't finished the video yet)

    • @YannicKilcher
      @YannicKilcher  3 роки тому

      We don't but it's safe to assume if we gain in performance

  • @WannabePianistSurya
    @WannabePianistSurya 3 роки тому +3

    Can I run this model on my toaster? I really want to test this out myself!

    • @YannicKilcher
      @YannicKilcher  3 роки тому +8

      yes, it runs on Toast Processing Units (TPUs)

  • @marianofares9694
    @marianofares9694 3 роки тому

    is this avaible for download? Some API maybe?

  • @diophantine1598
    @diophantine1598 3 роки тому +1

    Are there examples of text generated by the transformer?

  • @petroschristodoulou7987
    @petroschristodoulou7987 3 роки тому

    how do they backpropagate through the hard routing?

    • @YannicKilcher
      @YannicKilcher  3 роки тому

      Like you backpropagate through dropout.

  • @andres_pq
    @andres_pq 3 роки тому

    Each attention mechanism of the transformer layers have the same q,k and v weights? asking for a friend hehe

    • @YannicKilcher
      @YannicKilcher  3 роки тому

      Within the same layer, the parameters are replicated, but the different layer parameters are different.

    • @andres_pq
      @andres_pq 3 роки тому

      @@YannicKilcher Thanks Yannic!

  • @thevivekmathema
    @thevivekmathema 3 роки тому +1

    can I zip the 6B components to make it fit on my 16GB

    • @YannicKilcher
      @YannicKilcher  3 роки тому

      I think you need to zip twice to get it down really small

    • @CosmiaNebula
      @CosmiaNebula 3 роки тому

      It's so big you need the best hydraulic press you can get to compress it.

  • @jinlaizhang312
    @jinlaizhang312 3 роки тому

    waiting for the code release

  • @thevivekmathema
    @thevivekmathema 3 роки тому +2

    i just managed to train 1024x1024 gan with loss of 0.05 with gtx 1080 Ti and my GPU is now DEAD!

  • @dwarakasharma
    @dwarakasharma 3 роки тому +1

    I'm really wondering about it's carbon footprint.

  • @herp_derpingson
    @herp_derpingson 3 роки тому +1

    I am almost certain that the number of authors of a paper scale (sub?)linearly with model size.
    .
    There is a line in the paper, "Neural networks demand non-negligible amounts of computation power.". I mean the authors are not wrong :)
    .
    I was a bit curious about what Broader Impact could the authors conjure for this paper XD

    • @YannicKilcher
      @YannicKilcher  3 роки тому +1

      I guess the evil people could shard their models to 666 machines!

  • @DeepGamingAI
    @DeepGamingAI 3 роки тому +4

    RIP academia

  • @LouisChiaki
    @LouisChiaki 3 роки тому

    "Someone made the powerpoint mistake" LOL

  • @jabowery
    @jabowery 3 роки тому

    There seems to be hysteria against algorithmic information theory in language modeling.
    OpenAI boasts a 175 BILLION parameter model.
    Now Google boasts a 600 BILLION parameter model.
    Now, I wouldn't call this "anti-AIT" if it weren't for the fact that these papers don't even attempt to estimate the actual information content of these parameters. Instead, they seem to take _pride_ in the obviously-inflated "parameter count" (and we haven't even gotten into measuring the complexity of the parameterized code).

    • @Marcos10PT
      @Marcos10PT 3 роки тому +3

      We're just seeing what happens when we scale current algorithms. No one is stealing the other science from researchers!

    • @YannicKilcher
      @YannicKilcher  3 роки тому

      I agree, but it is rather peculiar that the more outrageously overparameterized a model is, the better it performs, don't you think so? :)

    • @jabowery
      @jabowery 3 роки тому

      @Big Deeper So long as they have _accurate_ measures of "perplexity" there is a direct relationship to AIT, hence avoidance of overfitting. The problem* is that perplexity measures can be contaminated by the corpus itself, and the likelihood of that happening increases with the comprehensiveness of the corpus. Indeed, such contamination has already been detected once. Moreover, the likelihood of it happening in an undetectable way also increases with the size of the corpus.
      This is why Marcus Hutter uses the size of the losslessly compressed corpus (including de/compression algorithm), as explained the two questions starting with this one in his FAQ:
      www.hutter1.net/prize/hfaq.htm#xvalid
      *There are other problems with perplexity that are avoided by using the size of an executable archive of the data as the benchmark, such as bringing the size of the code base into the same units (bits) with the size of the losslessly compressed data, but that gets washed out as the size of the compressed corpus gets into the multigigabyte range.

    • @jabowery
      @jabowery 3 роки тому

      @@YannicKilcher It doesn't surprise me* because, as I said, there is no attempt to compress the parameters to estimate their actual information content -- that combined with the fact that the approaches being taken are within the same general paradigm of machine learning -- so one may well expect the compression ratio of inflated parameters to deflated parameters to be similarly invariant.
      *Nor would it surprise me to find that they aren't suffering from a great deal of overfitting since one of the virtues of recent machine learning approaches is that they tend to be quite compressible using straight forward if not brute force statistical techniques. But wouldn't it be nice to have a benchmark that guaranteed protection against these things so we can actually make principled model if not ML algorithm selection decisions (given the same general class of compute resource)?
      Having these gigacorps out there throwing around vast piles of iron isn't exactly inspiring a new generation of innovative thinkers, now is it?

  • @dshlai
    @dshlai 3 роки тому

    Academia can hit back with smarter models instead of bigger models

  • @petros_adamopoulos
    @petros_adamopoulos 3 роки тому

    9:42 "I don't know other languages" just wow. Giggle on.

  • @viniciusarruda2515
    @viniciusarruda2515 3 роки тому +1

    *Out of context:*
    Maybe could have a drama video on this: www.reddit.com/r/MachineLearning/comments/hiv3vf/d_the_machine_learning_community_has_a_toxicity/

  • @DistortedV12
    @DistortedV12 3 роки тому

    Bigger models, bigger problems

  • @WHORTH-TheUglyDucklingOfFORTH
    @WHORTH-TheUglyDucklingOfFORTH 3 роки тому

    Wish they used Giga Parameters instead of billions, it's to confusing how the english speaking world do billions wrong... it's really just milliards.

    • @TheXergon
      @TheXergon 3 роки тому

      I like the short scale much bettern than the non-english longscale. It´s just more logical.

    • @WHORTH-TheUglyDucklingOfFORTH
      @WHORTH-TheUglyDucklingOfFORTH 3 роки тому

      @@TheXergon I find billion=million^2, trillion=million^3 etc more logical, but that's not the point. It's ambiguous and we do have an alternativ in mega, giga, terra etc.

  • @frankvandenberghen6024
    @frankvandenberghen6024 3 роки тому +2

    I would not be so enthusiastic about this new “research” from Google. The goal of this research is to try to improve the accuracy of the Google Translate engine, as estimated using this BLEU score. I must admit that I don’t know much about the BLEU score. ..But what I know is that the Translation engine developed by DeepL is *much* better than the one provided by Google (I know that because I am using it every day: DeepL is incredibly better than Google). So, I fail to see the point of this paper. For me, it’s just one of these gigantic useless architectures that fails to deliver any benefits. Actually, that’s even worse than that: Not only is this useless, it’s also the source of a huge CO2 pollution.
    About CO2 pollution: This scientific paper:
    download.timi.eu/docs/ThirdParty/Energy_and_Policy_Considerations_for_Deep_Learning_in_NLP.pdf
    …explains that the CO2 foot-print of training one “tiny” deep learning model that has *only* 213 millions parameters is equivalent to all the CO2 gaz emitted by five normal cars during ALL THE LIFETIME OF THE 5 CARS (More precisely, it’s 5.5 car: (39+78468+192+626155)/126000 ≈ 5.5). In other words, each time you train a tiny NLP model, you create the same amount of pollution than 5.5 cars during all their lifetime.
    I have difficulties to imagine the CO2 pollution that google created when they develop their “small” 37.5 billions parameters models. Yes, that’s 37500 millions parameters or, in other words, equivalent to the pollution of *minimum* 985 cars (985 ≈ 5.5*37500/213). To get this number, I made the optimistic assumption where the pollution only growth linearly with the number of parameters. This is really optimistic. In reality, the growth is near exponential, so that it’s *much* worse.
    What’s even worse is that Google did not stop at a 37.5 billions parameters models. They made a 150 billions parameters models, a 600 billions parameters models and they attempted to make a 1000 billions parameters models (this attempt failed but it still produced staggering amount of CO2).
    It’s difficult to get the measure of the CO2 pollution produced by this stupid useless NLP infrastructure. It’s around equivalent to the LIFETIME pollution of 50.000 cars, just for one stupid experiment (and a failed experiment at that because DeepL is much better).
    It might be time to re-think what a good data science architecture looks like. A little bit of efficiency would really not hurt.

    • @YannicKilcher
      @YannicKilcher  3 роки тому +1

      I was surprised to learn that inference (i.e. the front-end) of translation / search / etc. actually takes orders of magnitude more co2 than training one of these models (because once you've trained it, you can use it repeatedly). That's also true for DeepL

    • @TheXergon
      @TheXergon 3 роки тому

      I don´t think that Google is using any vast networks for this task atm, inference of such huge models would consume too much energy and TPU cycles. I suppose the current Google Translate Engine is using neural networks from 2018/begin of 2019 only.