Mixtral of Experts (Paper Explained)

Поділитися
Вставка
  • Опубліковано 17 тра 2024
  • #mixtral #mistral #chatgpt
    OUTLINE:
    0:00 - Introduction
    3:00 - Mixture of Experts
    6:00 - Classic Transformer Blocks
    11:15 - Expert Routing
    17:00 - Sparse Expert Routing
    22:00 - Expert Parallelism
    25:00 - Experimental Results
    31:30 - Routing Analysis
    33:20 - Conclusion
    Paper: arxiv.org/abs/2401.04088
    Abstract:
    We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.
    Authors: Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed
    Links:
    Homepage: ykilcher.com
    Merch: ykilcher.com/merch
    UA-cam: / yannickilcher
    Twitter: / ykilcher
    Discord: ykilcher.com/discord
    LinkedIn: / ykilcher
    If you want to support me, the best thing to do is to share out the content :)
    If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
    SubscribeStar: www.subscribestar.com/yannick...
    Patreon: / yannickilcher
    Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
    Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
    Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
    Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
  • Наука та технологія

КОМЕНТАРІ • 113

  • @YannicKilcher
    @YannicKilcher  4 місяці тому +13

    OUTLINE:
    0:00 - Introduction
    3:00 - Mixture of Experts
    6:00 - Classic Transformer Blocks
    11:15 - Expert Routing
    17:00 - Sparse Expert Routing
    22:00 - Expert Parallelism
    25:00 - Experimental Results
    31:30 - Routing Analysis
    33:20 - Conclusion

    • @genegray9895
      @genegray9895 4 місяці тому

      You skipped the appendix! Figure 10 was the best part of the whole paper.

  • @theosalmon
    @theosalmon 4 місяці тому +134

    You know what they say. We only use 10% of our MOE.

    • @DecentralisedGames
      @DecentralisedGames 4 місяці тому

      Is MoE able to analyse its own output with a rating system?

    • @user-pn7by3wn7z
      @user-pn7by3wn7z 4 місяці тому

      Soon@@DecentralisedGames

    • @desrucca
      @desrucca 23 дні тому +1

      What would happen if we use 100% of our MOE capacity

    • @theosalmon
      @theosalmon 23 дні тому

      @@desrucca slow garbage?

  • @MultiMojo
    @MultiMojo 4 місяці тому +12

    Yannic paper review = automatic like. Keep 'em coming !

  • @AM-yk5yd
    @AM-yk5yd 4 місяці тому +12

    ~1:30 ~29:30 Mistral was from the start secretive about training data even when asked directly on discord their answer was non-answer.
    ~4:00, yeah, mixtral have 46.7B params (according to HF parm counter which counts data in safetensors)
    ~19:30 yeah, G(X) is n-dimensional, so G(X)i is a scalar. It would be "fun" if each value inside the input token vector was passed through every possible pair of experts and output vector consisted of values where each value got routed through its own TOPK(2) experts (so o11 = e1(x1), o12 = e2(x1); y=[o11[0], o12[1]] where eN - expert N, x1 - input, y - output)
    ~31:30 if they look at validation of the pile, it's safe to assume they used training on it too.

  • @iknoorsingh7454
    @iknoorsingh7454 4 місяці тому +5

    Great work Yannic. Keep it up! :)

  • @erikdahlen2588
    @erikdahlen2588 4 місяці тому +3

    Thanks. I really like this format and the length was perfect too 😎

  • @axe863
    @axe863 24 дні тому

    I love the reutilization of older modes of modeling in novel ways.

  • @stephaneduhamel7706
    @stephaneduhamel7706 4 місяці тому +48

    I think the concept of "clown car of experts" that hobbyists came up with from this might have some potential. It's about merging different feed-forward networks from existing pre-trained models together as experts, and just training the routing network to adapt to the experts. I played around with some, and it seems to works pretty well, much better than old-school merges.

    • @viewspan
      @viewspan 4 місяці тому +11

      That is pretty close to what Mistral did for 8x7B: they took their 7B model's FFNs, duplicated each 8 times, added the routing gate, and ran gradient descent. Someone made an analysis of weight correlation that indicated that.

    • @stephaneduhamel7706
      @stephaneduhamel7706 4 місяці тому +9

      @@viewspan ah so the initialized the weights of each expert with copies of an existing foundational model, and then kept training the whole thing on generalist data. It's not quite the same as fitting already tuned specialist models in a clown car, but close enough.

    • @TheEbbemonster
      @TheEbbemonster 4 місяці тому

      This is very similar to what banks do when combining logistic regressions modularly, as complex models cannot be explained 🤔

  • @siddharth-gandhi
    @siddharth-gandhi 4 місяці тому +30

    Thanks for your service Yannic! Any chance you'll do a video on DPO? Seems promising, would love to see your explanation/take.

    • @grokkinghumans
      @grokkinghumans 2 місяці тому

      Austin Deep Learning released a video on the Self-Rewarding Language Models paper a couple of weeks ago which covers DPO. It's available on UA-cam

  • @snoosnoo6381
    @snoosnoo6381 4 місяці тому +8

    I was expecting to see some patterns arising where different "expert" would gain expertise in different tasks. What i instead of using the router the "expert" is randomly chosen? That'd clearly demonstrate if any expertise is truly emerging.

  • @theosalmon
    @theosalmon 4 місяці тому +9

    How much farther down the road can we go with splitting up the processing of a model. MOE allows me to run in system RAM on CPU, 4x faster than if the processing were monolithic. I'm wondering if one should run a gargantuan model in 1tb.of RAM on an ancient surplus server, but have the model split into a couple hundred posts, only a couple of which run at once

  • @HarshGuptahargup
    @HarshGuptahargup 4 місяці тому

    This is very cool, very fast because lot less params are being used at one time.

  • @tenaciousscaler5149
    @tenaciousscaler5149 4 місяці тому +8

    What do you think may the training of the router look like?

  • @mahab944
    @mahab944 4 місяці тому

    Thanks for the great video!
    It would be also good if you could go over the Mixtral code, or just the routing part code in a video.

  • @CalogeroZarbo
    @CalogeroZarbo 2 місяці тому

    Yannik, that's amazing explanation

  • @user-nn8lw6oj5c
    @user-nn8lw6oj5c 3 місяці тому

    neat catch on G(xi) being n dim output :)

  • @BobaQueenPanda
    @BobaQueenPanda 2 місяці тому +1

    Really said that data not only is getting gated but also now being omitted. This will not lead to much process outside of closed industry shops.

  • @dennisestenson7820
    @dennisestenson7820 4 місяці тому +13

    29:00, I think the input dataset is as important as the model design. Without good input, even the best models would fail to be interesting. With good data, even a poor model can perform in interesting ways. I totally agree that these researchers need to release their dataset since they're not actually providing their methods, needed for reproducibility.

    • @Hexanitrobenzene
      @Hexanitrobenzene 4 місяці тому +4

      Phi class of models by Bubeck's lab ("Textbooks are all you need") show that data is probably even more important than architecture.

  • @kaikapioka9711
    @kaikapioka9711 4 місяці тому +1

    thx!

  • @morancium
    @morancium 4 місяці тому

    Awesome

  • @GrifinsBrother
    @GrifinsBrother 4 місяці тому

    Thanks.

  • @jonathandawson3091
    @jonathandawson3091 4 місяці тому

    Love it that they did not mention the data source. Yes this is bad in terms of lack of openness, but the benefits outweigh the cost. At least there is a model and a paper.

  • @user-nv6te6fw6y
    @user-nv6te6fw6y 3 місяці тому +1

    The video is really great. I learned a lot. You mensioned that "I've made videos in the past on mixture of experts, expert routing, etc." Could you please paste the link to that video so that I can learn more. Thanks sooo much.

  • @SinanAkkoyun
    @SinanAkkoyun 4 місяці тому

    Tysm for the video! I didn't quite understand how the expert outputs get combined?

  • @wolpumba4099
    @wolpumba4099 4 місяці тому +4

    *Summary*
    *Introduction to Mixtral of Experts Model*
    - 0:00 Discussion about the Mixtral of experts model, built on the Mistl 7B architecture.
    - 0:30 The paper is nicknamed "Don't Say Data" due to its lack of information on training data sources.
    *Analysis of Data Source Disclosure Trends*
    - 0:49 Observation of trends in professional criticism regarding AI training data sources.
    - 1:40 Introduction of Mist AI, a startup with an open-source approach and its comparison to other AI startups.
    *Overview of Mixtral Model and Its Features*
    - 2:42 Explanation that Mixtral 8x7B is a Transformer with a mixture of experts architecture.
    - 3:04 Mixtral model's performance, parameter count, and comparison with other models like Llama 270B and GPT 3.5.
    - 4:16 Description of expert routing in the model, allowing the use of a subset of parameters per token for optimization.
    - 5:02 Details of the model's decoder-only architecture and feature of picking from distinct parameter groups.
    *Training Data and Multilingual Pre-training*
    - 5:25 Mention of multilingual data used in pre-training the Mixtral model, without specific details on the data sources.
    *Understanding Mixture of Experts in Transformer Models*
    - 5:58 Explanation of the core components of classic Transformer models, focusing on attention and feed-forward layers.
    - 8:17 Insight into the feed-forward network's role and parameter distribution in Transformer models.
    - 11:15 Introduction to the concept of mixture of experts and its transformative effect on feed-forward networks.
    - 12:56 Explanation of sparse mixture of experts and the role of a routing neural network in the process.
    *Routing Mechanism in Mixtral Model*
    - 15:03 Explanation of the weighted sum process in routing tokens to experts.
    - 15:41 The routing network uses the input signal to determine the computation path for each token.
    - 16:03 Analogy of distributing people to jobs based on their attributes to explain the routing process.
    - 16:32 The routing function (F) is a small neural network determining the routing of tokens.
    - 17:00 Discussion of the sparse expert routing mechanism and its computational efficiency.
    *Details of the Mixture of Experts Mechanism*
    - 17:37 Absence of entropy regularization in routing, which is often found in initial mixture of experts papers.
    - 18:02 EI denotes the output of each expert, with n representing the number of experts.
    - 18:35 Clarification on a potential error in the paper regarding the output of the gating network.
    - 19:43 The gating network involves a linear feed-forward layer.
    *Model Parameterization and Efficiency*
    - 20:11 Distinction between total and active parameter count in the model.
    - 20:57 Explanation of processing each token individually through the feed-forward stage.
    - 21:45 Discussion on the active parameter count and its dependence on the number of experts considered per token.
    - 22:25 Description of expert parallelism for high throughput, involving different GPUs for each expert.
    *Experimental Results and Performance Analysis*
    - 25:03 Overview of experimental results comparing the Mixtral model with other models like Llama 2 and GPT 3.5.
    - 26:08 Discussion on dynamic selection of active parameters for each token.
    - 26:47 Results showing the model's capability in reasoning and retrieval tasks.
    - 27:24 Analysis of perplexity decrease in relation to context length, emphasizing the importance of smart context selection.
    - 28:46 Skepticism about the usefulness of bias benchmarks.
    - 28:57 Mention of supervised fine-tuning on an instruction dataset and paired feedback dataset.
    - 29:39 Commentary on the model's release under Apache License and its impact on the community.
    *Reflections on the Release and Impact of the Model*
    - 30:17 Discussion on the significance of releasing the model under a fully open license.
    - 30:39 Appreciation for the model's release strategy, highlighting its impact on the community.
    *Speculations on Business Strategy and Data Set Disclosure*
    - 30:45 Speculation on the business value or risk related to the lack of disclosure about the training dataset.
    - 31:05 Possibility that withholding dataset details might be a strategy to provoke critics or simply a choice to not disclose obvious sources.
    *Routing Analysis in Mixtral Model*
    - 31:31 Analysis of how tokens are routed to different experts in the Mixtral model.
    - 31:45 Observation that there are no obvious patterns in expert assignments based on topics.
    - 32:02 Notable regularities like consecutive tokens being assigned to the same expert and certain patterns in Python code token routing.
    - 32:15 Consideration that the routing patterns might be either non-semantic or too complex for human interpretation.
    *Conclusion and Future Outlook*
    - 33:23 Mention of additional analysis available in the paper's appendix.
    - 33:36 Positive outlook on the model's open-source release and its potential for new applications.
    - 33:55 Discussion on the non-disclosure of the training data as a potentially smart but non-scientific approach.
    - 34:12 Invitation for feedback on the best applications of Mixtral and anticipation for future open-source AI developments.
    Disclaimer: I used chatgpt4 to summarize the video transcript. This
    method may make mistakes in recognizing words.

  • @MasamuneX
    @MasamuneX 4 місяці тому +2

    its almost like your brain has areas that are good at some things and bad at others and we use different parts at different times for tasks instead of using the emotional part for basic math

  • @bajdoub
    @bajdoub 4 місяці тому +5

    I find the "8 experts" terminology and "8x7B" notation quite confusing and misleading. I cannot say how many professional practioners think it's "8 expert models" collaborating like in an ensemble way. It's actually 8 expert modules *per layer*, and there are 32 layers, so a total of 32x8 = 256 independent "experts" and not 8 experts. Plus if you really think in terms of end to end processing/activation path as one "expert", each token can have (8 choose 2)=28 possible expert paths per layer, and there are 32 layers, so the total of expert paths each single token can take is 28^32. So in reality, there are 28^32 end-to-end "expert" alternative activation paths for each token. So all in all, its either 8×32=256 experts or 28^32 experts, but defintely not 8 experts in this model

    • @AM-yk5yd
      @AM-yk5yd 4 місяці тому +3

      I can say how many professional practitioners think it's "8 expert models" collaborating like in an ensemble way.
      Zero. Not "near zero", but exactly zero professional practitioners.
      MoE layers is not a novel concept.
      Switch transformers came out 3 years ago. It was not a novel idea.
      Shazeer's MoE layers paper(which is a major influence for ST and other MoE papers) came out 7 years ago.
      Fast Feedforward Networks came out in August, 2023, about half years ago.
      This leaves no confusion for any professional as they can read papers. Or for anyone who can read mixtral release blog ("Mixtral is a sparse mixture-of-experts network. It is a decoder-only model where the feedforward block picks from a set of 8 distinct groups of parameters. At every layer, for every token, a router network chooses two of these groups (the “experts”) to process the token and combine their output additively." that's 3 sentences to clear confusion)
      But let's assume professional practitioners are illiterate when it comes to papers and even reading three sentences is too much to ask.
      If they heard "mixture of experts" first time, first thing they would ask "why it's called a mixture rather than an ensemble, term that probably predates my birth?" Their first assumption would be that there is maybe more to it. So they would go to the source code.Transformers is not exactly a hard to read.
      This leaves no confusion for practitioners as they can read(and edit) code.
      The only confusion is for people who can't read papers, can't read blogs, can't read source code and, on top of that, never heard of MoE before.
      Bleeding edge research should not care about "professional practitioners" LARPers wanna-be who adamantly refuse to learn and/or illiterate fake AI gurus

  • @dibbidydoo4318
    @dibbidydoo4318 4 місяці тому +14

    Can you do the MoE-Mamba paper?
    MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts

    • @kaikapioka9711
      @kaikapioka9711 4 місяці тому

      Finally someone did it?
      Edit: I just read it. I hope they try the deepseek architecture. I think they're referring to that on the future works.

    • @axelmarora6743
      @axelmarora6743 17 днів тому

      Moe Mamba? No
      Mo Mamba? nonono
      Moma mba? sigh

  • @nathanbanks2354
    @nathanbanks2354 4 місяці тому +1

    I'm glad it can be sharded (see 22:18). In my case, I was thinking of getting a couple cheap 24GB GPU's instead of one expensive 48GB so I can run mixtral using 4-bit quantization at a high speed on a cheaper computer. I was impressed that Mixtral ran tolerably fast using CPU only with ollama on a five year old quad core laptop with 64GB of RAM. However mistral runs 10x as fast on the same system because it can use the 16GB GPU.

  • @vladimirtchuiev2218
    @vladimirtchuiev2218 4 місяці тому +1

    Having played with sparse-transformers using my own chess bot, a large model only partially perform like a smaller model, you still have to keep all the weights in your (GPU) memory, and it does limit your batch sizes for training and context lengths for inference.

    • @phaZZi6461
      @phaZZi6461 2 місяці тому

      yep, pretty sure the 8x7b requires the extra engineering (special routing, batching) to take advantage of it , otherwise it behaves more like 56b

  • @Timotheeee1
    @Timotheeee1 4 місяці тому +6

    Can you cover the paper "Exponentially Faster Language Modelling"? I think the fast feed forward has a lot of potential for decoder-only models

  • @DecentralisedGames
    @DecentralisedGames 4 місяці тому

    Sir, do you believe that it's more likely for someone to give up asking if they think it's you in public, due to asking so many prior mistakingly? Hence the whole purpose of the sunglasses you're probably thinking.

  • @macadeliccc2942
    @macadeliccc2942 4 місяці тому +3

    I have used the clown car of experts approach to train multilingual models that evaluate pretty well

  • @aevans1645
    @aevans1645 4 місяці тому +1

    Doesn't "top k" cause problems for gradient descent? Make it hard to train

  • @rick-kv1gl
    @rick-kv1gl 4 місяці тому

    ur da boss

  • @IsaiahGossner
    @IsaiahGossner 4 місяці тому +3

    Thanks for the overview of this paper! I read it at one point because I was impressed with the results of the model itself, but it's always good to get a second look at it. Would that second look be a "Mixture of Experts"...?
    Regardless, for anyone looking to do some homework and followup reading:
    "Approximating Two-Layer Feedforward Networks for Efficient Transformers" has some pretty interesting findings about routing and scaling of MoE overall (apparently softmax is basically evil)
    "Efficient shallow learning mechanism as an alternative to deep learning" argues that the brain isn't really comparable to a "deep" neural network and that a "wider" network may be more ideal for complex ideas. Depending on how far one was willing to stretch it, one could argue that MoE is an extension of or step in that direction.
    I'd be really curious to see an overview of "Inferring neural activity before plasticity as a foundation for learning beyond backpropagation", though (unrelated to MoE). It's apparently an alternative to backpropogation that is more tolerant of higher learning rates, deeper networks while giving less overshoot and catastrophic forgetting. I'm having some difficulty getting through the paper because it's quite dense, but I think it's a really different look at training.

  • @user-bd8jb7ln5g
    @user-bd8jb7ln5g 4 місяці тому +4

    What would be great is if they released a 4x7B model focusing on reasoning, coding, math, science.

    • @awee1234
      @awee1234 3 місяці тому +1

      Some music expert must not be missing! But I get your point

    • @adelhassan7997
      @adelhassan7997 3 місяці тому +2

      Why not just have 4 fine-tunes of Mistral-7b?

    • @user-bd8jb7ln5g
      @user-bd8jb7ln5g 3 місяці тому

      @@adelhassan7997 Funny, over the last few days I was just thinking about model router dynamically loading LoRA adapters to solve specific problems.

  • @lukeskywalker7029
    @lukeskywalker7029 4 місяці тому +4

    Please use a dark mode reader :D Thanks for all your great content though!

  • @auresdz701
    @auresdz701 4 місяці тому +2

    How cn they backprop if they use argmax/topk to select the experts, wierd.

  • @marsandbars
    @marsandbars 4 місяці тому

    I couldn't wait to try this on my 3090, but then I realized the relatively low active parameter count was essentially a performance optimization and not a RAM usage optimization. This thing loads in as a 83B parameter model; they authors proudly claim that it can load in on a single A100. Still cool they released it like they did, but you'd have to have a ton of RAM/VRAM somewhere in your system to even load this thing.

    • @keypey8256
      @keypey8256 4 місяці тому

      Just wondering, wouldn't it be possible to just load the router, check which experts we need and load the experts?

    • @drdca8263
      @drdca8263 4 місяці тому

      @@keypey8256I imagine it is possible to do that, but that swapping out which expert(s) you have loaded, for each token, *for each layer*, would be rather slow?
      But I don’t know what I’m talking about

    • @keypey8256
      @keypey8256 4 місяці тому

      @@drdca8263 oh I forgot you need to do it for each token

    • @clray123
      @clray123 4 місяці тому +3

      4bit quantized version is usable on 3090 (with llama.cpp)

  • @InventiveDingo
    @InventiveDingo 3 місяці тому +1

    Thanks for the great breakdown, very interesting with lots of detail and context 👍🏻 Just want to say, about bias - it's not just the EU. Some of us really do care! Someone, rightly or not, is going to use LLMs to make decisions that affect our lives, and it'd be nice if that was as fair as possible. Perhaps it won't affect you, but some of us have good reason to think it might affect us. Peace.

  • @daviddanju8493
    @daviddanju8493 4 місяці тому +1

    Hey fellow followers: are there other channels like Yannic's that do high quality paper reviews?

  • @astronemir
    @astronemir 4 місяці тому +1

    I do wonder if this architecture creates experts in certain areas well covered by training but will perform really badly in novel cases.
    Can’t test because they don’t have the data source. I’ve been using it for a while and running into weird issues

    • @awee1234
      @awee1234 3 місяці тому +1

      As for example?

  • @scottmiller2591
    @scottmiller2591 4 місяці тому +1

    "Follow the science - it needs to be reproducible."
    "This is reproducible."
    "We're suing you."

  • @apoorvumang
    @apoorvumang 4 місяці тому +1

    11:24 "Hey, weight a second"

  • @marshallmcluhan33
    @marshallmcluhan33 4 місяці тому

    7:33 it's like American Football

  • @saidtaghadouini6225
    @saidtaghadouini6225 4 місяці тому +3

    Why is the vocab size only 32k?

    • @Neomadra
      @Neomadra 4 місяці тому

      I think briefly confused the context window with the vocab size or he just gave a random number in the same ballpark.

    • @Nutzername9898
      @Nutzername9898 4 місяці тому

      ​@Neomadra No I think that is actually the vocabulary size

    • @dennisestenson7820
      @dennisestenson7820 4 місяці тому +3

      ​​@@Neomadra, the paper says the vocab_size=32000 @ 10:00.

    • @AM-yk5yd
      @AM-yk5yd 4 місяці тому +2

      Probably no reason other than original llama used it. It's actually decrease from previous models like opt which used around 50k vocab.

  • @evennot
    @evennot 4 місяці тому

    It's strange that they didn't consider transformers' states for routing. Basically, experts have no say if they are "confident that the input makes sense", and routing is reduced to a basic classification.
    Also experts should "disagree" if they actually represent different aspects of model and aren't just redundant copies

  • @zyzhang1130
    @zyzhang1130 4 місяці тому +1

    MOE architecture kinda goes against the idea of emergence intelligence doesn’t it

    • @piratepartyftw
      @piratepartyftw 4 місяці тому

      not really. its still all one model.

    • @zyzhang1130
      @zyzhang1130 4 місяці тому +1

      @@piratepartyftw specialised expert architecture outperforms one monolithic model. I’d say it does question the premise of emergence

    • @zyzhang1130
      @zyzhang1130 2 місяці тому

      Update: upon watching some paper discussion recordings, it seems MoE only outperforms monolithic LLMs with the same effective parameter count (I.e., no of active parameters during inference time). So this architecture really just is just there help to reduce computational cost.

  • @john_blues
    @john_blues 4 місяці тому +8

    Summary: "We did some things, with some stuff. Here's the results."

    • @TheSyborgue
      @TheSyborgue 4 місяці тому +11

      U summed up every single research paper, hats off

  • @DanielYokomizo
    @DanielYokomizo 4 місяці тому +1

    It's not opensource, it's freeware. There is no actual reproducible source involved anywhere. 29:43

    • @datrumart
      @datrumart 4 місяці тому +1

      Can you finetune freeware ?

    • @clray123
      @clray123 4 місяці тому

      Uh, there are open source implementations of the described architecture, both in Python and in C. And the input data is published under an "open source" license, which governs what you may or may not do with this weird binary "source code" (such as republish your own version after finetuning).

  • @IvarDaigon
    @IvarDaigon 4 місяці тому +1

    That last part is interesting that they are unable to predict which tokens go to which "experts".
    I would have assumed that each 7B model was fine tuned with a specific set of data relating to a particular field of knowledge.
    Maybe "Experts" isn't the right term to use if they have no idea what each one is actually good at.
    Maybe they should call it Mixture of Peers.

    • @clray123
      @clray123 4 місяці тому

      You would have assumed wrong, and the unfortunate term "experts" is a historical accident of wishful thinking (like so many other terms in AI).

  • @unvergebeneid
    @unvergebeneid 4 місяці тому +2

    Can we try to find a middle ground between the "AI is evil because it's stealing from artists" and the dismissive "haters gonna hate" kind of stance taken in this video? There are legitimate concerns that need to be sorted out, both legally and ethically. To pretend that there aren't is just as disingenuous as pretending human brains aren't trained on copyrighted data.

    • @YannicKilcher
      @YannicKilcher  4 місяці тому +7

      I'm not saying there is no middle ground, I'm saying the grifters have now made this their main talking point. Rest assured they do not care a bit about the actual artists

    • @unvergebeneid
      @unvergebeneid 4 місяці тому +1

      @@YannicKilcher sure, but I think reacting to one extreme position by taking the extreme opposite position might be tempting but we should ultimately strive for nuance, even if it feels like we're the only ones. It's usually not BTW; the extreme voices are just the loudest.

  • @ddd12343
    @ddd12343 4 місяці тому +5

    Is it just me, or are all these benchmarks basically useless and don't tell you how well the benchmarked models can actually help people with their everyday tasks? I tested the top scoring models a few months ago and (apart from GPT of course) they generated complete bullshit. I think what the scientific community should do now is to improve the idea of using Elo rating to score models, because it's the only way found so far to measure the actual usefulness of LLMs. LLM Arena does that, but the user interface is terrible and I don't think anyone is using it in actual daily work, which is what we want to measure.
    I think we need some open source initiative to improve this idea and finally make a proper LLM benchmark. Yannic, maybe OpenAssistant part 2? :)

  • @MADjaHEAD
    @MADjaHEAD 4 місяці тому +1

    Is routing subnet G(x) trained separately or together with the rest? I'm a bit confused about how it works: if G(x) is updated as training progresses, then the route for a given token can change midway; therefore the training done before the switch should not relative anymore. No?

    • @AM-yk5yd
      @AM-yk5yd 4 місяці тому +2

      Together. As Yannic said, they didn't specify it in paper. HF implementation of mixtral for the training uses balancer from switch transformer to penalize experts that took too much on themselves.

  • @crassflam8830
    @crassflam8830 4 місяці тому

    Hype

  • @willian_z
    @willian_z 4 місяці тому +4

    "bias benchmarks: who cares"

  • @gwynanjeanettecaparida4676
    @gwynanjeanettecaparida4676 4 місяці тому

    first

  • @MasamuneX
    @MasamuneX 4 місяці тому

    MY name is Kanye West and im here to design a lstm with percent change of the closing value as the key predicted

  • @mattanimation
    @mattanimation 4 місяці тому

    we got MOE where are Curly and Larry?

  • @berry4862
    @berry4862 4 місяці тому

    You could highlight more clearly that we don't know how to train such things. We essentially don't know how biological neurons learn. Therefore this hardware will be useless for learning and prediction.

    • @clray123
      @clray123 4 місяці тому +1

      1. These are not biological neurons. 2. We (obviously) know how to train such (artificial) things.

  • @ernststravoblofeld
    @ernststravoblofeld 3 місяці тому

    Maybe they are going too far in taking design out of the process? What if you have a router that is trained, maybe by humans, to recognize topics, and then train small transformers in various useful topics? This makes the system more expandable and upgradeable.

  • @EternalKernel
    @EternalKernel 4 місяці тому +3

    Also don't forget 28:40 - Yannic Kilcher reveals that he doesn't care about biases against minorities, and that he thinks there is a "collective consciousness" that agrees with him?
    Now I feel like shit for suggesting your channel to my friends. I wish people would just admit it with like a face tat or something when they don't care about reducing racism etc.
    Please tell me I'm wrong Yannic! Tell me you do care?

    • @dropthebassesp
      @dropthebassesp 4 місяці тому +1

      I also was weirded by that. Made me proud of being European after his comment lmao

    • @YannicKilcher
      @YannicKilcher  4 місяці тому +10

      My point was not that I don't care, my point was that the professional complainers crowd have moved on from pushing societal biases of models as their main talking points

    • @hieroben
      @hieroben 4 місяці тому +1

      @@YannicKilcher But then it would actually be a good thing that Mistral ai investigates bias because they don't follow the "professional complainers crowd" but look into problems that we experts actually (still) care about, right? ;)

    • @Phasma6969
      @Phasma6969 4 місяці тому +2

      No. ​@@hieroben

    • @Phasma6969
      @Phasma6969 4 місяці тому +8

      This is a channel about technology and bleeding edge research, not politics. Please, disrespectfully shush.