51:30 I think what this means is the first and last layer are both on GPU:0 and GPU:8 and the middle GPUs have the middle layers. So the data can do forward/backward in each direction of the pipeline
I wonder in MoE when they say that they are only using a few of the experts at one time what does that mean? Does it mean they load the expert model into ram at that time and use it (sounds too slow to be the case).. or do they route it to another GPU that has the expert preloaded? If its the latter, what do they do when several runs want the same expert? Or when nobody wants that expert? Is that wasted compute?
Great breakdown as always. I do have a few questions that the papers didn't clear up for me-- Why do they have multiple "shared experts?" Since they get summed element-wise, isn't that equivalent to just having one, just with a different total weight in the final sum? Or is there some nuance I'm missing there? An interesting note: The shared experts seem oddly similar to the gating mechanism used in the "memory layers at scale" paper, just with element-wise multiplication instead of addition. I have to wonder if there would be some advantage to having both... More expressive, somehow? Also just because I'm thinking about this and it's bugging me: If you were to add "no-op" experts to a deepseek-moe layer like in the "mixture of depths" paper, that would look like an expert that always returned 0, right? Not the identity, because of the residual/skip connection we already have the identity baked in- (in a way, the identity is a "shared expert"). So a "no-op" would simply represent... letting the model strategically select fewer experts? But it also makes me think back to "memory layers at scale" and "tokenformer" papers where there were some benefits from letting the model learn hardcoded keys and/or values... If you're including a no-op, why not include some experts that aren't a layer at all, but a *learned vector* that doesn't depend on the input at all? And for that matter, why are experts always so uniform? Why not mix in some MLPs with all those linear layers for experts? Couldn't variety in expert architecture even further enhance specialization? If the balancing strategies and auxiliary losses prevent it from relying too much on any specific experts it should be able to make good use of experts with varying capacity and/or design, right?
For the first point, I would guess that it functions as a suped general expert that's got more parameters but can be treated as a multiple of the regular experts which makes coding the pytorch or whatever easier.
In practice, they only use one shared expert. Though since the expert is an MLP (ff-->act-->ff) and not a linear function, adding them together makes sense. Though in practice I doubt using more than 1 will help performance much. I'm thinking adding a "memory layers at scale" layer would help much since that paper basically has an MLP, taking a subset of the output functions. Since we are already getting the sparsity from MoE, I don't think adding a memory layer would help too much since that's another way of only using a subset of the allocated params, just less expressive than MoE. I do wonder if adding a no-op would help the model out (which would be an expert that just returned zero)? Or perhaps k no-op experts. The model is forced to choose k experts and cannot do the identity unless these experts have a sum of zero. I think the problem would come down to balancing. How would you weigh this no-op expert? It could be as simple as not adding a bias. Additionally, the weight for this terms is a bit strange. Say we chose no-op with a value of 0.1, then the sum consists of everything else plus 0.1*0. I'm thinking this will lead to problems, 1 because there is no gradient signal for this expert, and 2 since the weighted sum no longer has a cumulative weight total of 1. Could be worth a try since it probably isn't too difficult to add! The "memory layers at scale" and "tokenformer" papers were basically MLPs that only took a subset of the output functions. A dense version of this is just an MLP which we get with the shared experts. I think adding vectors that are just learned like in the tokenformer and memory layers papers is good in terms of benchmarks, but in reality it feels liek the model is just over fitting and storing data. I think the experts are so uniform is just because it's easy and it works. I think it would be really interesting to have a variety of experts. For example maybe one is SwiGLU while another is just an affine transform, and another is a SiLU MLP. In practice, I am unsure if this will be better but I would be really interested to see what results from a mix of expert types. Since they don't use a loss, some experts should still be utilized more than others, which is a good thing since we want specialization, just not one expert that is always having tokens sent to it since that would defeat the purpose of MoE. Looking at the output scores could tell us what type of MLP the model likes most. Hope this answers all your questions!
@@gabrielmongaras thanks! My comment was pretty stream-of-consciousness so I appreciate the thorough response. I feel like a mixed-architecture MoE could be a fun way to see different types of expert "compete" with one another. If we pit KaN experts against MLP experts, which does the model prefer to use, and does diversity like that help or hinder? Could be a fun experiment. Someday soon I hope to have the hardware to test that out myself haha. I do agree the no-op expert might lead to vanishing gradients to the experts, since the weight of the other inputs wouldn't sum to 1 anymore-- if you used a learned vector instead of 0, that provides somewhere for the gradient to go, but as you said makes the model very vulnerable to memorization. Maybe a learnable scalar? I'll keep thinking. Fascinating stuff all around.
02:58 Architecture Changes
04:52 Multi-head Latent Attention (MLA)
13:19 DeepSeekMoE
28:17 Multi-Token Prediction (MTP)
34:08 Other advancements
37:09 Training Infrastructure Improvements : DualPipe
52:10 Training Infrastructure Improvements : FP8
51:30 I think what this means is the first and last layer are both on GPU:0 and GPU:8 and the middle GPUs have the middle layers. So the data can do forward/backward in each direction of the pipeline
Oh yea that makes a lot of sense!
Nice walkthrough!
Thank you!
I wonder in MoE when they say that they are only using a few of the experts at one time what does that mean? Does it mean they load the expert model into ram at that time and use it (sounds too slow to be the case).. or do they route it to another GPU that has the expert preloaded?
If its the latter, what do they do when several runs want the same expert? Or when nobody wants that expert? Is that wasted compute?
Thanks
Great breakdown as always. I do have a few questions that the papers didn't clear up for me-- Why do they have multiple "shared experts?" Since they get summed element-wise, isn't that equivalent to just having one, just with a different total weight in the final sum? Or is there some nuance I'm missing there?
An interesting note: The shared experts seem oddly similar to the gating mechanism used in the "memory layers at scale" paper, just with element-wise multiplication instead of addition. I have to wonder if there would be some advantage to having both... More expressive, somehow?
Also just because I'm thinking about this and it's bugging me: If you were to add "no-op" experts to a deepseek-moe layer like in the "mixture of depths" paper, that would look like an expert that always returned 0, right? Not the identity, because of the residual/skip connection we already have the identity baked in- (in a way, the identity is a "shared expert"). So a "no-op" would simply represent... letting the model strategically select fewer experts?
But it also makes me think back to "memory layers at scale" and "tokenformer" papers where there were some benefits from letting the model learn hardcoded keys and/or values... If you're including a no-op, why not include some experts that aren't a layer at all, but a *learned vector* that doesn't depend on the input at all? And for that matter, why are experts always so uniform? Why not mix in some MLPs with all those linear layers for experts? Couldn't variety in expert architecture even further enhance specialization? If the balancing strategies and auxiliary losses prevent it from relying too much on any specific experts it should be able to make good use of experts with varying capacity and/or design, right?
For the first point, I would guess that it functions as a suped general expert that's got more parameters but can be treated as a multiple of the regular experts which makes coding the pytorch or whatever easier.
In practice, they only use one shared expert. Though since the expert is an MLP (ff-->act-->ff) and not a linear function, adding them together makes sense. Though in practice I doubt using more than 1 will help performance much.
I'm thinking adding a "memory layers at scale" layer would help much since that paper basically has an MLP, taking a subset of the output functions. Since we are already getting the sparsity from MoE, I don't think adding a memory layer would help too much since that's another way of only using a subset of the allocated params, just less expressive than MoE.
I do wonder if adding a no-op would help the model out (which would be an expert that just returned zero)? Or perhaps k no-op experts. The model is forced to choose k experts and cannot do the identity unless these experts have a sum of zero. I think the problem would come down to balancing. How would you weigh this no-op expert? It could be as simple as not adding a bias. Additionally, the weight for this terms is a bit strange. Say we chose no-op with a value of 0.1, then the sum consists of everything else plus 0.1*0. I'm thinking this will lead to problems, 1 because there is no gradient signal for this expert, and 2 since the weighted sum no longer has a cumulative weight total of 1. Could be worth a try since it probably isn't too difficult to add!
The "memory layers at scale" and "tokenformer" papers were basically MLPs that only took a subset of the output functions. A dense version of this is just an MLP which we get with the shared experts. I think adding vectors that are just learned like in the tokenformer and memory layers papers is good in terms of benchmarks, but in reality it feels liek the model is just over fitting and storing data.
I think the experts are so uniform is just because it's easy and it works. I think it would be really interesting to have a variety of experts. For example maybe one is SwiGLU while another is just an affine transform, and another is a SiLU MLP. In practice, I am unsure if this will be better but I would be really interested to see what results from a mix of expert types. Since they don't use a loss, some experts should still be utilized more than others, which is a good thing since we want specialization, just not one expert that is always having tokens sent to it since that would defeat the purpose of MoE. Looking at the output scores could tell us what type of MLP the model likes most.
Hope this answers all your questions!
@@gabrielmongaras thanks! My comment was pretty stream-of-consciousness so I appreciate the thorough response. I feel like a mixed-architecture MoE could be a fun way to see different types of expert "compete" with one another. If we pit KaN experts against MLP experts, which does the model prefer to use, and does diversity like that help or hinder? Could be a fun experiment. Someday soon I hope to have the hardware to test that out myself haha.
I do agree the no-op expert might lead to vanishing gradients to the experts, since the weight of the other inputs wouldn't sum to 1 anymore-- if you used a learned vector instead of 0, that provides somewhere for the gradient to go, but as you said makes the model very vulnerable to memorization. Maybe a learnable scalar? I'll keep thinking.
Fascinating stuff all around.
I think it would be cool to see KAN vs MLP experts and what the model ends up preferring! Too bad I don't have 1000 gpus to test this.
Can you do the same for zero and r1?
Who are you?
The scare is that it's china lol.