This was very well explained from a layperson’s perspective. I’m not an expert in this field, but very curious about it. You did a great job breaking things down, and I’m excited to go back to the paper and read it, and maybe understand more of it. Like others have mentioned, I’d love to see you explain the formula with code, so I could follow along at home. Cheers!
I'm glad you found it useful! I've finished a formula and code walk through of GRPO over here: ua-cam.com/video/Yi1UCrAsf4o/v-deo.html I'm using HuggingFace implementation of it with their GRPOTrainer!
Valar Morghulis. Thanks for explaining this paper! I will have a paper reading session about DeepSeek R1 in my office this coming Friday. This really helps to understand it better.
Great explainer video! Thanks! The reason why the smaller models don’t gain as much from RL compared to larger models is because they lack the “capacity” (number of parameters needed) to model reasoning
Very interesting thought! The weird bit is that they do gain this capacity with the same amount of parameters when they are being fine tuned through distillation!
@deeplearningexplained good point! almost like the small models have trouble discovering the reasoning by themselves but can easily replicate it once discovered. I think it has to do with the fact that bigger overparameterized models have a higher probability of developing additional subnetwork representations, extra capacity to discover. Then the smaller model can use heuristics or simpler principles to replicate
I like that interpretation. Generally, larger models seems to behaving a tad differently than smaller models in term of emerging capability. Like they even do in-context learning differently than smaller models and are able to "learn on the fly".
There's actually two that I found that does this quite well, check them out: 📌 ua-cam.com/video/XMnxKGVnEUc/v-deo.html&ab_channel=UmarJamil 📌 ua-cam.com/video/bAWV_yrqx4w/v-deo.html&ab_channel=YannicKilcher
Don't forget to check out these two other videos for complementary understanding: 📌 ua-cam.com/video/XMnxKGVnEUc/v-deo.html&ab_channel=UmarJamil 📌 ua-cam.com/video/bAWV_yrqx4w/v-deo.html&ab_channel=YannicKilcher
@10:02 Are you sure? All other terms are positive, and this KL divergence is negative, so when minimize the loss, this divergence actually goes up, so it seems to me that it encourage the model being different from the reference model.
Great question, it's maximizing the objective function not minimize it: "[...] optimizes the policy model 𝜋𝜃 by maximizing the following objective." The min is for choosing either the normal policy*advantage or the clipped policy*advantage.
@@deeplearningexplained Sorry, yes you are right, this is not minimize of loss function, it is maximize the objective function, I was wrong, please correct my words.
I have a comp Eng undergrad degree, but what kind of math do I need to learn to be able to make sense of these math formulas? They are complete Greek to me. What am I missing :(
Awesome question, they are greek to you because you say the greek letter in your head. Not what they actually mean. Check out this video I made on how to read deep learning math (or other very dense math) easily: ua-cam.com/video/YXWxVxQ6AeY/v-deo.html
I have a doubt related to "We fine-tune DeepSeek-V3-Base for two epochs using the above curated dataset of about 800k samples." At this step, fine-tune DeepSeek-V3-Base is the original DeepSeek-V3-Base or the DeepSeek-V3-Base after Cold Start and Reasoning-oriented Reinforcement Learning? I asked above doubts to deepseek, it reply me as below: At the step where it mentions "We fine-tune DeepSeek-V3-Base for two epochs using the above curated dataset of about 800k samples," the model being fine-tuned is not the original DeepSeek-V3-Base. Instead, it is the DeepSeek-V3-Base after the Cold Start and Reasoning-oriented Reinforcement Learning stages. I don't knowif the answer is correct. Could you help double check? Thanks a lot!
Very good question, it’s the original DeepSeek-V3-Base! It’s quite confusing. At this point, all other models used in the R1 path were used to create the 800K dataset in some sort of way.
No they can’t, but they can use OpenAI to generate reasoning and non-reasoning data which as we have seen in the paper is important step in the pipeline for R1.
Great question, the details are a bit vague on that front for the formatting but I believe it’s for a full reward loop. Aka you need to have an answer that is verifiable to have the reward signals being propagated back to the whole sequence at the same time. Some bits of the reward will pertain to what’s in the think tag (like the consistency and formatting rewards). Other like the accuracy will check the answer only.
I wonder if mixing languages allows it to think in subtle, slightly different ways belonging to different cultures / languages ? And that's why aligning it to stick to one language resulted in a slight degradation in performance.
That part is one of the most fascinating. I think it has to do with how the knowledge is encoded within its weights. A concept might be easier to reach for the model with some token sequence belonging to Chinese, while other might be easier in English. It wouldn’t surprise me if some of the token sequence aren’t even readable but more broad ideas stitched together.
Haha yeah it’s a bit difficult to read the GRPO formula. If you are interested in improving your math reading skills, I got a video that cover the technique I use for complicated formula: m.ua-cam.com/video/YXWxVxQ6AeY/v-deo.html
i like this part of the internet
you from tpot?
Yes I agree
Read the paper and then saw your flowchart. This really helped a lot in understanding the workflow. Thanks.
Glad it was, don’t forget to check out the other videos in the description for having a full context!
@@deeplearningexplained Yannic's video really helps with the RL part. Thanks for the recommendation.
That map is lit! Its easy to follow the big picture.
Yes, the map should have been included directly in the paper.
Would have made this already great paper awesome.
This was very well explained from a layperson’s perspective. I’m not an expert in this field, but very curious about it. You did a great job breaking things down, and I’m excited to go back to the paper and read it, and maybe understand more of it. Like others have mentioned, I’d love to see you explain the formula with code, so I could follow along at home. Cheers!
I'm glad you found it useful!
I've finished a formula and code walk through of GRPO over here:
ua-cam.com/video/Yi1UCrAsf4o/v-deo.html
I'm using HuggingFace implementation of it with their GRPOTrainer!
Thank you! This is extremely helpful. Now binging through all of your videos!
I'm glad you found it helpful!🌹
this waaaaas awesome, im learning ml and you break things down soo well, thank you.
Hey thanks for the kind feedback! I’m glad the content was useful :)
Valar Morghulis. Thanks for explaining this paper! I will have a paper reading session about DeepSeek R1 in my office this coming Friday. This really helps to understand it better.
Hope your reading session is fruitful!
Valar morghulis!
the road map is neat!
Amazing video 👏🗣️ just subscribed, and totally looking forward to watching many many more of your videos!!
Hey there, thanks for the kind words and glad to have you as a subscriber! 🌹
早期人类驯服LLM的珍贵分析资料.
Great and simple explanation !
Je commente pas souvent sur youtube, mais je viens de découvrir ta chaine et c'est impeccable. Abonné direct, continue comme ça c'est génial
Ah merci c’est super gentil. Je suis bien content que les vidéos soient utiles!
Excellent breakdown
Great explainer video! Thanks! The reason why the smaller models don’t gain as much from RL compared to larger models is because they lack the “capacity” (number of parameters needed) to model reasoning
Very interesting thought! The weird bit is that they do gain this capacity with the same amount of parameters when they are being fine tuned through distillation!
@deeplearningexplained good point! almost like the small models have trouble discovering the reasoning by themselves but can easily replicate it once discovered. I think it has to do with the fact that bigger overparameterized models have a higher probability of developing additional subnetwork representations, extra capacity to discover. Then the smaller model can use heuristics or simpler principles to replicate
I like that interpretation. Generally, larger models seems to behaving a tad differently than smaller models in term of emerging capability. Like they even do in-context learning differently than smaller models and are able to "learn on the fly".
Best discovery in terms of llm channels in a while! Great content!
Ah thanks, very kind of you!
You are the only one on youtube explaining the maths behind this monster AI, what in the world??
There's actually two that I found that does this quite well, check them out:
📌 ua-cam.com/video/XMnxKGVnEUc/v-deo.html&ab_channel=UmarJamil
📌 ua-cam.com/video/bAWV_yrqx4w/v-deo.html&ab_channel=YannicKilcher
Thank you so much!! Super helpful
Excellent break down.
Thanks, glad it’s useful!
a big W for Yachine
Great video! Any editor / tool recommendations to read papers. The one you have here looks great!
Hey thanks!
The one I'm using is TLDRAW, it's a very simple whiteboard that I can draw on.
Other than that I'm using the firefox reader.
I was waiting for it
Don't forget to check out these two other videos for complementary understanding:
📌 ua-cam.com/video/XMnxKGVnEUc/v-deo.html&ab_channel=UmarJamil
📌 ua-cam.com/video/bAWV_yrqx4w/v-deo.html&ab_channel=YannicKilcher
@ actually i’m in the half of the second video you proposed😂😂
@ haha keep watching it! :)
Thank you brother.....we enjoyed
Glad you did :)
Thanks for the explanation! Great video!
Amazing explanation. Thank you
@10:02 Are you sure? All other terms are positive, and this KL divergence is negative, so when minimize the loss, this divergence actually goes up, so it seems to me that it encourage the model being different from the reference model.
Great question, it's maximizing the objective function not minimize it: "[...] optimizes the policy model 𝜋𝜃 by maximizing the following objective."
The min is for choosing either the normal policy*advantage or the clipped policy*advantage.
@@deeplearningexplained Sorry, yes you are right, this is not minimize of loss function, it is maximize the objective function, I was wrong, please correct my words.
@@DigitalAlligator no worries, thank you for your question because I also got confused first time I read it.
Congratulations!
best explanation
Thanks a lot!! Good explination.
I have a comp Eng undergrad degree, but what kind of math do I need to learn to be able to make sense of these math formulas? They are complete Greek to me. What am I missing :(
Awesome question, they are greek to you because you say the greek letter in your head. Not what they actually mean.
Check out this video I made on how to read deep learning math (or other very dense math) easily: ua-cam.com/video/YXWxVxQ6AeY/v-deo.html
i assume the harmlessness weights are how it censors certain topics like a certain place on a specific date.
Yes, this is strongly implied in the paper where the harmlessness is hurting a Chinese benchmark.
Are the policy updates updating a separate policy NN or directly the parameters of the underlying pretrained model?
great video, thanks!
Thanks for the wonderful explaination. What paper reader you are using?
Thanks I'm using TLDRAW!
Sick!
I have a doubt related to "We fine-tune DeepSeek-V3-Base for two epochs using the above curated dataset of about 800k samples." At this step, fine-tune DeepSeek-V3-Base is the original DeepSeek-V3-Base or the DeepSeek-V3-Base after Cold Start and Reasoning-oriented Reinforcement Learning?
I asked above doubts to deepseek, it reply me as below:
At the step where it mentions "We fine-tune DeepSeek-V3-Base for two epochs using the above curated dataset of about 800k samples," the model being fine-tuned is not the original DeepSeek-V3-Base. Instead, it is the DeepSeek-V3-Base after the Cold Start and Reasoning-oriented Reinforcement Learning stages.
I don't knowif the answer is correct. Could you help double check? Thanks a lot!
Very good question, it’s the original DeepSeek-V3-Base! It’s quite confusing.
At this point, all other models used in the R1 path were used to create the 800K dataset in some sort of way.
Thank you very much
Your map is superb! Where to download your map? Thanks.
It is really useful, it's not mine though I found it here: www.reddit.com/r/LocalLLaMA/comments/1i66j4f/deepseekr1_training_pipeline_visualized/
Should I get your picture about how to training R1 workflow graph?
For sure, I found it over here: www.reddit.com/r/LocalLLaMA/comments/1i66j4f/deepseekr1_training_pipeline_visualized/
Thank you for the explanation. So, one can't do distillation from OpenAI without the availability of their models' logits.
No they can’t, but they can use OpenAI to generate reasoning and non-reasoning data which as we have seen in the paper is important step in the pipeline for R1.
Great video, thanks.
Now I believe people in China are indeed good at math and very generous.
GRPO reward is for any text between tags? Is this done after whole sequence is generated or as soon as text appears in the area?
Great question, the details are a bit vague on that front for the formatting but I believe it’s for a full reward loop.
Aka you need to have an answer that is verifiable to have the reward signals being propagated back to the whole sequence at the same time.
Some bits of the reward will pertain to what’s in the think tag (like the consistency and formatting rewards). Other like the accuracy will check the answer only.
How good is the 1.5b version at let's say top high school math?
thanks for the overview, i was to lazy to read it😝
Glad it was useful! Do read the paper though, it’s quite well written!
🐐
I wonder if mixing languages allows it to think in subtle, slightly different ways belonging to different cultures / languages ? And that's why aligning it to stick to one language resulted in a slight degradation in performance.
That part is one of the most fascinating. I think it has to do with how the knowledge is encoded within its weights.
A concept might be easier to reach for the model with some token sequence belonging to Chinese, while other might be easier in English.
It wouldn’t surprise me if some of the token sequence aren’t even readable but more broad ideas stitched together.
please do the detailed math via paper like what you suggested
Yes, I’m preparing a detailed breakdown of GRPO and I’ll try to get some code to follow along too.
Kache - yacineMTB?
Haha no, I’m a Yacine, but a different one!
is this kache on X?
Haha no, I’m a different Yacine 😅
can you share the chart in the video
Yes for sure, it’s here:
www.reddit.com/r/LocalLLaMA/comments/1i66j4f/deepseekr1_training_pipeline_visualized/
You channel's name fits quite well with the new AI model, coincidence?
haha, yes we both are deep!
Super merci n'oublie pas de nous faire de belles vidéos en french aussi.
Ahah je vais essayer! La pluspart de mon audience est anglophone, mais je n’oublie pas mes français et québécois!
Yacine, would it be possible to get on a zoom call with you to discuss about AI research?
Hey there, for sure.
Shoot me an email at mail@yacinemahdid.com and I'll organize it.
@ thanks will do.
Wow, I didn't know John snow is also an AI expert
😂😂😂
Well, eh...i can only read a b c. 😅
Haha yeah it’s a bit difficult to read the GRPO formula.
If you are interested in improving your math reading skills, I got a video that cover the technique I use for complicated formula:
m.ua-cam.com/video/YXWxVxQ6AeY/v-deo.html
Lazy video: if u want to teach , translate the formula into the codes to demonstrate it
That’s a good idea, thanks for the feedback!