DeepSeek R1 Theory Overview | GRPO + RL + SFT

Поділитися
Вставка
  • Опубліковано 7 лют 2025

КОМЕНТАРІ • 103

  • @danielhemmati
    @danielhemmati 8 днів тому +89

    i like this part of the internet

  • @adityavipradas3252
    @adityavipradas3252 4 дні тому +8

    Read the paper and then saw your flowchart. This really helped a lot in understanding the workflow. Thanks.

    • @deeplearningexplained
      @deeplearningexplained  3 дні тому +2

      Glad it was, don’t forget to check out the other videos in the description for having a full context!

    • @adityavipradas3252
      @adityavipradas3252 День тому +1

      @@deeplearningexplained Yannic's video really helps with the RL part. Thanks for the recommendation.

  • @sheldonsebastian7232
    @sheldonsebastian7232 8 днів тому +13

    That map is lit! Its easy to follow the big picture.

    • @deeplearningexplained
      @deeplearningexplained  8 днів тому +3

      Yes, the map should have been included directly in the paper.
      Would have made this already great paper awesome.

  • @Maicolacola
    @Maicolacola 2 дні тому +1

    This was very well explained from a layperson’s perspective. I’m not an expert in this field, but very curious about it. You did a great job breaking things down, and I’m excited to go back to the paper and read it, and maybe understand more of it. Like others have mentioned, I’d love to see you explain the formula with code, so I could follow along at home. Cheers!

    • @deeplearningexplained
      @deeplearningexplained  2 дні тому

      I'm glad you found it useful!
      I've finished a formula and code walk through of GRPO over here:
      ua-cam.com/video/Yi1UCrAsf4o/v-deo.html
      I'm using HuggingFace implementation of it with their GRPOTrainer!

  • @ansitun
    @ansitun 2 дні тому +1

    Thank you! This is extremely helpful. Now binging through all of your videos!

  • @Dresstosweatdottv
    @Dresstosweatdottv 6 днів тому +4

    this waaaaas awesome, im learning ml and you break things down soo well, thank you.

    • @deeplearningexplained
      @deeplearningexplained  6 днів тому

      Hey thanks for the kind feedback! I’m glad the content was useful :)

  • @dheocahyo7721
    @dheocahyo7721 4 дні тому +1

    Valar Morghulis. Thanks for explaining this paper! I will have a paper reading session about DeepSeek R1 in my office this coming Friday. This really helps to understand it better.

  • @jiaxinkou5654
    @jiaxinkou5654 6 днів тому +7

    the road map is neat!

  • @fsaudm
    @fsaudm 5 днів тому +3

    Amazing video 👏🗣️ just subscribed, and totally looking forward to watching many many more of your videos!!

    • @deeplearningexplained
      @deeplearningexplained  5 днів тому

      Hey there, thanks for the kind words and glad to have you as a subscriber! 🌹

  • @kylezou7040
    @kylezou7040 2 дні тому +2

    早期人类驯服LLM的珍贵分析资料.

  • @AshaVishwanathan-u9o
    @AshaVishwanathan-u9o 2 дні тому +1

    Great and simple explanation !

  • @ElianHerby
    @ElianHerby 7 днів тому +4

    Je commente pas souvent sur youtube, mais je viens de découvrir ta chaine et c'est impeccable. Abonné direct, continue comme ça c'est génial

    • @deeplearningexplained
      @deeplearningexplained  6 днів тому

      Ah merci c’est super gentil. Je suis bien content que les vidéos soient utiles!

  • @stephanembatchou5300
    @stephanembatchou5300 4 дні тому +1

    Excellent breakdown

  • @continuallearning0
    @continuallearning0 5 днів тому +6

    Great explainer video! Thanks! The reason why the smaller models don’t gain as much from RL compared to larger models is because they lack the “capacity” (number of parameters needed) to model reasoning

    • @deeplearningexplained
      @deeplearningexplained  5 днів тому +3

      Very interesting thought! The weird bit is that they do gain this capacity with the same amount of parameters when they are being fine tuned through distillation!

    • @continuallearning0
      @continuallearning0 5 днів тому +3

      @deeplearningexplained good point! almost like the small models have trouble discovering the reasoning by themselves but can easily replicate it once discovered. I think it has to do with the fact that bigger overparameterized models have a higher probability of developing additional subnetwork representations, extra capacity to discover. Then the smaller model can use heuristics or simpler principles to replicate

    • @deeplearningexplained
      @deeplearningexplained  4 дні тому +1

      I like that interpretation. Generally, larger models seems to behaving a tad differently than smaller models in term of emerging capability. Like they even do in-context learning differently than smaller models and are able to "learn on the fly".

  • @MrMoonsilver
    @MrMoonsilver 5 днів тому +3

    Best discovery in terms of llm channels in a while! Great content!

  • @ben8718
    @ben8718 6 днів тому +2

    You are the only one on youtube explaining the maths behind this monster AI, what in the world??

    • @deeplearningexplained
      @deeplearningexplained  4 дні тому +1

      There's actually two that I found that does this quite well, check them out:
      📌 ua-cam.com/video/XMnxKGVnEUc/v-deo.html&ab_channel=UmarJamil
      📌 ua-cam.com/video/bAWV_yrqx4w/v-deo.html&ab_channel=YannicKilcher

  • @GradientChunk
    @GradientChunk 6 днів тому +2

    Thank you so much!! Super helpful

  • @philtrem
    @philtrem 6 днів тому +3

    Excellent break down.

  • @clipstok788
    @clipstok788 8 днів тому +7

    a big W for Yachine

  • @albitaulla1448
    @albitaulla1448 5 днів тому +4

    Great video! Any editor / tool recommendations to read papers. The one you have here looks great!

    • @deeplearningexplained
      @deeplearningexplained  5 днів тому

      Hey thanks!
      The one I'm using is TLDRAW, it's a very simple whiteboard that I can draw on.
      Other than that I'm using the firefox reader.

  • @ELum6perML-d4e
    @ELum6perML-d4e 8 днів тому +4

    I was waiting for it

    • @deeplearningexplained
      @deeplearningexplained  8 днів тому

      Don't forget to check out these two other videos for complementary understanding:
      📌 ua-cam.com/video/XMnxKGVnEUc/v-deo.html&ab_channel=UmarJamil
      📌 ua-cam.com/video/bAWV_yrqx4w/v-deo.html&ab_channel=YannicKilcher

    • @ELum6perML-d4e
      @ELum6perML-d4e 8 днів тому +1

      @ actually i’m in the half of the second video you proposed😂😂

    • @deeplearningexplained
      @deeplearningexplained  7 днів тому

      @ haha keep watching it! :)

  • @eagle43257
    @eagle43257 6 днів тому +2

    Thank you brother.....we enjoyed

  • @AntonioMartinezRamirez85
    @AntonioMartinezRamirez85 6 днів тому +1

    Thanks for the explanation! Great video!

  • @teddyperera8531
    @teddyperera8531 6 днів тому +2

    Amazing explanation. Thank you

  • @DigitalAlligator
    @DigitalAlligator 5 днів тому +2

    @10:02 Are you sure? All other terms are positive, and this KL divergence is negative, so when minimize the loss, this divergence actually goes up, so it seems to me that it encourage the model being different from the reference model.

    • @deeplearningexplained
      @deeplearningexplained  5 днів тому +1

      Great question, it's maximizing the objective function not minimize it: "[...] optimizes the policy model 𝜋𝜃 by maximizing the following objective."
      The min is for choosing either the normal policy*advantage or the clipped policy*advantage.

    • @DigitalAlligator
      @DigitalAlligator 4 дні тому +1

      @@deeplearningexplained Sorry, yes you are right, this is not minimize of loss function, it is maximize the objective function, I was wrong, please correct my words.

    • @deeplearningexplained
      @deeplearningexplained  4 дні тому

      @@DigitalAlligator no worries, thank you for your question because I also got confused first time I read it.

  • @jairguedesferreira
    @jairguedesferreira 7 днів тому +1

    Congratulations!

  • @fintech1378
    @fintech1378 7 днів тому +3

    best explanation

  • @lojian
    @lojian 4 дні тому +1

    Thanks a lot!! Good explination.

  • @mostinho7
    @mostinho7 7 днів тому +11

    I have a comp Eng undergrad degree, but what kind of math do I need to learn to be able to make sense of these math formulas? They are complete Greek to me. What am I missing :(

    • @deeplearningexplained
      @deeplearningexplained  7 днів тому +11

      Awesome question, they are greek to you because you say the greek letter in your head. Not what they actually mean.
      Check out this video I made on how to read deep learning math (or other very dense math) easily: ua-cam.com/video/YXWxVxQ6AeY/v-deo.html

  • @JohnNauman
    @JohnNauman 3 дні тому +1

    i assume the harmlessness weights are how it censors certain topics like a certain place on a specific date.

    • @deeplearningexplained
      @deeplearningexplained  3 дні тому

      Yes, this is strongly implied in the paper where the harmlessness is hurting a Chinese benchmark.

  • @krpcannon123
    @krpcannon123 День тому

    Are the policy updates updating a separate policy NN or directly the parameters of the underlying pretrained model?

  • @owenbianchi6729
    @owenbianchi6729 6 днів тому +1

    great video, thanks!

  • @jasper4803
    @jasper4803 5 днів тому +3

    Thanks for the wonderful explaination. What paper reader you are using?

  • @Clipaholick
    @Clipaholick 8 днів тому +2

    Sick!

  • @祖国翔
    @祖国翔 7 днів тому +3

    I have a doubt related to "We fine-tune DeepSeek-V3-Base for two epochs using the above curated dataset of about 800k samples." At this step, fine-tune DeepSeek-V3-Base is the original DeepSeek-V3-Base or the DeepSeek-V3-Base after Cold Start and Reasoning-oriented Reinforcement Learning?
    I asked above doubts to deepseek, it reply me as below:
    At the step where it mentions "We fine-tune DeepSeek-V3-Base for two epochs using the above curated dataset of about 800k samples," the model being fine-tuned is not the original DeepSeek-V3-Base. Instead, it is the DeepSeek-V3-Base after the Cold Start and Reasoning-oriented Reinforcement Learning stages.
    I don't knowif the answer is correct. Could you help double check? Thanks a lot!

    • @deeplearningexplained
      @deeplearningexplained  6 днів тому +1

      Very good question, it’s the original DeepSeek-V3-Base! It’s quite confusing.
      At this point, all other models used in the R1 path were used to create the 800K dataset in some sort of way.

  • @SevenErhan
    @SevenErhan 5 днів тому +1

    Thank you very much

  • @caseyyeow1649
    @caseyyeow1649 2 дні тому +1

    Your map is superb! Where to download your map? Thanks.

    • @deeplearningexplained
      @deeplearningexplained  2 дні тому

      It is really useful, it's not mine though I found it here: www.reddit.com/r/LocalLLaMA/comments/1i66j4f/deepseekr1_training_pipeline_visualized/

  • @enlyly7510
    @enlyly7510 2 дні тому +1

    Should I get your picture about how to training R1 workflow graph?

    • @deeplearningexplained
      @deeplearningexplained  2 дні тому

      For sure, I found it over here: www.reddit.com/r/LocalLLaMA/comments/1i66j4f/deepseekr1_training_pipeline_visualized/

  • @quippy8402
    @quippy8402 5 днів тому +1

    Thank you for the explanation. So, one can't do distillation from OpenAI without the availability of their models' logits.

    • @deeplearningexplained
      @deeplearningexplained  5 днів тому +2

      No they can’t, but they can use OpenAI to generate reasoning and non-reasoning data which as we have seen in the paper is important step in the pipeline for R1.

  • @SuperSoliton
    @SuperSoliton 3 дні тому +1

    Great video, thanks.
    Now I believe people in China are indeed good at math and very generous.

  • @johngrabner
    @johngrabner 7 днів тому +2

    GRPO reward is for any text between tags? Is this done after whole sequence is generated or as soon as text appears in the area?

    • @deeplearningexplained
      @deeplearningexplained  7 днів тому

      Great question, the details are a bit vague on that front for the formatting but I believe it’s for a full reward loop.
      Aka you need to have an answer that is verifiable to have the reward signals being propagated back to the whole sequence at the same time.
      Some bits of the reward will pertain to what’s in the think tag (like the consistency and formatting rewards). Other like the accuracy will check the answer only.

  • @kushkaptain4205
    @kushkaptain4205 День тому

    How good is the 1.5b version at let's say top high school math?

  • @larjunmnath
    @larjunmnath 7 днів тому +1

    thanks for the overview, i was to lazy to read it😝

    • @deeplearningexplained
      @deeplearningexplained  6 днів тому

      Glad it was useful! Do read the paper though, it’s quite well written!

  • @BizRid
    @BizRid 3 дні тому +1

    🐐

  • @jebprime
    @jebprime 6 днів тому +1

    I wonder if mixing languages allows it to think in subtle, slightly different ways belonging to different cultures / languages ? And that's why aligning it to stick to one language resulted in a slight degradation in performance.

    • @deeplearningexplained
      @deeplearningexplained  6 днів тому

      That part is one of the most fascinating. I think it has to do with how the knowledge is encoded within its weights.
      A concept might be easier to reach for the model with some token sequence belonging to Chinese, while other might be easier in English.
      It wouldn’t surprise me if some of the token sequence aren’t even readable but more broad ideas stitched together.

  • @fintech1378
    @fintech1378 7 днів тому +1

    please do the detailed math via paper like what you suggested

    • @deeplearningexplained
      @deeplearningexplained  6 днів тому +1

      Yes, I’m preparing a detailed breakdown of GRPO and I’ll try to get some code to follow along too.

  • @xinformatics
    @xinformatics 6 днів тому +3

    Kache - yacineMTB?

  • @fintech1378
    @fintech1378 7 днів тому +2

    is this kache on X?

  • @xiakj
    @xiakj 7 днів тому +1

    can you share the chart in the video

    • @deeplearningexplained
      @deeplearningexplained  6 днів тому +1

      Yes for sure, it’s here:
      www.reddit.com/r/LocalLLaMA/comments/1i66j4f/deepseekr1_training_pipeline_visualized/

  • @ben8718
    @ben8718 6 днів тому +1

    You channel's name fits quite well with the new AI model, coincidence?

  • @loicndo8469
    @loicndo8469 7 днів тому +1

    Super merci n'oublie pas de nous faire de belles vidéos en french aussi.

    • @deeplearningexplained
      @deeplearningexplained  7 днів тому +1

      Ahah je vais essayer! La pluspart de mon audience est anglophone, mais je n’oublie pas mes français et québécois!

  • @amunif_
    @amunif_ 7 днів тому +1

    Yacine, would it be possible to get on a zoom call with you to discuss about AI research?

    • @deeplearningexplained
      @deeplearningexplained  7 днів тому +1

      Hey there, for sure.
      Shoot me an email at mail@yacinemahdid.com and I'll organize it.

    • @amunif_
      @amunif_ 7 днів тому

      @ thanks will do.

  • @DigitalAlligator
    @DigitalAlligator 5 днів тому +1

    Wow, I didn't know John snow is also an AI expert

  • @beaniegamer9163
    @beaniegamer9163 6 днів тому +1

    Well, eh...i can only read a b c. 😅

    • @deeplearningexplained
      @deeplearningexplained  6 днів тому

      Haha yeah it’s a bit difficult to read the GRPO formula.
      If you are interested in improving your math reading skills, I got a video that cover the technique I use for complicated formula:
      m.ua-cam.com/video/YXWxVxQ6AeY/v-deo.html

  • @ps3301
    @ps3301 7 днів тому +1

    Lazy video: if u want to teach , translate the formula into the codes to demonstrate it