GRPO Crash Course: Fine-Tuning DeepSeek for MATH!

Поділитися
Вставка
  • Опубліковано 9 лют 2025
  • I'm happy to share my latest tutorial on Group Relative Policy Optimization (GRPO)! In this video, I break down GRPO in a way that's easy to understand, even if you're new to reinforcement learning. I explain the core concepts using simple language and visuals, aiming for that ELI5 (Explain Like I'm 5) level of clarity. No complex math or jargon here - just the essential ideas behind this powerful technique.
    But that's not all! I also dive into a practical demonstration of how to fine-tune a Distill DeepSeek model using the International Mathematical Olympiad (IMO) dataset from Kaggle. I walk you through the entire process, step-by-step, showing you how I improved the model's mathematical reasoning abilities. I cover everything from setting up your environment to evaluating the results. You'll see firsthand how GRPO can be applied to enhance LLMs for complex tasks like solving IMO-level problems.
    I believe this video will be incredibly valuable for anyone interested in AI, machine learning, and especially those looking to improve LLMs for mathematical tasks.
    If you found this video helpful, please give it a thumbs up! I really appreciate your support. Let me know what you think in the comments below - I'd love to hear your questions and feedback. And don't forget to subscribe to my channel for more tutorials on AI, machine learning, and other exciting topics. Your subscription helps me create more content like this! Thanks for watching!
    GitHub Repo: github.com/AIA...
    DeepSeek Research Paper: arxiv.org/pdf/...
    Unsloth Notebooks: docs.unsloth.a...
    Kaggle Dataset: www.kaggle.com...
    Join this channel to get access to perks:
    / @aianytime
    To further support the channel, you can contribute via the following methods:
    Bitcoin Address: 32zhmo5T9jvu8gJDGW3LTuKBM1KPMHoCsW
    UPI: sonu1000raw@ybl
    #grpo #deepseek #ai

КОМЕНТАРІ • 13

  • @rc_381
    @rc_381 День тому +2

    Good to know someone is actually explaining the concept... And applying it to a usecase ❤

    • @AIAnytime
      @AIAnytime  День тому +1

      Appreciate it! Glad you find it useful!

  • @kumargaurav2170
    @kumargaurav2170 15 годин тому

    Great content 👌

  • @pocketai-p2w
    @pocketai-p2w 21 годину тому +1

    bhai pura dekh liya but kuch samja nhi 😢

  • @tusharplug
    @tusharplug День тому +1

    I wonder if the grpo trained models actually perform better than sft ones

    • @rc_381
      @rc_381 День тому

      @@tusharplug DeepSeek R1 used a bunch of techniques, including cold start, supervised fine-tuning (SFT), and Grouped Relative Policy Optimization (GRPO) with reinforcement learning. SFT is pretty standard - nothing groundbreaking there while GRPO can be used in lots of different situations. The most interesting part about DeepSeek R1 is how they put these all together in architecture... I hope I answered it 😅

    • @tusharplug
      @tusharplug День тому

      @rc_381 i understand, you mean we can experiment the pipeline by ourselves to to see what we get

    • @rc_381
      @rc_381 День тому

      @@tusharplug sft is not some innovative thing is what I mean..GRPO is innovative and both are different to compare even... If you want you can play with the architecture of deepseek r1 on shorter datasets for chain of thought (cot) using some supervised finetuning

    • @tusharplug
      @tusharplug День тому

      @@rc_381 yea I kinda of get it, my doubt was ifwe have scenarios where we could train one model with sft and another same model with grpo for a common use case, which one would perform better, sole sft one or sole grpo one.

    • @rc_381
      @rc_381 День тому

      @@tusharplug Hey...I think there's a misunderstanding about how SFT and GRPO work. SFT is a training process where you fine-tune a model on a labeled dataset. GRPO is a reinforcement learning algorithm that optimizes the model's behavior based on rewards. They're not alternatives, but rather often used together. You typically use SFT first to get the model to a good starting point, and then use GRPO to fine-tune its responses based on human feedback or specific goals. So, it's not a question of which performs better, but how they can be used in conjunction.
      I hope you understood it...

  • @snsa_kscc
    @snsa_kscc 8 годин тому

    I appreciate your effort immensely bro, but for the love of god, invest some money into a decent mic. Sound is more critical than video.

  • @Chadpritai
    @Chadpritai День тому +1

    Audio is bad