Proximal Policy Optimization (PPO) - How to train Large Language Models

Поділитися
Вставка
  • Опубліковано 2 тра 2024
  • Reinforcement Learning with Human Feedback (RLHF) is a method used for training Large Language Models (LLMs). In the heart of RLHF lies a very powerful reinforcement learning method called Proximal Policy Optimization. Learn about it in this simple video!
    This is the first one in a series of 3 videos dedicated to the reinforcement learning methods used for training LLMs.
    Full Playlist: • RLHF for training Lang...
    Video 0 (Optional): Introduction to deep reinforcement learning • A friendly introductio...
    Video 1 (This one): Proximal Policy Optimization
    Video 2: Reinforcement Learning with Human Feedback • Reinforcement Learning...
    Video 3 (Coming soon!): Deterministic Policy Optimization
    00:00 Introduction
    01:25 Gridworld
    03:10 States and Action
    04:01 Values
    07:30 Policy
    09:39 Neural Networks
    16:14 Training the value neural network (Gain)
    22:50 Training the policy neural network (Surrogate Objective Function)
    33:38 Clipping the surrogate objective function
    36:49 Summary
    Get the Grokking Machine Learning book!
    manning.com/books/grokking-ma...
    Discount code (40%): serranoyt
    (Use the discount code on checkout)
  • Наука та технологія

КОМЕНТАРІ • 42

  • @RunchuTian
    @RunchuTian 24 дні тому +2

    Thank you! Your explanation of PPO is SO explicit.

  • @texwiller7577
    @texwiller7577 26 днів тому +2

    Probably the best explanation of the PPO ever

  • @dasistdiewahrheit9585
    @dasistdiewahrheit9585 3 місяці тому +6

    I love your clear examples and how you reduce them to the essentials.

  • @KumR
    @KumR 3 місяці тому +3

    Looking forward to this.

  • @ahmedshmels8866
    @ahmedshmels8866 3 місяці тому +3

    Crystal clear!

  • @Wenbobobo
    @Wenbobobo 3 місяці тому

    I love your clear teaching with both easy to understand and in-depth nature. I'll recommand to friend and hoping for the next RLHF video!

  • @itsSandraKublik
    @itsSandraKublik 3 місяці тому +3

    ​Loved it❤ Need to rewatch it few more times now but its getting much much clearer thanks to you!

  • @user-pe4xm7cq5z
    @user-pe4xm7cq5z 3 місяці тому

    You're the best!!! Absolutely love all your ML vids!

  • @sethjchandler
    @sethjchandler Місяць тому

    Extraordinarily lucid. Thanks!

  • @jff711
    @jff711 2 місяці тому

    Thank you for your time and effort to prepare this useful video and explain it.

  • @sergioa.serrano7993
    @sergioa.serrano7993 3 місяці тому

    Excelente explicación profe!!

  • @limal8012
    @limal8012 26 днів тому +1

    Thank you for your video, which provided a great explanation of PPO.❤

  • @mavichovizana5460
    @mavichovizana5460 Місяць тому

    great example and clear explanation!

  • @sachinsarathe1143
    @sachinsarathe1143 2 місяці тому

    You are a Genius man .... The way you explain things in a easy to understand way is mind blowing. Love you a lot :)

  • @gemini_537
    @gemini_537 8 днів тому

    You are a genius!!

  • @wirotep.1210
    @wirotep.1210 14 днів тому

    THE BEST on ppo.

  • @learnenglishwithmovie8485
    @learnenglishwithmovie8485 23 дні тому

    Since I am familiar with RL concepts, it was boring at the beginning. But, it finished awesome. Thanks

  • @sarthak.AiMLDL
    @sarthak.AiMLDL 3 місяці тому +3

    Sorry bae cant talk right now , Luis dropped another masterpiece had to watch it first..... :)

  • @itsSandraKublik
    @itsSandraKublik 3 місяці тому +1

    Let's goooo!

  • @HenriqueSousa-ub5en
    @HenriqueSousa-ub5en 26 днів тому

    I think that in the case of the policy loss you want to maximize it instead of minimize it. Since a positive gain means you need to increase the weights that contribute to an increase the probability of the considered action and therefore should do a gradient ascent on the weights.

  • @gemini_537
    @gemini_537 7 днів тому

    Gemini: This video is about Proximal Policy Optimization (PPO) and its applications in training large language models. The speaker, Louis Sano, starts the video by explaining what Proximal Policy Optimization (PPO) is and why it is important in reinforcement learning. Then, he dives into the details of PPO with a grid world example.
    Here are the key points of the video:
    * Proximal Policy Optimization (PPO) is a method commonly used in reinforcement learning. [1]
    * It is especially important for training large language models. [1]
    * In reinforcement learning, an agent learns through trial and error in an environment. The agent receives rewards for good actions and penalties for bad actions. [1]
    * The goal is to train the agent to take actions that maximize the total reward it receives. [1]
    * PPO uses two neural networks: a value network and a policy network. [2]
    * The value network estimates the long-term value of being in a particular state. [2]
    * The policy network determines the action the agent should take in a given state. [2]
    * PPO trains the value network and policy network simultaneously. [2]
    * The speaker uses a grid world example to illustrate the concepts of states, actions, values, and policy. [2,3,4,5]
    * In the grid world example, the agent is a small orange ball that moves around a grid. [2]
    * The goal of the agent is to get as many points as possible. [2]
    * The agent receives points by landing on squares with money and avoids squares with dragons. [2]
    * The speaker explains how to calculate the value of each state in the grid world. [4]
    * The value of a state is the maximum expected reward the agent can get from that state. [4]
    * The speaker also explains how to determine the best policy (i.e., the best action to take) for each state in the grid world. [5]
    * Once the value and policy are determined for all states, the agent can start acting in the environment. [5]
    * PPO uses a clipped surrogate objective function to train the policy network. [8,9]
    * This function helps to ensure that the policy updates are stable and do not diverge too much. [8,9]
    Overall, this video provides a clear and concise explanation of Proximal Policy Optimization (PPO) with a focus on its application in training large language models.

  • @KumR
    @KumR 3 місяці тому +2

    Excellent Session Luis. Can we have a similar one for DPO as well ?

    • @SerranoAcademy
      @SerranoAcademy  3 місяці тому +4

      Thank you! Yes, after this is RLHF and then DPO

    • @KumR
      @KumR 3 місяці тому

      Cant wait.....

  • @synchro-dentally1965
    @synchro-dentally1965 3 місяці тому

    Might be a novel approach for robotics: represent paths as gaussian splats and use spherical harmonics as the "recommended" directions within those splats to reach a goal/endpoint

  • @JaniMikaelOllenberg
    @JaniMikaelOllenberg 2 місяці тому

    wow this video is so awesome! your book link doesnt seem to work in the description :)

    • @SerranoAcademy
      @SerranoAcademy  Місяць тому +1

      Thank you, and thanks so much for pointing it out! Just fixed it.

  • @chenqu773
    @chenqu773 3 місяці тому

    Thank you Luis! One thing though I don't catch: why decrease the policy when the value needs to go down, and increase the policy when value goes up? I can't see a coupling between the trend of the value and of the policy

    • @SerranoAcademy
      @SerranoAcademy  3 місяці тому +1

      Great question! Yeah I also found that part a bit mysterious. My guess is that as we’re training both the value and policy NNs at the same time, that they kind of capture similar information. So if the value NN underestimated the value of a state, then it’s likely that the policy NN also underestimates the probabilities to get to that state. So as we increase the value estimate, then we should also increase the probability estimate.
      But if you have any other thoughts lemme know, I’m still trying to wrap my head around it…

  • @jimshtepa5423
    @jimshtepa5423 3 місяці тому +1

    I am wondering what level of expertise and knowledge one must have to be able to notice that impulse must be taken when probability is adjusted, omg:0 even if I live 2 lives spanning for 200 years I would never realize that impulse must be taken into account

  • @cromi4194
    @cromi4194 2 місяці тому

    Am I correct in pointing out, that the Loss is the negative of this expectation? Loss is always something we want to decrease, so this is the gain without the minus?

  • @guzh
    @guzh 2 місяці тому

    L_policy^CLIP seems to be incorrect, what is
    ho? The min of clip() is always the lower bound. Can you give a reference?

  • @fgh680
    @fgh680 3 місяці тому +1

    Is clipping done to avoid vanishing/exploding gradients?

    • @SerranoAcademy
      @SerranoAcademy  3 місяці тому

      Great question, yes absolutely! If the gradient is too big or too small, then that messes up the training, and that's why we clip it to something in the middle.

    • @TerryE-mo2ky
      @TerryE-mo2ky 2 місяці тому

      @@SerranoAcademy It seems to me that the lower bound of the probability ratio is not determined by the clipping function since the min function will take the minimum of the probability ratio and the result of the clipping function. So if epsilon is 0.3 and the probability ratio is 0.2, the lower bound of the clipping function will be 1-0.3=0.7 and min(0.2, 0.7)=0.2

  • @jimshtepa5423
    @jimshtepa5423 3 місяці тому

    when explaining the formula with mathematical notation, where exactly is the notation for summing the values for each step just as you summed them over little earlier before explaining math formula?

    • @SerranoAcademy
      @SerranoAcademy  3 місяці тому +1

      Great question! You mean in the surrogate objective function? Yes, I skimmed over that part, but at the end when you see the expected value sign, that means we're looking at the average of functions for different actions (those taken along a path).

  • @jimshtepa5423
    @jimshtepa5423 3 місяці тому +3

    musical inserts between concepts are too loud and too long in this ever decreasing attention span world. the presentation and material are amazing. thank you