Reinforcement Learning with Human Feedback - How to train and fine-tune Transformer Models

Поділитися
Вставка
  • Опубліковано 17 тра 2024
  • Reinforcement Learning with Human Feedback (RLHF) is a method used for training Large Language Models (LLMs). In the heart of RLHF lies a very powerful reinforcement learning method called Proximal Policy Optimization. Learn about it in this simple video!
    This is the first one in a series of 3 videos dedicated to the reinforcement learning methods used for training LLMs.
    Full Playlist: • RLHF for training Lang...
    Video 0 (Optional): Introduction to deep reinforcement learning • A friendly introductio...
    Video 1: Proximal Policy Optimization • Proximal Policy Optimi...
    Video 2 (This one): Reinforcement Learning with Human Feedback
    Video 3 (Coming soon!): Deterministic Policy Optimization
    00:00 Introduction
    00:48 Intro to Reinforcement Learning (RL)
    02:47 Intro to Proximal Policy Optimization (PPO)
    4:17 Intro to Large Language Models (LLMs)
    6:50 Reinforcement Learning with Human Feedback (RLHF)
    13:08 Interpretation of the Neural Networks
    14:36 Conclusion
    Get the Grokking Machine Learning book!
    manning.com/books/grokking-ma...
    Discount code (40%): serranoyt
    (Use the discount code on checkout)
  • Наука та технологія

КОМЕНТАРІ • 35

  • @gemini_537
    @gemini_537 23 дні тому +1

    You are a genius of explaining complex concepts with simple terms!

  • @hoseinalavi3916
    @hoseinalavi3916 13 днів тому +1

    Your explanation is so great. Keep going on my friend. I am waiting for your next video.

  • @testme2026
    @testme2026 3 місяці тому +1

    Seriously mate, you are annoyingly good! like off the charts amazing! Thank you so much Luis Serrano.

  • @jff711
    @jff711 2 місяці тому +3

    Thank you! Looking forward to watching your DPO video.

  • @sainulia
    @sainulia Місяць тому +1

    Amazing explanation!

  • @asma5179
    @asma5179 2 місяці тому +1

    Thanks a lot for sharing your knowledge

  • @itsSandraKublik
    @itsSandraKublik 2 місяці тому

    Such a great video! ❤ So intuitive as always ❤

  • @sgrimm7346
    @sgrimm7346 3 місяці тому

    Great presentation, thank you.

  • @dragolov
    @dragolov 3 місяці тому

    Deep respect, Luis Serrano!

    • @SerranoAcademy
      @SerranoAcademy  3 місяці тому +1

      Thank you Ivan! Deep respect to you too!

  • @KumR
    @KumR 3 місяці тому

    Thanks a lot Luis.

  • @HamidrezaFarhidzadeh
    @HamidrezaFarhidzadeh 2 місяці тому

    You are amazing. Thanks

  • @omsaikommawar
    @omsaikommawar 3 місяці тому

    Waiting for your video for very long time 😁

    • @SerranoAcademy
      @SerranoAcademy  3 місяці тому

      Thank you! Finally here! There's one on DPO coming out soon too!

  • @Murphyalex
    @Murphyalex 25 днів тому

    Great explanation. Loved the Simpsons reference 🤣

    • @SerranoAcademy
      @SerranoAcademy  25 днів тому

      LOL! Yay, someone got the reference!!! :)

  • @gemini_537
    @gemini_537 23 дні тому

    Gemini: This video is about reinforcement learning with human feedback (RLHF), a method used to train large language models (LLMs). Specifically, it covers how to fine-tune LLMs after they've been trained.
    Here are the key points of the video:
    * **Reinforcement learning (RL) with human feedback (RLHF):**
    * RLHF is a method for training LLMs.
    * It involves human annotators rating the responses generated by a large language model to a specific prompt.
    * The LLM is then trained to get high scores from the human annotators.
    * **Review of Reinforcement Learning (RL):**
    * The video reviews the basics of RL using a grid world example.
    * An agent moves around a grid trying to collect points and avoid getting eaten by a dragon.
    * The agent learns the optimal policy through trial and error, which is to move towards the squares with the most points.
    * Value neural network and policy neural network are introduced to approximate the values and the policy, respectively.
    * **Proximal Policy Optimization (PPO):**
    * PPO is an algorithm for training RL agents.
    * It approximates the value and policy functions using neural networks.
    * The agent learns by moving around the state space and getting points based on the actions it takes.
    * **Transformers:**
    * Transformers are neural networks that are used to generate text.
    * They are trained on a massive amount of text data.
    * They generate text one word at a time by predicting the next word in a sequence.
    * **Fine-tuning Transformers with RLHF:**
    * The core idea of RLHF is to combine RL with human feedback to fine-tune Transformers.
    * Imagine the agent is moving around a grid of sentences, adding one word at a time.
    * The goal is to generate coherent sentences.
    * The agent generates multiple possible continuations for a sentence.
    * Human annotators then rate these continuations, and the agent is trained to favor generating the higher-rated continuations.
    * In essence, the value neural network mimics the human evaluator, assigning scores to responses, while the policy neural network learns the probabilities of transitioning between states (sentences) which is similar to what Transformers do.
    The video concludes by mentioning that this is the third video in a series of four about reinforcement learning.

  • @RunchuTian
    @RunchuTian Місяць тому

    Some Questions for RLHF
    - The value model in RLHF is not like the typical PPO value model that gives a value to each grid of move. The value model in RLHF only gives a value to a complete chain of moves. It is more like a 'reward model' actually.
    - What is the loss function like for the policy model in RLHF? Does it still follow the one in PPO or add some change to it?

  • @tutolopezgonzalez1106
    @tutolopezgonzalez1106 3 місяці тому

    Love your videos ❤ thank you for sharing and bringing us light. Would you explain how rlhf is relevant to aligning AI systems?

    • @SerranoAcademy
      @SerranoAcademy  3 місяці тому

      Thank you so much, I'm glad you liked it! Yes, great question! In here I mostly talked about fine-tuning, which is to make them answer accurate responses, whether in general or for a specific dataset. Aligning them is deeper, as it requires them to be ethical, responsible, etc.. I would say that in general the process is similar, the difference lies in that the goals used are others. But I don't think there's a huge difference in the reward model, etc. I'll still check and if there's a big difference, I'll add it in the next video. Thanks for the suggestion!

  • @somerset006
    @somerset006 3 місяці тому

    Thanks for the great video! Is it a part of a playlist? You seem to be missing a playlist of the 4 videos at the end of this one.

    • @SerranoAcademy
      @SerranoAcademy  3 місяці тому

      Thanks for pointing it out! Yes I forgot that part, I'll add it now!

    • @SerranoAcademy
      @SerranoAcademy  3 місяці тому +2

      And it's been added! Here's the playlist (1 more video to come)
      ua-cam.com/play/PLs8w1Cdi-zvYviYYw_V3qe6SINReGF5M-.html

  • @lrostagno2000
    @lrostagno2000 Місяць тому

    Love the video, but please could you remove the loud music at the beginning of the sections?

  • @meme31382
    @meme31382 3 місяці тому +1

    Thanks for the great video, can you make one to explain graph neural networks, thanks in advance

    • @SerranoAcademy
      @SerranoAcademy  3 місяці тому

      Thanks for the message and the suggestion! Yes that topic has been coming up, and it looks super interesting!

  • @jeffpan2785
    @jeffpan2785 Місяць тому

    Could you please give a introduction to DPO(Direct Policy Optimization) as well? Thanks a lot!

    • @SerranoAcademy
      @SerranoAcademy  Місяць тому

      Thanks! Absolutely, I'm working on a DPO video, but tbh, I haven't yet fully understood the loss function the way I want to. I'll get it out hopefully very soon!

  • @pushpakagrawal7292
    @pushpakagrawal7292 2 місяці тому

    Great! When is DPO coming?

  • @romanemul1
    @romanemul1 3 місяці тому

    So its actually just pushing the learning of the model into the right direction ?

    • @SerranoAcademy
      @SerranoAcademy  3 місяці тому

      Great question, exactly! The model is trained, but this improves the results.