Reinforcement Learning: ChatGPT and RLHF

Поділитися
Вставка
  • Опубліковано 15 тра 2024
  • Reinforcement Learning from human feedback, and how it's used to help train large language models like ChatGPT.
    Part 3 of RL from scratch series.
    • Reinforcement Learning...
    0:00 - intro
    0:06 - large language models
    0:35 - learning to tell jokes
    1:13 - fine tuning with better data
    1:26 - positive and negative examples
    2:03 - reinforcement learning for LLMs
    3:00 - labeling fewer examples
    3:56 - reward networks
    5:08 - summing it up
    5:23 - variants
    5:57 - chatGPT, Bard, Claude, Llama
    6:09 - finally, a good joke!

КОМЕНТАРІ • 11

  • @user-cm5es5kk7j
    @user-cm5es5kk7j 15 днів тому +1

    help me a lot, can't wait to see more

  • @pegasusbupt
    @pegasusbupt 7 місяців тому +2

    Amazing content! Please keep them coming!

  • @jasonpmorrison
    @jasonpmorrison 7 місяців тому +1

    Super helpful - thank you for this series!

  • @ireoluwaTH
    @ireoluwaTH 9 місяців тому +1

    Welcome back!
    Hope to see more of these videos..

  • @tuulymusic3856
    @tuulymusic3856 Місяць тому +1

    Please come back, your videos are great!

  • @RaulMartinezRME
    @RaulMartinezRME 9 місяців тому +1

    Great content!!

  • @0xeb-
    @0xeb- 9 місяців тому +1

    Good teaching.

  • @0xeb-
    @0xeb- 9 місяців тому +1

    How long it takes to train a reward network? And how reliable would it be?

  • @vamsinadh100
    @vamsinadh100 6 місяців тому +1

    You are the Best

  • @stayhappy-forever
    @stayhappy-forever 18 днів тому +1

    come back :(

  • @onhazrat
    @onhazrat 9 місяців тому

    🎯 Key Takeaways for quick navigation:
    00:00 🤖 Reinforcement learning improves large language models like ChatGPT.
    00:25 🃏 Large language models face issues like bias, errors, and quality.
    01:11 📊 Training data quality impacts results; removing bad jokes might help.
    01:55 🧩 Training on both good and bad jokes improves language models.
    02:38 🔄 Language models are policies, reinforcement learning uses policy gradient.
    03:08 🎯 Reinforcement Learning from Human Feedback (RLHF) challenges data acquisition.
    03:35 🤔 RLHF theory: Language model might already know jokes' boundary.
    04:18 🏆 Training a reward network predicts human ratings for model's output.
    04:47 🔄 Reward network is a modified language model for predicting ratings.
    05:14 📝 Approach: Humans write text, train reward network, refine model with RL.
    05:57 ⚖️ Systems convert comparisons to ratings for reward network training.
    06:11 😄 RLHF successfully improves language models, including humor.
    Made with HARPA AI