Reinforcement Learning with Human Feedback - How to train and fine-tune Transformer Models

Serrano.Academy

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 24 гру 2024

КОМЕНТАРІ •

@gemini_537 8 місяців тому ⁺⁸
You are a genius of explaining complex concepts with simple terms!
@aryangaur276 5 місяців тому ⁺⁵
I am literally crying, what a wonderful explanation 😭
@testme2026 10 місяців тому ⁺¹
Seriously mate, you are annoyingly good! like off the charts amazing! Thank you so much Luis Serrano.
@jff711 10 місяців тому ⁺³
Thank you! Looking forward to watching your DPO video.
@hoseinalavi3916 7 місяців тому ⁺¹
Your explanation is so great. Keep going on my friend. I am waiting for your next video.
@gemini_537 8 місяців тому
Gemini: This video is about reinforcement learning with human feedback (RLHF), a method used to train large language models (LLMs). Specifically, it covers how to fine-tune LLMs after they've been trained.
Here are the key points of the video:
* **Reinforcement learning (RL) with human feedback (RLHF):**
* RLHF is a method for training LLMs.
* It involves human annotators rating the responses generated by a large language model to a specific prompt.
* The LLM is then trained to get high scores from the human annotators.
* **Review of Reinforcement Learning (RL):**
* The video reviews the basics of RL using a grid world example.
* An agent moves around a grid trying to collect points and avoid getting eaten by a dragon.
* The agent learns the optimal policy through trial and error, which is to move towards the squares with the most points.
* Value neural network and policy neural network are introduced to approximate the values and the policy, respectively.
* **Proximal Policy Optimization (PPO):**
* PPO is an algorithm for training RL agents.
* It approximates the value and policy functions using neural networks.
* The agent learns by moving around the state space and getting points based on the actions it takes.
* **Transformers:**
* Transformers are neural networks that are used to generate text.
* They are trained on a massive amount of text data.
* They generate text one word at a time by predicting the next word in a sequence.
* **Fine-tuning Transformers with RLHF:**
* The core idea of RLHF is to combine RL with human feedback to fine-tune Transformers.
* Imagine the agent is moving around a grid of sentences, adding one word at a time.
* The goal is to generate coherent sentences.
* The agent generates multiple possible continuations for a sentence.
* Human annotators then rate these continuations, and the agent is trained to favor generating the higher-rated continuations.
* In essence, the value neural network mimics the human evaluator, assigning scores to responses, while the policy neural network learns the probabilities of transitioning between states (sentences) which is similar to what Transformers do.
The video concludes by mentioning that this is the third video in a series of four about reinforcement learning.
@vigneshram5193 6 місяців тому
Thank you Luis Serrano for this super explanatory video
@sainulia 9 місяців тому ⁺¹
Amazing explanation!
@ብርቱሰው 6 місяців тому
I would like to say thank you for the wonderful video. I want to learn reinforcement learning for my future study in the field of robotics. I have seen that you only have 4 videos about RL. I am hungry for more of your videos. I found that your videos are easier to understand because you explain well. Please add more RL videos. Thank you 🙏
@somerset006 10 місяців тому
Thanks for the great video! Is it a part of a playlist? You seem to be missing a playlist of the 4 videos at the end of this one.
@SerranoAcademy 10 місяців тому
Thanks for pointing it out! Yes I forgot that part, I'll add it now!
@SerranoAcademy 10 місяців тому ⁺²
And it's been added! Here's the playlist (1 more video to come)
ua-cam.com/play/PLs8w1Cdi-zvYviYYw_V3qe6SINReGF5M-.html
@RunchuTian 8 місяців тому
Some Questions for RLHF
- The value model in RLHF is not like the typical PPO value model that gives a value to each grid of move. The value model in RLHF only gives a value to a complete chain of moves. It is more like a 'reward model' actually.
- What is the loss function like for the policy model in RLHF? Does it still follow the one in PPO or add some change to it?
@jeffpan2785 8 місяців тому
Could you please give a introduction to DPO(Direct Policy Optimization) as well? Thanks a lot!
@SerranoAcademy 8 місяців тому
Thanks! Absolutely, I'm working on a DPO video, but tbh, I haven't yet fully understood the loss function the way I want to. I'll get it out hopefully very soon!
@meme31382 10 місяців тому ⁺¹
Thanks for the great video, can you make one to explain graph neural networks, thanks in advance
@SerranoAcademy 10 місяців тому
Thanks for the message and the suggestion! Yes that topic has been coming up, and it looks super interesting!
@dragolov 10 місяців тому
Deep respect, Luis Serrano!
@SerranoAcademy 10 місяців тому ⁺¹
Thank you Ivan! Deep respect to you too!
@paveltsvetkov7948 6 місяців тому
Why do you need Value neural network? Why can't you train the policy neural network alone?
Is it because the value neural network allows to replace human evaluator and get more training samples for the policy network without need for human input?
@asma5179 10 місяців тому ⁺¹
Thanks a lot for sharing your knowledge
@itsSandraKublik 10 місяців тому
Such a great video! ❤ So intuitive as always ❤
@SerranoAcademy 10 місяців тому
Yayyyy! Thank you Sandra!!!! 🤗
@pushpakagrawal7292 9 місяців тому
Great! When is DPO coming?
@SerranoAcademy 9 місяців тому
Thanks! Soon, working on it :)
@lrostagno2000 9 місяців тому
Love the video, but please could you remove the loud music at the beginning of the sections?
@sgrimm7346 10 місяців тому
Great presentation, thank you.
@SerranoAcademy 10 місяців тому
Thank you! Glad you liked it! :)
@omsaikommawar 10 місяців тому
Waiting for your video for very long time 😁
@SerranoAcademy 10 місяців тому
Thank you! Finally here! There's one on DPO coming out soon too!
@gergerger53 8 місяців тому
Great explanation. Loved the Simpsons reference 🤣
@SerranoAcademy 8 місяців тому
LOL! Yay, someone got the reference!!! :)
@minditon3264 6 місяців тому
Deterministic Policy Optimization VIdeo ??
@SerranoAcademy 6 місяців тому
Working on it, almost there! :)
@tutolopezgonzalez1106 10 місяців тому
Love your videos ❤ thank you for sharing and bringing us light. Would you explain how rlhf is relevant to aligning AI systems?
@SerranoAcademy 10 місяців тому
Thank you so much, I'm glad you liked it! Yes, great question! In here I mostly talked about fine-tuning, which is to make them answer accurate responses, whether in general or for a specific dataset. Aligning them is deeper, as it requires them to be ethical, responsible, etc.. I would say that in general the process is similar, the difference lies in that the goals used are others. But I don't think there's a huge difference in the reward model, etc. I'll still check and if there's a big difference, I'll add it in the next video. Thanks for the suggestion!
@KumR 10 місяців тому
Thanks a lot Luis.
@SerranoAcademy 10 місяців тому
Thank you @KumR!
@romanemul1 10 місяців тому
So its actually just pushing the learning of the model into the right direction ?
@SerranoAcademy 10 місяців тому
Great question, exactly! The model is trained, but this improves the results.
@HamidrezaFarhidzadeh 10 місяців тому
You are amazing. Thanks
@WildeTernence-f9r 2 місяці тому
Allen Steven Thompson Shirley Brown Linda
@RoyDipta 6 місяців тому
where is the dpo video? 🥹
@SerranoAcademy 6 місяців тому
Thanks for your interest! I’m working on it, but it still hasn’t fully clicked in my head. Hopefully soon! ☺️
@azurewang 6 місяців тому
@@SerranoAcademy please let us know when it comes out！we are all waiting for it, it's very informative.
@SerranoAcademy 6 місяців тому
Hello! DPO is almost ready, coming out in a few days!

Наступне

Автоматичне відтворення

Direct Preference Optimization (DPO) - How to fine-tune LLMs directly without reinforcement learning