Gemini: This video is about reinforcement learning with human feedback (RLHF), a method used to train large language models (LLMs). Specifically, it covers how to fine-tune LLMs after they've been trained. Here are the key points of the video: * **Reinforcement learning (RL) with human feedback (RLHF):** * RLHF is a method for training LLMs. * It involves human annotators rating the responses generated by a large language model to a specific prompt. * The LLM is then trained to get high scores from the human annotators. * **Review of Reinforcement Learning (RL):** * The video reviews the basics of RL using a grid world example. * An agent moves around a grid trying to collect points and avoid getting eaten by a dragon. * The agent learns the optimal policy through trial and error, which is to move towards the squares with the most points. * Value neural network and policy neural network are introduced to approximate the values and the policy, respectively. * **Proximal Policy Optimization (PPO):** * PPO is an algorithm for training RL agents. * It approximates the value and policy functions using neural networks. * The agent learns by moving around the state space and getting points based on the actions it takes. * **Transformers:** * Transformers are neural networks that are used to generate text. * They are trained on a massive amount of text data. * They generate text one word at a time by predicting the next word in a sequence. * **Fine-tuning Transformers with RLHF:** * The core idea of RLHF is to combine RL with human feedback to fine-tune Transformers. * Imagine the agent is moving around a grid of sentences, adding one word at a time. * The goal is to generate coherent sentences. * The agent generates multiple possible continuations for a sentence. * Human annotators then rate these continuations, and the agent is trained to favor generating the higher-rated continuations. * In essence, the value neural network mimics the human evaluator, assigning scores to responses, while the policy neural network learns the probabilities of transitioning between states (sentences) which is similar to what Transformers do. The video concludes by mentioning that this is the third video in a series of four about reinforcement learning.
I would like to say thank you for the wonderful video. I want to learn reinforcement learning for my future study in the field of robotics. I have seen that you only have 4 videos about RL. I am hungry for more of your videos. I found that your videos are easier to understand because you explain well. Please add more RL videos. Thank you 🙏
Some Questions for RLHF - The value model in RLHF is not like the typical PPO value model that gives a value to each grid of move. The value model in RLHF only gives a value to a complete chain of moves. It is more like a 'reward model' actually. - What is the loss function like for the policy model in RLHF? Does it still follow the one in PPO or add some change to it?
Thanks! Absolutely, I'm working on a DPO video, but tbh, I haven't yet fully understood the loss function the way I want to. I'll get it out hopefully very soon!
Why do you need Value neural network? Why can't you train the policy neural network alone? Is it because the value neural network allows to replace human evaluator and get more training samples for the policy network without need for human input?
Thank you so much, I'm glad you liked it! Yes, great question! In here I mostly talked about fine-tuning, which is to make them answer accurate responses, whether in general or for a specific dataset. Aligning them is deeper, as it requires them to be ethical, responsible, etc.. I would say that in general the process is similar, the difference lies in that the goals used are others. But I don't think there's a huge difference in the reward model, etc. I'll still check and if there's a big difference, I'll add it in the next video. Thanks for the suggestion!
You are a genius of explaining complex concepts with simple terms!
I am literally crying, what a wonderful explanation 😭
Seriously mate, you are annoyingly good! like off the charts amazing! Thank you so much Luis Serrano.
Thank you! Looking forward to watching your DPO video.
Your explanation is so great. Keep going on my friend. I am waiting for your next video.
Gemini: This video is about reinforcement learning with human feedback (RLHF), a method used to train large language models (LLMs). Specifically, it covers how to fine-tune LLMs after they've been trained.
Here are the key points of the video:
* **Reinforcement learning (RL) with human feedback (RLHF):**
* RLHF is a method for training LLMs.
* It involves human annotators rating the responses generated by a large language model to a specific prompt.
* The LLM is then trained to get high scores from the human annotators.
* **Review of Reinforcement Learning (RL):**
* The video reviews the basics of RL using a grid world example.
* An agent moves around a grid trying to collect points and avoid getting eaten by a dragon.
* The agent learns the optimal policy through trial and error, which is to move towards the squares with the most points.
* Value neural network and policy neural network are introduced to approximate the values and the policy, respectively.
* **Proximal Policy Optimization (PPO):**
* PPO is an algorithm for training RL agents.
* It approximates the value and policy functions using neural networks.
* The agent learns by moving around the state space and getting points based on the actions it takes.
* **Transformers:**
* Transformers are neural networks that are used to generate text.
* They are trained on a massive amount of text data.
* They generate text one word at a time by predicting the next word in a sequence.
* **Fine-tuning Transformers with RLHF:**
* The core idea of RLHF is to combine RL with human feedback to fine-tune Transformers.
* Imagine the agent is moving around a grid of sentences, adding one word at a time.
* The goal is to generate coherent sentences.
* The agent generates multiple possible continuations for a sentence.
* Human annotators then rate these continuations, and the agent is trained to favor generating the higher-rated continuations.
* In essence, the value neural network mimics the human evaluator, assigning scores to responses, while the policy neural network learns the probabilities of transitioning between states (sentences) which is similar to what Transformers do.
The video concludes by mentioning that this is the third video in a series of four about reinforcement learning.
Thank you Luis Serrano for this super explanatory video
Amazing explanation!
I would like to say thank you for the wonderful video. I want to learn reinforcement learning for my future study in the field of robotics. I have seen that you only have 4 videos about RL. I am hungry for more of your videos. I found that your videos are easier to understand because you explain well. Please add more RL videos. Thank you 🙏
Thanks for the great video! Is it a part of a playlist? You seem to be missing a playlist of the 4 videos at the end of this one.
Thanks for pointing it out! Yes I forgot that part, I'll add it now!
And it's been added! Here's the playlist (1 more video to come)
ua-cam.com/play/PLs8w1Cdi-zvYviYYw_V3qe6SINReGF5M-.html
Some Questions for RLHF
- The value model in RLHF is not like the typical PPO value model that gives a value to each grid of move. The value model in RLHF only gives a value to a complete chain of moves. It is more like a 'reward model' actually.
- What is the loss function like for the policy model in RLHF? Does it still follow the one in PPO or add some change to it?
Could you please give a introduction to DPO(Direct Policy Optimization) as well? Thanks a lot!
Thanks! Absolutely, I'm working on a DPO video, but tbh, I haven't yet fully understood the loss function the way I want to. I'll get it out hopefully very soon!
Thanks for the great video, can you make one to explain graph neural networks, thanks in advance
Thanks for the message and the suggestion! Yes that topic has been coming up, and it looks super interesting!
Deep respect, Luis Serrano!
Thank you Ivan! Deep respect to you too!
Why do you need Value neural network? Why can't you train the policy neural network alone?
Is it because the value neural network allows to replace human evaluator and get more training samples for the policy network without need for human input?
Thanks a lot for sharing your knowledge
Such a great video! ❤ So intuitive as always ❤
Yayyyy! Thank you Sandra!!!! 🤗
Great! When is DPO coming?
Thanks! Soon, working on it :)
Love the video, but please could you remove the loud music at the beginning of the sections?
Great presentation, thank you.
Thank you! Glad you liked it! :)
Waiting for your video for very long time 😁
Thank you! Finally here! There's one on DPO coming out soon too!
Great explanation. Loved the Simpsons reference 🤣
LOL! Yay, someone got the reference!!! :)
Deterministic Policy Optimization VIdeo ??
Working on it, almost there! :)
Love your videos ❤ thank you for sharing and bringing us light. Would you explain how rlhf is relevant to aligning AI systems?
Thank you so much, I'm glad you liked it! Yes, great question! In here I mostly talked about fine-tuning, which is to make them answer accurate responses, whether in general or for a specific dataset. Aligning them is deeper, as it requires them to be ethical, responsible, etc.. I would say that in general the process is similar, the difference lies in that the goals used are others. But I don't think there's a huge difference in the reward model, etc. I'll still check and if there's a big difference, I'll add it in the next video. Thanks for the suggestion!
Thanks a lot Luis.
Thank you @KumR!
So its actually just pushing the learning of the model into the right direction ?
Great question, exactly! The model is trained, but this improves the results.
You are amazing. Thanks
Allen Steven Thompson Shirley Brown Linda
where is the dpo video? 🥹
Thanks for your interest! I’m working on it, but it still hasn’t fully clicked in my head. Hopefully soon! ☺️
@@SerranoAcademy please let us know when it comes out!we are all waiting for it, it's very informative.
Hello! DPO is almost ready, coming out in a few days!