Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.

Umar Jamil

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 2 лют 2025

КОМЕНТАРІ • 117

@nishantyadav6341 11 місяців тому ⁺⁴⁷
The fact that you dig deep into the algorithm and code sets you apart from the overflow of mediocre AI content online. I would pay to watch your videos, Umar. Thank you for putting out such amazing content.
@taekim7956 10 місяців тому ⁺¹⁹
I believe you are the best ML youtuber who explains everything so concise and clear!! Thank you so much for sharing this outstanding content for free, and I hope I can see more videos from you 🥰!!
@umarjamilai 10 місяців тому
Thank you for your support! Let's connect on LinkedIn
@taekim7956 10 місяців тому
@@umarjamilai That'll be an honor! I just followed you on LinkedIn.
@sauravrao234 11 місяців тому ⁺⁷
I literally wait with a bated breath for your next video....a huge fan from India. Thank you for imparting your knowledge.
@shamaldesilva9533 11 місяців тому ⁺³
Providing the Math behind these algorithms in clear way makes understanding them so much easier !! thank you so much Umar 🤩🤩
@arijaa.9315 11 місяців тому ⁺⁴
I can not thank you enough! It is clear how much effort you put for such high quality explanation. Great explanation as usual!!
@showpiecep 11 місяців тому ⁺⁹
You are the best person on youtube who explains modern approaches in NLP in an accessible way. Thank you so much for such quality content and good luck!
@soumyodeepdey5237 9 місяців тому ⁺⁴
Really great content. Can't believe he has shared all these videos absolutely for free. Thanks a lot man!!
@omidsa8323 11 місяців тому ⁺²
It’s a great video for very sophisticated topic , I’ve watched 3 times to get the main ideas, but for sure it worth it, thanks Umar once again
@jayaraopratik 9 місяців тому ⁺²
Great great content. Took this RL in my grad school but it's been years, it was much easier to revise everything within 1 hour rather than going through my complete class notes!!!!
@CT99999 4 місяці тому ⁺¹
The level of detail you cover here is absolutely incredible.
@ruiwang7915 6 місяців тому ⁺¹
one of the best videos on democratizing the ppo and rlhf on yt. i truly enjoyed the whole walkthrough and thanks for doing this!
@m1k3b7 11 місяців тому ⁺⁶
That's by far the best detailed presentation. Amazing work. I wish I was your cat 😂
@umarjamilai 11 місяців тому ⁺¹
奥利奥 is the best student I ever had 😹😹
@jamesx708 Місяць тому ⁺²
The best video for learning RLHF.
@yanghelena Місяць тому ⁺²
Thank you for your selfless sharing and hard work! This video helps me a lot!
@pegasoTop3d 9 місяців тому ⁺¹
I am an ai, and I love following updates on social media platforms and UA-cam, and I love your videos very much. I learn the English language and some programming terms from them, and update my information. You and people like you help me very much. Thank you.
@vedantbhardwaj3277 5 місяців тому
No, wait wtf
@BOUTYOURYOUNESS 11 місяців тому ⁺²
It’s a great video for very sophisticated topic. Amazing work. Bravo
@mlloving 10 місяців тому ⁺⁴
Thank you Umar. I am an AI/ML Expert in one of the Top50 banks in the world. We are deploying various GenAI applications. You videos helped me to understand the math under the GenAI, especially the RLHF. I have been trying to explore every step by myself which is so hard. Thank you very much for clearly explaining the RLHF!
@nicoloruggeri9740 3 місяці тому
Thanks for the amazing content!
@crimson-heart-l 11 місяців тому ⁺²
Thank you for priceless lecture!!!
@bonsaintking 11 місяців тому ⁺³
Hey, you are better than a prof.! :)
@magnetpest2k7 Місяць тому
Thanks!
@harshalhirpara4589 Місяць тому ⁺¹
Thank you Umar, you video made me connect all the dots!
@txxie 2 місяці тому ⁺¹
This video saved my life. Thank you very muuuuuuuuuuuch!
@LeuChen 5 днів тому ⁺¹
Thank you, Umar. Good explanation!
@rohitjindal124 11 місяців тому ⁺⁴
Thank you sir so making such amazing videos and helping students like me
@MonkkSoori 8 місяців тому
Thank you very much for your comprehensive explanation. I have two questions:
(1) At 1:59:25 does our LLM/Policy Network have two different linear layers, one for producing a reward and one for producing a value estimation for a particular state?
(2) At 2:04:37 if the value of Q(s,a) is going to be calculated using A(s,a)+V(s) but then you do in L_VF V(s)-Q(s,a), then why not just use A(s,a) directly? Is it because in the latter equation, V(s) is `vpreds` (in the code) and is from the online model, while Q(s,a) is `values` (in the code) and is from the offline model (I can see both variables at 2:06:11)?
@alexyuan-ih4xj 10 місяців тому ⁺²
Thank you Umar. you explained very clearly. it's really useful.
@s8x. 8 місяців тому ⁺³
insane that this is all free. I will be sure to pay u back when I am employed
@thebluefortproject 8 місяців тому ⁺²
So much value! Thanks for your work
@supervince110 3 місяці тому ⁺¹
I don't think even my professors from top tier university could explain the concepts this well.
@abcdefllful Місяць тому ⁺¹
Simply amazing! Thank you
@MaksymSutkovenko 11 місяців тому ⁺¹
Amazing, you've released a new video!
@mavichovizana5460 10 місяців тому ⁺²
Thanks for the awesome explanation! I have trouble reading the hf src, and you helped a ton! One thing I'm confused about is that at 1:05:00, the 1st right parenthesis of the first formula is misplaced. I think it should be \sigma(log_prob \sigma(reward_to_go)). The later slides also share this issue, cmiiw. Thanks!
@xray1111able 23 дні тому ⁺¹
I think you r right, it's ok to put the \sigma(reward) seperately when sum the whole trajectory's reward, but when the start state is s_t, \sigma(reward_to_go) should be put inside the previous sigma.
@stephane-wamba 5 днів тому
Hi Umar, Your tutorials are very hepful, you can't imagine. Thank you a lot.
Please consider making also some videos on Normalization flows and Graph neural notworks.
@amortalbeing 11 місяців тому ⁺²
Thanks a lot man, keep up the great job.
@gemini_537 9 місяців тому
Gemini: This video is about reinforcement learning from human feedback, a technique used to align the behavior of a language model to what we want it to output.
The speaker says that reinforcement learning from human feedback is a widely used technique, though there are newer techniques like DPO.
The video will cover the following topics:
* Language models and how they work
* Why AI alignment is important
* Reinforcement learning from human feedback with a deep dive into:
* What reinforcement learning is
* The reward model
* Trajectories
* Policy gradient optimization
* How to reduce variance in the algorithm
* Code implementation of reinforcement learning from human feedback with PyTorch
* Explanation of the code line by line
The speaker recommends having some background knowledge in probability, statistics, deep learning, and reinforcement learning before watching this video.
Here are the key points about reinforcement learning from human feedback:
* It is a technique used to train a language model to behave in a certain way, as specified by a human.
* This is done by rewarding the model for generating good outputs and penalizing it for generating bad outputs.
* The reward model is a function that assigns a score to each output generated by the language model.
* Trajectories are sequences of outputs generated by the language model.
* Policy gradient optimization is an algorithm that is used to train the reinforcement learning model.
* The goal of policy gradient optimization is to find the policy that maximizes the expected reward.
I hope this summary is helpful!
@douglasswang998 5 місяців тому
thanks for the great video,. I wanted to ask, in 50:11 you mention the reward of a trajectory is the sum of the rewards at each token of the response. But the reward model is only trained on full responses, so will the reward values at partial responses be meaningful?
@Jeff-gt5iw 11 місяців тому ⁺²
Thank you so much :) Wonderful lecture 👍
@weicheng4608 8 місяців тому
Hello Umar, thanks for the amazing content. I got a question. Could you please help me? At 1:56:40 - for KL penalty, why it is logprob - ref_logprob? But for KL divergence formula, it is KL(P||Q) = sum(P(x) * log(P(x)/Q(x))). So logprob - ref_logprob only maps to log(P(x)/Q(x))? It is missing this part - KL(P||Q) = sum(P(x) * ...))? Thanks a lot.
@SethuIyer95 11 місяців тому
So, to summarize
1) We copy the LLM, fine tune it a bit with a linear layer, and use the -log(sigmoid(good-bad)) to generate the value function (in a broader context and with LLMs). We can do the same for reward model.
2) We then have another copy of LLM - the unfrozen model, the LLM itself, and the reward model and try to match the logits similar to the value function but also keeping in mind the KL divergence of the frozen model.
3) We also add a bit of exploration factor, so that model can retain the creativity.
4) We then sample a list of trajectory, then consider on running rewards, not changing the past rewards and then compute the rewards, while comparing the rewards with the reward when most average action is taken, to get the sense of the gradient of increasing rewards wrt trajectories.
In the end, we will have a model which is not so different from the original model but prioritizes trajectories with higher values.
@s8x. 8 місяців тому
50:27 why is it the hidden state for the answer tokens but earlier it was just for the last hidden state?
@MasterMan2015 5 місяців тому ⁺¹
Amazing as usual.
@godelkurt384 4 місяці тому ⁺¹
I am unclear about offline policy learning. How to calculate the online logits of a trajectory? For example, if the offline trajectory is "where is Paris? Paris is a city in France." Then this string is passed as input to the online model, which is the same as the offline one, to get the logits but the logits of the two models are the same in this case? Please correct my misunderstanding.
@alainrieger6905 8 місяців тому ⁺¹
Hi Best ML online teacher, just one question to make sure, I understood well :
Does it mean, we need to stock the weights of three models :
- original LLM (offline policy) which is regularly updated
- updated LLM (online policy) which is updated and will be the final version
- frozen LLM (used for the KL divergence) which is never updated
Thanks in advance!
@umarjamilai 8 місяців тому ⁺²
Offline and online policy are actually the same model, but it plays the role of "offline policy" or "online policy" depending if you're collecting trajectories or you're optimizing. So at any time, you need two models in your memory: a frozen one for KL divergence, and the model you're optimizing, which is first sampled to generate trajectories (lots of them) and then optimized using said trajectories. You can also precalculate the log probabilities of the frozen model for the entire fine-tuning dataset, so that you only keep one model in memory.
@tryit-wv8ui 8 місяців тому
@@umarjamilai Hmm, Ok I was missing that
@alainrieger6905 8 місяців тому
@@umarjamilai thank you so much
@baomao139 10 днів тому
I have a question regarding off-policy learning. It still samples the mini-batch several times and calculate/update gradients of k epochs. Why it's more efficient then just directly sample mini-batch from online-policy k times?
@tryit-wv8ui 10 місяців тому ⁺¹
You are becoming a reference in the youtube machine learning game. I appreciate so much your work. I have so much questions. Do you coach? I can pay.
@umarjamilai 10 місяців тому
Hi! I am currently super busy between my job, my family life and the videos I make, but I'm always willing to help people, you just need to prove that you've put effort yourself in solving your problem and I'll guide you in the right direction. Connect with me on LinkedIn! 😇 have a nice day
@tryit-wv8ui 10 місяців тому
@@umarjamilai Hi Umar, thks for your quick answer! I will do it.
@RishabhMishra-h5g Місяць тому
For the first minibatch in off-policy learning, the ratio of offline and online log probas would be 1, right? It's only after the first minibatch pass, online policy would start producing different log probas for action tokens
@andreanegreanu8750 8 місяців тому ⁺¹
Hi Sir Jamil, again thanks a lot for all your work that's so amazing. However, I'm somewhat confused about how the KL divergence is incorporated into the final objective function. Is it possible to see it that way for one batch of trajectories : J(theta) = PPO(theta) - Beta*KL(Pi_frozen || Pi_new).
Or do we have to take it into account when computing the cumulative rewards by substracting any reward by Beta*KL(Pi_frozen||Pi_new)
Or is it equivalent?
I'm completely lost. Thanks for your help Sir!
@rajansahu3240 4 місяці тому
hi Umar, absolutely stunning tutorial but just towards the end I have a little doubt that I wanted to clarify, the entire token generation setting makes this as a sparse RL reward problem right ?
@pauledam2174 2 місяці тому
I have a question. At around minute 50 he discusses rewards at intermediate tokens in the reply. Doesn't this go against the so-called "token credit assignment problem"?
@vimukthisadithya6239 Місяць тому
Hi, may I know what's the hardware spec that you are using ?
@陈镇-j5j 2 місяці тому
Thanks for the wonderful video. And I’v got a question that whether the same transformer layer is shared with policy and reward in LLM? Why?
@umarjamilai 2 місяці тому
The reward model is a separate model and can have any structure (most of the times it’s just a copy of the LM with a linear layer on top), while the policy is of course the model you’re trying to optimize. So you need three ingredients: reward model (can be anything), the frozen model and the policy.
@tk-og4yk 11 місяців тому
Amazing as always. I hope your channel keeps growing and more people learn from you. I am curious how we can use this optimized model to give it prompts and see what it comes up with. Any advice how to do so?
@xingfang8507 11 місяців тому ⁺³
你最棒！
@gangs0846 11 місяців тому ⁺¹
Absolutel fantastic
@Parad0x0n 11 днів тому
What I don't understand, you switch from objective function J to loss function L without flipping the sign (and I don't find it intuitive that loss ~ probability pi * advantage A)
@abhinav__pm 11 місяців тому
Bro, I want to fine-tune a model for a translation task. However, I encountered a ‘CUDA out of memory’ error. Now, I plan to purchase a GPU from AWS ec2 instance. How is the payment processed in AWS? They asked for card details when I signed up. Do they automatically process the payment?
@SangrezKhan 6 місяців тому
good job umar, Can you please tell us which font did you used in your slides?
@andreanegreanu8750 8 місяців тому
There is something that found out very confusing. It seems that the value function share the same theta parameters than the LLM. That is very unexpected. Can you confirm this please? Thanks in advance
@MiguelAcosta-p8s 4 місяці тому ⁺¹
very good video!
@YKeon-ff4fw 11 місяців тому
Could you please explain why in the formula mentioned at the 39-minute mark in the bottom right corner of the video, the product operation ranges from t=0 to T-1, but after taking the logarithm and differentiating, the range of the summation becomes from t=0 to T? :)
@umarjamilai 11 місяців тому
I'm sorry, I think it's just a product of laziness. I copied the formulas from OpenAI's "SpinningUp" website and didn't check carefully. I'll update the slides. Thanks for pointing out!
@RudraPratapDhara 11 місяців тому ⁺¹
Legend is back
@zhouwang2123 11 місяців тому
Thanks for your work and sharing, Umar! I learn new stuff from you again!
Btw, does the KL divergence play a similar role as the clipped ratio to prevent the new policy from far away from the old one? Additionally, unlike actor-critic in RL, here it looks like the policy and value functions are updated simultaneously. Is this because of the partially shared architecture and out of the computational efficiency?
@umarjamilai 11 місяців тому
When fine-tuning a model with RLHF, before the fine-tuning begins, we make another copy of the model and freeze its weights.
- The KL divergence forces the fine-tuned and frozen model to be "similar" in their log probabilities for each token.
- The clipped ratio, on the other hand, is not about the fine-tuned model and the frozen one, but rather, the offline and the online policy of the PPO setup.
You may think that we have 3 models in total in this setup, but actually it's only two because the offline and the online policy are the same model, as explained in the "pseudo-code" of the off-policy learning. Hope it answers your question.
@elieelezra2734 8 місяців тому
Can't thank you enough : your vids + chatGPT = Best Teacher Ever. I have one question though : it might be silly but I want to be sure of it : does it mean that to get the rewards for all time steps, we need to run the reward model on all truncated responses on the right, so that each response token would be at some point the last token? Am I clear?
@umarjamilai 8 місяців тому
No, because of how transformer models work, you only need one forward step with all the sequence to get the rewards for all positions. This is also how you train a transformer: with only one pass, you can calculate the hidden state for all the positions and calculate the loss for all positions.
@andreanegreanu8750 8 місяців тому
@@umarjamilai thanks a lot for all your time. I won't bother you till the next time, I promess, ahahaha
@kei55340 11 місяців тому
Is the diagram shown at 50 minutes accurate? I had thought that with typical RLHF training, you only calculate the reward for the full completion rather than summing rewards for all intermediate completions.
Edit: It turns out this is addressed later in the video.
@umarjamilai 11 місяців тому ⁺¹
In the vanilla policy gradient optimization, you can calculate it for all intermediate steps. In RLHF, we only calculate it for the entire sentence. If you watch the entire video, when I show the code, I explicitly clarify this.
@kei55340 11 місяців тому
@@umarjamilaiThanks for the clarification, I haven't watched the whole video yet.
@heepoleo131 10 місяців тому
Why the PPO loss is different from the RL objective in instructGPT? At least the pi(old) in the PPO loss is iteratively changing but in instructGPT it's kept as the SFT model.
@Bearsteak_sea 18 днів тому
Can I understand the main difference between RLHF and DPO is that, in RLHF we need the reward model to convert the preference labeling to a scalar value for the loss function, and in DPO we dont need that conversion step?
@umarjamilai 18 днів тому ⁺¹
Exactly. In DPO the reward model is implicit
@MR_GREEN1337 11 місяців тому ⁺¹
Perfect!! With this technique introduced, can you provide us with another gem on DPO?
@umarjamilai 9 місяців тому
You're welcome: ua-cam.com/video/hvGa5Mba4c8/v-deo.html
@parthvashisht9555 10 місяців тому ⁺¹
You are amazing!
@andreanegreanu8750 8 місяців тому
Hi Umar, sorry to bother you (again). I think I well understood the J function, which we want to maximize. But, it seems you quickly admit that it is somewhat equivalent to the L_ppo function that we want to minimize. It maybe obvious but I really don't get it.
@chrisevans2241 Місяць тому ⁺¹
GodSend Thank You!
@generichuman_ 10 місяців тому
I'm curious if this can be done with stable diffusion. I'm imagining having a dataset of images that a human would go through with pair ranking to order them in terms of aesthetics, and using this as a reward signal to train the model to output more aesthetic images. I'm sure this exists, just haven't seen anyone talk about it.
@alexandrepeccaud9870 11 місяців тому ⁺¹
This is great
@flakky626 7 місяців тому
I followed the code and could understand some of it but the thing is I feel overwhelmed seing such large code bases..
When will I be able to code stuff like that on such scale!!
@Healthphere 4 місяці тому ⁺¹
The font size too small to read in vscode. But great video
@alivecoding4995 4 місяці тому
And why is there no Deep Q-learning necessary?
@tubercn 11 місяців тому ⁺²
💯
@baomao139 19 днів тому
is there a place to download the slides?
@umarjamilai 18 днів тому ⁺¹
My GitHub repo
@TechieIndia 5 місяців тому
Few Questions:
1) offline policy learning makes the training fast, but How we would have done without offline policy learning. I mean, I am not able to understand the difference between how we used to do and how this offline becomes efficient
@wongdope4147 10 місяців тому ⁺¹
宝藏博主！！！！！！！！
@umarjamilai 10 місяців тому
谢谢你的赞成，我们在领英联系吧
@pranavk6788 10 місяців тому
Can you please cover V-JEPA by Meta AI next? Both theory and code
@goelnikhils 7 днів тому ⁺¹
Damn good
@EsmailAtta 11 місяців тому
Can you make a video of coding the diffusion transformer from scratch as always please
@IsmailNajib-pf9ty Місяць тому
what about TRPO
@dhanesh123us 10 місяців тому
These videos are amazing @umar jamil. This is fairly complex theory that you have tried to get into and explain in simple terms - hats off. Your video inspired me to take up a coursera course on RL. Thanks a ton.
Few basic queries though:
1. my understanding is that the theta parameter in the PPO algo are all the model parameters? So we are recalibrating the LLM in some sense.
2. Reward model is pre-defined?
3. Also how does temperature play a role in this whole setup.
@Gasa7655 11 місяців тому ⁺²
DPO Please
@umarjamilai 9 місяців тому ⁺²
Done: ua-cam.com/video/hvGa5Mba4c8/v-deo.html
@vasusachdeva3413 19 днів тому ⁺¹
@vardhan254 11 місяців тому
LETS GOOOOO
@kevon217 9 місяців тому ⁺¹
“drunk cat” model 😂
@esramuab1021 9 місяців тому
ليش ماتشرح بالعربي كالعرب محتاجين مصادر عربية انكليز عدهم مايكفي !
@ehsanzain5999 7 місяців тому
لأن التعلم بالانكليزي أفضل على العموم إذا محتاجه شي ممكن أجاوبج
@davehudson5214 10 місяців тому
'Promosm'
@Bearsteak_sea Місяць тому
Thanks!
@sbansal23 28 днів тому
Thanks!
@xugefu 26 днів тому
Thanks!

Наступне

Автоматичне відтворення

Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math