Hi Mr. Serrano! I am doing your coursera course at the moment on linear algebra for machine learning and I am having so much fun! You are a brilliant teacher, and I just wanted to say thank you! Wish more teachers would bring theoretical mathematics down to a more practical level. Obviously loving the very expensive fruit examples :)
Thank you very much for the video! Do I understand correctly that RLHF still has some advantages, namely that by using it we can gather a small amount of human preferences data, and then, after training a reward model using that data, it will itself evaluate many more new examples? So by having trained the reward model, we have basically free human annotator, that can rate endless new examples. In the case of DPO, however, we only have the initial human preferences data and that’s it.
Really love the way you broke down the DPO loss, this direct way is more welcome by my brain :). Just one question on the video, I am wondering how important it is to choose the initial transformer carefully. I suspect that if it is very bad at the task, then we will have to change the initial response a lot, but because the loss function prevents from changing too much in one iteration, we will need to perform a lot tiny changes toward the good answer, making the training extremely long. Am I right ?
Thank you, great question! This method is used for fine-tuning, not specifically for training. In other words, it's crucial that we start with a fully trained model. For training, you'd use normal backpropagation on the transformer, and lots of data. Once the LLM is trained and very trusted, then you use DPO (or RLHF) to fine-tune it (meaning, post train it to get from good to great). So we should assume that the model is as trained as it can, and that's why we trust the LLM and we try to only change it marginally. If we were to do this method to train a model that's not fully trained... I'm not 100% if it would work. It may or may not, but we'd still have to punish the KL divergence much less. And also, human feedback gives a lot less data than scraping the whole internet, so I would still not use this as a training method, more as refining. Let me know if you have more questions!
I'm a little confused about one thing: the reward function, even in the Bradley-Terry model, is based on the human-given scores for individual context-prediction pairs, right? And πθ is the probability from the current iteration of the network, and πRef is the probability from the original, untuned network? So then after that "mathematical manipulation", how does the human-given set of scores become represented by the network's predictions all of a sudden?
same i was also thinking about that also i think it is incomplete maybe because it is dpo loss for just one training at a time and human evaluator tries it continuously in its training and tries to find better😅 it look tedious but i think main idea is all about training 1 neural network at a time. if u find it wrong correct me
I'm also confused here. It seems that in DPO, reward is still an *input* to the Bradley-Terry probabilities. I thought the reason RLHF trained a reward model was to abstract this human preference so that it can be applied to data not explicitly rated by humans. How does representing that reward in the form of a probability obviate the need for abstraction?
In RLHF: Train a reward model to predict human preferences Use that reward model to guide policy optimization on new examples Requires two separate models (policy and reward) In DPO: Takes human preference data directly Uses Bradley-Terry to convert preference scores into probabilities Optimizes the model to directly match these preference probabilities while staying close to the reference model Only needs one model The key insight is that DPO doesn't try to explicitly generalize the reward function. Instead, it relies on two mechanisms for generalization: The model's own ability to generalize from the preference data through direct optimization The KL divergence term that keeps the model close to the reference model's behavior on unseen examples So while DPO does use human rewards as input, it doesn't need a separate reward model because it: Directly incorporates the preferences into the loss function Uses the KL divergence constraint to prevent overfitting to just the preference data Lets the language model itself learn to generalize preferences through its parameters The tradeoff is that DPO may require more direct preference data upfront, but it's more efficient by avoiding the need to train and maintain a separate reward model.
Thanks for your comment @VerdonTrigance! I also can't remember these formulas, since to me, they are the worst way to convey information. That's why I like to see it with examples. If you understand the example and the idea underneath, then you understand the concept. Don't worry about the formulas.
Your explanation is helpful
Hi Mr. Serrano! I am doing your coursera course at the moment on linear algebra for machine learning and I am having so much fun! You are a brilliant teacher, and I just wanted to say thank you! Wish more teachers would bring theoretical mathematics down to a more practical level. Obviously loving the very expensive fruit examples :)
Thank you so much @Cathiina, what an honor to be part of your learning journey, and I’m glad you like the expensive fruit examples! :)
Thank you very much for the video!
Do I understand correctly that RLHF still has some advantages, namely that by using it we can gather a small amount of human preferences data, and then, after training a reward model using that data, it will itself evaluate many more new examples?
So by having trained the reward model, we have basically free human annotator, that can rate endless new examples.
In the case of DPO, however, we only have the initial human preferences data and that’s it.
Thanks for sharing. Is there any hands on resource to try DPO ?
Great video as always. I have a question, in practice which one works best using DPO or RLHF?
Thank you! From what I've heard, DPO works better, as it trains the network directly instead of using RL and two networks.
@@SerranoAcademy Thank you sir for the great work. your Coursera courses have been awesome.
Really love the way you broke down the DPO loss, this direct way is more welcome by my brain :). Just one question on the video, I am wondering how important it is to choose the initial transformer carefully. I suspect that if it is very bad at the task, then we will have to change the initial response a lot, but because the loss function prevents from changing too much in one iteration, we will need to perform a lot tiny changes toward the good answer, making the training extremely long. Am I right ?
Thank you, great question! This method is used for fine-tuning, not specifically for training. In other words, it's crucial that we start with a fully trained model. For training, you'd use normal backpropagation on the transformer, and lots of data.
Once the LLM is trained and very trusted, then you use DPO (or RLHF) to fine-tune it (meaning, post train it to get from good to great). So we should assume that the model is as trained as it can, and that's why we trust the LLM and we try to only change it marginally.
If we were to do this method to train a model that's not fully trained... I'm not 100% if it would work. It may or may not, but we'd still have to punish the KL divergence much less. And also, human feedback gives a lot less data than scraping the whole internet, so I would still not use this as a training method, more as refining.
Let me know if you have more questions!
@@SerranoAcademy Thanks for the answer, I understand better. I forgot that this design is for fine-tuning.
@@SerranoAcademy thank u that was also one of my doubts that transformer should be trained perfectly such that we can use dpo😅
I'm a little confused about one thing: the reward function, even in the Bradley-Terry model, is based on the human-given scores for individual context-prediction pairs, right? And πθ is the probability from the current iteration of the network, and πRef is the probability from the original, untuned network?
So then after that "mathematical manipulation", how does the human-given set of scores become represented by the network's predictions all of a sudden?
same i was also thinking about that also i think it is incomplete maybe because it is dpo loss for just one training at a time and human evaluator tries it continuously in its training and tries to find better😅 it look tedious but i think main idea is all about training 1 neural network at a time. if u find it wrong correct me
@@peace-it4rg if you have an answer, please kick me up, thanks!!
I'm also confused here. It seems that in DPO, reward is still an *input* to the Bradley-Terry probabilities. I thought the reason RLHF trained a reward model was to abstract this human preference so that it can be applied to data not explicitly rated by humans. How does representing that reward in the form of a probability obviate the need for abstraction?
In RLHF:
Train a reward model to predict human preferences
Use that reward model to guide policy optimization on new examples
Requires two separate models (policy and reward)
In DPO:
Takes human preference data directly
Uses Bradley-Terry to convert preference scores into probabilities
Optimizes the model to directly match these preference probabilities while staying close to the reference model
Only needs one model
The key insight is that DPO doesn't try to explicitly generalize the reward function. Instead, it relies on two mechanisms for generalization:
The model's own ability to generalize from the preference data through direct optimization
The KL divergence term that keeps the model close to the reference model's behavior on unseen examples
So while DPO does use human rewards as input, it doesn't need a separate reward model because it:
Directly incorporates the preferences into the loss function
Uses the KL divergence constraint to prevent overfitting to just the preference data
Lets the language model itself learn to generalize preferences through its parameters
The tradeoff is that DPO may require more direct preference data upfront, but it's more efficient by avoiding the need to train and maintain a separate reward model.
Thanks for the simplified explanation. Awesome as always.
The book link in the description is not working.
Thank you so much! And thanks for letting me know, I’ll fix it
DPO main equation should be PPO main equation.
Did anyone expect something different than Sofmax regarding the Bradley-Terry model as myself? 😅
lol, I was expecting something different too initially 🤣
It's kinda hard to remember all of these formulas and it's demotivating me from further learning.
You do not have to remember that formulas. You only have to understand the logic in them.
Thanks for your comment @VerdonTrigance! I also can't remember these formulas, since to me, they are the worst way to convey information. That's why I like to see it with examples. If you understand the example and the idea underneath, then you understand the concept. Don't worry about the formulas.
Agreed @javiergimenezmoya86!