Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math

Поділитися
Вставка
  • Опубліковано 30 тра 2024
  • In this video I will explain Direct Preference Optimization (DPO), an alignment technique for language models introduced in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model".
    I start by introducing language models and how they are used for text generation. After briefly introducing the topic of AI alignment, I start by reviewing Reinforcement Learning (RL), a topic that is necessary to understand the reward model and its loss function.
    I derive step by step the loss function of the reward model under the Bradley-Terry model of preferences, a derivation that is missing in the DPO paper.
    Using the Bradley-Terry model, I build the loss of the DPO algorithm, not only explaining its math derivation, but also giving intuition on how it works.
    In the last part, I describe how to use the loss practically, that is, how to calculate the log probabilities using a Transformer model, by showing how it is implemented in the Hugging Face library.
    DPO paper: Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S. and Finn, C., 2024. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36. arxiv.org/abs/2305.18290
    If you're interested in how to derive the optimal solution to the RL constrained optimization problem, I highly recommend the following paper (Appendinx A, equation 36):
    Peng XB, Kumar A, Zhang G, Levine S. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177. 2019 Oct 1. arxiv.org/abs/1910.00177
    Slides PDF: github.com/hkproj/dpo-notes
    Chapters
    00:00:00 - Introduction
    00:02:10 - Intro to Language Models
    00:04:08 - AI Alignment
    00:05:11 - Intro to RL
    00:08:19 - RL for Language Models
    00:10:44 - Reward model
    00:13:07 - The Bradley-Terry model
    00:21:34 - Optimization Objective
    00:29:52 - DPO: deriving its loss
    00:41:05 - Computing the log probabilities
    00:47:27 - Conclusion
  • Наука та технологія

КОМЕНТАРІ • 51

  • @nwanted
    @nwanted 3 дні тому +1

    Thanks so much Umar, always learn a lot from your video!

  • @Patrick-wn6uj
    @Patrick-wn6uj Місяць тому +7

    The legend returns, Always excited for your videos. I am an international student at Shanghai Jiao Tong daxue. Your videos have given me a very strong foundation of transformers. Much blessings your way

    • @umarjamilai
      @umarjamilai  Місяць тому +3

      我们在领英联系吧,我有个微信小群,你可以参加

    • @user-kg9zs1xh3u
      @user-kg9zs1xh3u Місяць тому

      ​@@umarjamilai我也想加

    • @user-kg9zs1xh3u
      @user-kg9zs1xh3u Місяць тому

      ​@@umarjamilai我看到b站也有你的账号

  • @sauravrao234
    @sauravrao234 Місяць тому +4

    I humbly request you to make videos on how to build a career in machine learning and AI. I am a huge fan of your videos and i thank you for all the knowledge that you have shared

    • @umarjamilai
      @umarjamilai  Місяць тому +5

      Hi! I will for sure make a video in the future about my personal journey. I hope that can help more people in navigating their own journeys. Have a nice day!

  • @user-hd7xp1qg3j
    @user-hd7xp1qg3j Місяць тому +5

    Legend is back, the GOAT, if my guess is right next will be ORPO or Q*

    • @umarjamilai
      @umarjamilai  Місяць тому +13

      Actually, the next video is going to be a totally new topic not related specifically to language models. Stay tuned!

    • @olympus8903
      @olympus8903 Місяць тому

      @@umarjamilai waiting

  • @mahdisalmani6955
    @mahdisalmani6955 15 днів тому +1

    Thank you very much for this video, please make ORPO as well.

  • @mlloving
    @mlloving Місяць тому +2

    Thank you! It's very clear explaination. It helps for reading the original paper. Looking forward to new topics.

  • @olympus8903
    @olympus8903 Місяць тому +1

    My Kind Request Please Increase volume little bit , just little bit. Otherwise your videos Outstanding . Best I can say.

  • @luxorska5143
    @luxorska5143 Місяць тому +4

    wow your explanation is so clear and complete... you are godsend, keep doing it. Sei un fenomeno

  • @631kw
    @631kw Місяць тому +2

    Thanks for making these videos. Concise and clear

  • @vanmira
    @vanmira 29 днів тому +1

    These lectures are amazing. Thank you!

  • @kmalhotra3096
    @kmalhotra3096 Місяць тому +2

    Amazing! Great job once again!

  • @lukeskywalker7029
    @lukeskywalker7029 Місяць тому

    New video🎉 can't wait to watch. Although having used DPO in production for a while now!

  • @mrsmurf911
    @mrsmurf911 15 днів тому +1

    Love from India sir, you are a legend 😊😊

  • @SaiKiran-jc8yp
    @SaiKiran-jc8yp Місяць тому +1

    Best explanation so far !!!!...

  • @abdullahalsaadi5991
    @abdullahalsaadi5991 Місяць тому

    Amazing explanation. Would it be possible to make a video on the theory and implementation of automatic differentiation (autograd).

  • @tuanduc4892
    @tuanduc4892 23 дні тому

    Thanks for your lecture. I wonder could you explain the vision language models

  • @elieelezra2734
    @elieelezra2734 День тому

    Hello Umar,
    Great as usual, however why do you say at 46:11, that you need to sum log probabilities up? The objective function is the expectation of logarithm of the difference of two weighted log probabilities ratios. I don't get what do you want to sum up exactly? Thank you

  • @jak-zee
    @jak-zee Місяць тому +1

    Enjoyed the style in which the video is presented. Which video editor/tools do you use to make your videos? Thanks.

    • @umarjamilai
      @umarjamilai  Місяць тому +1

      I use PowerPoint for the slides, Adobe Premiere for video editing

    • @jak-zee
      @jak-zee Місяць тому

      @@umarjamilai What do you use to draw on your slides? I am assuming you connected an ipad to your screen.

  • @vardhan254
    @vardhan254 Місяць тому +1

    love ur videos umar !!

  • @AptCyborg
    @AptCyborg Місяць тому

    Amazing Video! Please do one on SPIN (Self Play Fine-tuning) as well

  • @TemporaryForstudy
    @TemporaryForstudy Місяць тому +1

    great video. love from india.

  • @sidward
    @sidward Місяць тому +2

    Thanks for the great video! Very intuitive explanation and particular thanks for the code examples. Question: at 37:41, how do we know that the solving the optimization problem will yield the pi_*? Is there a guaranteed unique solution?

    • @umarjamilai
      @umarjamilai  Місяць тому +1

      Please check the paper I linked in the description for a complete derivation of the formula. It is also done in the DPO paper, but in my opinion the other paper is better suited for this particular derivation.

  • @tommysnowy3068
    @tommysnowy3068 Місяць тому

    Amazing video. Would it be possible for you to explain video-transformers or potential guesses at how Sora works? Another exciting idea is explaining GFlowNets

  • @ernestbeckham2921
    @ernestbeckham2921 Місяць тому

    Thank you. can you make video about liquid neural network?

  • @user-if9tm1co9e
    @user-if9tm1co9e Місяць тому

    great explaination, thanks. how about the recent work: KTO: Model Alignment as Prospect Theoretic Optimization? can you compare it with DPO?😁

  • @mohammadsarhangzadeh8820
    @mohammadsarhangzadeh8820 26 днів тому

    I love ur videos so much. please make a video about mamba or mamba vision

    • @umarjamilai
      @umarjamilai  26 днів тому

      There's already a video about Mamba, check it out

  • @OGIMxGaMeR
    @OGIMxGaMeR Місяць тому

    Thank you very much for the explanation.
    I had one questions. Are the dataset of preferences always made of two and only two answers?

    • @umarjamilai
      @umarjamilai  Місяць тому

      According to the Hugging Face library, yes, looks like you need a dataset with prompt and two answers, one is called the "chosen" one and the other is the "rejected" one. I'm pretty sure there are ways to convert more than two preferences into a dataset of two preferences.

    • @OGIMxGaMeR
      @OGIMxGaMeR Місяць тому

      @@umarjamilai thank you! Yes of course. I am just wondering why it wouldn’t help to have more than 1 rejected for 1 accepted. I guess the formula does not consider this case but may add value.

  • @lokeshreddypolu250
    @lokeshreddypolu250 3 дні тому

    Thanks for the video. Do you know any way on how we can create a dataset for DPO training. I currently have only question, answer pairs. Is it fine if i take y_w as answer and y_l as some random text(which would obviously have lower preference than answer) and then train it?

    • @lokeshreddypolu250
      @lokeshreddypolu250 3 дні тому

      The potential problem that I think could happen is that having random text may decrease the loss and the policy may not even change much

  • @ai.mlvprasad
    @ai.mlvprasad Місяць тому

    what is the ppt software you are using sir ?

  • @nguyenhuuuc2311
    @nguyenhuuuc2311 Місяць тому

    Hi Umar,
    If I use LoRA for fine-tuning a chat model with DPO loss, what should I use as a reference model?
    - The chat model applied LoRA
    - Or the chat model itself without LoRA?

    • @umarjamilai
      @umarjamilai  Місяць тому

      Considering LoRA is just a way to "store" fine-tuned weights with a smaller computation/memory footprint, the model WITHOUT LoRA should be used as the reference model.

    • @nguyenhuuuc2311
      @nguyenhuuuc2311 Місяць тому

      @@umarjamilai With my limited GPU, I can only fine-tune by combining a 4-bit-quantized model + LoRA. Surprisingly, using just the 4-bit model leads to NaN weight updates after one batch. But once LoRA is added, my loss updates smoothly without any problems.

    • @nguyenhuuuc2311
      @nguyenhuuuc2311 Місяць тому

      Thank you SO much for the quick answer and your excellent video. I did get the hang of DPO loss and be able to implement DPO loss + training loop with vanilla PyTorch code.

  • @trungquang1581
    @trungquang1581 Місяць тому

    thank you so much for your effort! could you make a video about tokenizers like BPE and sentencepiece from scratch? I would be very appreciate of it!

  • @DiegoSilva-dv9uf
    @DiegoSilva-dv9uf Місяць тому +1

    Valeu!

  • @samiloom8565
    @samiloom8565 Місяць тому +1

    I enjoy your videos umar on my phone while commuting or sitting in a coffe. Only the small fint on a phone is tiring me ..if you make them a bit bigger that will be better

    • @umarjamilai
      @umarjamilai  Місяць тому +1

      Sorry for the trouble, I'll keep it in mind for the next videos!

  • @kevon217
    @kevon217 Місяць тому

    “digital biscuits”, lol