ORPO: Monolithic Preference Optimization without Reference Model (Paper Explained)

Yannic Kilcher

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 30 кві 2024
Paper: arxiv.org/abs/2403.07691
Abstract:
While recent preference alignment algorithms for language models have demonstrated promising results, supervised fine-tuning (SFT) remains imperative for achieving successful convergence. In this paper, we study the crucial role of SFT within the context of preference alignment, emphasizing that a minor penalty for the disfavored generation style is sufficient for preference-aligned SFT. Building on this foundation, we introduce a straightforward and innovative reference model-free monolithic odds ratio preference optimization algorithm, ORPO, eliminating the necessity for an additional preference alignment phase. We demonstrate, both empirically and theoretically, that the odds ratio is a sensible choice for contrasting favored and disfavored styles during SFT across the diverse sizes from 125M to 7B. Specifically, fine-tuning Phi-2 (2.7B), Llama-2 (7B), and Mistral (7B) with ORPO on the UltraFeedback alone surpasses the performance of state-of-the-art language models with more than 7B and 13B parameters: achieving up to 12.20% on AlpacaEval2.0 (Figure 1), 66.19% on IFEval (instruction-level loose, Table 6), and 7.32 in MT-Bench (Figure 12). We release code and model checkpoints for Mistral-ORPO-α (7B) and Mistral-ORPO-β (7B).
Authors: Jiwoo Hong, Noah Lee, James Thorne
Links:
Homepage: ykilcher.com
Merch: ykilcher.com/merch
UA-cam: / yannickilcher
Twitter: / ykilcher
Discord: ykilcher.com/discord
LinkedIn: / ykilcher
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
Наука та технологія

КОМЕНТАРІ • 46

@r9999t 20 днів тому ⁺²⁵
Glad you're back to technical content this time. Any AI UA-camr can give us latest AI news, but you're just about the only one that can give technical insight into the stories.
@lone0017 20 днів тому ⁺²⁰
6 videos in 7 days, I'm having a holiday and this is such a perfect-timing treat.
@EternalKernel 20 днів тому ⁺⁴
Thank you for being awesome Yannic, I send people from the classes that I "TA" for to you because you're reliably strong with your analysis.
@justheuristic 20 днів тому ⁺¹²
The main loss function (7) looks like it can be meaningfully simplified with school-level math.
Lor = -log(sigm( log ( odds(y_w|x) / odds(y_l|x)))), where sigm(a) = 1/(1 + exp(-a)) = exp(a) / (1 + exp(a))
Let's assume that both odds(y_w|x) and odds(y_l|x) are positive (because softmax)
By plugging in the sigmoid, we get
Lor = - log (exp(log(odds(y_w|x) / odds(y_l|x) )) / (1 + exp(log(odds(y_w|x) / odds(y_l|x)))) )
Note that exp(log(odds(y_w|x) / odds(y_l|x)) = odds(y_w|x) / odds(y_l|x). We use this to simplify:
Lor = - log( [odds(y_w|x) / odds(y_l|x)] / (1 + odds(y_w|x) / odds(y_l|x)) )
Finally, multiply both numerator and denominator by odds(y_l|x) to get
Lor = - log(odds(y_w|x) / (odds(y_w|x) + odds(y_l|x)) )
Intuitively, this is the negative log-probability of (the odds of good response) / (odds of good response + odds of bad response ).
If you minimize the average loss over multiple texts, it's the same as maximizing the odds that the model chooses winning response in every pair (of winning+losing responses).
@peterszilvasi752 19 днів тому ⁺¹
Good job! I suppose you mean `odds(y_l|x)` instead of `odds(y_l)` in the final equation.
@justheuristic 19 днів тому
@@peterszilvasi752 thanks! good catch :) /* fixed the previous comment */
@lucidraisin 19 днів тому ⁺¹
very cool! thank you for this
@peach412 20 днів тому ⁺¹⁵
26:30 that 'really?' and the following struggle with basic math is WAAAAY to relatable
@borisbondarenko314 18 днів тому ⁺¹
I very like more technical content from you. I usually read tech news in telegram and your NL New are greats, but very ordinal and simple. So such paper explanations are kind of impact to the DS community, such videos grands new ideas and increase understanding of the field for those, who tried to dive in the deep. Of course it less popular due to complexity of material for audience, but much more interesting. So thank you for such format.
@tensorturtle1566 20 днів тому ⁺¹²
Great to see research from my homeland of South Korea represented!
@Dogo.R 20 днів тому ⁺²
Woo allegence to tribes!!... .. ..
@jawadmansoor6064 20 днів тому
do you know Seoul?
@cvabds 20 днів тому
There is only one korea
@blender6426 20 днів тому ⁺¹
Nice I was waiting for this after you mentioned ORPO in ML News :))
@I-0-0-I 20 днів тому
Thanks for explaining basic terms along with the more complex stuff, for dilettantes like myself. Cheers.
@kaikapioka9711 19 днів тому ⁺²
Thx again yan! 🎉
@fearnworks 20 днів тому ⁺⁴
You are on fire!
@max0x7ba 5 днів тому
That log of probability is also a power transform often used to narrow or widen a distribution.
@jondo7680 Годину тому
You should make a video just focusing on log and explaining it's role in neuronal networks.
@meselfobviouslyme6292 20 днів тому ⁺¹
Thank you Mr Klicher for delving into the paper, ORPO; Monolithic Preference Optimization without Reference Model
@gauranshsoni4011 19 днів тому ⁺¹
Keep them comin
@jellyfishnexus3132 20 днів тому ⁺¹
Nice!
@MyCiaoatutti 19 днів тому ⁺¹
"Specifically, 1 - p(y|x) in the denominators amplifies the gradients when the corresponding side of the likelihood p(y|x) is low". I think that (1 - p(y|x)) have two different meanings here: it can be the result of differentiation by coincidence and also the "corresponding side" of the likelihood, i.e., 1 - p(y|x). So, when it says the "corresponding side" of p(y|x) is low, it means that 1 - p(y|x) is low.
@yannickpezeu3419 19 днів тому
I liked the self deprecation at 32:00 haha
@wwkk4964 20 днів тому
What's going on, is it a yannic bonanza time of the year! Loving these addicting videos
@chrise8153 20 днів тому
Wow good timing to go on youtube
@axelmarora6743 20 днів тому ⁺¹
great! now apply ORPO to a reward model and round we go!
@Zed_Oud 13 днів тому ⁺¹
27:57
“the corresponding side”
Maybe they mistakenly switched the w l givens in the denominators?
@mantasorantas5289 20 днів тому ⁺¹
Would be interesting to see how it compares to KTO. Would guess that KTO outperforms and is easier to implament as you dont need pairs of inputs.
@SLAM2977 19 днів тому ⁺¹
There seems to be a conceptual problem, where are the preferences coming from given that they are expressed on multiple responses to the same prompt? Let's suppose we wish to fine-tune a foundation model for chat, we would not have the preferences before having done SFT and gathered some responses on the chat template format based prompt, that would force us to do SFT first and then SFT+ODDS_RATIO loss. Doable but surely not a single pass approach.
@syeshwanth6790 19 днів тому ⁺¹
Where does Yw and Yl come from. Is it from the training dataset or the LLM that is being trained generates these and are labelled by humans or reward models as W and L?
@drdca8263 20 днів тому ⁺²
0:52 : I wish we had a different term for this other than “alignment”
@TheRyulord 20 днів тому ⁺¹
"Preference tuning" is used to describe it pretty often
@drdca8263 20 днів тому ⁺¹
@@TheRyulord thanks!
@rectomgris 20 днів тому ⁺¹
makes me think of PPO
@thunder89 20 днів тому
The comparison in the end between OR and PR should also discuss the influence of the log sigmoid, or? And, more importantly, how the gradients for the winning and loosing output actually would look like with these simulated pars... It feels a bit handweavy why the logsigmoid of the OR should be the target ...
@john_blues 20 днів тому
I don't even know what the title of this video means 😵‍💫. But I'm going to watch anyway.
@Jason-lm2yq 17 днів тому
Can you do one on Kolmogorov-Arnold Network from MIT
@davidhauser7537 7 днів тому
yannick can you do xlstm paper
@amber9040 20 днів тому ⁺²
I feel like AI models have gotten more stale and same-y ever since RLHF became the norm. Playing around with GPT-3 was wild times. Hopefully alignment moves in a direction with more diverse ranges of responses in the future, and less censorship in domains where it's not needed.
@dinoscheidt 20 днів тому
LLMs are what Machine Learning has always been: input output. Quality data makes the cake…. no matter how many fancy mixers you bring to the table.
@Embassy_of_Jupiter 20 днів тому
why hat, indeed
@iworeushankaonce День тому
*posts videos almost every day*
*KAN paper dropped, disappears for 2 weeks*
I hope you alright man 🫂🤗

Наступне

Автоматичне відтворення

TransformerFAM: Feedback attention is working memory