Objective Mismatch in Reinforcement Learning from Human Feedback
Вставка
- Опубліковано 10 лют 2025
- Abstract:
Reinforcement learning from human feedback (RLHF) has been shown to be a powerful framework for data-efficient fine-tuning of large machine learning models toward human preferences. RLHF is a compelling candidate for tasks where quantifying goals in a closed form expression is challenging, allowing progress in tasks such as reducing hate-speech in text or cultivating specific styles of images. While RLHF is shown to be instrumental to recent successes with large language models (LLMs) for chat, its
experimental setup is known to have a set of unaddressed limitations. The talk will center around the objective mismatch issue, also referred to as proxy objectives, in RLHF (with inspiration from the issue in model-based RL), which arises when one objective is optimized in the hope that a second, often uncorrelated, metric will also be optimized. In RLHF, the directly optimized metric is the reward from the preference model, which is assumed to be correlated with downstream user preferences or LLM benchmarks. This mismatch can be compounded with repeated training and releases of models, which operates as an outer loop of feedback in the optimization process.
This talk will focus on how such a mismatch can appear in the optimization process by re-telling the building of a RLHF for chatbots training framework. It will detail experimental signs of mismatch and potential evaluation tools for mitigating its effects. The talk will conclude with future research directions to more directly optimize for downstream tasks.
Bio:
Nathan Lambert is a research scientist on AI2's AllenNLP team. At the time of this talk, Nathan was a Research Scientist and RLHF team-lead at HuggingFace. He received his PhD from the University of California, Berkeley working at the intersection of machine learning and robotics. He was advised by Professor Kristofer Pister in the Berkeley Autonomous Microsystems Lab and Roberto Calandra at Meta AI Research for his PhD, interning at Facebook AI and DeepMind. Nathan was awarded the UC Berkeley EECS Demetri Angelakos Memorial Achievement Award for Altruism for his efforts to better community norms. - Наука та технологія