John Schulman - Reinforcement Learning from Human Feedback: Progress and Challenges

Berkeley EECS

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 29 чер 2024
EECS Colloquium
Wednesday, April 19, 2023
Banatao Auditorium
5-6p

КОМЕНТАРІ • 23

@Jack-vv7zb Рік тому ⁺⁴⁶
Here is a summary of the key points from the talk:
• John discusses the issue of hallucination and factual accuracy with large language models. He argues that behavior cloning or supervised learning is not enough to avoid the hallucination problem. Reinforcement learning from human feedback can help improve the model's truthfulness and ability to express uncertainty.
• His conceptual model is that language models have some internal "knowledge graph" stored in their weights. Fine-tuning allows the model to output correct answers based on that knowledge. But it also leads the model to hallucinate when it lacks knowledge.
• John claims that models do know about their own uncertainty based on the probability distributions they output. However, incentivizing the model to properly express uncertainty in words remains an open problem. The current reward model methodology does not perfectly capture hedging and uncertainty.
• Retrieval and citing external sources can help improve verifiability and fact-checking during training. John discusses models that can browse the web to answer technical questions, citing relevant sources.
• Open problems include how to train models to express uncertainty in natural language, go beyond what human labelers can easily verify, and optimizing for true knowledge rather than human approval.
@mirwox Рік тому
Thanks for the great summary!
@arirahikkala Рік тому ⁺¹⁰
Main point that was missed in the summary IMO: The finetune has to match what the model knows.
If you tell it to say facts it doesn't know (even if the labeller does), you're teaching it to make stuff up, and if you tell it to say "I don't know" when it actually does know the answer (even if the labeller doesn't), you're teaching it to withhold knowledge.
@cassanolucas Рік тому ⁺¹
Thank you for sharing!
@AlgoNudger Рік тому
Thanks.
@blueblimp Рік тому
Thanks for uploading. It's a very clear and interesting talk.
@buzzkidclub1627 Рік тому
❤Thx
@karanbirchahal3268 Рік тому
Wow
@gsm1 Рік тому ⁺²
If i'm understanding correctly, there needs to be some RL agent whose job is to fine-tune the model.
@johan.j.bergman 10 місяців тому
Open problem: valuing the weight of evidence. Just because something is widely beleived and repeated doesn't mean it's true.
@royalfalcon2021 6 місяців тому
There is some high frequency sound is coming from the video
I don't know if anyone able to listen it
@devaggarwal1220 Рік тому ⁺¹
How can I access this inner monologue with the chatgpt browsing plugin?
@wege8409 5 місяців тому
Great lecture, but there is some kind of a buzzing, is there any way to EQ the audio?
@HenryMilner Рік тому ⁺²⁴
The talk content starts around 16:14.
@mbrochh82 Рік тому ⁺⁶
Here is a ChatGPT summary of John's talk:
- Welcome to the fifth seminar in the Berkeley AI series, hosted by Ken and featuring John Schulman, a Berkeley graduate and co-founder of OpenAI.
- John is the inventor of modern deep learning based policy gradient algorithms, including Trust Region Policy Optimization and Proximal Policy Optimization.
- John's talk focuses on the problem of truthfulness in language models, which often make things up convincingly.
- John proposes a conceptual model of what's going on when two neural nets are used for question answering tasks, which involves a knowledge graph stored in the weights of the neural net.
- John claims that any attempt to train a model with behavior cloning will result in a hallucination problem, as the correct target depends on the knowledge in the network, which is unknown to the experimenter.
- John suggests that reinforcement learning may be part of the solution to fixing the truthfulness problem.
- Language models can be trained to output their state of knowledge with the correct amount of hedging and expressing uncertainty.
- Models can be trained to minimize log loss, which is a proper scoring rule, and this results in a model that is calibrated and can output reasonable probabilities.
- Models can be trained with RL from human feedback to learn when to say 'I don't know' and how much to hedge.
- ChatGPT is an instruction following model from OpenAI that uses a similar methodology with RL from human feedback.
- Evaluations of the model show that it is improving on factuality metrics.
- Retrieval in the language model context means accessing an external source of knowledge.
- Retrieval is important for verifiability, as it allows humans to easily check the accuracy of a model's answer.
- WebGPT was a project that focused on a narrower type of question answering, where the model would do research online and answer questions.
- ChatGPT is an alpha product that uses the same methods as WebGPT, but only browses when it doesn't know the answer.
- An open problem is how to incentivize the model to accurately express its uncertainty in words.
- Another open problem is how to go beyond what labelers can easily do, as it is hard to check a long answer about a technical or niche subject.
- John discussed the idea that it is often easier to verify that a solution is correct than to generate a correct solution.
- He discussed the P versus NP problem and how a weak agent can provide an incentive to a strong agent to solve a hard class of problems.
- He discussed the idea of delegating tasks and using mechanism design to set up incentives.
- He discussed the difficulty of rating answers when there is no fixed answer and the idea of redirecting the question to a more factual one.
- He discussed the idea of using an inner monologue format for interpretability and the potential theoretical concerns with it.
- He discussed the difference in capabilities of the model when the knowledge is inside or outside of the model.
- He discussed the conflict between training the model not to withhold information in open domain contexts and not producing unsupported information in closed domain contexts.
@KathyiaS Рік тому ⁺¹
If you listen closely, the "knowledge graph" bit is an analogy. GPT does not contain nor rely on a knowledge graph in the proper sense.
@avimohan6594 Рік тому ⁺¹²
John Schulman starts his talk at 18:18
@moisesespiritosanto2195 Рік тому
Hi, I'm from São Paulo!
@DistortedV12 Рік тому ⁺¹⁰
This is making me think…what if OpenAI already* solved “Open problem III”
@thundie Рік тому
There's no doubt they're running experiments on it predicting world facts. It's a wonderful research direction.
@metalim 3 місяці тому
video is interlaced? wtf? replace potato with a camera already
@royalfalcon2021 6 місяців тому
There is some high frequency sound is coming from the video
I don't know if anyone able to listen it

Наступне

Автоматичне відтворення

Reinforcement Learning from Human Feedback: From Zero to chatGPT