Where did OpenAI say they didn't use tree search? I think they do use tree search, specifically MCTS for generating the synthetic data for o1, then at inference time they don't use tree search. The magic is in creating the synthetic data - they take a variety of paths including some wrong paths of the tree search and chain those with keywords like "but wait, the above is getting me stuck. Let's try this instead" then jump to another branch (the branch frequently does lead to the correct answer) of the tree. The key is MCTS + "let's verify step by step" in my opinion, so they linearize the MCTS thoughts chains and train on that. Somewhere in there they're using RL also as another key ingredient. Looking forward to hear your thoughts
Add one more thing: take a look at Sasha Rush's video "speculations on o1" where he describes 4 possible approaches and he explains the stream of search approach. There are a number of problems with this approach such as collapse and loss of generality (as you noted experiencing). But their "secret sauce" could really just be a lot of hard work to overcome these issues to scale the techniques
Prompt that makes 4o behave like o1: ``` [Problem] Do it out by the following format, taking care to reflect, verify, clarify all assumptions: ###Thoughts### ... ###Final Answer### ```
What is the purpose of generating synthetic data from the model which would be used to improve itself? Wouldn't the synthetic data it produced contain the exact same biases as the model? How do you remove the inherent bias? More importantly, if it can produce expert data, why would it be used to fine-tune itself over it again considering the model was already able to produce the very same data? Does this feel like CoT or ReAct with extra steps?
@@_PranavDesai You can actually do chain of thought prompting to get the model to output more detailed steps, which it natively may not do due to web data not being of that format. Such understanding of reasoning steps can be transferred across domains by fine tuning it, resulting in a model that can do reasoning/chain of thought natively without the prompt In most cases, you have a ground truth dataset to check if the answer obtained by reasoning is correct, and so you are more assured (though not 100%) that the model is generating the right reasoning traces. Btw I myself do not believe models can actually reason like humans, but these reasoning serves as chain of thought to help guide better generation, so it plays an important role.
one important reason of producing synthetic data from the model is that it helps the model represent its knowledge, otherwise you would be feeding the knowledge from another source which it doesn't know anything about. since we want the models to be honest, which means they should learn about what they know and don't, this self-generating data is the best way to make them hallucinate less.
Where did OpenAI say they didn't use tree search? I think they do use tree search, specifically MCTS for generating the synthetic data for o1, then at inference time they don't use tree search. The magic is in creating the synthetic data - they take a variety of paths including some wrong paths of the tree search and chain those with keywords like "but wait, the above is getting me stuck. Let's try this instead" then jump to another branch (the branch frequently does lead to the correct answer) of the tree.
The key is MCTS + "let's verify step by step" in my opinion, so they linearize the MCTS thoughts chains and train on that. Somewhere in there they're using RL also as another key ingredient.
Looking forward to hear your thoughts
Add one more thing: take a look at Sasha Rush's video "speculations on o1" where he describes 4 possible approaches and he explains the stream of search approach. There are a number of problems with this approach such as collapse and loss of generality (as you noted experiencing). But their "secret sauce" could really just be a lot of hard work to overcome these issues to scale the techniques
Great content, thanks so much for sharing all of your videos!
Prompt that makes 4o behave like o1:
```
[Problem]
Do it out by the following format, taking care to reflect, verify, clarify all assumptions:
###Thoughts###
...
###Final Answer###
```
What is the purpose of generating synthetic data from the model which would be used to improve itself? Wouldn't the synthetic data it produced contain the exact same biases as the model? How do you remove the inherent bias? More importantly, if it can produce expert data, why would it be used to fine-tune itself over it again considering the model was already able to produce the very same data?
Does this feel like CoT or ReAct with extra steps?
@@_PranavDesai You can actually do chain of thought prompting to get the model to output more detailed steps, which it natively may not do due to web data not being of that format.
Such understanding of reasoning steps can be transferred across domains by fine tuning it, resulting in a model that can do reasoning/chain of thought natively without the prompt
In most cases, you have a ground truth dataset to check if the answer obtained by reasoning is correct, and so you are more assured (though not 100%) that the model is generating the right reasoning traces.
Btw I myself do not believe models can actually reason like humans, but these reasoning serves as chain of thought to help guide better generation, so it plays an important role.
one important reason of producing synthetic data from the model is that it helps the model represent its knowledge, otherwise you would be feeding the knowledge from another source which it doesn't know anything about. since we want the models to be honest, which means they should learn about what they know and don't, this self-generating data is the best way to make them hallucinate less.