we're in a spot where a serious person can seriously say "it's SIMPLY the model talking to itself until it solves the problem" , and we enthusiasts shrug and move along. What a time to be alive.
But there is so much more to problem-solving than recursive iteration, is it? Humans solve problems using hypermodalities. Bodily sensations, sounds, smells, gut bioma, and emotional states all impact how we think. Then there are the more or less understood “a-ha!” moments or trial-and-error lucky guesses where intuitive judgment makes the call. We also have subconscious processing during sleep tackling the most difficult problems we are stuck with, accompanied by cerebrospinal fluid flushing over our brain tissue. Then there are hungover days when creativity takes the lead for some (e.g., Hemingway). Good luck trying to introduce a central nervous system depressant like alcohol into an LLM and then get the best out of it, lol. I can only imagine how difficult it is to capture all these nuances in current or future LLM architectures. Almost seems like we need something else to augment LLMs with.
Coming back a month later, this is still the goat (We've gotten more of a consensus since then about what's happening, but who's to say that some of these strategies aren't (or couldn't) be using at train-time [though, in usual OAI fashion, it's probably just the simple thing scaled up -- RL with ORMs]).
Hey, great video! I've been trying to wrap my head around O1 for a while, and this really helped me put things into perspective. I'm surprised I haven't seen more discussion about using special tokens for reasoning. It seems like trying to generate these really long, abstract sequences for reasoning can be difficult and hard to evaluate. I have a strong feeling that we could make LLMs more stable by using special tokens as anchors to keep them from going down the wrong path during reasoning.
Stream of search + let's verify step-by-step has looked the most likely to me. It might be that they just put their heads down and worked really hard to solve the collapse problems and optimized generalizability. Regardless, amazing overview, thanks a bunch for sharing
This is probably the 3rd time I've watched this since you posted it. I still can't shake it, you capture the 5Ws+H perfectly imo. It seems to have been overlooked in the general as I keep seeing people conflating things across models and architectures - which I find annoying. It could be a byproduct of the development cadence or the many hype trains littering the internet, but I always feel this is so significant that it's worth trying to portray it in such discussions. I have this itching question that I'm curious what your thoughts are on it. Thinking beyond what you've covered: in the same manner that LLMs derive these language patterns and then are leveraged, do you think that o1, in it's TTC graphs and results, along with supplying the output is there a more general heuristic pattern being derived from the process? If so, wouldn't these general heuristics effectively apply to different scopes and across domains, at least inevitably?
for search it is important to search over ideas. not letters or tokens or words or sentences or paragraphs but ideas. so an llm needs to be able to output a token that says that it has finished laying out an idea, and thus a new idea can begin at this point. if an llm is constantly interrupted at the lower levels, it can never fully finish the idea. that would also help battle the combinatorial explosion that makes search on lower levels untreatable. its like a human chess player that only considers a few moves vs a brute force algorithm that considers millions of moves that are leading nowhere.
The o1 test time compute plot x-axis is on the log scale, that means that you will need exponential compute to make a linear improvement, so it will be grinding to a halt
They apparently only just started scaling this. For example, there’s no reason that this couldn’t be applied to writing other than the fact that it is difficult to craft a reward signal for it. Saying that they’ll quickly hit a wall now would be like saying the same when we were at GPT-2. Sure, it’ll eventually happen, but we’re a ways off from it happening.
I think not following from expert examples is a stretch. They could of helped finetune the CoT mechanism having people write out their thought processes while solving problems especially for math and coding. Edit: i see it addressed at 20:30
Yeah I agree that there are expert examples somewhere in the training procedure. Wanted to emphasize that these play less of a role than I would have assumed before diving into this area (if you believe the OAI comments).
@@DistortedV12 I think to achieve scale, the data has to be generated by the model itself via a step-by-step prompt, the correctness of the solution has to be easily verified. For example, the AIME problems have an integer solution between 0-999. One can then use process and advantage reward on such dataset.
That is awesome. It saved me lots of time. I am trying to use some of these techniques for the AIMO Kaggle contest. If anyone is interested drop me a message.
Oh no I forgot to mention that! In my notation the reasoning token is how you know to move from z to y. It's kind of implied by the color changing from green to red.
Thinking LLMs from Meta, LLM-Berry, ARC AGI paper from MIT on test time training. Can someone (a LLM) ideally Noam Brown or otherwise comment how these are related to what is discussed here?
* Thinking LLMs is quite related. It uses an LLM as the verifier (I was emphasizing automatic verifiers in this talk.). * LLM-Berry is an effort to do a MCTS style search on existing Llama models without learning. * ARC-AGI paper that came out today seems really neat! They do SGD at test time, so pretty different than these methods that only do CoT at test time.
@@srush_nlp thank you so much for responding to my questions! Very great talk / liked how you pointed out the core problem so other researchers can focus efforts
I find this ridiculous and remarkably improbable. Did you see the missed space in the example CoT from o1? That matches Sam Altman’s laidback writing style, he’s clearly writing all the CoT a test-time by hand.
Test compute capability is still constrained by the data used for the RL training, which is harder to curate. You can give a D student an infinite amount of time on an exam and he is certainly not going to get an A.
But synthetic data can solve this restraint. Just have increasingly more capable models create more synthetic data to allow further reinforcement learning, and so on.
@@haiderameer9473 No it doesn't as its still combinatorics at work, D -> A remains a challenge. No amount of recursive repetition of one domain over even seemingly infinite window of time will make you an expert in another that you know little about
@@mossglow I really was thinking about this, but as it seems to be working, and producing better results. I came to the idea that maybe gpt-4=< models are not the best distillations of all the knowledge they've been trained on. and further distillation towards paths that aligns with the intended results required needs further optimization of this distillation towards an outlier reasoning that is better than the average reasoning. Basically trying to distill towards expert level human language proficiency. This should exist within the LLM corpus of knowledge, it's just lost in an ocean of data. I certainly don't have any idea about what I'm talking about, I just follow A.I. news.
we're in a spot where a serious person can seriously say "it's SIMPLY the model talking to itself until it solves the problem" , and we enthusiasts shrug and move along. What a time to be alive.
But there is so much more to problem-solving than recursive iteration, is it? Humans solve problems using hypermodalities. Bodily sensations, sounds, smells, gut bioma, and emotional states all impact how we think. Then there are the more or less understood “a-ha!” moments or trial-and-error lucky guesses where intuitive judgment makes the call. We also have subconscious processing during sleep tackling the most difficult problems we are stuck with, accompanied by cerebrospinal fluid flushing over our brain tissue. Then there are hungover days when creativity takes the lead for some (e.g., Hemingway). Good luck trying to introduce a central nervous system depressant like alcohol into an LLM and then get the best out of it, lol. I can only imagine how difficult it is to capture all these nuances in current or future LLM architectures. Almost seems like we need something else to augment LLMs with.
i love the fact that not only does this research exist, but someone went through the effort to distil it in such an intelligible way. Thank you!
Coming back a month later, this is still the goat
(We've gotten more of a consensus since then about what's happening, but who's to say that some of these strategies aren't (or couldn't) be using at train-time [though, in usual OAI fashion, it's probably just the simple thing scaled up -- RL with ORMs]).
such a good overview - thank you for the insights, quite instructive and accessible
Very interesting summary, thanks a lot. My intuition is that evaluation/test is where we can grow / low hanging fruits.
Hey, great video! I've been trying to wrap my head around O1 for a while, and this really helped me put things into perspective. I'm surprised I haven't seen more discussion about using special tokens for reasoning. It seems like trying to generate these really long, abstract sequences for reasoning can be difficult and hard to evaluate. I have a strong feeling that we could make LLMs more stable by using special tokens as anchors to keep them from going down the wrong path during reasoning.
Thank you so much for such an informative video 🙏🙏.
Thank you so much for your detailed information. 🙏
Thanks for creating this video
Stream of search + let's verify step-by-step has looked the most likely to me. It might be that they just put their heads down and worked really hard to solve the collapse problems and optimized generalizability.
Regardless, amazing overview, thanks a bunch for sharing
Shouldn't equation in 18:07 be E_(y~p(·|,z_(1:t),x))[Ver(y)]? Adding z_(1:t) into the expectation value equation's subscript.
This is probably the 3rd time I've watched this since you posted it. I still can't shake it, you capture the 5Ws+H perfectly imo. It seems to have been overlooked in the general as I keep seeing people conflating things across models and architectures - which I find annoying. It could be a byproduct of the development cadence or the many hype trains littering the internet, but I always feel this is so significant that it's worth trying to portray it in such discussions.
I have this itching question that I'm curious what your thoughts are on it. Thinking beyond what you've covered: in the same manner that LLMs derive these language patterns and then are leveraged, do you think that o1, in it's TTC graphs and results, along with supplying the output is there a more general heuristic pattern being derived from the process? If so, wouldn't these general heuristics effectively apply to different scopes and across domains, at least inevitably?
for search it is important to search over ideas. not letters or tokens or words or sentences or paragraphs but ideas. so an llm needs to be able to output a token that says that it has finished laying out an idea, and thus a new idea can begin at this point. if an llm is constantly interrupted at the lower levels, it can never fully finish the idea. that would also help battle the combinatorial explosion that makes search on lower levels untreatable. its like a human chess player that only considers a few moves vs a brute force algorithm that considers millions of moves that are leading nowhere.
Agreed. Lots of choices though in how to actually build that. Need steps that cause tangible progress.
This is fantastic work❤!
The o1 test time compute plot x-axis is on the log scale, that means that you will need exponential compute to make a linear improvement, so it will be grinding to a halt
Hence the 7 Tril bet
They apparently only just started scaling this. For example, there’s no reason that this couldn’t be applied to writing other than the fact that it is difficult to craft a reward signal for it. Saying that they’ll quickly hit a wall now would be like saying the same when we were at GPT-2. Sure, it’ll eventually happen, but we’re a ways off from it happening.
Cool lecture, thanks!
🎉❤terrific video, thank you
I think not following from expert examples is a stretch. They could of helped finetune the CoT mechanism having people write out their thought processes while solving problems especially for math and coding. Edit: i see it addressed at 20:30
Yeah I agree that there are expert examples somewhere in the training procedure. Wanted to emphasize that these play less of a role than I would have assumed before diving into this area (if you believe the OAI comments).
@@DistortedV12 I think to achieve scale, the data has to be generated by the model itself via a step-by-step prompt, the correctness of the solution has to be easily verified. For example, the AIME problems have an integer solution between 0-999. One can then use process and advantage reward on such dataset.
Brilliant!
That is awesome. It saved me lots of time. I am trying to use some of these techniques for the AIMO Kaggle contest. If anyone is interested drop me a message.
Did he mention that they use reasoning tokens?
Oh no I forgot to mention that! In my notation the reasoning token is how you know to move from z to y. It's kind of implied by the color changing from green to red.
Goat
Thinking LLMs from Meta, LLM-Berry, ARC AGI paper from MIT on test time training. Can someone (a LLM) ideally Noam Brown or otherwise comment how these are related to what is discussed here?
* Thinking LLMs is quite related. It uses an LLM as the verifier (I was emphasizing automatic verifiers in this talk.).
* LLM-Berry is an effort to do a MCTS style search on existing Llama models without learning.
* ARC-AGI paper that came out today seems really neat! They do SGD at test time, so pretty different than these methods that only do CoT at test time.
@@srush_nlp thank you so much for responding to my questions! Very great talk / liked how you pointed out the core problem so other researchers can focus efforts
I find this ridiculous and remarkably improbable. Did you see the missed space in the example CoT from o1? That matches Sam Altman’s laidback writing style, he’s clearly writing all the CoT a test-time by hand.
Has to be process reward
Yeah, it definitely seems like that is part of the equation. The question is whether that is everything.
'The innovations driving rapid AI research' [9:41-12:34]
Test compute capability is still constrained by the data used for the RL training, which is harder to curate. You can give a D student an infinite amount of time on an exam and he is certainly not going to get an A.
Depends entirely on the verifier and the test.
But synthetic data can solve this restraint. Just have increasingly more capable models create more synthetic data to allow further reinforcement learning, and so on.
@@haiderameer9473 No it doesn't as its still combinatorics at work, D -> A remains a challenge. No amount of recursive repetition of one domain over even seemingly infinite window of time will make you an expert in another that you know little about
@@mossglow
I really was thinking about this, but as it seems to be working, and producing better results.
I came to the idea that maybe gpt-4=< models are not the best distillations of all the knowledge they've been trained on. and further distillation towards paths that aligns with the intended results required needs further optimization of this distillation towards an outlier reasoning that is better than the average reasoning. Basically trying to distill towards expert level human language proficiency. This should exist within the LLM corpus of knowledge, it's just lost in an ocean of data.
I certainly don't have any idea about what I'm talking about, I just follow A.I. news.