So basically an AlphaGo Master (?) architecture? Seems like AlphaGo Zero was kind of an appendix in the sense that it just got rid of its planner-driven System 2 in favor of a hugely overgrown System 1. Now that's good enough against humans who also cannot possibly analyse that many possibilities either and also revert to System 1 often. But maybe that's actually an inferior architecture for generalization. At least until somebody actually makes progress in NN-driven Systems 2s.
I think you have shown that LLMs cannot reason or plan for your definition of planning. But they can compose essays, and doesn't the very act of composing an essay involve a kind of planning -- the organization of ideas, breaking them down into paragraphs, and expressing them through carefully chosen words? They seem to be doing planning, but in the domain of words and linguistically expressed ideas.
What you describe can be extracted statistically, so given enough essay trainingdata you can extract where to put which words to look like a convincing essay without really thinking about creating an essay.
Prof. Rao, I've had a short discussion with Liron Shapira and we were wondering if you feel strongly enough about this argument that you would make a prediction about what GPT-5 *won't* be able to do. Assuming GPT-5 will just be a bigger transformer with more training data, more parameters, and better RLHF, could you predict that it still won't be able to solve your Randomized Mystery Blocksworld problems past, say 10%?
@@billykotsos4642 Maybe not impressive, but it would be surprising. At 20:16, Rao shows that GPT-4 can only get to 2% with a Randomized Mystery Blockworld. Humans can solve it at close to 100%. Going from 2% to 10% would at least be a bit of a signal that there's more to transformer-based LLMs expected.
@@billykotsos4642 Indeed, or o3 for that matter. I would also like to see updated stats on the blocksworld problems, but the ARC-AGI scores for o3 are pretty surprising. Chollet thinks that ARC-AGI-2 will bring the scores down considerably though, so it's possible that blocksworld is still a challenge.
@@BrianPeiris I just had a look and there is a new paper on arxiv by the author covering o1-preview. It seems that there is a significant step up compared to LLMs (the paper calls 01-like models LRMs -> ‘Language Reasoning Models’) I need to go through the paper thoroughly though… planning is obviously something that simple LLMs and even LRMs can’t do out of the box. It would be great to see also how DeepSeek models fair on these benchmarks.
So basically an AlphaGo Master (?) architecture? Seems like AlphaGo Zero was kind of an appendix in the sense that it just got rid of its planner-driven System 2 in favor of a hugely overgrown System 1. Now that's good enough against humans who also cannot possibly analyse that many possibilities either and also revert to System 1 often. But maybe that's actually an inferior architecture for generalization. At least until somebody actually makes progress in NN-driven Systems 2s.
I think you have shown that LLMs cannot reason or plan for your definition of planning. But they can compose essays, and doesn't the very act of composing an essay involve a kind of planning -- the organization of ideas, breaking them down into paragraphs, and expressing them through carefully chosen words? They seem to be doing planning, but in the domain of words and linguistically expressed ideas.
What you describe can be extracted statistically, so given enough essay trainingdata you can extract where to put which words to look like a convincing essay without really thinking about creating an essay.
Prof. Rao, I've had a short discussion with Liron Shapira and we were wondering if you feel strongly enough about this argument that you would make a prediction about what GPT-5 *won't* be able to do. Assuming GPT-5 will just be a bigger transformer with more training data, more parameters, and better RLHF, could you predict that it still won't be able to solve your Randomized Mystery Blocksworld problems past, say 10%?
does solving 10% of the problems make it imrpessive?
@@billykotsos4642 Maybe not impressive, but it would be surprising. At 20:16, Rao shows that GPT-4 can only get to 2% with a Randomized Mystery Blockworld. Humans can solve it at close to 100%. Going from 2% to 10% would at least be a bit of a signal that there's more to transformer-based LLMs expected.
@@BrianPeiris I wonder how O1 would fair here
@@billykotsos4642 Indeed, or o3 for that matter. I would also like to see updated stats on the blocksworld problems, but the ARC-AGI scores for o3 are pretty surprising. Chollet thinks that ARC-AGI-2 will bring the scores down considerably though, so it's possible that blocksworld is still a challenge.
@@BrianPeiris I just had a look and there is a new paper on arxiv by the author covering o1-preview. It seems that there is a significant step up compared to LLMs (the paper calls 01-like models LRMs -> ‘Language Reasoning Models’) I need to go through the paper thoroughly though… planning is obviously something that simple LLMs and even LRMs can’t do out of the box. It would be great to see also how DeepSeek models fair on these benchmarks.