Speculations on Test-Time Scaling (o1)

Sasha Rush 🤗

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 15 січ 2025

КОМЕНТАРІ • 41

@test-sc2iy 2 місяці тому ⁺⁴⁴
we're in a spot where a serious person can seriously say "it's SIMPLY the model talking to itself until it solves the problem" , and we enthusiasts shrug and move along. What a time to be alive.
@mossglow 2 місяці тому
But there is so much more to problem-solving than recursive iteration, is it? Humans solve problems using hypermodalities. Bodily sensations, sounds, smells, gut bioma, and emotional states all impact how we think. Then there are the more or less understood “a-ha!” moments or trial-and-error lucky guesses where intuitive judgment makes the call. We also have subconscious processing during sleep tackling the most difficult problems we are stuck with, accompanied by cerebrospinal fluid flushing over our brain tissue. Then there are hungover days when creativity takes the lead for some (e.g., Hemingway). Good luck trying to introduce a central nervous system depressant like alcohol into an LLM and then get the best out of it, lol. I can only imagine how difficult it is to capture all these nuances in current or future LLM architectures. Almost seems like we need something else to augment LLMs with.
@JustSayin24 Місяць тому ⁺⁶
i love the fact that not only does this research exist, but someone went through the effort to distil it in such an intelligible way. Thank you!
@420_gunna 25 днів тому ⁺¹
Coming back a month later, this is still the goat
(We've gotten more of a consensus since then about what's happening, but who's to say that some of these strategies aren't (or couldn't) be using at train-time [though, in usual OAI fashion, it's probably just the simple thing scaled up -- RL with ORMs]).
@familiabartolome9725 2 місяці тому ⁺³
such a good overview - thank you for the insights, quite instructive and accessible
@DanielBonaker 2 місяці тому ⁺⁵
Very interesting summary, thanks a lot. My intuition is that evaluation/test is where we can grow / low hanging fruits.
@소금-v8z 29 днів тому
Hey, great video! I've been trying to wrap my head around O1 for a while, and this really helped me put things into perspective. I'm surprised I haven't seen more discussion about using special tokens for reasoning. It seems like trying to generate these really long, abstract sequences for reasoning can be difficult and hard to evaluate. I have a strong feeling that we could make LLMs more stable by using special tokens as anchors to keep them from going down the wrong path during reasoning.
@sanesanyo 2 місяці тому ⁺⁴
Thank you so much for such an informative video 🙏🙏.
@jaewooklee5844 Місяць тому
Thank you so much for your detailed information. 🙏
@openroomxyz 2 місяці тому ⁺⁴
Thanks for creating this video
@drhxa 2 місяці тому ⁺⁵
Stream of search + let's verify step-by-step has looked the most likely to me. It might be that they just put their heads down and worked really hard to solve the collapse problems and optimized generalizability.
Regardless, amazing overview, thanks a bunch for sharing
@burnytech 3 дні тому
Shouldn't equation in 18:07 be E_(y~p(·|,z_(1:t),x))[Ver(y)]? Adding z_(1:t) into the expectation value equation's subscript.
@Charles-Darwin 23 дні тому
This is probably the 3rd time I've watched this since you posted it. I still can't shake it, you capture the 5Ws+H perfectly imo. It seems to have been overlooked in the general as I keep seeing people conflating things across models and architectures - which I find annoying. It could be a byproduct of the development cadence or the many hype trains littering the internet, but I always feel this is so significant that it's worth trying to portray it in such discussions.
I have this itching question that I'm curious what your thoughts are on it. Thinking beyond what you've covered: in the same manner that LLMs derive these language patterns and then are leveraged, do you think that o1, in it's TTC graphs and results, along with supplying the output is there a more general heuristic pattern being derived from the process? If so, wouldn't these general heuristics effectively apply to different scopes and across domains, at least inevitably?
@HansKonrad-ln1cg 2 місяці тому ⁺²
for search it is important to search over ideas. not letters or tokens or words or sentences or paragraphs but ideas. so an llm needs to be able to output a token that says that it has finished laying out an idea, and thus a new idea can begin at this point. if an llm is constantly interrupted at the lower levels, it can never fully finish the idea. that would also help battle the combinatorial explosion that makes search on lower levels untreatable. its like a human chess player that only considers a few moves vs a brute force algorithm that considers millions of moves that are leading nowhere.
@srush_nlp 2 місяці тому ⁺¹
Agreed. Lots of choices though in how to actually build that. Need steps that cause tangible progress.
@theK594 2 місяці тому ⁺¹
This is fantastic work❤!
@SLAM2977 Місяць тому ⁺²
The o1 test time compute plot x-axis is on the log scale, that means that you will need exponential compute to make a linear improvement, so it will be grinding to a halt
@francisco444 Місяць тому ⁺²
Hence the 7 Tril bet
@diophantine1598 Місяць тому
They apparently only just started scaling this. For example, there’s no reason that this couldn’t be applied to writing other than the fact that it is difficult to craft a reward signal for it. Saying that they’ll quickly hit a wall now would be like saying the same when we were at GPT-2. Sure, it’ll eventually happen, but we’re a ways off from it happening.
@wiktorm9858 Місяць тому
Cool lecture, thanks!
@mindhoc Місяць тому
🎉❤terrific video, thank you
@DistortedV12 2 місяці тому
I think not following from expert examples is a stretch. They could of helped finetune the CoT mechanism having people write out their thought processes while solving problems especially for math and coding. Edit: i see it addressed at 20:30
@srush_nlp 2 місяці тому ⁺²
Yeah I agree that there are expert examples somewhere in the training procedure. Wanted to emphasize that these play less of a role than I would have assumed before diving into this area (if you believe the OAI comments).
@tankieslayer6927 2 місяці тому ⁺³
@@DistortedV12 I think to achieve scale, the data has to be generated by the model itself via a step-by-step prompt, the correctness of the solution has to be easily verified. For example, the AIME problems have an integer solution between 0-999. One can then use process and advantage reward on such dataset.
@wwkk4964 2 місяці тому
Brilliant!
@vaioslaschos 2 місяці тому
That is awesome. It saved me lots of time. I am trying to use some of these techniques for the AIMO Kaggle contest. If anyone is interested drop me a message.
@DistortedV12 2 місяці тому ⁺¹
Did he mention that they use reasoning tokens?
@srush_nlp 2 місяці тому ⁺²
Oh no I forgot to mention that! In my notation the reasoning token is how you know to move from z to y. It's kind of implied by the color changing from green to red.
@420_gunna 2 місяці тому ⁺¹
Goat
@DistortedV12 2 місяці тому
Thinking LLMs from Meta, LLM-Berry, ARC AGI paper from MIT on test time training. Can someone (a LLM) ideally Noam Brown or otherwise comment how these are related to what is discussed here?
@srush_nlp 2 місяці тому ⁺¹
* Thinking LLMs is quite related. It uses an LLM as the verifier (I was emphasizing automatic verifiers in this talk.).
* LLM-Berry is an effort to do a MCTS style search on existing Llama models without learning.
* ARC-AGI paper that came out today seems really neat! They do SGD at test time, so pretty different than these methods that only do CoT at test time.
@DistortedV12 2 місяці тому
@@srush_nlp thank you so much for responding to my questions! Very great talk / liked how you pointed out the core problem so other researchers can focus efforts
@novantha1 2 місяці тому ⁺¹
I find this ridiculous and remarkably improbable. Did you see the missed space in the example CoT from o1? That matches Sam Altman’s laidback writing style, he’s clearly writing all the CoT a test-time by hand.
@NerdCrusader 2 місяці тому ⁺²
Has to be process reward
@srush_nlp 2 місяці тому ⁺¹
Yeah, it definitely seems like that is part of the equation. The question is whether that is everything.
@ZenBen_the_Elder День тому
'The innovations driving rapid AI research' [9:41-12:34]
@tankieslayer6927 2 місяці тому ⁺⁶
Test compute capability is still constrained by the data used for the RL training, which is harder to curate. You can give a D student an infinite amount of time on an exam and he is certainly not going to get an A.
@wwkk4964 2 місяці тому ⁺²
Depends entirely on the verifier and the test.
@haiderameer9473 2 місяці тому
But synthetic data can solve this restraint. Just have increasingly more capable models create more synthetic data to allow further reinforcement learning, and so on.
@mossglow 2 місяці тому
@@haiderameer9473 No it doesn't as its still combinatorics at work, D -> A remains a challenge. No amount of recursive repetition of one domain over even seemingly infinite window of time will make you an expert in another that you know little about
@Asmodeus.q Місяць тому
@@mossglow
I really was thinking about this, but as it seems to be working, and producing better results.
I came to the idea that maybe gpt-4=< models are not the best distillations of all the knowledge they've been trained on. and further distillation towards paths that aligns with the intended results required needs further optimization of this distillation towards an outlier reasoning that is better than the average reasoning. Basically trying to distill towards expert level human language proficiency. This should exist within the LLM corpus of knowledge, it's just lost in an ocean of data.
I certainly don't have any idea about what I'm talking about, I just follow A.I. news.

Наступне

Автоматичне відтворення

Why Does Diffusion Work Better than Auto-Regression?