Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (Paper)

Поділитися
Вставка
  • Опубліковано 19 січ 2025

КОМЕНТАРІ • 81

  • @mshonle
    @mshonle 3 місяці тому +75

    I tried reading this paper three times but then decided it would have been more optimal if they doubled the number of scientists writing it…

    • @guccigav7912
      @guccigav7912 3 місяці тому +3

      lol same

    • @ultrasound1459
      @ultrasound1459 3 місяці тому +5

      They didn't share any code 🔴❌️

    • @csabaczcsomps7655
      @csabaczcsomps7655 3 місяці тому

      Neural network is a procedure to process stimul . Not message as in oop. Message is go to one object. Stimul go to all object and is processed in all node. Imagine you have one variable and go in one expression. Stimul is one value and go in all expressions of all node. Is a new way to compute more close to real neurons. How to implement is the work in progres , now.

  • @tingtingin
    @tingtingin 3 місяці тому +31

    He's alive!

  • @kevon217
    @kevon217 3 місяці тому +1

    Love your paper breakdowns. Always learn a lot. Appreciate it!

  • @juanjesusligero391
    @juanjesusligero391 3 місяці тому +4

    Glad to see another video of yours, thank you Yannic! :D
    I really miss your ML News, I hope you make some more of them one of these days ^^

  • @JumpDiffusion
    @JumpDiffusion 3 місяці тому +5

    There is a paper by Christopher Re and co. about scaling inference via random sampling; they demonstrate scaling all the way up to saturating MATH and other benchmarks. They also come up with scaling laws for inference.

    • @谢安-k6t
      @谢安-k6t 19 днів тому

      I guess you're talking about the Large Language Monkeys paper. That's actually quite pointless. The vast majority of answers in the dataset are positive integers within 1000, and they let the model attempt up to 10000 times, of course some of the guesses would be identical to the correct answer. Basically says nothing.

  • @MADjaHEAD
    @MADjaHEAD 3 місяці тому +3

    I was missing you! Hope to see more from you

  • @pumozavr
    @pumozavr 3 місяці тому +1

    In Figure 2, beam search refers to the "standard" beam search, without refinement. You simply sample intermediate steps from a "standard" LLM (one that might not have self-refinement capabilities) and see what the best intermediate solutions are using the verifier. A PRM-based verifier will give you a score for the current step (the steps are delimited in a way that the PRM understands, e.g. through new lines), and the scores for the single steps are then combined (using average, min, ...) into a score for the whole intermediate solution. You can then pick the solution(s) with the highest score, expand on it, and iterate until you reach one or ideally multiple final solutions from which you can again pick using the verifier. That's my understanding.

  • @Mordenor
    @Mordenor 3 місяці тому +3

    Thank You Mr Yannic For Explaining This Wonderful Paper About LLM Scaling

  • @kikijuju4809
    @kikijuju4809 3 місяці тому +16

    Long time no see

  • @erv993
    @erv993 3 місяці тому +1

    The king is back!

  • @daniele81
    @daniele81 3 місяці тому +2

    There are no error bars in figure 4. Ho would you know if any of these different methods performs significantly better than other? Looks like bad stat to me

  • @03Krikri
    @03Krikri 3 місяці тому

    Thanks for your critical review, was very insightful

  • @LatteDeCoder
    @LatteDeCoder 3 місяці тому +1

    this work seems to build upon another recent work, "Recursive Introspection: Teaching Language Model Agents How to Self-Improve," which has code available...

  • @MasamuneX
    @MasamuneX 3 місяці тому +6

    what if we use monty carlo tree search on tree of thought llms then we just keep the highest quality output and train a new foundation model on that synthetic data and repeat until asi

    • @montymemoladi8067
      @montymemoladi8067 3 місяці тому +3

      Sounds like a promising approach and I think its reasonably close to what the big labs are planning to do

    • @AtAtaylor
      @AtAtaylor 3 місяці тому +1

      People have already done this

    • @scoffpickle9655
      @scoffpickle9655 3 місяці тому +1

      Or just use something similar to Thinker:learning to plan and act to kinda (predict) a few tokens ahead which might increase quality

    • @Adhil_parammel
      @Adhil_parammel 3 місяці тому

      Oracle to guide and reach asi required.

    • @keypey8256
      @keypey8256 3 місяці тому

      I'm guessing they trained o1 in a similar manner. Maybe slightly different algorithm, different tree search technique or maybe slightly different way of generating output, but the general idea is probably the same.

  • @ChocolateMilkCultLeader
    @ChocolateMilkCultLeader 3 місяці тому +3

    My goat is back

  • @makhalid1999
    @makhalid1999 3 місяці тому +1

    Can't you review Computer Vision papers too? 😞

  • @googleyoutubechannel8554
    @googleyoutubechannel8554 3 місяці тому

    Welcome back! I'm not convinced their definition of 'difficulty' is interesting or helpful either, but isn't it entirely unsurprising that LLMs 'think' in a different way than humans?

  • @existenceisillusion6528
    @existenceisillusion6528 3 місяці тому +1

    Are we sure a* is not a type-o that should have been y*?
    Also, best of weighted N beam majority?

  • @EkShunya
    @EkShunya 3 місяці тому +1

    welcome back

  • @keypey8256
    @keypey8256 3 місяці тому

    41:15 isn't it at this point a manual overfitting of architecture to the dataset?

  • @andytroo
    @andytroo 3 місяці тому

    how does resampling the output of a LLM and taking the most frequent differ from running with temp=0 ?

    • @ArtOfTheProblem
      @ArtOfTheProblem 3 місяці тому

      I think performance breaks down at temp 0, and so you get much less exploration. Especially with ambiguous questions especially you get more stability with majority vote, plus a confidence metric

  • @akanjiemmanuel4807
    @akanjiemmanuel4807 3 місяці тому +3

    Interesting paper

  • @benedictsmith2415
    @benedictsmith2415 3 місяці тому +1

    Equation 1 just serves as a theoretical foundation for the "compute-optimal" concept but it cannot be directly used for optimization because:
    Intractability: Finding the truly optimal hyperparameters θ across all possible prompts and compute budgets a*(q) would require an exhaustive search.....
    Unknown Ground Truth: In a real-world setting, we don't know the ground-truth correct answer y*(q) for unseen prompt, so directly optimizing the indicator function is impossible.

  • @Veptis
    @Veptis 3 місяці тому

    Will have to check the whole video later. But I think IBM has had a somewhat similar paper recently. about the training rate changing based on epoch/mini batch performance on the benchmark or something. It's called scheduler something

  • @wurstelei1356
    @wurstelei1356 3 місяці тому

    It seems to me, accordingly to the graphs: The harder the question the more luck to get the right answer.

  • @MinecraftJuiceHD
    @MinecraftJuiceHD 3 місяці тому

    Isn't beam search done per token? Why does yannic say that they grade the answers?

    • @benedictsmith2415
      @benedictsmith2415 3 місяці тому

      he's misunderstood - the whole point of the beam search here is that it guides the generation process by making step-wise decisions based on the PRM's evaluation. It's more about strategically navigating the search space rather than explicitly modifying the output distribution or altering already generated outputs

    • @MinecraftJuiceHD
      @MinecraftJuiceHD 3 місяці тому

      @@benedictsmith2415 So im right in the way i understood it right? The beam search is done token by token and evaluated at intermediate steps?

    • @benedictsmith2415
      @benedictsmith2415 3 місяці тому

      @@MinecraftJuiceHD correct

  • @LysergicKids
    @LysergicKids 3 місяці тому +2

    It can't be, a new paper that's not 98% marketing wank? Is the world healing, brothers

  • @mike___-fi5kp
    @mike___-fi5kp 3 місяці тому

    long time no see

  • @youssefdirani
    @youssefdirani 2 місяці тому

    why not just open source Gemini and chatgpt ?

  • @bjarke7886
    @bjarke7886 3 місяці тому

    Please cover ESM3

  • @gileneusz
    @gileneusz 3 місяці тому +1

    he's the best

  • @TheAIEpiphany
    @TheAIEpiphany 3 місяці тому

    21:48 What can be unburdened by what has been

  • @aa-xn5hc
    @aa-xn5hc 3 місяці тому

    Please the news back!

  • @KadeemSometimes
    @KadeemSometimes 3 місяці тому +1

    Nice

  • @sushantpenshanwar
    @sushantpenshanwar 2 місяці тому

    Rant was good Lol

  • @DaRealCodeBlack
    @DaRealCodeBlack 3 місяці тому

    Chinese and Indian software engineers and computer scientists are "killin da game" when it comes to all things high tech in coding Ai and other complicated domains in our field. Hats off to them!

  • @TheTheeliman
    @TheTheeliman 3 місяці тому

    Too much of concepts zero lines of code. Deepmind should let me fine tune my llama/gemma with this approach

  • @RickeyBowers
    @RickeyBowers 3 місяці тому

    Completely worthless if the model has no concept of the test-time trajectory.

  • @islandfireballkill
    @islandfireballkill 3 місяці тому +6

    Wake up, babe. New Yannic video just dropped.

  • @burnytech
    @burnytech 25 днів тому

  • @张默涵-x3z
    @张默涵-x3z 2 місяці тому

    牛逼

  • @nineteenfortyeight
    @nineteenfortyeight 3 місяці тому +3

    Why in the name of all that's holy are we asking an LLM to do arithmetic?? 😭

    • @hunterkudo9832
      @hunterkudo9832 3 місяці тому +4

      Because being able to do arithmetic is a good indicator of being able to reason. We want LLMs to be good reasoners because a lot of tasks in the real world will require LLMs and soon AI agents to reason like a human can.

    • @HUEHUEUHEPony
      @HUEHUEUHEPony 3 місяці тому +1

      Because not all of us are interested in roleplay slop

  • @csabaczcsomps7655
    @csabaczcsomps7655 3 місяці тому

    I think wath you want. When a kid see you put one apple than put one more he will answer we have 2. So we write 1+1=2. Then he will take notation always as true wthitout recall the apple video. This mean some training need 2 module, video then video-notation asociotion. And probable use notation is 3 step. My noob opinion.

  • @ozordiprince9405
    @ozordiprince9405 3 місяці тому

    200 views in 15 minutes. Bro fell off

  • @fontenbleau
    @fontenbleau 3 місяці тому +3

    Python is just dead end pathway. One guy on UA-cam writes neural network in Assembly low-level language and it's 500 times faster than Pytorch on 1 CPU core on one same task. We need full rewrite of networks and models.

    • @scoffpickle9655
      @scoffpickle9655 3 місяці тому +2

      Please tell me who made that. It seems so interesting

    • @scoffpickle9655
      @scoffpickle9655 3 місяці тому +2

      Also yeah, C or C++ is better for actually useful and fast models, python is good for modularity and prototyping but god it is so fucking slow

    • @biomerl
      @biomerl 3 місяці тому +9

      Wat? 99 percent of training is done on gpu which is already cpp

    • @scoffpickle9655
      @scoffpickle9655 3 місяці тому

      @biomerl Yeah sorry I dont have much knowledge on low level ML

    • @kennycommentsofficial
      @kennycommentsofficial 3 місяці тому

      @@scoffpickle9655easiest starting place is search youtube for matrix multiplication with cuda (basically just c code)

  • @imaspacecreature
    @imaspacecreature 3 місяці тому

    The Travis Pickle of AI!