Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (Paper)

Поділитися
Вставка
  • Опубліковано 25 лис 2024

КОМЕНТАРІ • 79

  • @mshonle
    @mshonle Місяць тому +71

    I tried reading this paper three times but then decided it would have been more optimal if they doubled the number of scientists writing it…

    • @guccigav7912
      @guccigav7912 Місяць тому +3

      lol same

    • @ultrasound1459
      @ultrasound1459 Місяць тому +5

      They didn't share any code 🔴❌️

    • @csabaczcsomps7655
      @csabaczcsomps7655 Місяць тому

      Neural network is a procedure to process stimul . Not message as in oop. Message is go to one object. Stimul go to all object and is processed in all node. Imagine you have one variable and go in one expression. Stimul is one value and go in all expressions of all node. Is a new way to compute more close to real neurons. How to implement is the work in progres , now.

  • @tingtingin
    @tingtingin Місяць тому +30

    He's alive!

  • @JumpDiffusion
    @JumpDiffusion Місяць тому +5

    There is a paper by Christopher Re and co. about scaling inference via random sampling; they demonstrate scaling all the way up to saturating MATH and other benchmarks. They also come up with scaling laws for inference.

  • @kevon217
    @kevon217 Місяць тому +1

    Love your paper breakdowns. Always learn a lot. Appreciate it!

  • @pumozavr
    @pumozavr Місяць тому +1

    In Figure 2, beam search refers to the "standard" beam search, without refinement. You simply sample intermediate steps from a "standard" LLM (one that might not have self-refinement capabilities) and see what the best intermediate solutions are using the verifier. A PRM-based verifier will give you a score for the current step (the steps are delimited in a way that the PRM understands, e.g. through new lines), and the scores for the single steps are then combined (using average, min, ...) into a score for the whole intermediate solution. You can then pick the solution(s) with the highest score, expand on it, and iterate until you reach one or ideally multiple final solutions from which you can again pick using the verifier. That's my understanding.

  • @kikijuju4809
    @kikijuju4809 Місяць тому +16

    Long time no see

  • @MADjaHEAD
    @MADjaHEAD Місяць тому +2

    I was missing you! Hope to see more from you

  • @daniele81
    @daniele81 Місяць тому +2

    There are no error bars in figure 4. Ho would you know if any of these different methods performs significantly better than other? Looks like bad stat to me

  • @erv993
    @erv993 Місяць тому +1

    The king is back!

  • @LatteDeCoder
    @LatteDeCoder Місяць тому +1

    this work seems to build upon another recent work, "Recursive Introspection: Teaching Language Model Agents How to Self-Improve," which has code available...

  • @existenceisillusion6528
    @existenceisillusion6528 Місяць тому +1

    Are we sure a* is not a type-o that should have been y*?
    Also, best of weighted N beam majority?

  • @Mordenor
    @Mordenor Місяць тому +3

    Thank You Mr Yannic For Explaining This Wonderful Paper About LLM Scaling

  • @03Krikri
    @03Krikri Місяць тому

    Thanks for your critical review, was very insightful

  • @googleyoutubechannel8554
    @googleyoutubechannel8554 Місяць тому

    Welcome back! I'm not convinced their definition of 'difficulty' is interesting or helpful either, but isn't it entirely unsurprising that LLMs 'think' in a different way than humans?

  • @ChocolateMilkCultLeader
    @ChocolateMilkCultLeader Місяць тому +3

    My goat is back

  • @juanjesusligero391
    @juanjesusligero391 Місяць тому +3

    Glad to see another video of yours, thank you Yannic! :D
    I really miss your ML News, I hope you make some more of them one of these days ^^

  • @MasamuneX
    @MasamuneX Місяць тому +6

    what if we use monty carlo tree search on tree of thought llms then we just keep the highest quality output and train a new foundation model on that synthetic data and repeat until asi

    • @montymemoladi8067
      @montymemoladi8067 Місяць тому +3

      Sounds like a promising approach and I think its reasonably close to what the big labs are planning to do

    • @AtAtaylor
      @AtAtaylor Місяць тому +1

      People have already done this

    • @scoffpickle9655
      @scoffpickle9655 Місяць тому +1

      Or just use something similar to Thinker:learning to plan and act to kinda (predict) a few tokens ahead which might increase quality

    • @Adhil_parammel
      @Adhil_parammel Місяць тому

      Oracle to guide and reach asi required.

    • @keypey8256
      @keypey8256 Місяць тому

      I'm guessing they trained o1 in a similar manner. Maybe slightly different algorithm, different tree search technique or maybe slightly different way of generating output, but the general idea is probably the same.

  • @Veptis
    @Veptis Місяць тому

    Will have to check the whole video later. But I think IBM has had a somewhat similar paper recently. about the training rate changing based on epoch/mini batch performance on the benchmark or something. It's called scheduler something

  • @EkShunya
    @EkShunya Місяць тому +1

    welcome back

  • @keypey8256
    @keypey8256 Місяць тому

    41:15 isn't it at this point a manual overfitting of architecture to the dataset?

  • @makhalid1999
    @makhalid1999 Місяць тому

    Can't you review Computer Vision papers too? 😞

  • @benedictsmith2415
    @benedictsmith2415 Місяць тому +1

    Equation 1 just serves as a theoretical foundation for the "compute-optimal" concept but it cannot be directly used for optimization because:
    Intractability: Finding the truly optimal hyperparameters θ across all possible prompts and compute budgets a*(q) would require an exhaustive search.....
    Unknown Ground Truth: In a real-world setting, we don't know the ground-truth correct answer y*(q) for unseen prompt, so directly optimizing the indicator function is impossible.

  • @akanjiemmanuel4807
    @akanjiemmanuel4807 Місяць тому +3

    Interesting paper

  • @andytroo
    @andytroo Місяць тому

    how does resampling the output of a LLM and taking the most frequent differ from running with temp=0 ?

    • @ArtOfTheProblem
      @ArtOfTheProblem Місяць тому

      I think performance breaks down at temp 0, and so you get much less exploration. Especially with ambiguous questions especially you get more stability with majority vote, plus a confidence metric

  • @LysergicKids
    @LysergicKids Місяць тому +2

    It can't be, a new paper that's not 98% marketing wank? Is the world healing, brothers

  • @bjarke7886
    @bjarke7886 Місяць тому

    Please cover ESM3

  • @gileneusz
    @gileneusz Місяць тому +1

    he's the best

  • @youssefdirani
    @youssefdirani 13 днів тому

    why not just open source Gemini and chatgpt ?

  • @MinecraftJuiceHD
    @MinecraftJuiceHD Місяць тому

    Isn't beam search done per token? Why does yannic say that they grade the answers?

    • @benedictsmith2415
      @benedictsmith2415 Місяць тому

      he's misunderstood - the whole point of the beam search here is that it guides the generation process by making step-wise decisions based on the PRM's evaluation. It's more about strategically navigating the search space rather than explicitly modifying the output distribution or altering already generated outputs

    • @MinecraftJuiceHD
      @MinecraftJuiceHD Місяць тому

      @@benedictsmith2415 So im right in the way i understood it right? The beam search is done token by token and evaluated at intermediate steps?

    • @benedictsmith2415
      @benedictsmith2415 Місяць тому

      @@MinecraftJuiceHD correct

  • @wurstelei1356
    @wurstelei1356 Місяць тому

    It seems to me, accordingly to the graphs: The harder the question the more luck to get the right answer.

  • @mike___-fi5kp
    @mike___-fi5kp Місяць тому

    long time no see

  • @DaRealCodeBlack
    @DaRealCodeBlack Місяць тому

    Chinese and Indian software engineers and computer scientists are "killin da game" when it comes to all things high tech in coding Ai and other complicated domains in our field. Hats off to them!

  • @aa-xn5hc
    @aa-xn5hc Місяць тому

    Please the news back!

  • @KadeemSometimes
    @KadeemSometimes Місяць тому +1

    Nice

  • @TheAIEpiphany
    @TheAIEpiphany Місяць тому

    21:48 What can be unburdened by what has been

  • @sushantpenshanwar
    @sushantpenshanwar 15 днів тому

    Rant was good Lol

  • @TheTheeliman
    @TheTheeliman Місяць тому

    Too much of concepts zero lines of code. Deepmind should let me fine tune my llama/gemma with this approach

  • @张默涵-x3z
    @张默涵-x3z 16 днів тому

    牛逼

  • @nineteenfortyeight
    @nineteenfortyeight Місяць тому +3

    Why in the name of all that's holy are we asking an LLM to do arithmetic?? 😭

    • @hunterkudo9832
      @hunterkudo9832 Місяць тому +3

      Because being able to do arithmetic is a good indicator of being able to reason. We want LLMs to be good reasoners because a lot of tasks in the real world will require LLMs and soon AI agents to reason like a human can.

    • @HUEHUEUHEPony
      @HUEHUEUHEPony Місяць тому +1

      Because not all of us are interested in roleplay slop

  • @RickeyBowers
    @RickeyBowers Місяць тому

    Completely worthless if the model has no concept of the test-time trajectory.

  • @csabaczcsomps7655
    @csabaczcsomps7655 Місяць тому

    I think wath you want. When a kid see you put one apple than put one more he will answer we have 2. So we write 1+1=2. Then he will take notation always as true wthitout recall the apple video. This mean some training need 2 module, video then video-notation asociotion. And probable use notation is 3 step. My noob opinion.

  • @islandfireballkill
    @islandfireballkill Місяць тому +6

    Wake up, babe. New Yannic video just dropped.

  • @ozordiprince9405
    @ozordiprince9405 Місяць тому

    200 views in 15 minutes. Bro fell off

  • @fontenbleau
    @fontenbleau Місяць тому +3

    Python is just dead end pathway. One guy on UA-cam writes neural network in Assembly low-level language and it's 500 times faster than Pytorch on 1 CPU core on one same task. We need full rewrite of networks and models.

    • @scoffpickle9655
      @scoffpickle9655 Місяць тому +2

      Please tell me who made that. It seems so interesting

    • @scoffpickle9655
      @scoffpickle9655 Місяць тому +2

      Also yeah, C or C++ is better for actually useful and fast models, python is good for modularity and prototyping but god it is so fucking slow

    • @biomerl
      @biomerl Місяць тому +9

      Wat? 99 percent of training is done on gpu which is already cpp

    • @scoffpickle9655
      @scoffpickle9655 Місяць тому

      @biomerl Yeah sorry I dont have much knowledge on low level ML

    • @kennycommentsofficial
      @kennycommentsofficial Місяць тому

      @@scoffpickle9655easiest starting place is search youtube for matrix multiplication with cuda (basically just c code)

  • @imaspacecreature
    @imaspacecreature Місяць тому

    The Travis Pickle of AI!