Neural network is a procedure to process stimul . Not message as in oop. Message is go to one object. Stimul go to all object and is processed in all node. Imagine you have one variable and go in one expression. Stimul is one value and go in all expressions of all node. Is a new way to compute more close to real neurons. How to implement is the work in progres , now.
There is a paper by Christopher Re and co. about scaling inference via random sampling; they demonstrate scaling all the way up to saturating MATH and other benchmarks. They also come up with scaling laws for inference.
In Figure 2, beam search refers to the "standard" beam search, without refinement. You simply sample intermediate steps from a "standard" LLM (one that might not have self-refinement capabilities) and see what the best intermediate solutions are using the verifier. A PRM-based verifier will give you a score for the current step (the steps are delimited in a way that the PRM understands, e.g. through new lines), and the scores for the single steps are then combined (using average, min, ...) into a score for the whole intermediate solution. You can then pick the solution(s) with the highest score, expand on it, and iterate until you reach one or ideally multiple final solutions from which you can again pick using the verifier. That's my understanding.
There are no error bars in figure 4. Ho would you know if any of these different methods performs significantly better than other? Looks like bad stat to me
this work seems to build upon another recent work, "Recursive Introspection: Teaching Language Model Agents How to Self-Improve," which has code available...
Welcome back! I'm not convinced their definition of 'difficulty' is interesting or helpful either, but isn't it entirely unsurprising that LLMs 'think' in a different way than humans?
what if we use monty carlo tree search on tree of thought llms then we just keep the highest quality output and train a new foundation model on that synthetic data and repeat until asi
I'm guessing they trained o1 in a similar manner. Maybe slightly different algorithm, different tree search technique or maybe slightly different way of generating output, but the general idea is probably the same.
Will have to check the whole video later. But I think IBM has had a somewhat similar paper recently. about the training rate changing based on epoch/mini batch performance on the benchmark or something. It's called scheduler something
Equation 1 just serves as a theoretical foundation for the "compute-optimal" concept but it cannot be directly used for optimization because: Intractability: Finding the truly optimal hyperparameters θ across all possible prompts and compute budgets a*(q) would require an exhaustive search..... Unknown Ground Truth: In a real-world setting, we don't know the ground-truth correct answer y*(q) for unseen prompt, so directly optimizing the indicator function is impossible.
I think performance breaks down at temp 0, and so you get much less exploration. Especially with ambiguous questions especially you get more stability with majority vote, plus a confidence metric
he's misunderstood - the whole point of the beam search here is that it guides the generation process by making step-wise decisions based on the PRM's evaluation. It's more about strategically navigating the search space rather than explicitly modifying the output distribution or altering already generated outputs
Chinese and Indian software engineers and computer scientists are "killin da game" when it comes to all things high tech in coding Ai and other complicated domains in our field. Hats off to them!
Because being able to do arithmetic is a good indicator of being able to reason. We want LLMs to be good reasoners because a lot of tasks in the real world will require LLMs and soon AI agents to reason like a human can.
I think wath you want. When a kid see you put one apple than put one more he will answer we have 2. So we write 1+1=2. Then he will take notation always as true wthitout recall the apple video. This mean some training need 2 module, video then video-notation asociotion. And probable use notation is 3 step. My noob opinion.
Python is just dead end pathway. One guy on UA-cam writes neural network in Assembly low-level language and it's 500 times faster than Pytorch on 1 CPU core on one same task. We need full rewrite of networks and models.
I tried reading this paper three times but then decided it would have been more optimal if they doubled the number of scientists writing it…
lol same
They didn't share any code 🔴❌️
Neural network is a procedure to process stimul . Not message as in oop. Message is go to one object. Stimul go to all object and is processed in all node. Imagine you have one variable and go in one expression. Stimul is one value and go in all expressions of all node. Is a new way to compute more close to real neurons. How to implement is the work in progres , now.
He's alive!
There is a paper by Christopher Re and co. about scaling inference via random sampling; they demonstrate scaling all the way up to saturating MATH and other benchmarks. They also come up with scaling laws for inference.
Love your paper breakdowns. Always learn a lot. Appreciate it!
In Figure 2, beam search refers to the "standard" beam search, without refinement. You simply sample intermediate steps from a "standard" LLM (one that might not have self-refinement capabilities) and see what the best intermediate solutions are using the verifier. A PRM-based verifier will give you a score for the current step (the steps are delimited in a way that the PRM understands, e.g. through new lines), and the scores for the single steps are then combined (using average, min, ...) into a score for the whole intermediate solution. You can then pick the solution(s) with the highest score, expand on it, and iterate until you reach one or ideally multiple final solutions from which you can again pick using the verifier. That's my understanding.
Long time no see
I was missing you! Hope to see more from you
There are no error bars in figure 4. Ho would you know if any of these different methods performs significantly better than other? Looks like bad stat to me
The king is back!
this work seems to build upon another recent work, "Recursive Introspection: Teaching Language Model Agents How to Self-Improve," which has code available...
Thanks for the hint.
Are we sure a* is not a type-o that should have been y*?
Also, best of weighted N beam majority?
Thank You Mr Yannic For Explaining This Wonderful Paper About LLM Scaling
Thanks for your critical review, was very insightful
Welcome back! I'm not convinced their definition of 'difficulty' is interesting or helpful either, but isn't it entirely unsurprising that LLMs 'think' in a different way than humans?
My goat is back
Glad to see another video of yours, thank you Yannic! :D
I really miss your ML News, I hope you make some more of them one of these days ^^
what if we use monty carlo tree search on tree of thought llms then we just keep the highest quality output and train a new foundation model on that synthetic data and repeat until asi
Sounds like a promising approach and I think its reasonably close to what the big labs are planning to do
People have already done this
Or just use something similar to Thinker:learning to plan and act to kinda (predict) a few tokens ahead which might increase quality
Oracle to guide and reach asi required.
I'm guessing they trained o1 in a similar manner. Maybe slightly different algorithm, different tree search technique or maybe slightly different way of generating output, but the general idea is probably the same.
Will have to check the whole video later. But I think IBM has had a somewhat similar paper recently. about the training rate changing based on epoch/mini batch performance on the benchmark or something. It's called scheduler something
welcome back
41:15 isn't it at this point a manual overfitting of architecture to the dataset?
Can't you review Computer Vision papers too? 😞
Equation 1 just serves as a theoretical foundation for the "compute-optimal" concept but it cannot be directly used for optimization because:
Intractability: Finding the truly optimal hyperparameters θ across all possible prompts and compute budgets a*(q) would require an exhaustive search.....
Unknown Ground Truth: In a real-world setting, we don't know the ground-truth correct answer y*(q) for unseen prompt, so directly optimizing the indicator function is impossible.
Interesting paper
how does resampling the output of a LLM and taking the most frequent differ from running with temp=0 ?
I think performance breaks down at temp 0, and so you get much less exploration. Especially with ambiguous questions especially you get more stability with majority vote, plus a confidence metric
It can't be, a new paper that's not 98% marketing wank? Is the world healing, brothers
Please cover ESM3
he's the best
why not just open source Gemini and chatgpt ?
Isn't beam search done per token? Why does yannic say that they grade the answers?
he's misunderstood - the whole point of the beam search here is that it guides the generation process by making step-wise decisions based on the PRM's evaluation. It's more about strategically navigating the search space rather than explicitly modifying the output distribution or altering already generated outputs
@@benedictsmith2415 So im right in the way i understood it right? The beam search is done token by token and evaluated at intermediate steps?
@@MinecraftJuiceHD correct
It seems to me, accordingly to the graphs: The harder the question the more luck to get the right answer.
long time no see
Chinese and Indian software engineers and computer scientists are "killin da game" when it comes to all things high tech in coding Ai and other complicated domains in our field. Hats off to them!
Please the news back!
Nice
21:48 What can be unburdened by what has been
Rant was good Lol
Too much of concepts zero lines of code. Deepmind should let me fine tune my llama/gemma with this approach
牛逼
Why in the name of all that's holy are we asking an LLM to do arithmetic?? 😭
Because being able to do arithmetic is a good indicator of being able to reason. We want LLMs to be good reasoners because a lot of tasks in the real world will require LLMs and soon AI agents to reason like a human can.
Because not all of us are interested in roleplay slop
Completely worthless if the model has no concept of the test-time trajectory.
I think wath you want. When a kid see you put one apple than put one more he will answer we have 2. So we write 1+1=2. Then he will take notation always as true wthitout recall the apple video. This mean some training need 2 module, video then video-notation asociotion. And probable use notation is 3 step. My noob opinion.
Wake up, babe. New Yannic video just dropped.
200 views in 15 minutes. Bro fell off
Python is just dead end pathway. One guy on UA-cam writes neural network in Assembly low-level language and it's 500 times faster than Pytorch on 1 CPU core on one same task. We need full rewrite of networks and models.
Please tell me who made that. It seems so interesting
Also yeah, C or C++ is better for actually useful and fast models, python is good for modularity and prototyping but god it is so fucking slow
Wat? 99 percent of training is done on gpu which is already cpp
@biomerl Yeah sorry I dont have much knowledge on low level ML
@@scoffpickle9655easiest starting place is search youtube for matrix multiplication with cuda (basically just c code)
The Travis Pickle of AI!