Just the fact that it can play chess, is so much more impressive than the fact it did not win from a level 5 trained computer algorithm. To me it show you these agents are perfectly capable to automate relatively simple tasks.
Recently I was thinking about chess and agents and strategy games overall and I had realized if I want to use an agent for chess then it should call a deep learning model that was trained on chess and then it can just handle the response, so the LLM is used for input and output of the user.
@@JayDee-b5u to be fair it would be a very good video that truly answers that question given the disparate audience. Perhaps “llm fight it out in 64 square smackdown arena” is more accurate ;)
@@Data-Centric fair enough. I think I understand your argument in this video. However. Is a lot of agent features like chess or is it like mine craft? Remember how they got gpt4 to learn how to play it buy getting it to make its own tools and commands it could recall that seemed to work? Maybe agents may be more like that as it seemed it could manage mine craft, or perhaps more in-between minecraft and chess
If you think about it I know very few chess player being able to play a chess game without seeing the board after like 8-10 moves. Would you ? I wouldn’t at all but I wouldn’t ever make the same mistakes you demonstrated if I can see the board.
Yeah, I probably wouldn't be able to remember the board state after a few moves of blindfolded chess (unless the previous moves were all book moves). I wonder how the bot would fare if there were a second agent that summarized the board state and included that into the context.
lol : even me i was a chess master but now its gone! ... Chess is something that your brain will loose ! its not the same as finger memeory ! even after ten years i can jump into any xbox game ( call of duty etc ) and be a winner ! .. Chess takes a couple of months to wake back up !
Hi! Thanks for the video and the code. Is there any reason you decided to separate the white and black moves in the prompt instead of using the "standard" format, e..g., 1. e4 e5 2. Nf3 Nf6, etc? Since this is more common in books and websites it could be easier for the models to parse? Just speculation, I may try this later if I find some time.
Thanks for watching. No reason I decided on that in particular, I doubt there would be much of an uplift in performance changing the representation of the board/moves. But let me know if you try and you do get an uplift.
or any research which is unusual; this can include even be historical research but where there is very limited and difficult to find papers about very specific subjects.Also anything that is basically falling in to edge or outside cases. Also in code, where you are coding anything novel the usefulness of LLM based tools drops dramatically.
can u give me review on codestral llm ? ollama i use ai to code to build web applications my ram is little low 32gigs to run codestral very smooth like other or llama3 do !! how much potential codestral have? and can it beat gpt3.5 atleast?
This is a good demonstration of how not to use agents. As there is practically an infinite number of chess moves at any piont, are not we just asking the llm for a random next move? Although llm's cant do random, they should just return the closest similar example from their training data.
I think ai agents, like coders, should write a test for the soln before generating it, they can test a solution using either: a calculator, write code and run it, use a custom function tool (ie is this a valid chess move),use local RAG, use web search from a quality source, simulate it, monte carlo tree search (for chess, etc), subdivide it and test, test using a different llm, human verification.
Interesting solution regarding your chess approach, however one might say there's no use for the LLM there at all because the algo is doing 99% of the chess. I assume by valid chess move you mean good (correct me if I'm wrong). I think in this case, the LLM still wouldn't know what a valid chess move is.
Your video is truly shocking. I never would have imagined a major LLM could so quickly make such trivial and direct reasoning errors worthy of a quasi-beginner. I actually think you just provided a clear demonstration that there is almost not an ounce of general reasoning in an LLM. We think there is because the language is logical and our prompts are recurring but this is wrong. In fact, it doesn't seem able of isolating key pieces in a layout and analyzing the impact of their movement. As soon as the game develops a little, he no longer understands anything. No chess player analyzes the potential movement of all pieces on the board. We know in a few seconds how to identify the main threats or opportunities and we figure out the few resulting options. Maybe training the model with a good move/ bad move starting from a random layout would help him isolate key pieces in a layout but I’m not even sure about that.
I love your content, and this video is no exception. That said, I think you are drawing overly broad conclusions about an LLM’s ability to reason in the face of new circumstances/material (versus merely parrot back aspects of its training data) based on the very specific type of “reasoning” required for chess. There are lots of types of reasoning that LLMs are terrible at. Chess requires a very specific type of thinking/planning that an autoregressive model is simply not well equipped to do-namely it must not only identify what seems to be the most promising possible next moves based on the current state, and from what the model already knows (its training data which informs its ‘intuition’), but it must then explore all the possibilities from that hypothetical state-then repeating the same exercise with another potential state. This is a highly systematic type of exploration that algorithms like MCTS are designed to perform and autoregressive GPTs are not. With an infinite context window and infinite max_tokens, the model could perhaps talk through the possibilities, but that not how people do it. And it would be hopelessly inefficient. People visualize the configurations to visually think through the implications. They don’t verbalize it. More fundamentally, the addition of chess-like methodical exploratory thinking capabilities (MCTS-like systematic exploratory thinking) would address a big deficit that LLMs have. But this is only one form of reasoning. I don’t think we can generalize from this that LLMs don’t reason.
Thank you for your feedback. I found your thoughts engaging and I broadly agree with you. My aim with this video was to demonstrate how LLM capabilities break down when asked to reason. I believe that what LLMs currently do is not reasoning at all, though I admit I've used that word to describe agent behaviour (for convenience's sake). I chose chess specifically because I believe it's a good way to visualise this concept. The chess boards displayed alongside the agent's "reasoning" trace demonstrates this quite well. The game complexity of chess is so vast that we know many chess scenarios simply don't exist in the training data. If LLMs truly "understood" the chess scenarios they had been trained on, that understanding could be transferred to new board states. LLMs attempt this by predicting the next token based on what they've already encountered, as you quite rightly pointed out, this next-token prediction isn't sufficient to play chess competently. I find your point about infinite context interesting, but I still believe it wouldn't "know" the best move to make even if it could walk through all chess scenarios from a given board state. Generating a set of possible moves is obviously within an LLM's capabilities, but knowing which is the best of that set would require an understanding of how each move brings you closer to the goal of checkmate. This isn't something that autoregressive next-token prediction is well-suited for. Then again, if all possible outcomes were in the training data, it could predict the best move , but this still isn't reasoning, or is it?
Does the LLM explanation will not be just pure hallucination to justify whatever move that was played ? Should it not reanalyse the board and it’s plan to make it useful ?
The end of the video convince me that it would not work because we will just emulate a pseudo search that will never be able to compare with stuff like Monte Carlo tree search But it was mostly to think what could trigger hallucinations or not
Cool video, especially as someone who really enjoys chess. Obviously, chess is not an LLMs strong suit, but I was surprised just how poorly multiple agents did.
Its like using a wrench to write a book. Makes no sense. Now compare stockfish to make a financial report by providing it data, then compare it with LLM's.
These videos are like attending class at Oxford. I love these things. Thank you.
Wow, thank you!
Just the fact that it can play chess, is so much more impressive than the fact it did not win from a level 5 trained computer algorithm. To me it show you these agents are perfectly capable to automate relatively simple tasks.
Recently I was thinking about chess and agents and strategy games overall and I had realized if I want to use an agent for chess then it should call a deep learning model that was trained on chess and then it can just handle the response, so the LLM is used for input and output of the user.
Thanks for the information at the end about good and bad use cases. It helps cut through the hype.
you are always clear, honest and forthright. Lovable :)
Not with clickbait titles like this.
@@JayDee-b5u to be fair it would be a very good video that truly answers that question given the disparate audience. Perhaps “llm fight it out in 64 square smackdown arena” is more accurate ;)
Thanks for the support.
Curious to see performance if LLM has vision, also a scratched and memory.
Great knowledge of the rules of AI agents. Probing alternative means' power reveals what they can and can't do.
well F Done ! I did see the datasets for this on HF ! - Also your agent was quite good too !
it would be interesting to know what the boost to the ELO of the MoA llm was vs it's ELO as a single Llm
I didn't measure it, but if I had to guess I would say it was negligible.
@@Data-Centric fair enough. I think I understand your argument in this video. However. Is a lot of agent features like chess or is it like mine craft? Remember how they got gpt4 to learn how to play it buy getting it to make its own tools and commands it could recall that seemed to work? Maybe agents may be more like that as it seemed it could manage mine craft, or perhaps more in-between minecraft and chess
good work! i love how you explain the code and have the github where to find it
If you think about it I know very few chess player being able to play a chess game without seeing the board after like 8-10 moves. Would you ? I wouldn’t at all but I wouldn’t ever make the same mistakes you demonstrated if I can see the board.
Yeah, I probably wouldn't be able to remember the board state after a few moves of blindfolded chess (unless the previous moves were all book moves). I wonder how the bot would fare if there were a second agent that summarized the board state and included that into the context.
lol : even me i was a chess master but now its gone! ... Chess is something that your brain will loose ! its not the same as finger memeory !
even after ten years i can jump into any xbox game ( call of duty etc ) and be a winner ! .. Chess takes a couple of months to wake back up !
Hi! Thanks for the video and the code. Is there any reason you decided to separate the white and black moves in the prompt instead of using the "standard" format, e..g., 1. e4 e5 2. Nf3 Nf6, etc? Since this is more common in books and websites it could be easier for the models to parse? Just speculation, I may try this later if I find some time.
Thanks for watching. No reason I decided on that in particular, I doubt there would be much of an uplift in performance changing the representation of the board/moves. But let me know if you try and you do get an uplift.
or any research which is unusual; this can include even be historical research but where there is very limited and difficult to find papers about very specific subjects.Also anything that is basically falling in to edge or outside cases. Also in code, where you are coding anything novel the usefulness of LLM based tools drops dramatically.
Great idea as a test!
Hey sorry I’ve been absent lately. I’m traveling. Thanks for looking at my pull requests and being active with your community!
Thanks for the support!
is it possible to make a short zoom call with you about this topic?
I offer consultancy/development services. You can book it through my consulting link in the description to this video.
can u give me review on codestral llm ? ollama
i use ai to code to build web applications
my ram is little low 32gigs to run codestral very smooth like other or llama3 do !! how much potential codestral have? and can it beat gpt3.5 atleast?
This is a good demonstration of how not to use agents. As there is practically an infinite number of chess moves at any piont, are not we just asking the llm for a random next move? Although llm's cant do random, they should just return the closest similar example from their training data.
Im looking forward to the arrival of a dell RTX pc and testing your videos out locally.
I think ai agents, like coders, should write a test for the soln before generating it, they can test a solution using either: a calculator, write code and run it, use a custom function tool (ie is this a valid chess move),use local RAG, use web search from a quality source, simulate it, monte carlo tree search (for chess, etc), subdivide it and test, test using a different llm, human verification.
Interesting solution regarding your chess approach, however one might say there's no use for the LLM there at all because the algo is doing 99% of the chess. I assume by valid chess move you mean good (correct me if I'm wrong). I think in this case, the LLM still wouldn't know what a valid chess move is.
Your video is truly shocking. I never would have imagined a major LLM could so quickly make such trivial and direct reasoning errors worthy of a quasi-beginner.
I actually think you just provided a clear demonstration that there is almost not an ounce of general reasoning in an LLM. We think there is because the language is logical and our prompts are recurring but this is wrong.
In fact, it doesn't seem able of isolating key pieces in a layout and analyzing the impact of their movement. As soon as the game develops a little, he no longer understands anything.
No chess player analyzes the potential movement of all pieces on the board. We know in a few seconds how to identify the main threats or opportunities and we figure out the few resulting options.
Maybe training the model with a good move/ bad move starting from a random layout would help him isolate key pieces in a layout but I’m not even sure about that.
I love your content, and this video is no exception. That said, I think you are drawing overly broad conclusions about an LLM’s ability to reason in the face of new circumstances/material (versus merely parrot back aspects of its training data) based on the very specific type of “reasoning” required for chess. There are lots of types of reasoning that LLMs are terrible at. Chess requires a very specific type of thinking/planning that an autoregressive model is simply not well equipped to do-namely it must not only identify what seems to be the most promising possible next moves based on the current state, and from what the model already knows (its training data which informs its ‘intuition’), but it must then explore all the possibilities from that hypothetical state-then repeating the same exercise with another potential state. This is a highly systematic type of exploration that algorithms like MCTS are designed to perform and autoregressive GPTs are not. With an infinite context window and infinite max_tokens, the model could perhaps talk through the possibilities, but that not how people do it. And it would be hopelessly inefficient. People visualize the configurations to visually think through the implications. They don’t verbalize it. More fundamentally, the addition of chess-like methodical exploratory thinking capabilities (MCTS-like systematic exploratory thinking) would address a big deficit that LLMs have. But this is only one form of reasoning. I don’t think we can generalize from this that LLMs don’t reason.
In what area do you think that LLMs can shine in `reasoning`. Your answer on the spot and if you elaborate more I would appreciate it.
Thank you for your feedback. I found your thoughts engaging and I broadly agree with you. My aim with this video was to demonstrate how LLM capabilities break down when asked to reason. I believe that what LLMs currently do is not reasoning at all, though I admit I've used that word to describe agent behaviour (for convenience's sake).
I chose chess specifically because I believe it's a good way to visualise this concept. The chess boards displayed alongside the agent's "reasoning" trace demonstrates this quite well.
The game complexity of chess is so vast that we know many chess scenarios simply don't exist in the training data. If LLMs truly "understood" the chess scenarios they had been trained on, that understanding could be transferred to new board states. LLMs attempt this by predicting the next token based on what they've already encountered, as you quite rightly pointed out, this next-token prediction isn't sufficient to play chess competently.
I find your point about infinite context interesting, but I still believe it wouldn't "know" the best move to make even if it could walk through all chess scenarios from a given board state. Generating a set of possible moves is obviously within an LLM's capabilities, but knowing which is the best of that set would require an understanding of how each move brings you closer to the goal of checkmate. This isn't something that autoregressive next-token prediction is well-suited for. Then again, if all possible outcomes were in the training data, it could predict the best move , but this still isn't reasoning, or is it?
Does the LLM explanation will not be just pure hallucination to justify whatever move that was played ? Should it not reanalyse the board and it’s plan to make it useful ?
I don't think it is capable of this. I tried this with my approach, but I appreciate that my prompting is likely suboptimal.
The end of the video convince me that it would not work because we will just emulate a pseudo search that will never be able to compare with stuff like Monte Carlo tree search
But it was mostly to think what could trigger hallucinations or not
Cool video, especially as someone who really enjoys chess. Obviously, chess is not an LLMs strong suit, but I was surprised just how poorly multiple agents did.
I think the LLMs mainly know the semantic relationships of words and sentences, embeddings etc. Chess is not that, so much.
Its like using a wrench to write a book. Makes no sense. Now compare stockfish to make a financial report by providing it data, then compare it with LLM's.
The aim of the video was to show where the "reasoning" capabilities of LLMs break down.