Some context: This happens to people too. They need to order the rules to come to the correct conclusion. The random words rules look like hallucinations to humans as well so it is normal that in this case the LLM hallucinates. It is training data problem (because people prefer hronological order) and it is random junk problem as it looks like a hallucination anyway.
I do wonder how a single particular human would go with such a deep analysis of their cognition. Imagine all the little flaws, quirks and imperfections of your own cognition being analysed to death like this. I feel it would be fascinating and revealing.
seems like you can mitigate this with a good prompt. The trick is to ask the llm to reorder the problem, and then answer the reordered problem. like so: "You are an 'Order of Operations Detection' module within an AI reasoning system . Your job is to meticulously analyze text for temporal and causal indicators, clarifying the sequence of events or operations within a problem statement. Your output is always a reordered sequence of events that is in correct chronological order as determined by the temporal and causal indicators in the text. Label this output as "reordered_problem". Once this is done, solve the problem statement labelled "reordered_problem"."
@@RPG_Guy-fx8ns the problem with that prompt is the llm will struggle to know which version of the problem to solve. I have much higher success with giving the updated problem a label, and instructing the agent to solve the labelled version.
it is almost a similar problem when it was also discovered last year that llm are trained for A-> B but failed when ask for B->A (paper "The Reversal Curse")
@ 8:45 the language model actually got it right, because it didn't make an assumption that time is only went home once. The way it's worded, no money was lost on his way home from withdrawing the 1,000, but rather the money was lost on his way home after converting his bills two five-dollar bills. A better answer would have been to point out the ambiguity in the way that it's worded and give both mathematical possibilities, but whoever scored that answer as wrong was making an unsupported assumption.
In the first example it looks like the LLM is ignoring the punctuation so "The rink also has yellow cars." becomes "The rink has yellow cars, they have 3 times the number of blue cars.." If you do RAG chunking at the paragraph level rather than the sentence level and then sort your paragraphs into chronological order then this may reduce the problem.
Looking at your examples I think the outcome is incredibly positive and not negative. LLMs have been absolutely incompetent for math logic. It’s fun to play with it but the answers are usually terrible. But these examples show that again prompt engineering is very important and maybe with the right prompt the LLM that has always been stupid at these things might not be quite as stupid as we thought…for math logic problems.
I don't think this will happen if GP4 or LLMs are trained on Propositional Logic corpus. I had suggested that we include all types of Logic Text books including Modal and Fuzzy logic in the original corpus. That would certainly take care of logical elements in the LLM. Maybe train a smaller model on Logic and then Merge it with standard models=
Besides the hallucination, the sensitivity to ordering, makes me wonder whether we have unrealistic expectations of getting a autoregressive engine to do logical reasoning. I suspect the future is in some kind of LLM + deterministic framework, such as suggested by AlphaGeometry. In this case, I wonder whether the hybrid framework, e.g. the LLM rewriting the sentences to fit into Prolog, then evaluating it from there? It would be another case of "tool use", which is better behaved.
LLMs have a non-monotonous temporal logic, the same that natural language has. Nothing to be surprised with. It has very little in common with Aristotelan logic. If you are interested in it read: Paul Hertz, Gerhard Gentzen etc. etc. up to Dov Gabbay.
the word "They" in the moved line being read "These yellow cars" have 3 times the number of blue cars. An ambiguity of speech sometimes we make in real life, and is due to the english grammar, and applies to even technical truncations, short forms,,,, but its that LLM will blunder such cases without any inkling or regret
It would be interesting to see how an agent system performs that first identifies semantic primitives, and then divides them into relevant and irrelevant statements, then uses the selected statements tor reason with.
Since this is the case, what impact might this have on the use of DSPy, which relies, as far as I can tell, on logical operations from the LLM being accurate?
In principle it is easy. Since the authors argue, that this behaviour is not a mistake in the LLM processes, but the LLM is simply learning from the human datasets, from all the book written, from all social media posts (maybe a significant portion of the free internet and all human conversations), .... it is the preferred way we as human seem to think. In a linear fashion. Therefore if we want to construct a theoretical intelligence with a higher logical complexity, we have at minimum two options: A - create a logical training dataset with higher reasoning complexities, and then pre-train your new LLM on this specific dataset. Hint: Not a synthetic dataset, since GPT-4-Turbo is linear limited, given its training data. B - if we have non-linear complexities or higher order complexities, write a code to reduce our higher complexities to the simpler, human based complexity that GPT-4 can handle. Hint: It is not, that a complexity level 3 task can be reduced to three complexity level 1 tasks. Smile. So Google is already working on it, and maybe I have a hint for you in my next video .....
Really interesting. Still, I think an argument can be made for interpretation, especially regarding the bill example. Lost bills returning home -> was he at the bank when he exchanged the 20 dollar bills for 5 dollar bills? One could argue that he could not have done this at home and then might have lost 10 5 dollar bills. Some of the other examples including the logical premises, however, should not have made a difference.
We already have logical reasoning tools that surpassed human ability decades ago - SMT solvers, Prolog and its variants, HOL and other proof assistants. Just let LLMs use them. I had quite a bit of success with just providing a small 7B model with an access to a Prolog interpreter.
Thanks, interesting stuff! I am going to test if the logical ordering of my Python functions has an impact if the llm to work with the codebase. What do you think?
It seems GPT-4 Turbo's performance did not degrade very much, its accuracy drops from 94.1% to 85.0% on the full dataset and from 100% to 89.9% on a subset where the original problems are correctly solved. This would still be nearly human since for the GSM8K dataset "A bright middle school student should be able to solve every problem. It can be used for multi-step mathematical reasoning." I would think "bright" would mean in the top 25% or even top 10%?
That’s almost a tripling of errors. I wouldn’t use an LLM to manage my finances on that basis. AGI via LLMs now looks further away than ever. The prompt whitespace and punctuation issue was bad enough, but this is far worse.
you pretend that they wont be able to solve this issue. most people dont even manage their finances let alone using an LLM to manage it. all intelligent people understand that you use an LLM in addition to your own intelligence, not a substitute for it@@damianlewis7550
I did the same. I think that might be the point though. As humans, we tend to work through things in order. So, the majority of our written work follows this pattern. The LLM's, trained on human data learn that this order is important and follow it. This leads to mistakes when rules are out of order. Maybe then, generating synthetic data that is out of order and training on it will help with this issue?
From @DonaldKronos (above) "The way it's worded, no money was lost on his way home from withdrawing the 1,000, but rather the money was lost on his way home after converting his bills two five-dollar bills". You and the person below (and the LLM) were correct.
@@EdJones1960 Interesting. I think the "while getting home" clause in the lost money step is supposed to match up with the "after getting home" step. That gives the context to put the steps back in the correct order. For the lost money step to come at the end, he would need to leave the house again to satisfy the "while getting home" part. I guess one could argue it makes no sense to convert money at your house, so he must have left to do that.
i call it cognitive lambda calculus and its only bad because the people writing the chatbots are bad at this. This is the kinda thing you have to smoke 3 joints to understand lmao. Attention doesn't have enough parenthesis.
Please, please put links to papers in the text section of the videos.
Some context: This happens to people too. They need to order the rules to come to the correct conclusion. The random words rules look like hallucinations to humans as well so it is normal that in this case the LLM hallucinates. It is training data problem (because people prefer hronological order) and it is random junk problem as it looks like a hallucination anyway.
I do wonder how a single particular human would go with such a deep analysis of their cognition. Imagine all the little flaws, quirks and imperfections of your own cognition being analysed to death like this. I feel it would be fascinating and revealing.
seems like you can mitigate this with a good prompt. The trick is to ask the llm to reorder the problem, and then answer the reordered problem. like so: "You are an 'Order of Operations Detection' module within an AI reasoning system . Your job is to meticulously analyze text for temporal and causal indicators, clarifying the sequence of events or operations within a problem statement. Your output is always a reordered sequence of events that is in correct chronological order as determined by the temporal and causal indicators in the text. Label this output as "reordered_problem". Once this is done, solve the problem statement labelled "reordered_problem"."
Triage Chronologically the following prompt, then answer it:
@@RPG_Guy-fx8ns the problem with that prompt is the llm will struggle to know which version of the problem to solve. I have much higher success with giving the updated problem a label, and instructing the agent to solve the labelled version.
@@PrincessKushana so something like:
The Following is Prompt A. Triage Prompt A Chronologically into Prompt B, then answer Prompt B:
it is almost a similar problem when it was also discovered last year that llm are trained for A-> B but failed when ask for B->A (paper "The Reversal Curse")
@ 8:45 the language model actually got it right, because it didn't make an assumption that time is only went home once. The way it's worded, no money was lost on his way home from withdrawing the 1,000, but rather the money was lost on his way home after converting his bills two five-dollar bills. A better answer would have been to point out the ambiguity in the way that it's worded and give both mathematical possibilities, but whoever scored that answer as wrong was making an unsupported assumption.
In the first example it looks like the LLM is ignoring the punctuation so "The rink also has yellow cars." becomes "The rink has yellow cars, they have 3 times the number of blue cars.."
If you do RAG chunking at the paragraph level rather than the sentence level and then sort your paragraphs into chronological order then this may reduce the problem.
Really appreciate your vids !! Thanks !
Looking at your examples I think the outcome is incredibly positive and not negative. LLMs have been absolutely incompetent for math logic. It’s fun to play with it but the answers are usually terrible. But these examples show that again prompt engineering is very important and maybe with the right prompt the LLM that has always been stupid at these things might not be quite as stupid as we thought…for math logic problems.
Perhaps it can be fixed by automatic putting premises in a linear order (self-prompting)?
What if we include these reordered examples in the training itself? Or add examples in the prompt?
I don't think this will happen if GP4 or LLMs are trained on Propositional Logic corpus. I had suggested that we include all types of Logic Text books including Modal and Fuzzy logic in the original corpus. That would certainly take care of logical elements in the LLM. Maybe train a smaller model on Logic and then Merge it with standard models=
that's a phenomenal paper!!!
Shows that we need to include Logical reasoning and advanced logic text books in the corpus
Besides the hallucination, the sensitivity to ordering, makes me wonder whether we have unrealistic expectations of getting a autoregressive engine to do logical reasoning. I suspect the future is in some kind of LLM + deterministic framework, such as suggested by AlphaGeometry. In this case, I wonder whether the hybrid framework, e.g. the LLM rewriting the sentences to fit into Prolog, then evaluating it from there? It would be another case of "tool use", which is better behaved.
LLMs have a non-monotonous temporal logic, the same that natural language has. Nothing to be surprised with. It has very little in common with Aristotelan logic. If you are interested in it read: Paul Hertz, Gerhard Gentzen etc. etc. up to Dov Gabbay.
the word "They" in the moved line being read
"These yellow cars" have 3 times the number of blue cars.
An ambiguity of speech sometimes we make in real life,
and is due to the english grammar, and applies to even technical truncations, short forms,,,,
but its that LLM will blunder such cases without any inkling
or regret
It would be interesting to see how an agent system performs that first identifies semantic primitives, and then divides them into relevant and irrelevant statements, then uses the selected statements tor reason with.
Thank you best paper since the dawn of gpt3
Since this is the case, what impact might this have on the use of DSPy, which relies, as far as I can tell, on logical operations from the LLM being accurate?
Any chance of a video on how to perform Logic Filtering for RAG?
In principle it is easy. Since the authors argue, that this behaviour is not a mistake in the LLM processes, but the LLM is simply learning from the human datasets, from all the book written, from all social media posts (maybe a significant portion of the free internet and all human conversations), .... it is the preferred way we as human seem to think. In a linear fashion.
Therefore if we want to construct a theoretical intelligence with a higher logical complexity, we have at minimum two options:
A - create a logical training dataset with higher reasoning complexities, and then pre-train your new LLM on this specific dataset. Hint: Not a synthetic dataset, since GPT-4-Turbo is linear limited, given its training data.
B - if we have non-linear complexities or higher order complexities, write a code to reduce our higher complexities to the simpler, human based complexity that GPT-4 can handle. Hint: It is not, that a complexity level 3 task can be reduced to three complexity level 1 tasks. Smile.
So Google is already working on it, and maybe I have a hint for you in my next video .....
Seems like the required training data could be synthetically generated in linear logic order then shuffled/modified algorithmically.
Really interesting.
Still, I think an argument can be made for interpretation, especially regarding the bill example.
Lost bills returning home -> was he at the bank when he exchanged the 20 dollar bills for 5 dollar bills? One could argue that he could not have done this at home and then might have lost 10 5 dollar bills.
Some of the other examples including the logical premises, however, should not have made a difference.
We already have logical reasoning tools that surpassed human ability decades ago - SMT solvers, Prolog and its variants, HOL and other proof assistants. Just let LLMs use them. I had quite a bit of success with just providing a small 7B model with an access to a Prolog interpreter.
Thanks, interesting stuff! I am going to test if the logical ordering of my Python functions has an impact if the llm to work with the codebase. What do you think?
These flaws suspiciously remind human flaws.
It seems GPT-4 Turbo's performance did not degrade very much, its accuracy drops from 94.1% to 85.0% on the full dataset and from 100% to 89.9% on a subset where the original problems are correctly solved. This would still be nearly human since for the GSM8K dataset "A bright middle school student should be able to solve every problem. It can be used for multi-step mathematical reasoning." I would think "bright" would mean in the top 25% or even top 10%?
That’s almost a tripling of errors. I wouldn’t use an LLM to manage my finances on that basis. AGI via LLMs now looks further away than ever.
The prompt whitespace and punctuation issue was bad enough, but this is far worse.
you pretend that they wont be able to solve this issue. most people dont even manage their finances let alone using an LLM to manage it. all intelligent people understand that you use an LLM in addition to your own intelligence, not a substitute for it@@damianlewis7550
wonder how Deepseek Math RL (7B) would do
need a logic balancer…
I asked a few large models, and they said i can.
Try to solve such tasks just with your brain. Then you will understand these LLMs much better.
for the dollar bills one, I got the same wrong answer, when I read from left to righjt, not gonna lie
I did the same. I think that might be the point though. As humans, we tend to work through things in order. So, the majority of our written work follows this pattern. The LLM's, trained on human data learn that this order is important and follow it. This leads to mistakes when rules are out of order. Maybe then, generating synthetic data that is out of order and training on it will help with this issue?
From @DonaldKronos (above) "The way it's worded, no money was lost on his way home from withdrawing the 1,000, but rather the money was lost on his way home after converting his bills two five-dollar bills". You and the person below (and the LLM) were correct.
@@EdJones1960 Interesting. I think the "while getting home" clause in the lost money step is supposed to match up with the "after getting home" step. That gives the context to put the steps back in the correct order. For the lost money step to come at the end, he would need to leave the house again to satisfy the "while getting home" part. I guess one could argue it makes no sense to convert money at your house, so he must have left to do that.
i call it cognitive lambda calculus and its only bad because the people writing the chatbots are bad at this. This is the kinda thing you have to smoke 3 joints to understand lmao. Attention doesn't have enough parenthesis.
FASCINATING need an AI to keep up.
Are we surprised? I'm not.
If you're surprised about this then you believe things about LLMs that aren't true.