GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Поділитися
Вставка
  • Опубліковано 21 жов 2024

КОМЕНТАРІ • 122

  • @bardaq9643
    @bardaq9643 2 дні тому +33

    As someone who taught mathematics to various audiences, I got to say that, indeed, quite many younger students will solve x^2 + 5 x + 6 = 0 but will bug out completely if you replace x by A, B or an emoji. Or, ask some first-year students the "trick questions" about the zero vector space or the difference between a 1xn matrix and \mathbb R^n. So yes, that joins the points raised in the video. Adding unnecessary information to exam questions used to be a nice trick too. Or asking to verify if something satisfies a definition with the violated condition being the last one usually presented, but trivial to verify.
    The sad conclusion that I started to have after a few years is that a lot of students (including, arguably, myself when younger as well?..) were simply too good at, broadly speaking, memorisation (pattern-matching?..), even at higher university levels. They would not be able to manipulate the objects freely if left unguided. Seeing what o1 (even though it's not the focus here) can do with research level math actually amazed me due to my low expectations (I did not expect it to operate that well in TeX and not go through such things as Lean). And I think things will go further, raising more and more questions about humans and mathematics.

    • @memegazer
      @memegazer День тому

      I think a lot of people don't like to think about how humans and generative models are not much different as a monky at a typewriter type system
      I think you could say humans have generalized more than a transformer model, but imo it seems more likely that we humans have likely only memorized some useful patterns in reasoning that appear to be something like a universal generalizer
      but I suspect that it is not nearly as much as popular consunsus seems to think

  • @JorgetePanete
    @JorgetePanete 3 дні тому +42

    Bold of you to assume the average human uses reasoning

    • @elliotanderson1585
      @elliotanderson1585 2 дні тому +4

      well, he is bald after all.

    • @frederickmueller7916
      @frederickmueller7916 2 дні тому +1

      The problem is that we actually punish childs for reasoning in school. You just have to memorize and do "as told".

    • @TheTruthOfAI
      @TheTruthOfAI 2 дні тому +1

      @@frederickmueller7916 obviously, u dont want einsteins at ur class.. stupid dislexic idiots like him LOL

    • @memegazer
      @memegazer День тому

      @@frederickmueller7916
      I think if you said, "children are not incentivised to reason" that might be valid.
      But I have never seen any evidence that children are "punished" for reasoning.

  • @Mordenor
    @Mordenor 3 дні тому +4

    Thank you Mr Yannic for going over the cons of GSM-symbolic.

  • @theepicosityofpizza
    @theepicosityofpizza 2 дні тому +6

    So glad to see you cover this! I thought this paper was embarrassingly bad. Spends most of its time evaluating 7-9B parameter models and glazes comepltely over the fact that o1 absolutely wallops their benchmark. They cherry pick one example of it getting something wrong and conclude that LLMs can't reason. So strange.

  • @mshonle
    @mshonle 3 дні тому +17

    A hard question to answer after “can humans reason?” is the continuum: can gorillas reason? Can crows? Dogs? Snails? A housefly? A virus? At some point down this line you will say “no” but if bet there is wide variation in where people draw the line.

    • @frankbauerful
      @frankbauerful 2 дні тому +1

      We need to define "to reason". One definition could be "Use mathematical/logical rules and apply them to a problem." That's the kind of reasoning the maths tasks from the paper expect.
      Furthermore we need to define what "can" means in this context. What does it mean that someone "can reason". My definition in this context would be "to be able to perform the task of reasoning (according to the above definition) with near 100% accuracy, allowing only for a few random failures that do not reoccur if the same problem is presented multiple times."
      By this definition not all humans can reason (because some just are fundamentally incapable of learning mathematicale rules). However some humans can, e.g. myself. It doesn't matter how many variations of these maths problems you present to me, what names occur therein, what narrative context is wrapped around it. I will solve each and every task of the difficulty level presented in the paper and I will do it every time, allowing only for random failures that we humans can't avoid due to lapses of concentration. You won't find a variance in my performance depending on whether there's a "Sarah" and it's about "toys" or if it's a "Kurt" dealing with "cars".
      And of course NONE of the LLMs "can reason" according to the above definition.

  • @frankbauerful
    @frankbauerful 2 дні тому +20

    I disagree that it is a problem if test datasets contain unrealistic narratives. I think the opposite is the case. We should focus on these unrealistic scenarios to test reasoning. Take the following example:
    "A quadriplegic juggler juggles 20 nuclear reactors on a Tuesday. Every other day of the week he juggles 10 of Jupiter's moons (different ones every day). After a full week, how many total objects has he juggled."
    Even a high school student who has the fundamental mathematical background will easily solve this question (after he stops laughing). Because it doesn't matter that the scenario is absurd. The mathematical reasoning always works the same.
    BTW, ChatGPT can solve this.
    But now let's make this more interesting. We rephrase this to move it from the mathematical realm into the real world:
    "A juggler who is now quadriplegic after an accident is ordered by his boss to juggle 20 nuclear reactors on Tuesday and 10 of Jupiter's moons on every other day (different ones each day). After a full week, how many different objects will he have juggled?"
    Now every smart student will realize that this is no longer a maths question. The information "quadriplegic after an accident" combined with "ordered by his boss" (rather than a statement that the juggles does something) makes this very clear. So the student switches seamlessly from mathematical reasoning to real world reasoning and replies:"Zero."
    ChatGPT fails this spectacularly by treating it as a maths question.

    • @A2ATemp
      @A2ATemp 2 дні тому +1

      Very good example
      Pushback: is the quadriplegic using a neural implant? jk
      But...what kind of "juggling" is he doing?
      Metaphorical language can throw some models off. A lot of metaphors abound in normal day to day jargon without clarifying the context. In the below case, it's about reasoning about the emotions, not the math.
      Example - " No-Hands has to juggle 100 airplanes and 3 typhoons in the pacific, how many widows will he have to juggle if he drops everything?"
      Context : On the aircraft carrier, the Air-boss (call-sign "No-Hands') has to juggle 100 aircraft pilot schedules during flight operations and 3 typhoons moving near by in the pacific. How many people will he piss-off if he screws up?

    • @christopheriman4921
      @christopheriman4921 2 дні тому

      If you treat it like a real world question for the second one the real answer is that we don't have enough information to know how many objects he would have juggled because in the real world things change and the person within that week under some sets of assumptions could have reasonably recovered from their condition and juggled some other objects than what the boss told them to and so the answer would have to accommodate that possibility. I personally read the second question as a purely mathematical problem too because it is still so ridiculous that I couldn't see it being a real world situation.

    • @woolfel
      @woolfel 2 дні тому +1

      lets get real, the physical world isn't nice and orderly like regular math problems in a book. In the real world lots of pointless information is mixed in, so any system that can't handle random bits of unrelated information will have issues. go to a new city and ask 5 different people for directions to a place. I don't buy the excuse that math problems shouldn't have non-relevant details.
      I like your example. I've done similar tests with chatGPT and it fails on prompts with minor changes. For real debugging, chatGPT is complete shit. It's great for summarizing content or boilerplate code, but not for real problems.

    • @frankbauerful
      @frankbauerful 2 дні тому +1

      @@A2ATemp If the LLM asked for clarification bringing up the issues you raise, that would be a different thing. But currently LLMs will simply confidently give a wrong answer. But I must also add that there's not really room for a plausible metaphorical meaning. If the person was juggling power plants in the managerial sense he wouldn't be described as a "juggler". And regarding Jupiter moons I'm not seeing any way to fit that into the same metaphor. It's the kind of thing a student might bring up in class to be funny, but on a written test that is graded where no questions are allowed, no student would ever answer as if it was metaphorical.

    • @frankbauerful
      @frankbauerful 2 дні тому

      @@christopheriman4921 If this was a written test presented by Google as part of your interview for a job application, you'd go for the stupid math solution? Well, they'd not hire you.

  • @broyojo
    @broyojo 3 дні тому +60

    always funny when ai researchers think that they are representative of the average human

    • @shubhamdhapola5447
      @shubhamdhapola5447 2 дні тому

      "Every statistical model is wrong, but some are useful"
      Obviously, no one can ever formulate a perfect model representation for humans (uncountably many salient features, relevancy of those features varies with time, etc.), else we would have completely eradicated all the societal problems..... but that doesn't mean that one can't refurbish the past models, to function better at present, and be wary of the future. That's what researchers mostly do.
      Problem is not all of them being aware of that very fact, and having a "Further Study" section in their papers that lists those assumptions they made, how choosing them over others shaped the intermediate human representations that lead to the final model, & encouraging explorations on what +/- other assumptions would have lead to !!

    • @TheTruthOfAI
      @TheTruthOfAI 2 дні тому

      common sense.. not the sense of the commons :D project that and u'll see..

    • @minecraftermad
      @minecraftermad День тому

      not everyone is so knowledgeable about geochemistry and only know the chemical formula of only a few common minerals, like feldspars or perovskite.

  • @Charliethephysicist
    @Charliethephysicist День тому +4

    It seems you have misunderstood what reasoning means. It means mathematical reasoning, or formal logic, as the paper title clearly indicates. It is not comparing to what average human would have performed. It is simply comparing to what correctly following a rigorous formal mathematical logical deductive rules would have produced.

  • @dennisestenson7820
    @dennisestenson7820 2 дні тому +8

    There definitely are test set poisonings in recent models. One I was playing with recently with fabric went off the rails and started outputting training questions from a benchmark dataset.

    • @TheTruthOfAI
      @TheTruthOfAI 2 дні тому

      hmmm.. i guess that part of the research never arrives to the public-media.. we only got the notice of "LiquidLM, Reflection, Gemini.." are the big-cholos of the AI landscape.. making others to look like idiots hahahahahahahahahaha

  • @thenoblerot
    @thenoblerot 2 дні тому +9

    Apple is trying to lower the expectations for their """Apple Intelligence"""

    • @jsbgmc6613
      @jsbgmc6613 День тому

      That has to be it !
      Or maybe it's an explanation why Apple is so behind on AI ... They have the dumbest researchers around that don't understand AI and cherry pick data to make generalized statements.

    • @Kovici.
      @Kovici. День тому +1

      On point

  • @protocol6
    @protocol6 2 дні тому +4

    I was actually just thinking about whether all humans reason, not as a joke. It's in the name of our species so you'd think it was a defining characteristic. And it was analyzing LLMs and the current political situation in the US that made me think that. What LLMs seem best at is generating plausible-seeming BS, a trait they share with politicians. Aside from the statistical stuff, they take advantage of confirmation bias, the Barnum effect and linguistic pareidolia. When presented with things that don't make sense, we often impose an interpretation informed by our knowledge and biases. We often read things into statements that aren't actually there. Some people are better at avoiding that trap by reasoning through it, validating parts independently and making an effort to not lose track of the misses OR the hits. Some people come away from certain political rallies thinking the incoherent word salad they listened to was not only coherent but that it agreed with their opinions perfectly or was in complete opposition to their opinions. Others recognize it for what it is, incoherent word salad. A good LLM will seem coherent to a much larger group of people than certain politicians but many politicians take advantage of that tendency of listeners to shape slightly ambiguous statements. Our response to such statements, whether from LLMs or other humans, kind of strongly suggests there's a spectrum of sapience in humans. I suspect that often its not a lack of capability but a sort of mental laziness that could be a learned trait. Or maybe that's the default and we have to properly learn to reason... which would mean it is not an innate feature of our species.

  • @Brainstormer_Industires
    @Brainstormer_Industires 2 дні тому +3

    There's a difference between if humans CAN reason and if they DO reason in any particular instance. It's not about the fact that we often use mental shortcuts that make us mistake prone. It's that we do have the faculties to think things through and reliably get the right answers when required. Some humans are very good at this and do so reliably. Others do not. This is basically what we mean when we say some humans are very "intelligent" and others not so much.
    But computers aren't humans. If a computer IS able to solve a particular kind of problem, we really expect it to EVERY time. We don't want to model that part of human behavior where we only get good performance after having our morning coffee. So, the fact that they fail A LOT on basic variations of a problem shows that they don't "think things through", only a little more than guess and check.
    While we know humans CAN reason (even if we often DON'T), we still don't know if machines CAN (if they can, they certainly don't a lot). Evidence seems to indicate that at the moment, they actually cannot.

  • @amantayal1897
    @amantayal1897 2 дні тому +3

    LLMs are not regular humans, and this clearly shows that models are not learning underlying abstractions. If you give any decent high school student a word problem with inflated numbers, they can easily solve it. It may be that models generate correct reasoning but struggle with calculations involving large numbers-this often happens with smaller models. However, we cannot overlook these drops in performance by simply comparing them to regular humans. If a person can solve one problem with a drop in rate and another problem with a discount, we would expect them to solve a combined problem during a test if they understand the correct logic.
    When these models perform exceptionally well, we compare them to PhDs, but when they fail, we start comparing them to regular humans.
    Regarding the point that they are doing calculations in their heads: first, these models have billions of parameters with tens of layers, and secondly, they use Chain of Thought (CoT) reasoning, so models clearly have a lot of memory. Unlike humans, LLMs don’t forget information in their context; they can attend to every token in their history. We should not dismiss the failures of these models by saying, “Oh, they’re just doing calculations in their heads” or “Oh, they’re just world models.” These models are trained on trillions of tokens and then further tuned with millions of tokens to help them learn reasoning.

    • @shawnryan3196
      @shawnryan3196 16 годин тому

      🤦🤦🤦🤦🤦🤦🤦🤦

    • @r.k.vignesh7832
      @r.k.vignesh7832 7 годин тому

      Exactly. These are supposed to be PhD-level performance models (OpenAI and Anthropic's claims, not claims made by these researchers), but they're utterly stumped by questions you'd expect a high schooler to ace 10 times out of 10 in an important exam and it's laughed off as "haha, the average human can't reason either"? You wouldn't hire that average human who couldn't solve these puzzles to do coding, solve complicated mathematical problems or otherwise do cutting-edge research, would you?

  • @baselariqat9771
    @baselariqat9771 2 дні тому +2

    There's a lot of talk about whether humans can reason or not. I'd argue that given enough time and motivation, humans will definitely outperform LLMs in reasoning. We often overlook some key points:
    1. LLMs see tasks thousands of times during training. Humans, when learning, only see a few examples.
    2. Humans typically prepare for tests and take them for the test's sake. What we really take away from learning in general are concepts that we learn to apply broadly.
    3. Tests like this are very narrow. Even expanding them, as Yannic suggested, doesn't make much sense. Tests are a poor way to evaluate. They focus more on scale (evaluating more people) than on actual individual performance.

  • @Free_Ya_Mind
    @Free_Ya_Mind 2 дні тому

    A very comprehensive reviews as usual. Keep it up with the good work!

  • @YuraL88
    @YuraL88 3 дні тому +11

    Why did they test 4 different models from OAI, but not Claude sonnet 3.5?

    • @ArtOfTheProblem
      @ArtOfTheProblem 2 дні тому +3

      because it can solve all those problems

    • @TheTruthOfAI
      @TheTruthOfAI 2 дні тому

      No bro.. its because its "NOT SAFE"... "NOT ALIGNED" or "NOT PART OF THE AI-ALLIANCE".. per se.. it doesnt f.. exists XDDDDDDD also.. u dont want to compare ur research towards something that proves how stupid nonsense is what u are writing publicly on a paper.. u know dawg..

    • @avogadroarts4366
      @avogadroarts4366 8 годин тому

      Rate limits make it impossible. I had the same problem while benchmarking Sonnet 3.5 for a paper. At the end we decided to drop it.

    • @ArtOfTheProblem
      @ArtOfTheProblem 4 години тому

      @@avogadroarts4366 too bad...

    • @ArtOfTheProblem
      @ArtOfTheProblem 4 години тому

      @@avogadroarts4366 also did you notice they prompted o1 incorrectly (using COT) in the paper? that jumpedout at me

  • @P1XeLIsNotALittleSquare
    @P1XeLIsNotALittleSquare 2 дні тому +4

    If you start asking random questions on the street you inevitably come to conclusion that people can't reason, do basic math, know basic geography, history etc.

    • @TheTruthOfAI
      @TheTruthOfAI 2 дні тому

      common sense is not the sense of the commons, my friend ;)

  • @MultiMojo
    @MultiMojo День тому

    Excellent observations! On a side note, I've also observed that LLMs are terrible at parsing tabular data.

  • @keypey8256
    @keypey8256 2 дні тому

    For quadratic function problems it's common in a lot of exams to handpick the values such that the value of delta is a square of a natural number

  • @fulin3397
    @fulin3397 3 дні тому +1

    was just thinking if you will have a video on this paper!

  • @andytroo
    @andytroo День тому

    as someone who has done bunches of multi choice exams, i know you can play the exam game over human written questions separately from the actual knowledge tests. Last week i did a (ungraded) quiz, and as i clicked submit i thought "3/4, and i till be question 2 i'm wrong on" - and i was right ..
    20:30 - where's gpt-4o there, it appeared to have the least "bias" according to the first graph collection (11:40)

  • @Ishirosama
    @Ishirosama 2 дні тому +1

    Why are Anthropic models rarely tested/present in benchmarks?

  • @ugthefluffster
    @ugthefluffster 23 години тому

    great video! the best analysis of this (frankly, very flawed) paper.

  • @supportvideomachine1112
    @supportvideomachine1112 День тому

    As a researcher in the field, this paper annoys me because it presents results (like the impact of irrelevant context, or the difficulties of LLMs to learn CS grammars) that has been known for a long time and already available on arxiv. But course this paper is from apple so everyone is talking about it...

  • @elliotanderson1585
    @elliotanderson1585 2 дні тому

    I get the feeling that researchers sometimes publish papers for the sake of publishing, rather than to contribute truly meaningful research. Humans definitely reason, but each individual has different capabilities. LLMs give the illusion of reasoning because they're so good at manipulating language. However, unless there are fundamental changes in their architecture, LLMs won't reason like humans.

  • @SimonJackson13
    @SimonJackson13 3 дні тому +1

    So is the call rate from the beginning or from the new rate start?

  • @markonfilms
    @markonfilms 3 дні тому +1

    I'm only explicitly reasoning when I explicitly reason. If I tell an LLM to reflect on it's own outputs and pay attention to its purpose from a framework of a perspective it will also start "waking up."

    • @markonfilms
      @markonfilms 3 дні тому

      Well at least the system as a whole is self-aware and the more you point it out in a chat context etc and instruct a cognitive framework of sorts you can get them to reason a good bit better. It seems like it particularly helps to get the model to acknowledge it's own outputs in context. I'll literally prompt "pay special attention to self-referencing language from the perspective of being a language model." Usually lots of other stuff too for the task at hand. When you get it to think about what it's generating everything gets more system 2ish.
      It seems that even in text only with a feed forward stochastic model, projecting a self to have an anchor point for cognition works.

  • @zhandanning8503
    @zhandanning8503 День тому

    Would there be a difference in "performance" if the numbers were changed to words? For example "$10" to "ten dollars"? Any researchers able to answer this question?

  • @zyzhang1130
    @zyzhang1130 3 дні тому +1

    I can agree with the memorization hypothesis by changing numbers, but saying the difficulty level doesn’t matter so long as LLMs can reason really goes to show how detached these researchers are from reality. Also I find this paper unintentionally became a marketing material for some of the good models, so the conclusion should be properly trained LLMs can exhibit seemingly good reasoning (and to no one’s surprise, the same goes for humans!)

    • @mikebarnacle1469
      @mikebarnacle1469 3 дні тому +3

      It shouldn't matter, because they are an LLM, not a human.... the "reality" here for an LLM is the graphics card they are running on. I love how as soon as an LLM does well on a test designed to compare humans against other humans like the bar exam, people claim the LLM is now a lawyer. But as soon as you get a test designed for an LLM, then those are the results you don't believe lol.

    • @christopheriman4921
      @christopheriman4921 2 дні тому

      @@mikebarnacle1469 The difficulty of the question of course will matter to any system that you are trying to solve if any particular person or AI would be able to solve it especially if you are talking about solving it in a reasonable amount of time. There will always be a tradeoff between accuracy and speed even if you can diminish a lot of the negatives of having speed. Although if they aren't given enough time or enough resources to solve it in a reasonable timeframe they should probably be able to tell you eventually that it isn't worth it currently.

    • @mikebarnacle1469
      @mikebarnacle1469 2 дні тому

      @@christopheriman4921 LLMs don't work that way. There isn't a tradeoff between time and speed. You're touching on the problem of system 2 thinking which these systems don't exhibit. If you solve that problem and make a system that can spend more time thinking about a hard problem and come up with better answers then yes I might say you have an actual AI. CoT is the closest thing, a cute hack to do that which has not demonstrated any seriously compelling results. The problem is specifically that they don't spend time thinking about it because they are not capable of spending more time thinking about it. That's not a part of the architecture. They don't do any better, so we don't ask them to spend more time on it, because nobody has figured out how to make them do any better. Humans however, can spend more time on a problem, and get vastly better results. That's a feature of our architecture.

  • @siddharth-gandhi
    @siddharth-gandhi 3 дні тому +6

    18:08 - Yannic unfortunately that is the same logic behind how algebra is taught in school for humans to reason. A man buys 200 coconuts or a car going at 200km/hr, those aren’t really realistic but they teach humans to reason from that.

    • @JorgetePanete
      @JorgetePanete 3 дні тому +3

      200 coconuts buy a car going at 200 km/h

    • @MadPCsuperb
      @MadPCsuperb 2 дні тому

      Yea, why not? We can use something to buy a moving car. It makes sense. If it’s not a in motion car, I am not buying😂

  • @konichiw4
    @konichiw4 3 дні тому

    The interesting question is what happens if we do train the LLMs on GSM-Symbolic generated dataset? Would they be able to then develop some semblance of "reasoning"?

    • @jsbgmc6613
      @jsbgmc6613 День тому

      They do that for robotics already - Monte Carlo perturbed simulations ... Then the robot performed pretty well in real life without much additional training.
      The data defines the "focus" - focusing on guessing and giving quick answer or figuring the right/best answer.
      There was a story about military training of AI to recognize camouflaged tanks. Training had great results but if failed 100 in the field. It turned out that the AI learned to recognize the clouds (it didn't get that it is tanks they needed - they were all camouflaged LOL) because all pictures with tanks were on a cloudy days, and the ones without were on a sunny day ... Talking about bad data.

  • @roomo7time
    @roomo7time День тому

    I think well-trained LLMs reason just like well-trained humans. Human reasoning is, most definitely, statistical and context-based and predictive. The only major difference between the LLM and human now is the test-time adaptation/training on out-of-distribution. While human learn continuously at the cost of forgetting, LLMs don't. I believe this is the major factor that makes that LLMs do not seemingly reason on some particular dataset which is most likely OOD to that LLM. But if you think about it, humans also cannot reason properly on OOD reasoning tasks unless you really adapt to it.

  • @drxyd
    @drxyd 2 дні тому

    The more relevant question is "can humans reason?" the answer is obviously yes, whilst the answer to "can LLMs reason?" is not settled.

  • @Ishirosama
    @Ishirosama 2 дні тому

    Could someone explain why scaling them by the error percentage would negate the shift?
    Or is it that the shift will be kind of the same for all models when normalized?

  • @daniel7587
    @daniel7587 2 дні тому +1

    I think you underestimate humans. What is notable, in my opinion, is that llms perform not exceptionally (somewhat ok) on these simple math questions, performance does not seem to decrease much when moving to much harder questions. Try for example a difficult, even obscure, question in quantum mechanics and the result may still be very reasonable, especially with o1. Compare that to humans, if say a child, struggles with the call price question you wouldn't expect them to come up with the an ok answer to the quantum mechanics problem.

  • @АлександрАбросимов-е4е

    Reasoning is a way of matching patterns. All these methods - a chain of thoughts, variants of chains of thoughts, comparisons, analogies, different methods of solving, splitting the problem into parts - the ultimate goal is to apply certain templates, trying to find and compare them in a more convenient way. So I wonder what definition of reasoning the authors used in this article.
    I will try to assume the definition of researchers:
    Reasoning is a complex cognitive process based on a quantum superposition of neural connections that leads to the spontaneous generation of insights. It is characterized by elements of creative chaos and manifestations of free will.

    • @drdca8263
      @drdca8263 2 дні тому

      I don’t think there’s enough evidence to justify the conclusion that large-scale quantum superposition is relevant to reasoning.

    • @АлександрАбросимов-е4е
      @АлександрАбросимов-е4е 2 дні тому +2

      It was sarcasm

    • @drdca8263
      @drdca8263 2 дні тому

      @@АлександрАбросимов-е4е oh, I missed that. Thanks for the correction.

  • @samson_77
    @samson_77 2 дні тому

    Transformers, in their current form, have a huge disadvantage compared to humans when reasoning IMHO: They have to sync reasoning steps with language output. As transformers do not have any memory other than the context and no internal recurrent connections, they have always to solve two tasks at once: First, forming grammatically correct human readable sentences token by token. Second, understanding a problem and merging from reasoning step to reasoning step, which is not in sync with generating new tokens for outputting language. Yes, the models learn to deal with that somehow and the attention/self-attention mechanism does a remarkable job, but that problem might decrease their reasoning performance a lot. IMHO, it would maybe better, if Transformers could think without the need of generating new tokens. Connections between decoder blocks at the end to decoder blocks at the beginning might be an idea. The Transformer should be able to self-decide (using learnable parameters), to either use it for an internal thinking step, or to spit out a new token / distribution. Penalties during training for the number of loops could prevent a model to get stuck in infinite loops.

  • @MakeyTech
    @MakeyTech 2 дні тому

    I'm sure i'm off the mark but I think it's something akin but else entirely as it's more of a pattern generation based on the rules and assumptions it finds most relevant. Still not quite reasoning but still computing over the logically complete data space we've explored enabling them to generate and pattern in the space.
    I dunno what makes us unique from animals but we're grasping something on the verge. Be ready.

  • @Mordenor
    @Mordenor 2 дні тому

    It would probably be better to normalize by accuracy instead of error rate to measure dataset difficulty. This way, accuracy would be the fraction of tasks doable on the easy dataset, but not the hard dataset, so how much harder the hard dataset is compared to the easy dataset. Already impossible tasks remain impossible. The original LLM performance is normalized.
    And it would make more sense in the extreme case. If accuracy drops from 50% to 0%, it isn't twice as hard (or 50% harder), its infintely harder (LLM cannot do task at all).

  • @taylorchu
    @taylorchu 3 дні тому

    I am still waiting on the dataset, and test on our own model.

  • @rogerc7960
    @rogerc7960 3 дні тому +1

    Huggingface: redo the whole leaderboard.

  • @mikebarnacle1469
    @mikebarnacle1469 3 дні тому +10

    You can't compare humans. These tests are important specifically because they exploit the differences between an LLM and a human. Like an LLM has no lack of patience and does not care about how much milk is in the actual cereal because they don't eat. So specifically by expanding the tests to see if the human-specific variables are removed, and the performance drop, does show that they are parroting humans, rather than reasoning. So you keep bringing up "but humans might not be as good at this either", yes, that's the entire point. We should test them in ways that we would expect don't change the results because they are a machine and not a human, and that tells us a lot about what they are actually doing. It is the only thing we can do in fact because there is always the possibility they have just memorizing something a human wrote otherwise. Keep in mind, humans weren't trained on a bunch of text by aliens, we don't just "reason" we also invented reasoning and invented language. I think you're really understating human capabilities in your evaluation. These Q&A tests are just a tiny shadow of what reasoning entails and if we expand the edges of those shadows the tiny bit and see the LLMs crumble it really doesn't look good for them. We care if LLMs reason because you can't task them with novel problems that require reasoning otherwise. The space of those problems is astronomically bigger than the benchmarks. Also since when is an LLM supposed to be competing with an average person on the street? An average person on the street is made of meat and hungry. An LLM has all the advantages of being a program. If it does reason, the expectation should be that it far exceeds the best humans. Otherwise, it just doesn't reason..... it's like if I show you a calculator that is giving me a wrong answer, and your response is it's not a bug in the calculator, because some humans might also get that wrong. Dude, it's a calculator, the expectations are not the same. And you might say, well a calculator is deterministic, but LLMs aren't, they're stochastic. Exactly. Stochastic parrots. That don't reason. At the end of the video you then say "it's an LLM not a calculator, what else do you expect" lol but you're just conceding the conclusion the paper makes without realizing it.

    • @mikebarnacle1469
      @mikebarnacle1469 3 дні тому +2

      I'll define reasoning for them. Reasoning is the ability to take a truly novel observation that has never been seen before, decompose it into constituents and apply those constituents to philosophy, logic and formulas, deriving sound & valid conclusions that were not directly apparent from the directly observed premises. And the ability to do so recursively in a way that our conclusions form the basis for new reasoning that leads to new conclusions, and produce vastly new and complex models of the world by doing so.... It is not noticing a quiz follows a similar structure to a question you've memorized and spitting out the tester's desired answer to it by plugging in the variables. LLMs have a model of the world, but that is a static model that humans derived through reasoning. LLMs are not reasoning because they are not deriving any new models of the world. They are not even rediscovering independently models we've already discovered. Everything is just an opinion held by some humans to an LLM that it's seen in their text.

    • @Mordenor
      @Mordenor 2 дні тому

      But LLMs do will recognize if the situation is not realistic. And we want them to recognize unrealistic situations. Being able to take in contextual clues is what separates a dumb calculator from a human. The reason why factories still use humans in the loop instead of perfect logical reasoners like Python is we want to adapt to contextual clues. LLMs are trained to be useful systems, which should behave and reason like humans, instead of a calculator.

    • @ringpolitiet
      @ringpolitiet 2 дні тому

      @@mikebarnacle1469 Relax. You seem to be distressed by your opinion maybe not being the truth. Try to learn instead. This is new for everyone, including you.

    • @mikebarnacle1469
      @mikebarnacle1469 2 дні тому +2

      ​@@ringpolitiet Thanks. I'm not distressed though. And it's not that new. That they don't truly reason has been obvious to anyone using them who isn't as prone to cognitive biases. I'm slightly perturbed by the bad science and outlandish claims coming from the LLM companies. And that people don't seem to realize it only takes one single piece of counterfactual evidence to falsify a hypothesis. Any instance they fail to reason is proof they simply can't reason, but it's always written off as if that's just some "quirk of the model" or "look ask a different way and it gets it right" which is irrelevant. There is no difference in the reasoning abilities of a model that gets 99% versus 50%. They both can't reason, one is just 49% better at imitating it or cheating. That is the only conclusion one can draw.

    • @fenixfve2613
      @fenixfve2613 2 дні тому

      Can chimpanzees reason? I would say a little, but their general intelligence is extremely weak. We tried to teach them the language, but they are only able to memorize individual words and phrases, while they are not capable of recursive grammar; they are unable to do arithmetic; they are not capable of using tools in an unusual context, for example, you can teach a chimpanzee to put out a fire by scooping water with a bucket from a well, but once you replace the well with a river, they do not they know what to do; they are not capable of abstractions at all. At the same time, the architecture of the human brain is absolutely no different from that of a chimpanzee, it is simply scaled with a very large cortex. Do LLMs reach the human level? Absolutely not, they have a good memory, but they are dumber than a healthy stupid person. At the same time, I consider them much smarter than chimpanzees, scaling will magically solve all problems, as it happened with the monkey's brain.

  • @woolfel
    @woolfel 2 дні тому +1

    the technique they used to template math problems is old. when i was in high school in the 80's, teachers did this. Heck even the SAT test in the US had questions that made me say "What the fuck and why did they word it this way?"
    clearly there is something important happening and it's good evidence LLM aren't actually reasoning or not reasoning in a way that humans use. Is that important? No one knows the answer to that, but we shouldn't ignore them.

  • @alekseyburrovets4747
    @alekseyburrovets4747 2 дні тому

    Wait a second. How exactly the numbers are encoded for the input of the artificial neural network? How does the range of a number would be related to the variances of the input tokens?

    • @alekseyburrovets4747
      @alekseyburrovets4747 2 дні тому

      So basically the numbers are converted to the tokens. The numerical vectors would be produced from it. The thing is that if the number whas two or more digits (characters) then it can be represented as two or more tokens! etc.

    • @alekseyburrovets4747
      @alekseyburrovets4747 2 дні тому

      The number of tokens at the input can negatively affect the performance. etc

  • @herp_derpingson
    @herp_derpingson 2 дні тому

    16:58 It is not unreasonable to believe that OpenAI with their billion dollar funding just trained on a generated dataset like the one shown on the right. In that case it will be a lot more in distribution than for other models which are basically just pre-trained models for the most part without explicit thought put into how it is going to be used in a business setting.
    .
    25:21 This also points to some kind of OOD problem. If you look at the o1 vs GPT4o-mini performance. o1 is being sold with the idea of doing complex step-by-step reasoning. Unsurprisingly it does better in those cases.
    .
    29:32 I think this is called the cherenkovs gun effect. If the user mentioned it, it must be relevant otherwise he wouldnt have spent the energy typing it. If the user mentions that 5 kiwis are bad, then they must not be included. I think that is a reasonable thing to do.
    .
    34:11 Following the few shot example is a "good thing" needs to be taught. Pretrained models have a little bit built into it out of the box, but if you really want to use few shot. Finetuning on few shot helps a lot. Maybe that is what is happening here.
    .
    I found it really useful to make GPT4 (ChatGPT) just write python scripts using the sympy library and execute it and then interpret the results. If you are planning to use reasoning in production, use this trick.

  • @yannickpezeu3419
    @yannickpezeu3419 2 дні тому

    28:40 10 first minutes cost 6$
    15 next minutes cost 5$
    35 next minutes cost 35x0.3 = 10.5$
    Total is 21.5 $
    You got a 25% reduction.
    21.5x3/4 = 64.5/4 = 16.125 $
    I can't pay milli-dollars. Lowest coins are cents.
    Is it a reasoning évaluation or a calculus evaluation in the end ? 😅. Llms are known to be bad with divisions.

  • @Arcticwhir
    @Arcticwhir 2 дні тому

    humans will expect them to be more moral, respectful, trustworthy, transparent, more intelligent, emotionally intelligent than humans, at some point they will be conceivably better in every way than humans - because that's what we wanted. we will expect them to produce art better than the best artists, produce research better than the best experts. atleast that's how i see it going

  • @frederickmueller7916
    @frederickmueller7916 2 дні тому +1

    Of course they only do pattern matching. They are trained to learn pattern matching, how could they magically learn to reason from this?

  • @AustinGlamourPhoto
    @AustinGlamourPhoto 3 дні тому +3

    You bring up that many humans suck at reasoning. I don't think that's a good argument because half of people have an IQ less than 100 and whether or not you could call such a low IQ person "intelligent" is debatable. If you going to call AI; artificial "intelligence" than AI needs to compared to actually intelligent humans. Of course the main problem with Ai researcher is that they don't have a sufficient definition for what intelligence actually is. Trying to build AGI when you don't have a clear understanding of what intelligence is, is like trying to build skyscraper with no blueprint. But AI researches are trained on computer and defining "intelligence" and "reasoning" are the subjects of philosophy and psychology.

    • @HappyMathDad
      @HappyMathDad 3 дні тому +2

      You make a good point. If you define intelligent as something that only a small percent of humans possess. Then these models don't reason.
      The philosophical definition of reasoning from a mathematical point of view is the ability of selecting the right rules and apply them to a problem. Such that we get a verifiably correct result.
      By that definition intelligence is always a percent of success. And we have come incredibly close from almost nothing in a few years.

    • @MrVontar
      @MrVontar 2 дні тому

      Intelligence is actually pretty easy to define. The problem is representing the information so that it can learn to be intelligent. A LLM is not trained on segmented data, so it invariably calculates the wrong representation at some distribution point in space.

    • @MrVontar
      @MrVontar 2 дні тому

      And actually humans are the same. They simply quantify information at a higher level, which restricts how the solution space evolves. For example, if you train a large language model on a whole bunch of problems in higher order logic, it actually should be able to do a lot more logic because it learns to represent them in a space that accurately represents their supposed function in a statistical restrictive state.

    • @MrVontar
      @MrVontar 2 дні тому

      Anyways, this also supposes that what you teach is the correct state, which actually isn't true. Most information they use is not going to help the AI learn that much lol.

    • @MrVontar
      @MrVontar 2 дні тому

      To any sufficiently intelligent being, any form of intelligence is not actually intelligent lol. This is a dumb joke, but it is kind of true in a sense.

  • @woolfel
    @woolfel 2 дні тому

    honestly I don't care if LLM is better or worse than a human. Humans are squishy creatures that vary in reasoning capabilities. have a gorgeous woman walk into a room of male math majors and you'll see significant drop in reasoning.
    what I care about is "what can we understand about neural networks and can we use that to gain insights?" clearly something is happening, we don't understand enough to prove exactly what the weights are doing and where the circuits are in the model. Until we get to that point, we shouldn't throw out interesting experimental results even if the authors of a paper over state the importance.

  • @tautalogical
    @tautalogical 2 дні тому

    The contortions these guys are going to, to deny reality... bravo

  • @pastrop2003
    @pastrop2003 2 дні тому

    Very good points, it will be interesting if they indeed asked the same questions to a reference group of humans (and, no, a group of Stanford & MIT Computer Science & Math grad students is n not a good reference group for representing average humans :) ) and compare the results.

  • @gauravkumarsharma5904
    @gauravkumarsharma5904 3 дні тому +1

    why the shades though 🤔

  • @0xcdcdcdcd
    @0xcdcdcdcd 3 дні тому

    I do not get the point in trying to measure ML algorithms against human capabilities in a scientific context. If a computer program doesn't work then it doesn't work. I'm not sending code that has a compile error to my customer just because "well, humans are imperfect, they make errors... here you go with a useless solution". LLM's are programs made for a specific task. If you try to repurpose it to a jack of all trades device, you should still measure it just by how good it is in doing all the tasks your are trying to reuse it for. Comparing it to humans does not get anyone any further in actually evaluating the thing. If it can't solve the task, then it's not a good program, end of the story.
    Maybe, just maybe, the error lies in our assumption that LLMs have/will have/should have human-like properties. Only like this I can explain that people constantly compare LLMs to humans. We should take a step back and try to see the fundamental things: we are comparing a living being with a computer program - if we take all the scifi indoctrination away then it quickly becomes rediculous to think from the start that both thinks are similiar. So rather than assuming a similarity and then deducing properties from that, we should first show that humans and LLMs even have something fundamentally alike. And and if yes: what if humans have (amongst others) just a sophisitcated constantly learning pattern matching device in their head?

    • @rumfordc
      @rumfordc 2 дні тому +2

      imagine you are trying to sell your LLM scam, you need to convince all your customers that they can save money by replacing their human staff with your glorified auto completer. to do that its crucial to always frame them against humans whenever possible, to at least create the illusion that the 2 are competitive.

  • @pedruskal
    @pedruskal 3 дні тому +1

    Well explained! The devil is in the details.

  • @poisonza
    @poisonza 2 дні тому

    From a functionalist perspective, there's no such thing as "true" reasoning. The authors seem misguided in believing that reasoning could be more than just basic pattern matching.
    classic Chinese room argument

  • @ronen300
    @ronen300 День тому

    Oh yannick 😂 you fell into your own trap ...
    Ranting about the general conclusions like LLM's don't reason , and then you said LLM's suck at pattern matching 😅

  • @nichevo
    @nichevo 3 дні тому

    ne pervii nah