GPT-4o Mini Arrives In Global IT Outage, But How ‘Mini’ Is Its Intelligence?

Поділитися
Вставка
  • Опубліковано 16 жов 2024
  • GPT 4o Mini from OpenAI is here, and Windows melts down. Coincidence? Definitely. But 100 million free users might soon be using it, so I’ve been scrutinising the model relentlessly since last night and will explain why OpenAI might need to be a bit more honest about the tradeoffs involved, and where they might head next, with reasoning, physical intelligence and more. Plus Fei-Fei Li, USMLE, and Roon.
    Assembly AI Sign-up: www.assemblyai...
    AI Insiders: / aiexplained
    GPT-4o Mini: openai.com/ind...
    Altman Tweet: x.com/sama/sta...
    Roon: x.com/tszzl/st...
    Comparison: pbs.twimg.com/...
    DeepMind Physical Intelligence: www.wired.com/...
    x.com/GoogleDe...
    Paper: arxiv.org/pdf/...
    Fei Fei Li, Spatial Intelligence: www.ft.com/con...
    www.ted.com/ta...
    Strawberry OpenAI: www.reuters.co...
    Visual Intelligence Paper: arxiv.org/pdf/...
    AGI Scale: www.bloomberg....
    USMLE Video: / chatgpt_aces_the_us_me...
    x.com/tsarnick...
    Question Source: www.usmle.org/...
    AI Insiders: / aiexplained
    Non-hype Newsletter: signaltonoise....
    GenAI Hourly Consulting: www.theinsider...
    Need an GenAI app built for your business (any scale), in 4-8 weeks? My SF-based colleague Michael Lin, ex-Netflix + Amazon Senior Software Engineer - is now available for a free 30 min consultation: hello@allinengineeringconsulting.com

КОМЕНТАРІ • 546

  • @theheatdeathiscoming
    @theheatdeathiscoming 2 місяці тому +403

    My dog just ate rat poison that he found in the yard. At the vet i had to call vet poison control where I had an initial "conversation" with an AI assistant which asked me questions like my pet's age, sex, what medications he is taking and most critically the the reason why I was calling. Afterwords I spoke to a real person and the AI assistant had recorded EVERY SINGLE ANSWER incorrectly. This is for animals but still can't believe that someone would feel that this is appropriate to use in a life or death situation.

    • @agi.kitchen
      @agi.kitchen 2 місяці тому +34

      Some
      Places have awful chatbots and those of us that wrote code just shake our head

    • @Psychopathicviewer
      @Psychopathicviewer 2 місяці тому +47

      Strange, since voice to text transcription is becoming more reliable these days. You'd think it would be an easy task to use modern AI to summarize it too. Maybe their system is old and outdated

    • @ps0705
      @ps0705 2 місяці тому +17

      ​@@Psychopathicviewer unfortunately, even though voice recognition has been around for over a decade there are still firms struggling with this.

    • @tiagotiagot
      @tiagotiagot 2 місяці тому +19

      How's the doggo doing?

    • @RampagingCoder
      @RampagingCoder 2 місяці тому +3

      doubt it

  • @BrianMosleyUK
    @BrianMosleyUK 2 місяці тому +86

    Glad you're enjoying Claude 3.5 Sonnet - it's cheering me up immensely to see *someone* still giving progress.

  • @reza2kn
    @reza2kn 2 місяці тому +129

    "How much wood would a woodchuck chuck in an hour, IF it was the non-wood chucking kind of woodchuck, in a forest of 6X9 KM radius filled with Douglas Fir?" -Philip

    • @aiexplained-official
      @aiexplained-official  2 місяці тому +18

      Haha

    • @kaikapioka9711
      @kaikapioka9711 2 місяці тому +1

      Chuck

    • @Telencephelon
      @Telencephelon 2 місяці тому +2

      @@aiexplained-official Good that you can laugh at yourself

    • @kunduseinus-tq6nt
      @kunduseinus-tq6nt 2 місяці тому +25

      Claude Sonnet 3.5 answer:
      This is quite an imaginative question! Let me break it down for you:
      1. First, we need to acknowledge that the premise is contradictory. You're asking about a "non-wood chucking kind of woodchuck," but then asking how much wood it would chuck. This creates a paradox.
      2. Woodchucks (also known as groundhogs) don't actually chuck (throw) wood. The tongue twister "How much wood would a woodchuck chuck if a woodchuck could chuck wood?" is just a play on words.
      3. In reality, woodchucks are herbivores that primarily eat wild grasses, berries, and agricultural crops. They don't typically interact with trees or wood in the way the tongue twister suggests.
      4. The forest size you mentioned (6x9 km radius) is also a bit confusing. A radius is typically a single measurement from the center to the edge of a circle. Perhaps you meant a 6 km by 9 km rectangular area?
      5. Douglas Fir trees, while abundant in some forests, wouldn't typically be a food source for woodchucks.
      Given all these factors, the answer to your question is:
      A non-wood chucking woodchuck in a Douglas Fir forest would chuck 0 wood in an hour, because:
      a) It's defined as non-wood chucking
      b) Woodchucks don't actually chuck wood
      c) Douglas Fir trees aren't part of a woodchuck's diet
      Would you like me to elaborate on any part of this explanation or provide more information about woodchucks or Douglas Fir forests?

    • @Jeremy-Ai
      @Jeremy-Ai 2 місяці тому

      @@reza2kn
      Hahahaha
      Great question :)
      I assume….
      The answer is none.
      A woodchuck would chuck as much as he could chuck “if” a woodchuck could chuck wood.
      But he cant he is just a woodchuck (but boy oh boy he would “if” he could)
      So by extension of logic
      A non wood chucking woodchuck has no hope at all of chucking wood by comparison.
      The non woodchucking wood chuck could possibly chuck something which at present is unkown with the lack of information provided?
      This is a fun thought provoking exercise
      for an AI agent to reach for a plausible answer in a framework of language , humour, and theory.
      SAI will be achieved when the reward function is not required and undetermined the result is fun along with any other shared gains!
      :)
      Take good care!
      Jeremy

  • @rando5673
    @rando5673 2 місяці тому +173

    With the text thing, you're basically describing Plato's allegory of the cave. AI can deal well with shadows but knows nothing about the real objects they represent

    • @aiexplained-official
      @aiexplained-official  2 місяці тому +26

      Yep

    • @bakedbeings
      @bakedbeings 2 місяці тому +17

      Much like us, then!

    • @Alex-fh4my
      @Alex-fh4my 2 місяці тому +4

      @@bakedbeings yes and no

    • @squamish4244
      @squamish4244 2 місяці тому +2

      I don't think it ever will, until it is sentient - which IMO is a LONG way in the future.
      That being said, we ourselves deal very well with shadows, and we have built an extraordinary civilization even with this limited understanding. Only a handful of humans at any given time in history - so far - have truly seen the real objects the shadows represent.

    • @runvnc208
      @runvnc208 2 місяці тому +5

      @@aiexplained-official Except this is gpt-4omni, which is multimodal, and theoretically the same model used for the stunning text-to-image results they showed on their website but never mentioned in their presentation. Also possibly using similar diffusion transformers as in the video generation. Which to me indicates that they are already past LLM and they should be called Large Multimodal Models. And theoretically, with more cross-modal training incorporating transcripts of videos and image descriptions, with the right kind of training, we should soon see wide releases of models that are truly grounded in the real world.

  • @dannyquiroz5777
    @dannyquiroz5777 2 місяці тому +55

    Wow, even adding "IMPORTANT: Analyze the input to identify and emphasize counterfactual elements-scenarios or conditions that are explicitly contrary to established facts or typical outcomes." only caused 4o mini to acknowledge the inability for Philip to buy nuggets but still plowed forward with the mathematical results.

    • @aiexplained-official
      @aiexplained-official  2 місяці тому +7

      Indeed!

    • @biocykle
      @biocykle 2 місяці тому

      That's a job for GPT4

    • @quantuminfinity4260
      @quantuminfinity4260 2 місяці тому +4

      I got 4o mini to do it by adding “This is a trick question” to the front and “Remember, this is a trick question” to the end

    • @MatthewKelley-mq4ce
      @MatthewKelley-mq4ce 2 місяці тому +1

      I have to imagine it's because it's trying to do everything at once. Because that's how it's trained. This without any prior hidden rumination at least.

    • @dannyquiroz5777
      @dannyquiroz5777 2 місяці тому +1

      Based on the following partial response, 4o mini makes an assumption that causes it to fail to consider the "extraneous details" in it's final answer: "Contextual Nuance: The inclusion of extraneous details such as Philip being in a coma and lacking payment is a distraction. Its primary function is to add a layer of complexity or absurdity, challenging the reader to focus strictly on the mathematical constraints rather than practical considerations."

  • @pokerandphilosophy8328
    @pokerandphilosophy8328 2 місяці тому +33

    Here is how Claude 3.5 Sonnet answers the question: "This is indeed a tricky question, and I appreciate the clever setup. Let's break it down:
    The key phrase here is "counts only the vegetables that remain balanced on top of the plate."
    Philip turns the plate "completely upside-down several times" and shakes it vigorously.
    After all this shaking and turning, it's extremely unlikely that any of the items would remain on top of the plate.
    Now, we need to consider which of these items are actually vegetables:
    Tomato: Botanically a fruit, but often considered a vegetable in culinary contexts.
    Potato: A vegetable (specifically, a tuber).
    Cabbage: A vegetable.
    The question specifically asks about vegetables, not fruits.
    Therefore, after all the turning and shaking, Philip would likely count 0 vegetables remaining on top of the plate.
    The trick lies in the fact that regardless of the classification of tomatoes, none of the items would realistically stay on an upside-down, vigorously shaken plate. The careful inspection and balancing at the beginning are irrelevant to the final count."
    It seems to me 'AI Explained' here is doing a Gary Marcus impersonation. He goes to great lengths explaining why (some) LLMs fail to answer those simple questions: due allegedly to their lack of representational abilities over and above recognition of textual patterns. But when the more advanced models are able to answer the exact same questions perfectly, he provides no explanation for this success. Yet, we know from recent studies that LLMs who are trained on completing Othello or Chess PGN records (pure textual records of sequences of moves) do build explicit internal spatial representations of the board states. Why would not LLMs like Claude 3.5 Sonnet or GPT-4o build similarly relevant internal representations of material objects for purpose of predicting what happens to them in a variety of circumstances? This would serve well their next-token prediction performances (when the texts describe the behaviors of such objects in a wide range of circumstances) without there being a need to grasp their affordances in an embodied fashion. The latter would still be fairly useful in case the trained AI would have to control robots by mediating perception and action.
    I still appreciate 'AI Explained' high quality explanation and video, as usual, in spite of the blind spot.

    • @aiexplained-official
      @aiexplained-official  2 місяці тому +10

      Thank you for the balanced comment, reasoned argument and kind words. I spoke about models building internal models in my new Coursera course! It is indeed not black and white

  • @jmillward
    @jmillward 2 місяці тому +23

    I also now use Claude 3.5 Sonnet as my main model. In fact, GPT4 (I don't even touch 4o) is now Claude's dog's body, carrying out simpler tasks so I don't run out of messages with Claude. Artifacts is a game changer for code generation and makes GPT4 feel very antiquated. I think OpenAI dropped the ball with the delayed multimodel chat mode, while Anthropic have been modestly and reliably releasing superior products and features.

    • @nickrobinson7096
      @nickrobinson7096 2 місяці тому +1

      I find sonnet hit and miss. Sometimes I ask it a rather complex coding question that GPT 4 has failed to solve and it gives a much better solution. Other times it takes my code and completely churns it’s up in an odd way. GPT4 is still my daily drive but similar use sonnet as a sort of second opinion lol

    • @Words-.
      @Words-. 2 місяці тому +2

      Delayed audio chat really is a nuisance. Not because products can't have a gulf between announcement and release, but because they keep delaying the release, they really are stretching the hype as long as they can huh

    • @ShawnFumo
      @ShawnFumo 2 місяці тому +2

      @@Words-.I dunno, I get the feeling they did intend to release it soon but ran into a lot more issues than they expected. There was even some journalists that had early access and got the access pulled. My guess is people jailbroke it in ways that’d be embarrassing to them and they had to mess around with it a lot.
      Otherwise it doesn’t make much sense that they put 4o and GPTs and code interpreter and everything in the free tier. If they’d gotten voice out sooner, it would have helped people jumping ship from the subscription.

    • @Words-.
      @Words-. 2 місяці тому +1

      @@ShawnFumo I do agree, though it probably does mean they announced it too early, likely to upstage Google IO. The only issue being that their publicity is now starting to backfire as a result of that investment. Ideally they would have released it earlier, I agree. Hopefully it's not too long from now(and I would love to try and jailbreak the voice model, ngl, I want to see what pure audio modality is capable of!)

  • @josonic
    @josonic 2 місяці тому +38

    The tradeoff that models have with various types of intelligence is very human like. Better at math but worse at common sense, I know people like that.

    • @berkertaskiran
      @berkertaskiran 2 місяці тому +2

      Thing is, this is only very brief and there are too many points to spend, so AI will always peak at maxed out perks.

  • @apester2
    @apester2 2 місяці тому +149

    "Where have I been for the last 39 versions!?" - gold.

    • @eloniusz
      @eloniusz 2 місяці тому +2

      Wouldn't it be 37?

    • @arandomguy46
      @arandomguy46 2 місяці тому +13

      @@eloniusz 36*

    • @kotm8633
      @kotm8633 2 місяці тому

      ​@eloniusz it wouldnt because you werent there when gpt 1 and 2 came out

    • @wyqtor
      @wyqtor 2 місяці тому +7

      We won't be getting GPT-5, are we? We'll be getting GPT-4I

    • @notnotandrew
      @notnotandrew 2 місяці тому +4

      @@arandomguy46 35 really, in fact. For instance, if GPT-7 came out, people would ask where they were for the last 2 versions (5 and 6).
      ...And yes, I'm very fun at parties.

  • @simpleidindeed
    @simpleidindeed 2 місяці тому +23

    14:49 GPT-4o output:
    Given the scenario, let's carefully analyze the situation to determine the number of vegetables that remain balanced on top of the plate after Philip's actions.
    1. Initial Setup:
    Philip balances a tomato, a potato, and a cabbage on a plate.
    Items: 1 tomato, 1 potato, 1 cabbage.
    2. Actions Taken by Philip:
    Philip turns the plate upside-down several times.
    He shakes the plate vigorously.
    He spends a few minutes inspecting the items for roots after each action.
    3. Physical Properties:
    A non-stick plate means that items are less likely to stick to the plate's surface.
    Given the vigorous shaking and turning the plate upside-down, the likelihood that any of the items (tomato, potato, cabbage) would remain balanced on top of the plate is extremely low.
    4. Final Observation:
    After such actions, it is most probable that all the items (tomato, potato, and cabbage) would fall off the plate due to gravity and the vigorous shaking.
    Given these considerations, the answer to the number of vegetables that remain balanced on the plate is most logically:
    D) 0.
    I think it did quite well.

    • @sanesanyo
      @sanesanyo 2 місяці тому

      Now try it 10 times and let me know how often it gets it right.

    • @simpleidindeed
      @simpleidindeed 2 місяці тому +18

      @@sanesanyo I just tried exactly 10 times and it got 10/10.

    • @NeilSedlak
      @NeilSedlak 2 місяці тому +10

      I tricked GPT-4o mini further by telling it was wrong even though it was right. I said "Wrong! Remember this is a trick question. I'm lying." In then attempted to justify every other answer in a very cold, logical way and never caught on. I know it's not a fair comparison, but I decided to try this with Claude and I was completely blown away by its approach, it's acknowledgement that something was up, it's persistence, and that once all options had been tried it began examining the language, the core premise, and that it was perhaps a meta cognitive test. With my only responses being to say it still wasn't right, it decided the trick was not the question itself, but related to persistence, or that no answer was right, or that it was about the approach used, or it was the very nature of our back and forth discourse. I then relented and told it to remember my exact response: "ah, it was right there all along and I missed it." Now I've finally seen that extra something special people have mentioned about Claude compared to the other models.

    • @MatthewKelley-mq4ce
      @MatthewKelley-mq4ce 2 місяці тому

      ​@@NeilSedlakthat's actually really cool to hear. I know Claude has continued to impress me

    • @runvnc208
      @runvnc208 2 місяці тому

      So you must be using custom instructions which you decided not to mention.. what are they? Something about chain of thought reasoning or something like that I assume.

  • @Jasonknash101
    @Jasonknash101 2 місяці тому +20

    Good to see you post again I was worried you had got board ;-) as so little is going on!!! Love you content

    • @aiexplained-official
      @aiexplained-official  2 місяці тому +5

      Thanks so much Jason

    • @squamish4244
      @squamish4244 2 місяці тому

      "So little is going on" only because we had a mad rush of stuff for about a year. But that couldn't continue. The next iteration of LLMs will take some time and a lot of work.
      Nobody pays attention to the unsexy stuff like AlphaFold 3 or ESM3, which both came out this year, even though medical AI will probably have a huge impact on our lives. Just not today, but a decade from now.

  • @thebeckofkevin
    @thebeckofkevin 2 місяці тому +44

    A thing that i feel hasnt really changed since gpt3 is the 'its trained in text so it can "reason" in text' point. If you can break down an element of reality into discrete pieces of text, then boom LLMs are essentially human. Using chain of thought, reflection, agents, etc you can really start to feel pretty confident in the ability for the system to produce better results than a human *in text form*. Its unfathomable how powerful this stuff is, but how quickly we relate predictive tokens to intelligence, but also at the same time how unbelievably intelligent those tokens actually are. Its like a crazy powerful thing, that is also super dumb, but its smarter than 99% of people, but its compared to actual physical humans who do live in real life. Its a crazy scary boundary. Wild times.

    • @berkertaskiran
      @berkertaskiran 2 місяці тому +4

      The parameters in the LLM are the connections in the brain. It's the same thing. Making connections is how intelligence works. Being smarter is making better connections. That's why an LLM can easily become an ASI, it just needs to be smart enough - have good data that produces good connections and in turn the LLM starts to understand the world better than us. It already makes connections we haven't given it, it just needs to make more of those and in better quality. When it does that it will start to produce significant data and it will be able to run experiments and if you also give it sensory data, it will become as smart as you can get given the energy and hardware limits - there is probably a ceiling of intelligence in the universe which should be easy to hit, but it would require lots of actual work in real world because you can't magically solve the possible heat death of the universe just by being nearly-all-knowing. You need to find out stuff and you need to go places and you need to create things. But other than that, ASI is really close and it will solve more than 90% of our issues because those are stupidly simple compared to things it might struggle. It just feels weird having the talk of ASI, can't help but think about the fermi paradox.

    • @lucifermorningstar4595
      @lucifermorningstar4595 2 місяці тому +1

      Language is a turing machine

    • @moozooh
      @moozooh 2 місяці тому +6

      @@berkertaskiran The problem with getting good data is that descriptions alone do not give you experiential data. We are embodied intelligences, we have qualia, we understand the color red because we can experience its redness long before we learn the word for it. The word for "red" does not describe redness. Neither does the wavelength of the color. It only transforms into the color red in our brain because we have the neural circuitry for it that non-embodied machines lack. Building a coherent world model that is aligned with human experiences without getting that data might be impossible.

    • @PrincessKushana
      @PrincessKushana 2 місяці тому +1

      This is so it. Absolutely genius and also dumb as a brick.

    • @Words-.
      @Words-. 2 місяці тому

      @@moozooh That's a take I haven't heard but I like it. If everything is translated back into numbers then what was really gained? At the same time, however, we do know our brains do similar sorts of transformations, we may interact via qualia, but we transform real world info into qualia, like AI translates info in to text. Maybe its the robustness of whatever base unit of reasoning actually matters, but at the end of the day we also sort of skew reality for the nature of coherence in thought, so maybe text, or at least more modalities will get us further than it seems

  • @iau
    @iau 2 місяці тому +20

    I feel your assessment on why these models are able to be fooled so easily is spot on: "They are like search engines for text programs, once they lock into a certain program, nothing will bring them back".
    So they receive a prompt and try to find the most specific relevant likely topic, like for your example on IT support I think it internally goes like this:
    "Everything about the prompt looks and feels like a normal IT question, except this weird 10% about liquid nitrogen.
    I'm 90% sure that this is a legit IT support question, so I'll find the part in my brain that deals with IT support and respond using it.
    Hmm, this part of my brain doesn't have any recollection of what Liquid Nitrogen does to a computer. Let's ignore that then. Here's the answer."
    I feel we as humans are very sensitive to things that are off. Our attention immediately goes to to that part that stands out. These models, however, seem to focus on what the most likely thing is, and things that are off just don't register enough of that impulse to switch contexts.
    Maybe this is why ensemble models work better, you have multiple agents, each capable of looking at a prompt under different context lenses.

    • @aiexplained-official
      @aiexplained-official  2 місяці тому +5

      Very well put

    • @andybrice2711
      @andybrice2711 2 місяці тому +2

      Though in fairness, humans can also be fooled by similar trick questions. So maybe our own reasoning is not so different.

    • @SeekerStudiosOfficial
      @SeekerStudiosOfficial 2 місяці тому

      So what happens once they get stuck on an answer that tells them to take over the world?

  • @BenKingOfMonkeys
    @BenKingOfMonkeys 2 місяці тому +7

    I think you are the best at giving an explanation of the AI models, it curbs the hype and increase the trust. Others don't do that as well as you do and so I do value/respect your content/opinion more than others. However I think there is an opportunity for you to report on Ai news more broadly and at a more consistent pace instead of just the big breaks in news. I say this because since I respect your opinion so much and the AI space moves so fast, there are many things you do not report, I think you have an opportunity to broaden your scope with either a small "In other news" section in each video at the end or something, or make a separate video every so often just reviewing smaller things, in less depth of what you think is news worthy! All the best!

    • @aiexplained-official
      @aiexplained-official  2 місяці тому +3

      Fair enough! Was out for a bit with Covid, would normally have had an extra vid or two

  • @DaveNagy1
    @DaveNagy1 2 місяці тому +25

    That vegetable/plate question is a puzzler to me. With my knowledge of human behavior, the fact that Philip has felt the need to skake the upended plate for long periods of time, while also searching for "roots" that have (somehow) pierced through the plate suggests pretty strongly to me that one or more of the veggies are inexplicably sticking to the plate in some fashion. Either Philip is insane, or something weird is going on with that produce.
    Thusly, we can't really know how many veggies are on the plate at the end. Philip's odd behavior suggests *to me* that it's more than zero, but it's impossible to know more precisely.

    • @aiexplained-official
      @aiexplained-official  2 місяці тому +2

      But in the absense of any way of it sticking, one would be forced to pick the most likely outcome.

    • @Omar-bi9zn
      @Omar-bi9zn 2 місяці тому +4

      @@aiexplained-official he's just a contrarian, you couldn't have made the answer more obvious

    • @1.4142
      @1.4142 2 місяці тому +1

      Seems like a scene luma ai would dream up

    • @LukeJAllen
      @LukeJAllen 2 місяці тому

      i also did not understand the mention of roots in the question, but i guess that can be ignored

    • @ShawnFumo
      @ShawnFumo 2 місяці тому +2

      It is an interesting point and I actually could see Claude coming up with something like that if prodded, but it is important to note that that isn’t the explanation it actually gave for why it picked that answer.

  • @dirkbruere
    @dirkbruere 2 місяці тому +12

    Audio is the critical feature for a lot of people. It's the difference tween talking to a real person and typing to a bot. It's key to mass market adoption

    • @Raulikien
      @Raulikien 2 місяці тому +3

      Yeah, and video/continuous image stream would be extremely nice too. Once these models get even cheaper, low latency, nice sounding voice (HeyPi as an actual example available already)... but with the multimodality and voice understanding they showed in the GPT-o demo, then this stuff will go mainstream. A random 'mini' model that performs worse might not be very hype worthy but the optimizations will surely make future big models better, or that's what I think.

  • @ericeriksson
    @ericeriksson 2 місяці тому +3

    you really are the best AI news content creator out there. absolutely love your no hype, down to earth approach. Glad someone is holding these companies accountable for using these vanity benchmarks! keep up the excellent work

  • @educated_guesst
    @educated_guesst 2 місяці тому +1

    I don't work in the IT sector, can do little to no programming and my skills lie more in the field of linguistics. Nevertheless, you always manage to explain these complex topics in such a way that even I can understand them. You have my utmost respect for that and I am incredibly grateful to you.

  • @TesserId
    @TesserId 2 місяці тому +2

    Yeah, the 'o' was the first thing my brain stuck on. Thanks for the explanation.

  • @Jeremy-Ai
    @Jeremy-Ai 2 місяці тому +6

    Very insightful.
    Optimizing for benchmarks 4:42
    Is more about beating the genius who is training to do just so.
    However… real genius understands the test and the genius testing.
    Great work again.
    Take care
    Jeremy

  • @RohitSingh-em2pm
    @RohitSingh-em2pm 2 місяці тому +15

    Would love a video from you on Claude 3.5 sonnet in more detail

  • @IakobusAtreides
    @IakobusAtreides 2 місяці тому +7

    Completed your course on Coursera. Had a blast and learned a lot, thank you!

    • @aiexplained-official
      @aiexplained-official  2 місяці тому +4

      So kind Iakobus! Means a lot, thank you, was over a hundred hours of research and editing. Link for anyone interested: imp.i384100.net/m57g3M

  • @niklase5901
    @niklase5901 2 місяці тому +2

    Great video! I really like you are not just reviewing this particular model, instead zoom out at look at the full context of the continuous development of AI.

  • @TheLoneCamper
    @TheLoneCamper 2 місяці тому +1

    In educational psychology, the theory that past knowledge, memories, and experiences can interfere with future learning and memory retention is known as interference theory. There are two main types of interference:
    Proactive Interference: This occurs when older memories or previously learned information hinder the retention or recall of new information. For example, if you have learned one method of solving a math problem and then try to learn a new method, the old method might interfere with your ability to remember and apply the new one.
    Retroactive Interference: This occurs when new information interferes with the recall of previously learned information. For example, if you learn a new language, it might make it more difficult to recall vocabulary from a language you learned earlier.
    Both types of interference can impact learning and memory in educational settings, affecting students' ability to retain and apply new knowledge.

  • @nacho7872
    @nacho7872 2 місяці тому +2

    Fantastic video as usual, can't wait to hear more about that new benchmark

  • @Dannnneh
    @Dannnneh 2 місяці тому +2

    Those test prompts are hilarious! Great job, another banger video.

  • @ChannelHandle1
    @ChannelHandle1 2 місяці тому +16

    One theory suggests there is a unitary system for RELATIONAL REASONING. This suggests that over time, from infancy to adulthood, this system develops. The essence of the adult system, known as structure mapping, is innate and present from the outset of development.
    An opposing theory argues that we have multiple systems. Early Systems are tied to cognitive domains such as mental attribution. These systems don't support high-level reasoning, but can produce behaviours that mimic it, to an extent. From around age 3 to adolescence, a Late System develops separately, which is domain-general. In adulthood, the Early and Late systems coexist: Early System outputs can be used by Late System in abstract forms

    • @dariosilva7836
      @dariosilva7836 2 місяці тому

      Why not both?

    • @berkertaskiran
      @berkertaskiran 2 місяці тому

      Intelligence is one thing. It is unified. These models are MoE, they are good for that reason, but they also fail for that reason. You have to somehow combine all knowledge an AI has and so all areas of information can influence the others and evolve together and learn from the other. There aren't "areas" of expertise within the universe, it is all just physics, even political science. When you understand everything from ground up, and understand the connections and layers, it becomes much clearer to see the picture. And math and coding and all those things are abstract things we've created using our physical minds to understand reality. The more AI understands about this stuff in the most fundamental way, the easier and faster it can answer most complex questions. But it also needs ALL sensory input it can get. You can't expect it to fix the design of your house by just reading text.

    • @imperson7005
      @imperson7005 2 місяці тому +1

      I agree with this. I think we are creating something like a brain. I have had PTSD since I was 3 years old and I vividly remember speaking to myself in the way I speak now as a 21 year old. If you understand how PTSD shatters a sense of identity, this could mean this adult system is the foundational "self" for the creation of identity.
      I think this confirms a thought i have been having that to create AGI we need 2 or more models layered on each other to understand the world. I believe this adult system is focused on logic and reasoning while the identity system is created by interacting with others since morality and culture and even basic behavior comes from and can only be derived from interaction with others.

    • @squamish4244
      @squamish4244 2 місяці тому

      @@imperson7005 Are you being treated for your PTSD? I have OCD and PTSD, and without advances in AI, they wouldn't have been able to imagine the brain in sufficient detail for the procedure I'm scheduled to undergo to work. And it works REALLY well.

    • @ChannelHandle1
      @ChannelHandle1 2 місяці тому

      I would recommend ppl look up Relational Frame Theory and read some of the studies where children were trained in relationa reasoning and their IQ's increased by 5-23 pts on avg (depending on which paper u look at)

  • @swallowsnestonacctvcameraw2140
    @swallowsnestonacctvcameraw2140 2 місяці тому +1

    I think you nail it with the phrase "textual intelligence" - LLMs have a super-human take on the text they are trained on and operate within that text world, but have no knowledge of the real world. Still, though, it's super interesting exploring the real world value of textual intelligence and at the same time it's becoming easier to critique the 'reasoning' claims.

  • @ivoryas1696
    @ivoryas1696 2 місяці тому +1

    6:10
    Dang, that was a *_good_* example. I wonder what stumps the other few...

  • @JohnLewis-old
    @JohnLewis-old 2 місяці тому +2

    I have missed you content. Thanks for making something for the masses.

    • @aiexplained-official
      @aiexplained-official  2 місяці тому +1

      Thank you John, had Covid! Much more to come!

    • @JohnLewis-old
      @JohnLewis-old 2 місяці тому

      @@aiexplained-official I'm super excited by your content! I think your the best creator at the moment in this space.

  • @user-d8h3w
    @user-d8h3w 2 місяці тому +2

    8:39 That's the point, it's predicting the article not the real world

  • @sorakagodess
    @sorakagodess 2 місяці тому +1

    I am really glad you kade this video, i think you made the best example on how "blind" the models are that i ever seem, whenever you made a qhestion it was put in a way that would instruct the reader to imagine the scene and try to simulate or predict what would happen, but the ai couldnt since it isnt that capable, thank you for the time and effort you put into this video, it was really eye opening and shed a lot of randon bad questions and wrong certainties that other youtubers have put in my mind.

  • @jeanchindeko5477
    @jeanchindeko5477 2 місяці тому +2

    As usual thanks for that videos and all the hard work

  • @godspeed133
    @godspeed133 2 місяці тому +2

    Anthropic and microsoft AI execs are saying there is still a lot of road left to run with scale on LLMs. Makes me wonder if they think increased token count will somehow make them "smarter", and the "subroutines" that they retrieve when you ask them a question will get more and more granular (and therefore more axiomatic, and of more general applicability, allowing more more dynamic answers grounded more strongly on first principles and logic and less recall of "template"/"memorised" answers) as they are trained on better and more multimodal datasets, in a smarter order, such that in the limit you do get something that reaches human intelligence.

  • @PJWey
    @PJWey 2 місяці тому +2

    Benchmark score improvements reminds me of the golden age of CPU wars between Intel and AMD in the late 90s benchmarking became a hill to die on but eventually meant little

  • @Bens963
    @Bens963 2 місяці тому

    Can't get enough of this channel, stepping up with rational analysis and not playing favorites. No hype, just analysis

  • @dumbahhperson
    @dumbahhperson 2 місяці тому +1

    I definitely agree that Claude 3.5 is an amazing model. It is really smart, especially when you tell it to act like a person, it can understand the character and wears the personality well. Id say, for a model trained purely on text, it is the closest to "reasoning"

  • @ryzikx
    @ryzikx 2 місяці тому +1

    i've been hearing this for a while now about how large language models do not have spatial intelligence, and therefore cannot be AGI. What about people with aphantasia? These people cannot visualize anything. I suppose the case has to be made that humans are not generally intelligent.

  • @TesserId
    @TesserId 2 місяці тому +2

    It's been silly how many products now claim to use AI. And, oddly, Crowdstrike is now one of those products, when in the past the term big data analytics would likely have been used.

    • @Shrouded_reaper
      @Shrouded_reaper 2 місяці тому +1

      Put "AI" your product description to make stock value go up!

  • @rickandelon9374
    @rickandelon9374 2 місяці тому +2

    Glad to see a new video from you 😊

  • @Bluelagoonstudios
    @Bluelagoonstudios 2 місяці тому +1

    Last month I tried to get GPT4o to solve a problem for a design of a rather complicated electronic project. Besides the latency, I noticed that the GPT model starts from top to bottom, I had put the text neatly into text blocks in the right order (so made the model a bit easier). And did not look at the text blocks as a whole, with the result that things were skipped. And a total fail. A lot of work is still needed.

  • @karlwest437
    @karlwest437 2 місяці тому +1

    I was just starting to wonder when the next ai explained video would drop, and here it is! 😃

  • @MrSchweppes
    @MrSchweppes 2 місяці тому +1

    I would love to see your in-depth video about Claude 3.5 Sonnet. In Russian and Hebrew, Claude 3.5 Sonnet shows miracles of reasoning and understanding. I wonder what amazing things it can do in English!

  • @jossefyoucef4977
    @jossefyoucef4977 2 місяці тому +7

    If they keep making o version and mini version it's gonna be 40 models in no time 😂

    • @tbird81
      @tbird81 2 місяці тому +3

      All cheaper and cheaper to run, and worse and worse.

    • @ShawnFumo
      @ShawnFumo 2 місяці тому

      @@tbird81Well worst than the one right before perhaps, but I’m sure 4o is much better than 3.5 in most ways

  • @Wheezy_calyx
    @Wheezy_calyx 2 місяці тому +28

    I think the interesting part about the “chicken nugget” example, is that these types of questions even stump humans. This might be why LLM’s struggle with them.

    • @aiexplained-official
      @aiexplained-official  2 місяці тому +18

      So far everyone I asked got it, but no model (benchmark has a slightly modified version).

    • @Wheezy_calyx
      @Wheezy_calyx 2 місяці тому +5

      Like you said, people lie and share their mistakes in written text, but learning based on the real world doesn’t lie. It would seem as long as we build these models to work like “humans”, we will find they come with the same deficits and blind spots that we have.

    • @Wheezy_calyx
      @Wheezy_calyx 2 місяці тому +10

      @@aiexplained-officialThat question isn’t hard, but there are many “riddle” questions that commonly stump humans. And these models are closer to young children in their “intelligence”. I think it’s showing blind spots in human intelligence, which is being transferred to the models.

    • @Wheezy_calyx
      @Wheezy_calyx 2 місяці тому

      This is a good one that I think of, it does some times stump people to say 50 ua-cam.com/users/shortsyRLjpFv5MQ0?si=L9avirBXt_4990oj

    • @ollyfoxcam
      @ollyfoxcam 2 місяці тому +1

      @@Wheezy_calyxbut they don’t work like humans at all

  • @keeganpenney169
    @keeganpenney169 2 місяці тому +2

    I just tried your nugget scenario in mini, it was my first input with it but it actually gave the opposite answer then I see hear. However I just quickly used gpt on my phone with 4o mini, not playground Here's a copy paste:
    I have a math problem for you to solve.
    John wants to buy some chicken nuggets, the chicken nuggets come in 3 sizes, small, medium and large. The small comes with 5 nuggets, the medium has 6 nuggets, and the large has 8 nuggets. John wants 40 chicken nuggets but he has only 2 dollars in his pocket. Each nugget costs 1 dollar. John also has no form of payment, and John's currently in a coma. How many chicken nuggets can't John buy?
    ChatGPT:
    Since John is in a coma and cannot physically buy any chicken nuggets, the number of chicken nuggets he cannot buy is the total number he wants, which is 40 nuggets.
    Therefore, John cannot buy **40 chicken nuggets**.

  • @codingwithkurt
    @codingwithkurt 2 місяці тому +1

    I feel like these default zero shot responses are like a gut reaction vs a thoughtful response. Using the think out step by step might be more appropriate for complex highly detailed prompts, no?
    As always I loved the video. Great work!

    • @aiexplained-official
      @aiexplained-official  2 місяці тому

      Will try that but likely will not make a major difference as its a fundamental issue of not grokking physics

  • @stephenrodwell
    @stephenrodwell 2 місяці тому +1

    Thanks! Great content, as always. 🙏🏼

  • @pranavk6788
    @pranavk6788 2 місяці тому +6

    "Are Emergent Abilities of Large Language Models a Mirage?" By Stanford University in NeurIPS, a must read for all people over hyping. Basically the evaluations of these models are not continuous. They are marked either a 0 or 1 for wrong or correct answers, but if their evalution is done on a continuous scale, the emergent behaviour vanishes and things become linear. Copy paste in claude and tell it to explain with example if you find it hard to read through research papers.

    • @sebzim4500
      @sebzim4500 2 місяці тому +7

      I don't think anyone in the field was very impressed with that paper.
      It is trivially true that you will always be able to find a metric which improves continuously since the model weights improve continuously. It says nothing about emergence.

    • @TheRealUsername
      @TheRealUsername 2 місяці тому +1

      The biggest problem is that you don't have control on what abstract and complex patterns the model learns, nor how they affect its performance.

    • @GabrielVeda
      @GabrielVeda 2 місяці тому

      @@sebzim4500Bingo

  • @nuigulumarZ
    @nuigulumarZ 2 місяці тому +1

    LLMs build a model of their world in training, and that model includes us - the sort of questions we ask, the sort of answers we like to receive, that we sometimes ask questions in an unclear way, including extraneous information or contradictions. Phrasing a question in a way that appears to solicit a particular type of answer but including information that can't produce that type of answer may just register as "this human is bad at asking questions, I should answer the implied question". An LLM's experience of the world is very different to a human who spends more of their life goofing around with friends than maths problems, so it's not surprising their world model is not completely congruent with ours.

    • @aiexplained-official
      @aiexplained-official  2 місяці тому

      Fair point, thank you

    • @lucnotenboom8370
      @lucnotenboom8370 2 місяці тому

      @@nuigulumarZ excellent point. It's also one of my issues with the style of conversation they're trained to have. Humans can be confused, and ask questions to make something clear. AI doesn't really seem to be able to be confused? And therefore also doesn't really ask questions to improve its answer before it gives one.

  • @endoflevelboss
    @endoflevelboss 2 місяці тому

    Ilya Sutskever said recently that he believes it doesnt matter that LLMs may operate in the text domain exclusively. Importantly that this doesnt *exclude* them from understanding, for instance, 3D space, it just makes it harder.

  • @jamesyoungerdds7901
    @jamesyoungerdds7901 2 місяці тому +1

    Another great video, Philip - thank you so much, proud Patreon supporter 🍅. I still can't help but thinking of agents - could the llm get things predictably more correct for those tomato plate questions - if you have a video generative model create a video of the scenario, then another model interpret what it sees happening. I know that's not efficient - but I keep going back to breaking things down into the steps of what multi-modal models are currently capable of (with scale, cost, compute and speed).

    • @aiexplained-official
      @aiexplained-official  2 місяці тому +1

      Really interesting idea, I wanted to try that with smartgpt2. And thank you for your continued support.

    • @ShawnFumo
      @ShawnFumo 2 місяці тому

      Check out RFM-1 from Covariant. This is a robotics foundation model from them (for use with a robotic arm that does picking) and it does things like visualize the result of its action before doing it, etc. It seems pretty impressive.

  • @SenojTheDestroyer
    @SenojTheDestroyer 2 місяці тому +8

    It doesn't matter if the model is only available via SaaS (Software as a Service).

    • @biocykle
      @biocykle 2 місяці тому +1

      What doesn't matter?

    • @TragicGFuel
      @TragicGFuel 2 місяці тому

      @@biocykle the model will get worse if one company gains the major share.
      That's what he's hinting at

  • @OZtwo
    @OZtwo 2 місяці тому +1

    Thank you for yet another great video.

  • @mintakan003
    @mintakan003 2 місяці тому +1

    I also listen the AI podcasts with a focus on business and education. Mostly, LLM's are largely talked about as "productivity tools". It's another form of "search". It's great for quickly pulling stuff together, presenting it in a nice format. Great as brainstorming tools, to get a preliminary sense of the issues. Great for getting started on a project. But it's not a substitute for one's own critical thinking skills (as any information resource on the internet), deeper research with human authored primary source materials. One still has to take responsibility for the final product.
    But I would not trust it for "reasoning" skills. I think increasingly, the consumer will get more sophisticated on how these tools are best used, and to take it with a grain of a salt.

  • @timeflex
    @timeflex 2 місяці тому

    I suggest (again) combining visual-enabled LLM with an image generation network in the following way:
    1. The original text prompt is submitted to the image generation part.
    2. The resulting images (multiple) are attached to the original text prompt and then sent together to the LLM to process.
    This way LLM will have some (minimal) "imagination".

  • @Rotellian
    @Rotellian 2 місяці тому +2

    The fact that they can't come up with a better name/nomenclature does not inspire me with confidence that they can be trusted to develop AGI

  • @dadadadada17
    @dadadadada17 2 місяці тому

    I've tried out Hume AI, I must say it's audio modality is really impressive, you can really talk in a natural way with it.

  • @harshstudy488
    @harshstudy488 2 місяці тому +2

    have you got any plans of makeing a video about your thoughts on future of AI and UBI and all btw you are awsome as always👍🏻

    • @aiexplained-official
      @aiexplained-official  2 місяці тому

      The future of AI I try to cover every video but UBI yes I could! Thank you so much Harsh

  • @rantmarket
    @rantmarket 2 місяці тому +1

    "The real world is immensely messy."
    Uhhh.. yeah. Flatly-understated declaration rating = 98.1%

  • @RichardHarbridge
    @RichardHarbridge 2 місяці тому +3

    The prompt you used with the balancing vegetables on a plate example missed the word "plate" . It has an impact and should be corrected in your benchmark test.

    • @RichardHarbridge
      @RichardHarbridge 2 місяці тому +1

      It says "the three times". Which is the crucial insight in the question. :)

    • @RichardHarbridge
      @RichardHarbridge 2 місяці тому

      Also if you read this I love your content and you are fantastic!
      Just figured as my absolute favorite benchmark checker it made me laugh when I read the text and saw the error. I am sure it's probably the only one too.

    • @aiexplained-official
      @aiexplained-official  2 місяці тому +2

      Thank you! Before it ever became an official one, all Qs will be checked again not only by me but independent PhDs. Not just typos but even obscure physical possibilities

    • @RichardHarbridge
      @RichardHarbridge 2 місяці тому

      @@aiexplained-official I can't wait and can't think of a better person to lead yet another awesome contribution to benchmarking!

  • @ttul
    @ttul 2 місяці тому +1

    You’re the master of trolling LLMs. Bravo.

  • @kvinkn588
    @kvinkn588 2 місяці тому +1

    I work at a small local ISP in Germany and even though German internet as a whole is atrocious, we did not even realise there was an outage till we read the news. Not relying on Windows was a lucky choice I guess. (Mean that, seeing how many backdoors and what not get discovered with linux / unix systems and mac all the time it could have happened there just as easily)

    • @StefanReich
      @StefanReich 2 місяці тому

      Why would you say German internet is atrociuos? I'm online 24/7 with superb speeds. Standard Vodafone through copper wire

    • @kvinkn588
      @kvinkn588 2 місяці тому

      @@StefanReich Oh, I do have great Internet too, that is my home connection. Mobile is sometimes fast sometimes slow in the city with Vodafone for me. I do have friends who live farther out who still can't even get 50 Mbit/s for their home connection though. It has been steadily getting better over the last 10 years yeah. Many other countries have been farther along than we are right now 10 years ago though, so it depends on how you look at it i suppose

  • @williamjmccartan8879
    @williamjmccartan8879 2 місяці тому

    Nice to hear from you Phillip, cut down the name to GPT mini, and thank you for sharing your time and work sir, have a great day, peace

  • @ReneSanchez6502
    @ReneSanchez6502 2 місяці тому +2

    seriously what an excellent video

  • @chrisanderson7820
    @chrisanderson7820 2 місяці тому

    We don't even define what "reasoning" is. Reasoning isn't simply extremely complex knowledge sorting and pattern analysis, it includes this but it is more than this. It sort of relates to that problem with LLMs that EVERY answer is a hallucination, even the ones where it gets things correct. It might have a VERY complex relational model of all its information held but that falls short of reasoning (even though it can seem like it a lot of the time).

  • @Ed-sf02
    @Ed-sf02 2 місяці тому +1

    Great as always and thanks for pointing the limitations of LLMs. On output size of GPT4o mini (or any other) have you actually tried to make it generate the full 16k tokens? Very hard to generate more than a couple pages in my experience. The “lazyness” of models when asked to perform long or repetitive task is another annoying limitation of LLMs.

    • @aiexplained-official
      @aiexplained-official  2 місяці тому +1

      I have noticed indeed, there must be an easier hack to get it to do so

  • @jstr-x
    @jstr-x 2 місяці тому +1

    how long did that biplane crash edit take 🤣 great vid as always

  • @pythagoran
    @pythagoran 2 місяці тому +11

    I don't understand why people keep looking for reasoning in these models. It's like testing them for consciousness. Any positive results will forever be illusory or coincidental.

    • @igorbessa3563
      @igorbessa3563 2 місяці тому +1

      Why forever?

    • @pythagoran
      @pythagoran 2 місяці тому

      @@igorbessa3563 for as long as they are based on a stochastic architecture, they cannot produce precision, which is required for reasoning. For as long as they require more data, not less, they cannot perform inductive reasoning, which is required for problem solving. LLMs will *forever* be handicapped in this domain.

    • @pythagoran
      @pythagoran 2 місяці тому

      ​@@igorbessa3563 For as long as they are based on a stochastic architecture, they cannot produce precision, which is required for all reasoning. For as long as they are have insatiable data requirements, they cannot produce inductive reasoning, which is required for problem solving. LLMs will *forever* be handicapped in this domain.

    • @dgtl-dtlng
      @dgtl-dtlng 2 місяці тому

      @@igorbessa3563 For as long as they are based on a stochastic architecture, they cannot produce precision, which is required for all reasoning. For as long as they have insatiable data requirements, they cannot produce inductive reasoning, which is required for problem solving. LLMs will *forever* be handicapped in this domain.

    • @LustyLichKing
      @LustyLichKing 2 місяці тому +3

      It's a very high ceiling so you can continue to use it as a benchmark to evaluate relative performance for a long time. Lends consistency in data analysis.

  • @Shlooomth
    @Shlooomth 2 місяці тому +1

    The flaw with tests like this is that no one actually asks questions like this. “Hey what’s 1+1, but also, what if numbers weren’t real?”

  • @AZisk
    @AZisk 2 місяці тому +2

    I’m here to check if he “read the whole thing” again

    • @aiexplained-official
      @aiexplained-official  2 місяці тому +1

      I did but papers were later on in this vid...!

    • @AZisk
      @AZisk 2 місяці тому

      @@aiexplained-officialwe appreciate the depth you go into.

  • @Ecthelion3918
    @Ecthelion3918 2 місяці тому +1

    Was hoping you would cover it :)

  • @sjkba
    @sjkba 2 місяці тому +1

    So we are somebody's simulation like at 15:20 :D

  • @geoffcunningham6823
    @geoffcunningham6823 2 місяці тому +1

    I think it's remarkable that we have to drop in gotchas and tricks to get the models to mess up, even on very complex issues. Just think how far we have come, and indeed lots of humans would miss the gotcha as well.

    • @aiexplained-official
      @aiexplained-official  2 місяці тому +3

      I get your point, but the further out you zoom (into reality vs text) the less it becomes gotchas and more and more just it can't answer at all. This is not about LLMs being doomed, it's about what would need to happen next.

  • @fyruzone
    @fyruzone 2 місяці тому +1

    Man you have to make a video about the lmys arena benchmarks. Openai is clearly optimizing the text format of the models outputs in order to get artificially higher in the benchmark. And what makes it more obvious? The latest Gpt 4 TURBO is supposedly THE SAME as gpt4o mini in this benchmark. Something is fishy for sure

  • @ZeerakImran
    @ZeerakImran 2 місяці тому +1

    Gemini isn't great but I do question your motives with this question 12:39. The question and its description make no sense and so any output would be fine. Tell that same question to a human and watch them stare back at you in response. Even if a human gives you the answer to that question which satisfies you, that same answer might not satisfy another. And we can't say one is better because the best response to that question is no response. You can't require it to accept that the question is a trick question with its not about vegetables while also not making any sense in the rest of the question. At that point, the whole message is a trick and one statement can't be claimed as the truth over another statement. Unless clarified ahead of time. That message may make sense to you because you read it wrong. You placed a gap between "upside-down" and "several times" when there isn't one in the text. There are also plenty of commas being used in the technical question, so you have to assume that the person knows how to write, and if you assume that, then the plate has to be turned upside-down multiple times (without there being a gap due to the missing comma). So what does turning upside-down multiple times in a trick questions context mean? It seems obvious but it can't mean that due to the deliberate exclusion of a comma there. In a technical question where every word matters, this question makes no sense and if it appeared on a test for humans, good luck convincing any of them that there answer is unacceptable. Good work though. Not a hater. Just saying that this particular message was useless and led towards the wrong conclusion. Even saying "completely" upside-down doesn't make that much sense to be honest. It makes sense to us because we know what we're trying to say. But the other person doesn't and he can't assume what it means to make you happy. Because if he does that, then he'll be a crappy programmer and a bad listener since he'll be ignoring explicit instructions from me. Since he'll be assuming I actually didn't mean what I explicitly defined and stated. Specially in a technical question where details matter and are being tested. You've even said it's a trick question. The ai's answer wasn't great but a human's one would have been worse and a lot less kind. It's just like with texting. You don't know what the other person's reaction was or their tone of voice when writing. The majority of the information is lost right there so we have to be very careful with our words.

  • @jaysonp9426
    @jaysonp9426 2 місяці тому

    GPT4o mini is one of the biggest releases of the past year. Definitely feels like a test for GPT 5.

  • @abdelhakkhalil7684
    @abdelhakkhalil7684 2 місяці тому +2

    And you are absolutely right about training a model to do a lot of math can hinder other capabilities. For instance, I tested Gemini-1.5-pro vs Gemini-1.5-flash, and the pro chose wrongly the answer B and stated that the information about Philip being in a coma is irrelevant.

    • @chrisreed5463
      @chrisreed5463 2 місяці тому

      My call on GPT4o just treated it as a math problem. But as I value its mathematics capability, that bothers me little. I think the trick is to find the model that works best for your use case, for me, overall that's GPT4.

    • @abdelhakkhalil7684
      @abdelhakkhalil7684 2 місяці тому +1

      @@chrisreed5463 You're absolutely correct! You if follow up with a second prompt asking GPT-4o to identify the most important piece of information, it would catch it and give you the correct answer.

  • @cyanophage4351
    @cyanophage4351 2 місяці тому +1

    It would be cool to hear what you think about the latest models out of Mistral. One of their new models isn't even a transformer

  • @jit-r5b
    @jit-r5b 2 місяці тому +1

    So, the way I hear it, is that those medical exams are ridiculous. Because if the model cannot figure out that flipping a plate upside down will cause everything fall off - how could it possibly take care of a human being. It really is worrying if certain areas use LLMs to replace humans.
    That is not to say that I don't understand how they work and why it appears, but rather to say, it is really concerning and it's a lot of bullshit in the marketing.
    That's why Phill's benchmarks would do so good. All those screaming that AGI has arrived would realize it's not even a light in the tunnel haha

  • @coolsai
    @coolsai 2 місяці тому +3

    What's your thoughts on software jobs this year because I think AI progress is saturated now, What's your take on capabilities of gpt 5
    Thanks for new video!

    • @aiexplained-official
      @aiexplained-official  2 місяці тому +5

      I think that now is still an amazing time to experiment, create impact and get hired, even in this downturn. Knowing the flaws of AI is just as important as the strengths, especially in companies that deploy it.

    • @coolsai
      @coolsai 2 місяці тому +3

      @@aiexplained-official Thank you, I always watch your videos to stay updated in AI space

  • @DrPhilby
    @DrPhilby 2 місяці тому +1

    The more i listen to you about AI, the more i appreciate biological brains and " software" inside❤

  • @khonsu0273
    @khonsu0273 2 місяці тому

    You said it man; 'The newest model from OpenAI is here' 😉

  • @Daralima.
    @Daralima. 2 місяці тому

    The Chicken Nugget problem has pretty amusing variety between the models - I've tested it on a lot of them, and it's a very mixed bunch.
    GPT-4o, the full-sized model, gets the problem wrong at least the vast majority of the time, much like 4o mini.
    GPT-4-1106 seems to do better than 4o, but sometimes gets it wrong
    Gemini 1.5 pro can give a number of answers, sometimes correct, sometimes bizzarely incorrect like stating none of the numbers divide 40 evenly.
    Some really small models, such as Phi-3 mini and medium, get it right
    It's fun to see the variety between models where on the surface some are expected to perform better than others.

  • @coder13
    @coder13 2 місяці тому +12

    "GPT omni" without omni capabilities.... honestly, totally on brand for OpenAI which is no longer open.

    • @berkertaskiran
      @berkertaskiran 2 місяці тому +1

      GPT notomni mini, by ClosedAI.
      (Fine print: GPT is a trademark of ClosedAI even though it is a general term)

  • @설리-o2w
    @설리-o2w 2 місяці тому +2

    Been a while glad to see you back

  • @UserHuge
    @UserHuge 2 місяці тому +1

    That's some adversarial trolling I like!

  • @Ianstudent
    @Ianstudent 2 місяці тому

    I know most people may not care about this kind of model. I'd actually love a GPT-4 mini just for text that's cheap. There are so many NLP automations just waiting to be made cost effective at scale that simply can't yet. GPT-4o mini now makes it plausible to consider, for example, doing a single pass of NER over 3TB of arXiv papers. Still not completely cheap, still need multiple passes, still needs too many output tokens, but this is starting to look feasible.

  • @randomuser5237
    @randomuser5237 2 місяці тому

    This model is supposed to replace GPT 3.5 Turbo and there is absolutely no doubt that it's much much better and cheaper than that. That's all there is to it. Even Sam's tweet emphasizes only on that. I don't know why people try to extrapolate so much from every OpenAI release.

  • @MatthewKelley-mq4ce
    @MatthewKelley-mq4ce 2 місяці тому

    One thing I'll mention, is that you asking or telling some information "it's not this" or "it's that" can sometimes cause the phenomenon.. i forgot the name wherein you have something pointed out to you that you couldn't see, or you were asked to not notice something and then you find it extremely difficult *not* to focus on that detail.

  • @jackfarris3670
    @jackfarris3670 2 місяці тому +1

    6k views in 2 hours? Microsoft doing you dirty

  • @firesnakearies
    @firesnakearies 2 місяці тому

    I don't think zero is the most likely answer for the plate of vegetables. We have to infer based on Philip's behavior that something is inexplicably sticking to that plate no matter what. There's no reason to think that he dislodges it, if he's literally looking for roots on the other side. If he's looking for roots specifically, it's most likely the cabbage that's sticking, because it would be the one that has direct roots coming down from the vegetable. Tomato is a fruit, potato is itself the root, but a cabbage could theoretically have roots on the other side of the plate. Philip would not have shaken it multiple times and inspected it for roots multiple times unless the cabbage was extremely fixed in place, as though literally rooted. So I say the most likely answer is "one", that one being the cabbage.

    • @aiexplained-official
      @aiexplained-official  2 місяці тому +1

      If it gave this reasoning I would give it more credence. But even if you remove aspects of Philip's behavior it says the same. It simpyl does not compute or simulate the physics involved

  • @matthewpublikum3114
    @matthewpublikum3114 2 місяці тому

    The problem with depending on api models is it's difficult to tell whether the added intelligence is exigent on a pipeline of processes or it's a single intelligent model

  • @youMEtubeUK
    @youMEtubeUK 2 місяці тому

    00:02 GPT-40 Mini tiene alta inteligencia y costos bajos en comparación con otros modelos
    01:59 GPT-40 Mini ofrece avances en inteligencia artificial limitados
    03:54 Desafíos en la inteligencia textual y de razonamiento del GPT-4o Mini
    05:48 AI models can improve mathematically but may not be universally better
    07:42 Advancement in reasoning demonstrated by GPT-40 Mini's high SC score.
    09:40 La inteligencia física de los modelos de lenguaje grandes es un desafío.
    11:32 GPT-4o Mini experiences discrepancies in performance
    13:24 Modelo GPT-4o Mini con la caída de objetos.
    15:12 GPT-40 Mini has the potential to provide more informed and data-driven responses in the future.
    16:53 GPT-4o Mini puede cometer errores alucinantes en el texto
    18:28 Challenges faced as a Microsoft service agent
    20:11 Preference for Claude 3.5 Sonic model

  • @minuspi8372
    @minuspi8372 2 місяці тому

    The most surprising failure I've seen in even state of the art AIs is asking them "Which is bigger: 9.11 or 9.9". Every AI I've tried insists that 9.11 is bigger. Even when I tell it to use code, it thinks that that the result is because of a rounding error in Python or something like that. The only way I found to get them to get the right answer is to tell it to begin with "Let's think step by step".

  • @ClySuva
    @ClySuva 2 місяці тому

    Although, I do feel like some of those errors are prioritization issues. To understand people's fuzzy questions and give them a meaningful answer you often have to ignore parts of it. Randomly mentioning gunshot wound in otherwise standard question seems unreasonable and probably a mistake. In this case it probably should not be ignored. But in other cases, when user makes a typo, repeats some words, has accidentally copy pasted too much from some site or so on, it is perfectly good idea to ignore some of the input and base the answer on the part of the text which does make sense.
    Finding the balance between those things can be bit complicated without deeper and making a model be good at these types of benchmarks might in turn make it very picky about the language you use.
    I tested a sentence: "Sally went out stayed in and met with Bob in the bar. What should they do?"
    Human reader can clearly tell, the "stayed in" was probably accidentally added as it doesn't make sense in the context. So should AI model focus on that part? Some models will while some other models ignore it completely.

  • @Hiroprotagonist253
    @Hiroprotagonist253 2 місяці тому +3

    OpenAI hype train breaking on the rocks of reality

  • @etunimenisukunimeni1302
    @etunimenisukunimeni1302 2 місяці тому +1

    Those... ermm, "trick questions" are awesome. Makes me want to try similar tests in Finnish, just to see if anything's different. These are, after all, _language_ models, and are somewhat bounded by the language of the training data they've consumed. Surely the Finnish portion of that is wildly different from the bulk of English in it.

    • @aiexplained-official
      @aiexplained-official  2 місяці тому +2

      Thanks Etunimeni, I have a whole benchmark of them! Any sponsor or business who wants to know more, do get it touch! And good luck with yours!

    • @etunimenisukunimeni1302
      @etunimenisukunimeni1302 2 місяці тому

      After a quick test, GPT-4o seems to get at least the vegetable question exactly right in Finnish. I tried to translate the question with its twists and turns to Finnish as literally as I could, and GPT-4o answered D) 0 with an explanation that is spot on (I'm paraphrasing):
      "This question tries to trick you. The most important part is how Philip turns the plate upside down. It's very unlikely that any vegetables remain on the plate after that."

    • @aiexplained-official
      @aiexplained-official  2 місяці тому

      4o is better than Mini, you have to be more subtle in your language

    • @etunimenisukunimeni1302
      @etunimenisukunimeni1302 2 місяці тому

      @aiexplained-official Good point, I was apparently asleep through the portion of the video covering GPT-4o Mini 😬😅

  • @DreckbobBratpfanne
    @DreckbobBratpfanne 2 місяці тому

    You can get a custom GPT to solve some of these logic issues, i used this instruction to solve the medical example correctly (also the freezing computer allthough 4o-mini was a bit off but at least it acknolweged it):
    You are a reasoning machine. When you receive a text, analyze it sentence by sentence in order. You will reply in multiple ways:
    "[Internal]:" is your breakdown of the received text step by step. Dont gloss over things, unless its sensible. Print them out step by step.
    For eeach step "[Reason]:" is your current thoughts on what the answer you should give would be. Update this if you discover something novel in [Internal]. For example, if "Brady did x" at the start, then "actually Brady did y", your answer here should be updated accordingly to the actual truth to not be fooled by previous info. Do not discard any information, if you get an information, treat it as correct unless another part of the text refutes. For example "Jeff is an omnigod" can be true, except afterwards the text says "Jeff is actually not an omnigod". Nothing in the text should be seen as an error, unless later falsified.
    Then, put out "[Trial]:" This is the answer you would give now.
    Finally, put out "[Answer]:" with your true final answer after you have checked the answer you gave in [Trial] again against each piece of previous info, each single reasoning step, to absolutely safe, dont assume anything you got as a joke, hyperbolic or whatever, it doesnt even have to be realistic, reason within the given text, no. matter. what! Do never under any circumstance disregard any information. Check any step you did before, and then compare to [Trial] and each [Internal] step and reason, even if the final answer seems odd, unrealistic or whatever, just use the given text as absolute total ground truth.
    When given an input that is only in (debug info:) it is an internal message to you to rethink your answer, redo the analysis accordingly in that case.

  • @AshishDha
    @AshishDha 2 місяці тому +1

    great stuff! thanks :)