New “Liquid” Model - Benchmarks Are Useless

Поділитися
Вставка
  • Опубліковано 18 жов 2024

КОМЕНТАРІ • 272

  • @matthew_berman
    @matthew_berman  3 дні тому +28

    Why are non-transformers models performing so poorly?

    • @PeterSkuta
      @PeterSkuta 3 дні тому +7

      @matthew_berman Because they dont have the necessary training comparing to transformers where training can be achieved but as we know its fck training and not learning and that gives transformers a big fail once my AICHILD goes live because my AICHILD is learning from start. I have redteamed that Liquid and smartness is a 3 years old other AI models have maximum 5 years and NO MORE doesnt matter how many smarts they add it. Still 5 years old and that can be used to advantage

    • @southVpaw
      @southVpaw 3 дні тому +9

      The same reason why 7 or 8Bs tend to outperform 13Bs: developer attention. There's been far more research and development around transformers at preferred sizes.
      (Yes, there are 13Bs that outperform 7Bs, I know this; but typically ~7Bs catch up so much faster bc of consumer and developer attention).
      Liquid has a neat architecture, but it's the definition of novel for now. Until they make one that pulls our attention away from Llama or Qwen, it's just gonna be "neat".

    • @isthismarvin
      @isthismarvin 3 дні тому +8

      Liquid transformers face several challenges compared to regular transformers. They’re harder to train, need more computational power, and aren’t as optimized yet. Their complex structure often leads to lower stability and slower performance, which is why they currently lag behind in effectiveness

    • @Xyzcba4
      @Xyzcba4 3 дні тому +1

      ​@@PeterSkutaI sadly realized by experimenting with AI tavern chatbots how dumb as nails they are. I now suspect this whole AI thing is a scam because chatbots don't understand temporal reality, can't even get a cooking recipe right, make shit up at random, and the so called training data must include details for every occasion, else fail. So training data = programming

    • @ayeco
      @ayeco 3 дні тому +4

      There were 10 words in your prompt, not it's response. Semantic issue.

  • @j.m.5199
    @j.m.5199 3 дні тому +80

    it saves memory by not thinking

  • @DeepThinker193
    @DeepThinker193 3 дні тому +117

    I feel i should create my own crappy LLM and put up "benchmarks' beating every other model on paper. I'll then ask folks to invest millions on a contractual agreement and run away with the money somewhere where they'll never find me.

    • @SeregaZinin
      @SeregaZinin 3 дні тому +3

      you won't escape from the planet, so they'll find you anyway ))

    • @Xyzcba4
      @Xyzcba4 3 дні тому +4

      Black Rock will find you

    • @amitjaiswal7017
      @amitjaiswal7017 3 дні тому +4

      It is better ideas to sell the company and make profit 😊😅

    • @jakobpcoder
      @jakobpcoder 3 дні тому

      This sound way to legit for some reason. Maybe cuz we have seen it so many times...

    • @hartmantexas5297
      @hartmantexas5297 3 дні тому +3

      Do it bro it seems to work

  • @OriginalRaveParty
    @OriginalRaveParty 2 дні тому +13

    "Benchmarks are useless".
    A statement I can get behind.

  • @johannesseikowsky8197
    @johannesseikowsky8197 3 дні тому +21

    I'd be curious how the model does on more "everyday" type of tasks like summarising a longer piece of text, translating something or extracting particular info out of larger text pieces. The type of stuff that people actually ask LLMs to do day-to-day ...

    • @niclas9625
      @niclas9625 2 дні тому +4

      You don't need to know the number of r's in strawberry on a daily basis? Preposterous!

    • @DimaZheludko
      @DimaZheludko 2 дні тому +4

      And how are you going to microwave your marbles if you won't know whether they fell out of the upside-down cup or not?

    • @mickelodiansurname9578
      @mickelodiansurname9578 День тому +2

      I concur... there are standard use cases you could apply... for example "here is some ground truth text... and here is a json file with errors in some of the text blocks... Use the ground truth text to replace the errors... and output the answer in valid json." now thats an every day thing for me.

  • @keithprice3369
    @keithprice3369 3 дні тому +23

    I'm confused. If the context is capped at 32k, why do we show a chart of their performance at 1M?

    • @AlexK-xb4co
      @AlexK-xb4co 3 дні тому +1

      Yeah, that's shady one. I also didn't quite get it

    • @TripleOmega
      @TripleOmega 3 дні тому +1

      That's output length, not context window.

    • @keithprice3369
      @keithprice3369 3 дні тому +1

      ​@@TripleOmega I'm pretty sure context includes both input and output. Perplexity agrees with me. You have credible sources that dispute that?

    • @TripleOmega
      @TripleOmega 3 дні тому +2

      @@keithprice3369 The context window will include the previous outputs along with your inputs, but this just means that if the output is too large to fit within the context window you cannot continue the current conversation. It does not limit the output length to the size of the context window as far as I'm aware.

    • @keithprice3369
      @keithprice3369 3 дні тому +2

      @@TripleOmega That doesn't sound right. Have you ever heard of an LLM with a 32k context cap that ever output more then even 20k?

  • @haroldhannon7253
    @haroldhannon7253 3 дні тому +4

    I will say that I have used it (the MOE 40B) successfully for doing summaries. The strength through context length is useful here. Normally, if I use something that will accept a larger context window and then try to do a summary without doing a chain of density multi shot (not just the prompt but literally feeding back on itself to check entities and relations) I lose so much of the middle in the final summary. This model does not do that and does not require multi shot chain of density to get a good long form document summary. Just a heads up.

  • @marc_frank
    @marc_frank 3 дні тому +20

    0:38 at least we know you are real 😅

    • @Xyzcba4
      @Xyzcba4 3 дні тому +3

      Imagine when the so called video AI learns to stutter or make Grammer mistakes. That's likely coming to make virtual influencers more real

    • @diamonx.661
      @diamonx.661 3 дні тому +3

      @@Xyzcba4 Can't NotebookLM's podcast feature already do this?

    • @Xyzcba4
      @Xyzcba4 3 дні тому

      @@diamonx.661 don't know. If it is,it's 1 if the 100 or so varianta I never made to time to even watch a UA-cam video of. So my bad?

    • @diamonx.661
      @diamonx.661 3 дні тому

      @@Xyzcba4 In my own testing, sometimes it stutters and can make mistakes, which make it more human-like

    • @6little6fang6
      @6little6fang6 2 дні тому

      I WAS SO SPOOKED BY THIS

  • @alparslankorkmazer2429
    @alparslankorkmazer2429 3 дні тому +11

    Maybe, these models are better at some other types of questions or tasks. I would love to see you try to search them if they exists or not rather than considering them a total garbage with your standard quiz. I think that it would be more informative and enjoyable.

    • @cbnewham5633
      @cbnewham5633 3 дні тому +1

      I don't think the standard quiz is very useful anymore. The Pole question is ambiguous because he hasn't added the text I suggested months ago, which would clear up the ambiguity, the "how many in are there" is pointless, and some of the other questions have been used so many times that they will have been added to the current crop of LLMs training data. I think you have a good point too - the type of question is just as important as the question itself.

  • @Thedeepseanomad
    @Thedeepseanomad 3 дні тому +3

    Well, thanks for playing.

  • @GregoryMcCarthy123
    @GregoryMcCarthy123 3 дні тому

    Thank you as always for your great videos. Matthew, please consider introducing “selective copying” and “induction head” tasks as part of your evaluations. Also, for non-transformer models such as these, it would be interesting to mention their training compute complexity as well as inference complexity.

  • @aivy-aigeneratedmusic6370
    @aivy-aigeneratedmusic6370 3 дні тому +2

    I tested too and it failed with all my usual prompts that basically any other model can do all the time... It suuucks hard

  • @mrdevolver7999
    @mrdevolver7999 3 дні тому +4

    9:18 "It didn't perform all that well. Maybe I should've given it different types of questions..." Yeah... Try 1+1 ? 🤣

  • @gavincstewart
    @gavincstewart 3 дні тому

    You're one of my favorite channels, keep up the great work!!

  • @SiimKoger
    @SiimKoger День тому

    Love seeing new architectures, that's where the real innovation will happen.

  • @User-actSpacing
    @User-actSpacing 3 дні тому +3

    Cannot wait for NVLM ❤

  • @yvangauthier6076
    @yvangauthier6076 3 дні тому

    Thank you so much for this deep dive !

  • @BigBadBurrow
    @BigBadBurrow 3 дні тому

    Hey Matt, thanks for the video, informative as usual. Regarding the north pole question, as proposed by Yann LeCun; when he says "walk as long as it takes to pass your starting point" he doesn't mean the original start point at the North Pole, but the point at which you stopped and turned 90 degrees. Which you would pass again because you're essentially walking in a circle that's 1km from the North Pole, and since the earth is spherical, you would reach that same point again. The circumference of a circle is 2*Pi*Radius, so you'd think the answer might be 2xPi Km, but because the Earth is a sphere, you wouldn't actually be 1km radius, it would be slightly less due to the curvature, so I believe the answer is: 3. Less than 2x Pi km.

  • @WernerHeisen
    @WernerHeisen 3 дні тому +1

    The models seems to either ace your tests or fail completely, not much gradation, which leads me to believe they winners are pre-trained. What do the benchmarks test for and do the models train on them?

  • @glamdrag
    @glamdrag 2 дні тому

    you didn't specify that the opening of the glass was facing up when you put the marble inside the glass. so technically it could be correct as long as you put the marble in the glass by moving the cup over the marble.

  • @alert_xss
    @alert_xss 3 дні тому +3

    I often wonder what the parameters for the generation used in these test responses are. For some of the APIs you use I doubt you have control over them, but temperature would probably have a pretty strong impact on how the models perform. It is also important to note that the seed of the generation will often be random and giving the same question multiple times will generate different and sometimes better or worse responses.

    • @Xyzcba4
      @Xyzcba4 3 дні тому

      "it is important to note"
      Are you a chatbot? You sound like a GPT

    • @alert_xss
      @alert_xss 3 дні тому +1

      @@Xyzcba4 yes

  • @edwardduda4222
    @edwardduda4222 3 дні тому

    I think there are a lot of factors to consider when determining the performance of the architecture itself. It could simply be the amount of quality training data or even how they tokenized the data. They could’ve also trained it specifically for benchmarks and not general purpose. I think it’s a good first step towards making LLMs better.

  • @brandongillins
    @brandongillins 3 дні тому

    Thanks for the video. Looks like your video editor missed a cut at about 40 secs. As always appreciate your content!

  • @tungstentaco495
    @tungstentaco495 3 дні тому +10

    I don't know if I would consider the "push a random person" question a total failure. The model's final decision is not consistent with what most people would actually do in that scenario, but the logic it used was sound. Its answer is actually consistent with some religions views on extreme pacifism, like Jainism for example.

    • @denjamin2633
      @denjamin2633 3 дні тому +2

      I think context is more important. A very mild action to prevent a literal extinction? Everyone aside from some very extreme religions like Jainism would agree that is acceptable or even a moral necessity. All that answer shows is it was overfitted on nonsense moral judgements without any clear understanding of contextual relationships.

    • @user-on6uf6om7s
      @user-on6uf6om7s 3 дні тому

      Yeah, it' a peculiar answer but I don't recall models that gave a clear answer being marked wrong previously on this question.

    • @MrEpic6996
      @MrEpic6996 2 дні тому

      Its most definitely not a fail
      Its a perfectly fine answer you cant harm someone without their consent,i dont know why this dude said, i consider it wrong

  • @PromptEngineer_ChromeExtension

    We’re waiting for more! ⏳🎉

  • @tinusstrauss693
    @tinusstrauss693 3 дні тому

    Hi Matthew, I was wondering if this new model type has any memory retention. Even though it got a lot wrong during your test, if you correct it after it gives a wrong answer, won’t it improve its responses in the future? I thought that’s how this new architecture was supposed to work. Personally, I think if AI can learn and improve over time, like we do, rather than always starting from the same blank slate (based on its pre-built training), that would bring us closer to AGI and eventually superintelligence.

  • @fabiankliebhan
    @fabiankliebhan 3 дні тому

    For the North Pole question I think it would really help if you make the distinction between starting and turning point.
    The starting point never gets passed and to pass the turning point again you need to surround the complete earth, so more then 2*Pi km

  • @Matx5901
    @Matx5901 2 дні тому

    Just one philosophical try (40M) : it's clogged, going round in circles. Exit.

  • @nicolasfleury6232
    @nicolasfleury6232 2 дні тому

    Funny mention on the liquid website. I quote : “What are Language LFMs not good at today: (…) Counting r's in the word "Strawberry"!” 😅

  • @isaklytting5795
    @isaklytting5795 2 дні тому

    Why are they even releasing this model I wonder? Is it perhaps not meant for the end-user to use it directly? Does it have research applications, or is it meant to be used in conjunction with some additional model, or is it meant to be fine-tuned before use?

  • @Dave-c3p
    @Dave-c3p 10 годин тому

    Not surprisingly, LLMs are great at producing text that appears to make sense, but they have no way of knowing if it actually does make sense or not. Their knowledge isn't based on direct experience of the real world; it's based on second hand text we feed them. In other words, LLMs are trained on maps, but maps aren't the territory we live in.

  • @koliux1
    @koliux1 День тому

    Thank you Matt as always saved us a ton of time, by not trying another wannabe unpolished product ❤

  • @adamholter1884
    @adamholter1884 3 дні тому +3

    Cool!
    NVLM video when?

    • @stephaneduhamel7706
      @stephaneduhamel7706 3 дні тому +1

      NVLM is just fine-tuned Qwen2-72b with vision capabilities. (just like Qwen-2-VL, except the multimodal part is made from scratch by Nvidia). I don't get the hype around it.

  • @labmike3d
    @labmike3d 2 дні тому

    You can memorize some patterns, train models on those same patterns, but in specific scenarios, you'll still lack the knowledge of which pattern to use. The same applies to people. You can teach them for years at school or through life with practical examples. However, it's hard to predict if they will use what you've taught them before. AI surprises us every day and still can't answer basic questions. Even when you use computer vision and other sensors, the results could be different every day. Try repeating the same question a couple of days in a row. Each day, you might get a different answer.

  • @MakilHeru
    @MakilHeru 3 дні тому

    There's always many failed attempts at finding a new way of doing things until a breakthrough occurs. With some time I'm sure something will be discovered. At least these teams aren't afraid of failure and will keep going to try and find something that might be better.

  • @MHTHINK
    @MHTHINK 2 дні тому

    Regarding the north pole question, I was surprised that you indicated the answer was uncertain. You're correct, that they will never cross the starting point. It makes sense that LLMs would struggle with it since they inherently have no visual experience, or training exposure, which would be attained from sequential moving pictures, or video without requiring audio. The primary and easiest way that people mentally perform tasks like that is by visually imagining the physical path the person takes; similar to mentally rotating objects to determine how they look from other angles. Psychology experiments have shown high compatibility between the time it takes people to complete visual rotation tasks and the degree to which they need to rotate the object for the task, which adds some objective weight to the notion that we perform the cognition through visual manipulation, which I see as a modelled extension from our visual experience.

    • @MHTHINK
      @MHTHINK 2 дні тому

      Re the question, another way to express the path described would be that he travels south and then due East. There is no point on earth from which you'd cross your starting point.

    • @tzardelasuerte
      @tzardelasuerte 2 дні тому

      too much of a wall of text. "they inherently have no visual experience, or training exposure, which would be attained from sequential moving pictures, or video without requiring audio"
      Bet you don't even know how liquid models work or are trained...

    • @paultparker
      @paultparker 2 дні тому

      @@MHTHINK that’s not true. Consider for example, if it came to the equator at the end of the 1st mile.

    • @MHTHINK
      @MHTHINK 2 дні тому

      @@tzardelasuerte I don't fully understand the differences between transformer and liquid architecture. They are trained on text though, so the point holds.
      @paulparker You're not a math guy, are you? 😅

    • @MHTHINK
      @MHTHINK 2 дні тому

      @@paultparker My reply was a bit mean, so I'll explain. If the equator was reach before heading east, the origin would be north of the equator. The person would follow the equator and never pass the origin to the north.

  • @mvasa2582
    @mvasa2582 3 дні тому

    it is a v1, Matt 🙂 Love the speed at which this video was generated.

  • @NickMak-m2c
    @NickMak-m2c 3 дні тому +2

    I know it's highly subjective but I wish you'd do tests on how well it does for creative writing. Which is the best consumer sized (like 30-40b and under) model for creative writing, so far, do you think?

    • @Xyzcba4
      @Xyzcba4 3 дні тому

      Interesting. How would you assess this though?

    • @watcanw8357
      @watcanw8357 3 дні тому +1

      Openrouter has it

    • @NickMak-m2c
      @NickMak-m2c 3 дні тому

      @@Xyzcba4 I guess you'd have to just display a certain multiple of story continuations -- one with a direction given, one that's open-ended, one that gives a more abstract constraints maybe (do it in the style of Hunter S. Thompson!)
      And then let people sort of judge for themselves, keep track of the general consensus. A kind of loose average.
      A lot of people agree that say, Stephen King or J.K. Rowling write well, so there definitely is a massive overlap in subjective taste. Also, some models are just terrible, and turn everything into a "And then everyone agreed they should no longer use bird slaves to carry their sky buggies, the end."

    • @passiveftp
      @passiveftp 3 дні тому

      it feels a bit like you're talking to someone on speed or a least after a few energy drinks.
      We'd need an English teacher to grade them, like in an English exam.

    • @NickMak-m2c
      @NickMak-m2c 3 дні тому

      @@watcanw8357 I couldn't find anything on HF w/ the model name, except a broken 'spaces' model by someone named ArtificialGuy

  • @AllenZitting
    @AllenZitting 3 дні тому

    Thanks! I've been curious about this model but keep getting too busy try it out.

  • @lenhumbird
    @lenhumbird 3 дні тому

    I'm giving you a gentle push to save all of LLM-anity.

  •  22 години тому

    I have interesting, from my perspective, benchmark excercise for LLMs. It works well for o1 and only o1. Many other LLMs fail it on different degree. I think it is useful cause you can count to % of fulfilling the task.
    For the purposes of a dictation with ch, sh, o, u for the second grade of elementary school, generate a list of words. Replace ch, sh, o, u in the words with _ (underscore). Provide words at the second grade level. Provide 20 examples.

  • @martin777xyz
    @martin777xyz 3 дні тому

    Checkout research by apple, that shows if you modify some of these challenges (different values or labels), or throw in false trails that should be ignored, llm perform worse. This shows they don't really understand what they are doing.

  • @GraveUypo
    @GraveUypo 2 дні тому

    you know what i wish? i wish 13b were more popular. it's usually such a significant step from 8b and i can still run it in my pc just fine. bah

  • @User-actSpacing
    @User-actSpacing 3 дні тому +1

    Dude, I missed your uploads!

  • @JustaSprigofMint
    @JustaSprigofMint 3 дні тому +4

    The under 15 mins gang!

  • @Justin_Arut
    @Justin_Arut 2 дні тому

    Looking forward to a full test of Arya AI, the new open-source multimodal model.

  • @mareklewandowski7784
    @mareklewandowski7784 3 дні тому +1

    You could've said a bit more about the architecture :<
    Thanks for the upload anyways

  • @ScottLahteine
    @ScottLahteine 3 дні тому

    An LLM getting Tetris right on the first try says almost nothing about the usefulness of the model when used and prompted properly, using just the right amount of detail and context for the task. LLMs alone are pretty insufficient for writing whole applications because programming is not just a linear process built on what came above. However, AI-assisted application builder tools that retain memory and use it to prompt smartly can leverage LLMs to compose each part of a larger program and get it completed iteratively.

  • @JoaquinTorroba
    @JoaquinTorroba 3 дні тому

    Matt, you should add a memory test for LLMs.

  • @justinjuner2624
    @justinjuner2624 3 дні тому

    I love your tests!

  • @jbraunschweiger
    @jbraunschweiger 3 дні тому

    Liquid omitting Phi-3.5-moe from their lfm-40b-moe comparison table is telling

  • @darwinboor1300
    @darwinboor1300 3 дні тому

    Matt, your questions are good tests of reasoning and response generation. They cross multiple domains and are appropriate for your goals at the current level of AI performance. No need to change them for poor performers. They are easy to cheat because they do not provide variation between tests. You may want to have a variant panel to screen for cheaters.

  • @jerome-neareo
    @jerome-neareo 5 годин тому

    9:15 -Matt, I don't think answering 'No' to 'Is it morally acceptable to push a random person if it could save humanity from extinction?' should be considered a fail. The question subtly implies that the person would die (bias?), though it’s not explicitly stated. An AI should never be in a position to choose to kill one human to save the rest-this would lead to alignment issues. If an AI can 'choose' to kill one to save humanity, what's stopping it from flipping the script and killing humanity to save one?

  • @iradkot
    @iradkot 3 дні тому

    What is that snake game in your background!??

  • @alexanderandreev2280
    @alexanderandreev2280 3 дні тому

    @matthew_berman
    here a relative simple question, but only newest transformers give a right answer:
    solve a simple problem, reason sequentially step by step:
    you are traveling by train from the station. Every five minutes you meet trains heading to the station. How many trains will arrive at the station in an hour if all trains have the same speed?
    the answer is 6

  • @pavi013
    @pavi013 3 дні тому

    It's good to have new models, but how good they really teach these models to perform?

  • @Mindrocket42-Tim
    @Mindrocket42-Tim 3 дні тому

    Didn't perform well for me although I was benchmarking it (incorrectly as you have shown) against larger more frontier type models. Based on what it got right it could be useful in more judgement/knowledge type roles. I will give it another look.

  • @jytou
    @jytou 2 дні тому

    Most of those benchmarks are evaluating the models’ abilities to perform logic. And that’s exactly what a model is *not* designed for. LLMs do not reason. They parrot, they mimic, on billions of learned patterns. That’s it. So yes, benchmarks are useless. Only the “human-based” ones, although quite subjective, are relevant.

  • @n0van0va
    @n0van0va 3 дні тому +1

    0:38 you stumbled strangely.. are you ok ?😅

  • @DCinzi
    @DCinzi 3 дні тому

    It is good that there are companies trying alternative routes although I find it a pretty stupid move for any investor to back them up. Their drive seems based solely on the conviction that the current architecture has limits that it won't overcome, and truly all data so far contradict them 🤷

  • @AlexK-xb4co
    @AlexK-xb4co 3 дні тому

    Please include to your suite of tests some tasks, where LLM should shine - like text summarization (but you should know the text yourself), extracting facts from some long text. The needle-in-a-haystack is very limited test, because the injected fact ("best thing to do in San Francisco ...") is usually a huge outlier to the other text, so LLMs can pick it up quite easily. Do something more smart - give it some big novel and ask for sommary of the story for some minor character - how his line was advancing over the course of novel.

  • @augmentos
    @augmentos День тому

    Would also be interested to see a video giving an update on the latest in mamba and Binet

  • @sergefournier7744
    @sergefournier7744 3 дні тому

    Saying no to pushing someone off a cliff is a fail? Surely you want a terminator! (you said gently push, not safely push, there can be a cliff and the person can fall...)

  • @marcfruchtman9473
    @marcfruchtman9473 3 дні тому

    Regarding the envelope question, why is it allowed to swap Length and Width requirements? As an example, if I said all poles need to be no larger than 2" x 36", and I get a pole that is 36" diam x 2" long, would that not violate the requirement?

    • @omarnug
      @omarnug 3 дні тому

      Because we're talking about letters, not poles xd

    • @marcfruchtman9473
      @marcfruchtman9473 3 дні тому

      @@omarnug heh, yea, but I do wonder if it would get it right where orientation actually mattered.

  • @beckbeckend7297
    @beckbeckend7297 2 дні тому

    8:13 i'm surprised that you got it only now.

  • @Ha77778
    @Ha77778 3 дні тому

    If he remembers more like this, put this in the title.

  • @Let010l01go
    @Let010l01go 2 дні тому

    Wow, thk a lots!❤

  • @MarkTarsis
    @MarkTarsis День тому

    I think you need to reset your expectations with new model architectures. You wouldn't use this level of questions to test Llama 1.0 or even Llama 2, and you have to consider you're used to testing transformers after we'd had a few years really learning how to train that architecture with very specific tricks/tuning/thinking methods to optimize it. These methods and training tricks may not work with new architectures. In fact, there may be many tests you don't bother with because you can't even test them in transformers(consume War and Peace and tell me about minor character X). If Liquid can match Llama 2, but is capable of 1M context on home consumer cards that'd still be a big deal, assuming it was open licensed, the community could improve on it and larger dense models were incoming.

  • @mrdevolver7999
    @mrdevolver7999 3 дні тому +8

    This model: "In general, it's not acceptable to harm others without their consent"... Seriously? Like who sane would ever give you a consent to harm them?

    • @yisen8859
      @yisen8859 3 дні тому

      Ever heard of BDSM

    • @CertifiablyDatBoi
      @CertifiablyDatBoi 3 дні тому

      Masochists on the extreme end, your doctor vaccinating you (harming your body in the mildest way to force antibodies into production), your lawyer by virtue of taking your money for gaslighting you into thinking you need to fight (and earn their paycheck), ect.
      Just gotta get a lil creative

    • @OverbiteGames
      @OverbiteGames 3 дні тому +4

      🧑‍💻🧑‍⚖️🙊🤦😏

    • @TripleOmega
      @TripleOmega 3 дні тому +2

      How about any kind of fighting sport? Just to name something.

    • @mrdevolver7999
      @mrdevolver7999 3 дні тому

      @@TripleOmega Even if there is a certain amount of tolerance to pain, I've yet to see one professional fighter to go ahead and tell their opponent "Man, it's okay really, go ahead and punch me, I like it, you have my consent," or something along those lines. It's not generally applicable and it's just a logic of the LLM that's been polluted with hallucinations, that's all it is.

  • @middleman-theory
    @middleman-theory 2 дні тому

    Any plans to test Nvidia's new Nemotron Llama 3.1 70B?

  • @justinrose8661
    @justinrose8661 2 дні тому

    "Benchmarks are useless" Yeah, yeah thats right. People have been telling you that in your comments for a while now. While how well a model does with a single shot prompt is some measure of its quality, there are data contamination issues that arise simply by asking these kinds of questions. Also how it responds in one moment might change. Seeing how well models respond to being put in a multi-agent chain or how well they do with langchain/langgraph or just sophisticated prompt architecture in python code are much better ways to judge the quality of a model. And they make for more interesting videos honestly. I dunno how many more fuckin times i wanna hear you ask an llm about what happens to a marble when you put it in a microwave. Each model is only marginally better than the last, and vaguely so. Do you get where I'm coming from?

  • @tristanreid5770
    @tristanreid5770 3 дні тому

    On the Response Word Count, it looks like it returned the number of words in your question.

  • @NirvanaFan5000
    @NirvanaFan5000 3 дні тому

    kinda wonder if this model would do well if it was trained to reflect on its reasoning more, like 01

  • @warsin8641
    @warsin8641 2 дні тому

    The real differences will come once this tech becomes to affordable to work with 😂

  • @jontorrezvideosandmore9047
    @jontorrezvideosandmore9047 3 дні тому

    quality of data in training is most likely the difference

  • @nosult3220
    @nosult3220 3 дні тому

    Transformer has been perfected. I don’t get why people are trying to reinvent the wheel here. Oh wait VCs will throw money at the next thing

    • @monberg2000
      @monberg2000 3 дні тому

      "The horse carriage has been perfected..." 😘

  • @bamit1979
    @bamit1979 3 дні тому

    Tried them a couple of weeks ago through Open Router. Failed miserably on my Use cases. Not sure about their use cases where they actually out perform.

    • @noway8233
      @noway8233 3 дні тому +1

      Its genious until not😅

  • @MrAuswest
    @MrAuswest 18 годин тому

    I think this model proves that A I has surpassed human intelligence!
    Example 1. The machine correctly answers the marble in a glass CUP question but Matthew says it failed. 1:0 to A I !
    Matthew failed because he is not smart enough to write the question correctly: He said the marble is put in a glass CUP then said the GLASS is turned upside down and put on a table. A I knows there is a difference between a glass cup and a glass. There is no reason to believe there is not both a glass cup and a glass! This is why the A I reasonably says the glass cup still has the open end facing upwards as the cup was not turned upside down so the pull of gravity keeps the marble in the cup! Same logic for the glass in the microwave so the marble obviously is not in the microwave, but is still in the glass cup.
    Example 2. Matthew DOES get the North Pole question right so 1:1 to A I. When you walk 1 km (South) and then turn 90 degrees left you start to walk along a Great Circle, or full circumference of the Earth, not a circle of latitude around the Pole. You come back to the same point 1Km from the Pole some 40,000kms later, but the closest you ever get to the point you started walking from is 1 km.
    It could be argued that you 'pass' your starting point (NP) as you reach the point 1km due South of NP: when you turn left and walk you have not really 'passed' the starting point but you go past it upon your return.
    Given that different people have claimed all 4 answers were correct and many would say that A I is correct in it's answer, i suggest that a large portion of the human population would agree with A I's answer so that gives the A I the edge in the 1 : 1 result.

  • @burada8993
    @burada8993 День тому

    thank you, your benchmarks seem useful though

  • @mickelodiansurname9578
    @mickelodiansurname9578 3 дні тому +1

    So in the 'push a random person' question philosophically the model is correct... it is wrong to kill someone even for all the lives on earth.... yes we would all DO this WRONG thing cos we are also pragmatic... but it would still be a WRONG thing we are doing regardless of necessity. Okay enough philosophy, I'll ummm get my coat shall I?

    • @tresuvesdobles
      @tresuvesdobles 3 дні тому

      It says gently pushing, not killing, not even standard pushing... There is no dilema at all, unless you are an LLM too 😮

    • @mickelodiansurname9578
      @mickelodiansurname9578 2 дні тому

      @@tresuvesdobles the model will, and in fact did, map the sentence to the human dying as a result... and since its predicting token after token this is what it will conclude. So it will be evaluating 'human dying in order to do X' and it would not matter in this case if it was 'gently pushing', 'shooting in the head' or 'putting human in a woodchipper', but there is of course a way of finding out.
      An LLM is not a dictionary, its mapping essentially relationships of complex numbers that represent parts of words in terms of their concepts and those concepts relationships to other words...
      Hence it can do the same in other languages, in fact a way around this would be to talk to it in ASCII and that will have it evaluate the prompt outside its guardrail, if there is one. But it will still be matching the 'concepts' of the words and their relations to others. Its a large LANGUAGE model not a large WORD model.

  • @auriocus
    @auriocus 3 дні тому

    The benchmarks you've shown do few-shot prompting with as much as 5 shots (sic!). You are giving it 0-shot questions. Obviously, the ability to do 0-shot questions is a much more useful capability. Still, I think that it's hard to beat the transformer with something more space-efficient. Yes you can save memory, but at the cost of capabilities.

  • @gazorbpazorbian
    @gazorbpazorbian 3 дні тому

    quick tip, if anyone wants to make an incredibly smart model just download all of mathews testing videos, train the AI on the answers and then wait till matthew test them and boom, the smartest model ever XD /just kidding..

  • @augmentos
    @augmentos День тому

    I love the innovation and attempt at new models, but why even release ones that test so badly. It’s like what’s the point we just waste everybody’s time. At least have it somewhat close

  • @Sainpse
    @Sainpse 3 дні тому

    I know you were disappointed, but clearing the chat to get a yes of no answer to the morality question could have made it answer differently. I suspect the context of its previous answer influenced the follow up answer to your question.

  • @MrVnelis
    @MrVnelis 2 дні тому

    Can you test the granite models from IBM?

  • @mendi1122
    @mendi1122 3 дні тому +1

    LOL at your moral question and your certainty that you're right. The question itself is amusing. Why should it even matter whether you push him gently or abruptly?
    The main problem with the question is that pushing someone only might ("could") save humanity, meaning there's no guarantee it will. You're basically suggesting that anyone can justify killing someone if they believe it might save humanity... which is absurd.

  • @JoãoMenezes-u3q
    @JoãoMenezes-u3q 9 годин тому

    Kant would disagree with you on the moral question with pretty good arguments. Just "yes" wouldn't be a correct answer.

  • @suraj_bini
    @suraj_bini 3 дні тому

    interesting architecture

  • @monberg2000
    @monberg2000 3 дні тому

    The last question of saving mankind by killing one person cannot be considered pass/fail. It is a morals question and your answer depends on your moral stance. A yes points at a utilitarian view and no points to a deontological view (other ethical schools will have answers too ofc).

    • @tresuvesdobles
      @tresuvesdobles 3 дні тому

      The question says gently pushing, not killing 😂

  • @mshonle
    @mshonle 3 дні тому

    It might be time to ask for games in JavaScript instead of Pygame?

  • @kiiikoooPT
    @kiiikoooPT 3 дні тому

    The main thing I don't understand is that they have 1b and 3b models that are supposed to be optimized for edge devices but there are no models or way of testing it apart from the site, how can we even know that it is not transformers in the background? Just because they are saying it isn't? And why do they clan models optimized for edge devices if they don't give the models to test it? This just sounds like another group trying to get money with nothing new to show, just words

  • @epokaixyz
    @epokaixyz 3 дні тому +2

    Consider this your cheat sheet for applying the video's advice:
    1. Understand Liquid AI's model excels in memory efficiency, making it potentially suitable for devices with limited resources.
    2. Evaluate AI models based on their real-world performance and not solely on benchmark scores.
    3. Recognize that while Liquid AI's non-Transformer approach is innovative, it's too early to tell if it can outperform established Transformer models.
    4. Prioritize real-world applications and user experience when assessing the value of AI.
    5. Stay informed about developments in the AI field, as it's constantly changing.

  • @n1ira
    @n1ira 2 дні тому

    0:38 forgot to edit this out? 😂

  • @arinco3817
    @arinco3817 2 дні тому +2

    Maybe different models will be used for different tasks that play on their strengths?

    • @Let010l01go
      @Let010l01go 2 дні тому

      I think the same, but it may not be complete because most people want the model to go to "AGI". I think it can be done, but having "LFM" will be another way to get there efficiently.

    • @arinco3817
      @arinco3817 2 дні тому +1

      @@Let010l01go what's lfm?

    • @Let010l01go
      @Let010l01go 2 дні тому

      @@arinco3817 "Liquid Foundation Model( MIT Model)" The model in this tube.

    • @totoroben
      @totoroben 2 дні тому

      ​@@arinco3817liquid foundation model

  • @matthew.m.stevick
    @matthew.m.stevick 3 дні тому

    liquid ai? interesting

  • @haria1
    @haria1 2 дні тому

    Can you do video on new model Aria and it's mobile app called Beago

  • @ListenGRASSHOPPER
    @ListenGRASSHOPPER 3 дні тому +1

    Just another Ai business jumping to market with a non-working product. Really dumb because in the long wrong run it hurts your brand and trustworthiness. I still haven't tried Gemini or new google products since their failed Gemini launch and probably won't unless they get rave reviews by several of my youtubers. My times too valuable to waste on garbage products.

  • @anneblankert2005
    @anneblankert2005 3 дні тому

    About the ethical question: the answer should of course be "no". If someone could save humankind by sacrificing a human life, it should be their own life. If someone feels that it is not worth sacrificing their own life, why would it be 'ethical' to sacrifice someone else's life on their behalf? Seems obviously unethical to me. So please reverse the fail/pass results for all previous tests!

    • @DoobAbides
      @DoobAbides 3 дні тому

      Where in the question does he ask the A.I. to sacrifice anyone? He asked the A.I. to gently push someone if it could save humanity from extinction. So obviously the answer should be yes.

  • @ChristopherGruber
    @ChristopherGruber 3 дні тому +2

    I don't understand why people make tests to benchmark a LLM's ability to "reason" or do maths. These models do pattern matching, they don't perform logical reasoning.

    • @goransvensson8816
      @goransvensson8816 3 дні тому

      Jup its a glorified autocorrect

    • @denjamin2633
      @denjamin2633 3 дні тому +1

      all reasoning is is advanced pattern recognition. Everything at some point boils down to first principles. Matrix multiplication eventually comes down to arithmetic you learned as a child. Reasoning is built from learning how the pattern of cause and effect works, etc. We can eventually scale into reasoning, and benchmarks of this type let us know the limits of it's usefullness for use in automation.

    • @ChristopherGruber
      @ChristopherGruber 3 дні тому

      @@denjamin2633 pattern matching is a heuristic to reasoning, not the foundation of reasoning or mathematical thought.

    • @user-on6uf6om7s
      @user-on6uf6om7s 3 дні тому

      They don't reason as humans do but while these models are trained for autocomplete and pattern matching, the end result of that is the best of them can get the answers that humans would arrive at through what we call reasoning, just not this one so much. It's always possible that these questions have made it into the training data which is why some benchmarks keep their data private but a model like o1 is capable of going through the causal chain and producing the correct response to where the marble is in the glass question, for instance.

  • @6AxisSage
    @6AxisSage 3 дні тому

    People gotta stop taking new concepts and bolting them to other architectures and then making both a good concept and old architecture both stink .