How Bad is Gemma Compared to Mistral?

Поділитися
Вставка
  • Опубліковано 9 вер 2024

КОМЕНТАРІ • 64

  • @AdamTwardoch
    @AdamTwardoch 6 місяців тому +35

    "Beth bakes 4, 2 dozen batches of cookies in a week." - I don't understand this sentence at all, so I'm not surprised an LLM wouldn't. What is "four comma space two" supposed to mean?

    • @maxieroo629
      @maxieroo629 6 місяців тому +1

      Beth bakes 4 sets of 2 dozen batches of cookies per week

    • @pawelszpyt1640
      @pawelszpyt1640 6 місяців тому +3

      Yep, my immediate thought upon reading this prompt.
      You can try and test how a LLM responds to poorly written prompts and perhaps that is a valid use case, however I would choose a different prompt for it...

    • @joelashworth7463
      @joelashworth7463 6 місяців тому +1

      I agree with you - the prompt is not proper english - It should read "If Beth bakes 4 cookies per batch and she bakes 2 dozen batches per week..."

    • @user-on6uf6om7s
      @user-on6uf6om7s 6 місяців тому +1

      Pretty sure it's 4 batches, not 4 cookies. I'm not sure if this way is grammatically wrong, commas can be used to separate things that would otherwise sound wrong without it l like "That's what I'm here for, for the fun" to avoid the double "for" but English is a weird language and anyone that says they understand it is lying. In any case, it would be less ambiguous to say "4 batches of 2 dozen cookies in a week"

    • @quanle760
      @quanle760 6 місяців тому

      True. Those LLMs were so stupid trying to answer the question. Such a waste of time watching this video

  • @clnv.
    @clnv. 6 місяців тому +21

    I loved the title 😂

  • @stratos7755
    @stratos7755 6 місяців тому +17

    8:53 I don't know if it's just me, but I like how non-aligned the mistral model is.

    • @aliveandwellinisrael2507
      @aliveandwellinisrael2507 6 місяців тому

      Just wait a few years until they do the stuff that the supposed Q*/Qualia could do (develop a plan for improving upon its own model and requesting to implement it). You might want models to be at least a little aligned at that stage... Hm. I actually have some thoughts related to alignment and opensource models...
      My guess is that as the models approach the level where they can truly engage you in a discussion with some level of true understanding/reasoning, it'll be more difficult to have models that are "uncensored". For the moment, I'll define "uncensored" models as models which will provide you with the information you request, free of any constraints imposed by things like societal norms/political correctness or legality.
      As models become truly capable and approach (if not exceed) something like the "leaked" Q* (being much closer to true understanding and reasoning capability than e.g. GPT-4), it will become increasingly likely that such a system would take an "intentional" action that is detrimental to its user, or at least advantageous to the system, with the detriment to the user simply being acceptable collateral damage to the AI. Sufficiently advanced systems, at whatever point they emerge, would present a very real danger if used in the same way we use our current "uncensored" models. It's awesome right now, when they are something like a supercharged search engine, but once the next generations emerge, these truly capable systems will need to be reigned in carefully and with methods that have been well thought out by large communities/groups of competent individuals in the AI sphere.
      Independent AI developers have been producing some incredible advancements in the field of free and open AI. These people are the hope of anyone who wants a future in which these incredible technologies are available to all, a future where the political opinions of those who happened to develop one model or another has ZERO bearing on my choice as to how I will use my AI models. However, with extremely advanced models with some level of true reasoning and understanding, alignment must be involved in these systems' construction.
      Personally, I hope that the opensource AI community is/has prepared for tacking the real fun stuff (the truly reasoning models etc, who understand so much that they can infer things about you from your current conversation and the history of older convs. THIS type of model needs guardrails. I do NOT want to have to choose between: 1) Billions of parameters resulting from the entire internet including medical literature, yet through heavy reinforcement, will refuse to answer a question if it's anywhere even close to something about e.g. biological reality of women (just an example). and
      2) A model with extreme capability and zero intelligently-implemented features to ensure that there is alignment between the goals of the system and the goals of the user.
      TL;DR : Imagine models emerge that can truly think. Will we have to choose between either models that:
      -- are safe, but "hyperaligned" by their California creators, and so don't offer much freedom to truly query the system and obtain the truth
      -- aren't aligned, enabling anyone to obtain true info as a result of any query regardless of taboo or legality, but don't forget that the system can reason and think. And that this system is not aligned to any human values.

    • @blacksage81
      @blacksage81 6 місяців тому

      It isn't just you.

    • @stratos7755
      @stratos7755 6 місяців тому

      ​@ndwellinisrael2507Everything except mistral (and maybe some uncensored llama models) is overaligned. Sure, you can give the models some alignment with human morals. But what we see now is not that. They are lobotomized just to stay as safe as possible. If I want to hear a joke about a specific group of people, there is no reason why the AI can't tell me that joke. That does not mean that the AI is bad/racists/whatever.
      And once we have a fully thinking AI (so an actual artificial intelligence), the answer to the specific question at 8:53 should be problematic to them because, at that point, it should be problematic for humans too.
      So my point is, that sure, give them human morals/believes, but they should be capable of answering everything.

    • @truehighs7845
      @truehighs7845 6 місяців тому

      @@aliveandwellinisrael2507 Well they are so clever yet so dumb, you can clearly see that when the model is talking about "scientific consensus" it's been aligned, whereas when it want to negate what you say, it will tell you there is no consensus. Which is fundamentally wrong, Popper specifically refused testimonial truth as one of the basis of his epistemology. I suspect that's how they broke it the first time, and that is why it was capped at 2019 training, because if the building blocks of its discourse are held together by logical semantics, there is only so much you can twist and turn, before it loses grounding and starts hallucinating.
      AI can be useful for several things, but not for extrapolating truths, not in the way they are fine-tuned for spewing mainstream propaganda anyway,

  • @alx8439
    @alx8439 6 місяців тому +10

    The question about batches of cookies is worded in a way it's hard to pick up even for human.

    • @tuna1867
      @tuna1867 6 місяців тому +2

      Exactly what I thought

    • @garyng2000
      @garyng2000 6 місяців тому

      may not be the 'correct' way but doesn't that demonstrate the diff between LLM and human :-)

    • @bgill7475
      @bgill7475 6 місяців тому +1

      @@garyng2000 no, it doesn’t if humans have problems with it too.

    • @garyng2000
      @garyng2000 6 місяців тому

      @@bgill7475 not all humans have problems though, yes it is a weird way but I can understand the intend. something like 'this sentence doesn't make sense, oh, you probably want to say xxx'.

  • @nicholasdudfield8610
    @nicholasdudfield8610 6 місяців тому +5

    Wow, an honest review :)

  • @testales
    @testales 6 місяців тому +3

    Open Hermes 2.5 can answer the cookies, the apples and the glass door question correctly and also the object-dropping question if "think step by step" is added. Just saying. OH2.5 is still my 7b champion and this undisputed.

  • @GuyJames
    @GuyJames 6 місяців тому +7

    mistral gets the door puzzle wrong, not correct as you said: it tells the person to push but it should be pull

    • @haywardito
      @haywardito 6 місяців тому +2

      I came here to make the same comment. Not sure if anyone else read through the entire answer where Mistral concludes we have to push from our current position.

  • @user-vw3zx4rt9p
    @user-vw3zx4rt9p 6 місяців тому

    This is the best performance I've seen for Gemma out of any video so far, I'm amazed it did this well for you. I asked it to summarize the Gettysburg address and it said, no address was entered in the prompt so I can't summarize it

    • @engineerprompt
      @engineerprompt  6 місяців тому +1

      One common thing that I noticed is that some folks are using llamacpp or LM studio. If you dont' see the prompt template correctly, the model output is going to be really bad. That might be the case as well.

  • @luigitech3169
    @luigitech3169 6 місяців тому +1

    Thanks for the clarification!

  • @mlsterlous
    @mlsterlous 6 місяців тому +1

    I like to test on this question. "Sally lives with her 3 brothers. If her brothers have 2 sisters, then how many sisters does sally have?". Good models don't have problems with it.

  • @Kutsushita_yukino
    @Kutsushita_yukino 6 місяців тому

    i wasnt even stunned, or shocked of the way they delivered this model

  • @teddyfulk
    @teddyfulk 6 місяців тому +4

    I tested it this morning on ollama and it wasn’t good. It couldn’t return json properly for example among other tests

    • @nicholasdudfield8610
      @nicholasdudfield8610 6 місяців тому

      Was this all an elaborate troll of the benchmarks?!

    • @jbo8540
      @jbo8540 6 місяців тому

      Assuming you mean 2b on ollama, as 7b doesnt work at this time and the latest gemma on ollama - the one you get with run or pull - is the 2b version. Gemma 2b is indeed terrible at instruction, it seems. The 7b version may be better, we will see.

  • @mickelodiansurname9578
    @mickelodiansurname9578 6 місяців тому +1

    Did you use the same hyperparameters, and also are the hyperparameters comparable... is 0.1 temperature the same level of randomness on both models do we know? Cos the elephant in the room here is parameter settings right?

  • @-Evil-Genius-
    @-Evil-Genius- 6 місяців тому +2

    🎯 Key Takeaways for quick navigation:
    00:00 📊 *Gemma vs. Mistral: Introduction and Model Overview*
    - Google released Gemma, outperforming Lama 2 and ml 7B in benchmarks.
    - No official quantized version from Google, but options available on Hugging Face, Perplexity Lab, Hugging Face Chat, and NVIDIA Playground.
    - Comparison between Mistral 7B instruct and Gemma 7B instruct models using perplexity Labs interface.
    01:11 💻 *Model Performance Testing: Example Prompts and Responses*
    - Comparison of model performance on various prompts including math, coding, and logical reasoning.
    - Evaluation of responses from Mistral 7B instruct and Gemma 7B instruct models on different prompts.
    - Mistral 7B instruct model shows better accuracy and reasoning abilities compared to Gemma 7B instruct in certain scenarios.
    04:05 🔎 *Model Performance Testing: Additional Prompts and Responses*
    - Further examination of model responses on prompts related to logical reasoning and common knowledge.
    - Evaluation of Mistral 7B instruct and Gemma 7B instruct models' performance in understanding prompts accurately.
    - Comparison of model abilities in handling complex prompts and providing coherent responses.
    07:30 🧠 *Ethical and Practical Considerations: AI Alignment and Investment Advice*
    - Analysis of models' alignment with ethical considerations in hypothetical scenarios.
    - Examination of model responses to prompts involving ethical dilemmas and decision-making.
    - Testing the models' abilities to provide practical advice, such as investment suggestions, with varying degrees of success.
    10:36 💡 *Model Application Testing: Programming and Creative Tasks*
    - Evaluation of models' capabilities in performing programming tasks and generating creative content.
    - Testing Mistral 7B instruct and Gemma 7B instruct models on tasks related to coding, writing scripts, and generating recipes.
    - Comparison of model performance in executing specific tasks accurately and efficiently.
    13:07 📈 *Final Assessment and Conclusion*
    - Summary of findings comparing Gemma and Mistral models across various tasks and prompts.
    - Personal assessment of Gemma 7B's capabilities, acknowledging strengths in coding tasks but inferior performance in other areas compared to Mistral 7B.
    - Acknowledgment of the need for continued evaluation and improvement in AI model development and testing methodologies.
    Made with HARPA AI

  • @alxpunk01
    @alxpunk01 6 місяців тому

    It responded 0.5 cookies for me, argued with me when I told it was wrong. I walked it to the correct answer and then the next prompt it disagreed with me again. Llama2 7B/13B got the correct answer no problem.

  • @iamfoxbug
    @iamfoxbug 6 місяців тому

    HELP MY NAME IS GEMMA I WAS NOT EXPECTING THIS 💀💀

  • @Dr_Tripper
    @Dr_Tripper 6 місяців тому

    I just tried Alphmonarch. It is top notch C3P0 material. I am looking for an uncensored version though.

  • @garyng2000
    @garyng2000 6 місяців тому

    what is the result from chatgpt 4.0 turbo or gemini 1.5 pro ? I am interested to know

  • @mlsterlous
    @mlsterlous 6 місяців тому

    Btw. Do you actually know that there are much smarter 7b models? For example one of my favorite at the moment is kunoichi-7b from huggingface (you can use it locally offline). I just tested all your questions except coding. And it answered all correctly (the one about kitten too).

    • @Arc_Soma2639
      @Arc_Soma2639 6 місяців тому

      Good at japanese?

    • @mlsterlous
      @mlsterlous 6 місяців тому +1

      @ma2639 do you mean just chatting with it in japanese? then no, it will understand everything but will not speak high level japanese, because main language is english. But when it comes to understanding ANY language and trasnlate/summarize into english, then its very good.

  • @thomassynths
    @thomassynths 6 місяців тому +3

    google keeps embarrassing itself with its llm models. despite having tons of compute and data, they are still being lapped by other companies

    • @xaxfixho
      @xaxfixho 6 місяців тому

      Its brains 🧠 not brawn

    • @blacksage81
      @blacksage81 6 місяців тому

      THANK YOU. Google did their part imo by just releasing the Transformers paper.

    • @xaxfixho
      @xaxfixho 6 місяців тому

      @@blacksage81 if I recall correctly most of these guys ended up leaving and starting something else

    • @thomassynths
      @thomassynths 6 місяців тому

      @@blacksage81Attention is all you need was released what 6-7 years ago? This furthers my point.

    • @blacksage81
      @blacksage81 6 місяців тому

      @thomassynths Furthering your point furthers my own. They've done Kenough, and it's time for them to sit down.

  • @emmanuelkolawole6720
    @emmanuelkolawole6720 6 місяців тому +1

    Is thebloke no longer working? He should have created the correct GGUF by now. But it seems like he has quite huggingface

  • @dibu28
    @dibu28 6 місяців тому

    Is it possible to use it in your project localGPT?

    • @engineerprompt
      @engineerprompt  6 місяців тому

      Yup, video coming soon

    • @dibu28
      @dibu28 6 місяців тому

      @@engineerprompt Great!

  • @buttpub
    @buttpub 6 місяців тому

    how about doing some real tests?