Benchmarks Say Claude 3 is Better than GPT-4, But is It?

Поділитися
Вставка
  • Опубліковано 4 бер 2024
  • Anthropic has released a new version of its Claude Large Language Model. The new LLM, called Claude 3, comes in 3 versions. According to the benchmarks, Claude 3 Opus is better than GPT-4. But do the real-world tests show the same thing? Let's find out.
    ---
    Let Me Explain T-shirt: teespring.com/gary-explains-l...
    Twitter: / garyexplains
    Instagram: / garyexplains
    #garyexplains
  • Наука та технологія

КОМЕНТАРІ • 27

  • @reficwitte5771
    @reficwitte5771 2 місяці тому +5

    Your questions might be good but asking every question in the same chat window might help or confuse the model. Starting a new chat for each question would also be concidered a zero-shot. What you are doing currently, might be referred to as "contaminated". Dont want to sound mad but thats just the nature of chat, it doesnt transfer emotions very well. I hope this is useful to you! Thanks for the vid

    • @GaryExplains
      @GaryExplains  2 місяці тому +2

      But you agree that the "contamination" is the same for every model, yes?

    • @sgartner
      @sgartner 2 місяці тому

      @@GaryExplains That's a good point, since you did the same with the other models. It would be an interesting test to see if that significantly affected the responses in the different models...

  • @milan1p
    @milan1p 2 місяці тому +2

    Great video. You should look at doing some recall tests

  • @D3ND
    @D3ND 2 місяці тому

    This is an unrelated comment, but I can't find an older Gary Explains video (I think it is 3-4 years old). I'd be glad if someone can help me find it.
    In that video, Gary was presenting a software that follows your actions and records them to create a manual or instructions set.
    Can someone point me to that video or the software? I've been trying to find it for half an hour with no success.

    • @GaryExplains
      @GaryExplains  2 місяці тому

      It is called Scribe - ua-cam.com/video/t1WkYkNcWMM/v-deo.html

    • @D3ND
      @D3ND 2 місяці тому

      @@GaryExplains thanks, you're amazing! Seems like my memory doesn't serve me that well, and it is much newer. Have a great day!

  • @justronny20
    @justronny20 2 місяці тому +1

    Claude Sonnet can write reports that cannot be detected by AI plagiarism detectors GPTZero and Turnitin. That is a win over GPT-4 in my book

  • @anb4351
    @anb4351 2 місяці тому +2

    You should now change your questions to make them a bit more harder

  • @KingFeraligator
    @KingFeraligator 2 місяці тому

    Can you do benchmarks of the free versions?

  • @mccannger
    @mccannger 2 місяці тому

    Thanks for the interesting comparison.
    Hopefully all the AI suppliers will start to compete on price and drive the costs down. I have no prob paying £20 (ish) for any of them, but as they're all so close on performance, availability, features etc, price is an important factor IMHO now.

  • @technolus5742
    @technolus5742 2 місяці тому +7

    I love this. But having tried, gemini advanced and from what I see here in this demo, they are all still a step behind gpt4.
    And they all need custom instructions. With Professor Synapse instructions I get an noticeable increase in performance for complex coding prompts.
    Update: I have used Claude and it performs really well for coding.
    Update 2: Solves capchas well too. Best performing model in my tests.

    • @tonysheerness2427
      @tonysheerness2427 2 місяці тому

      What is impressive is the speed it generates the answers, it has interpret what you are asking, go to its data base and search it than supply an answer. My mind boggles at the speed it does it.

    • @technolus5742
      @technolus5742 2 місяці тому +3

      @@tonysheerness2427 good point! I generally focus on the raw ability (cause in my use cases, I don't mind waiting).
      That is a fair point indeed.

  • @jeffreyjoshuarollin9554
    @jeffreyjoshuarollin9554 2 місяці тому

    Inconceivable!

  • @ukaszLiniewicz
    @ukaszLiniewicz 2 місяці тому

    It definietly is. Especially the context. It can actually use information in most of its context window and apply it to e.g. code. GPT 4 will either refuse, produce a generic and unhelpful response or start hallucinating. Claude 3 is miles better for coding-related tasks. GPT 4 Turbo's "128k context" just doesn't work properly, and even he 32k version, which I tried via API, is not as cabable within its context as Claude is within its 256k context. I had it print 900 lines of code with requested modifications - and it was actually correct. This won't be the case every time, it will make mistakes, but you will be able to have it correct them because the context can encompass both the code and the conversation.

  • @perschistence2651
    @perschistence2651 2 місяці тому +2

    What really wonders me, all these models showing benchmarks where they beat GPT-4 everywhere, but when I try it, GPT-4 is definitely a step above. They are better than 3.5 but 4 is still ahead.

  • @tonysheerness2427
    @tonysheerness2427 2 місяці тому

    How will teachers know if AI did the students homework?

    • @technolus5742
      @technolus5742 2 місяці тому +9

      By giving students a test. Those who did the homework will be prepared, those who didn't.... welp

    • @technolus5742
      @technolus5742 2 місяці тому +3

      Love this. Good break-down of the capabilities.
      You could put the models through a leet code competition (easy, medium hard), and see how they compare.
      A lot of people use these as coding assistants, and this is super relevant.
      I love that competition is gaining ground, but from my experience with gemini advanced and seeing your demo here, gpt4 is still one step ahead.

    • @bakedbeings
      @bakedbeings 2 місяці тому

      The homework is for the students own learning, and overseen by their parents. If those two aren't invested, it won't much matter what the teachers can detect.

  • @ThePowerLover
    @ThePowerLover 2 місяці тому

    Interesting.

  • @nhtna4706
    @nhtna4706 2 місяці тому

    Is it free, I mean does it have a basic, free version like ChatGPT?

    • @GaryExplains
      @GaryExplains  2 місяці тому

      Yes, as I said in the video. But when I tried it, the system was overloaded. But just try it and see.

    • @nhtna4706
      @nhtna4706 2 місяці тому

      ​@@GaryExplains smart guys, whether its really overloaded with real users or just the gimmicks of the brand to make you pay for the opus, will spoil the brand reputation. Its like bombarding with ad's every 1-2 min, and force the user to go with paid plan to avoid ad's ;)

    • @bakedbeings
      @bakedbeings 2 місяці тому

      @@nhtna4706 Taking a position on what's marketing vs capacity is a risk for no-return, unless you have psychic powers. With time you might have answers, if you're still invested.