Qwen2.5 Coder 32B vs GPT4o vs Claude 3.5 Sonnet (new)

Поділитися
Вставка
  • Опубліковано 19 гру 2024

КОМЕНТАРІ •

  • @DOCTOR-FLEX
    @DOCTOR-FLEX 28 днів тому +4

    Thank you for this demonstration. In the future, please work on more complex apps. I’m happy you tried Tetris instead of only the snake game.

    • @volkovolko
      @volkovolko  27 днів тому

      They the issue is that we need to balance the complexity of the tasks.
      If it's too easy all models get it right so we cannot compare them
      If it's too difficult all models fails so we cannot compare them.
      Tetris and Pac man games seems currently a good fit for SOTA and aren't that tested so that's why I use them

  • @tpadilha84
    @tpadilha84 Місяць тому +4

    Funny thing: I tried the same tetris example locally with the q8 and fp16 versions of Qwen coder 2.5 32b and it generated buggy code in both cases. When I tried with the default quantization (q4_k_m if I'm not mistaken) it got perfect the first time (properly bounded and you could lose the game too). I guess there's a luck factor involved.

    • @volkovolko
      @volkovolko  Місяць тому

      Yeah, it might be because of the luck factor.
      Or maybe the architecture of qwen is optimised for high quantizations levels 🤷‍♂️
      Or maybe your q8 version wasn't properly quantized, I think they updated their weight at one moment

    • @66_meme_99
      @66_meme_99 Місяць тому +2

      luck it's called temperature nowadays :D

    • @volkovolko
      @volkovolko  Місяць тому

      Yeah, I now.
      Top_k also right ? ​@@66_meme_99

  • @Rolandfart
    @Rolandfart Місяць тому +3

    you should ask for physics demos like softbody , particles, fluid particles, cloth. Anything math heavy pretty much.

    • @volkovolko
      @volkovolko  Місяць тому

      Okay, I will try in the next video

  • @IgnatShining
    @IgnatShining Місяць тому +2

    Sweet. I remember, when Chat GPT just appeared, feeling very pessimistic that this tech would be locked in big companies datacenters. Glad I was wrong

    • @volkovolko
      @volkovolko  Місяць тому

      Yes, it's so awesome they this technology is going toward open sourcing 👍

  • @bigbrotherr
    @bigbrotherr 19 днів тому +1

    Not a great test to me because these models have been trained with these games before and the codes are in there. Let's try something custom and let's see how it can reason, create and solve problems. That will make it a good model. Also Claud 3.5 sonet is the best coder and very hard to make mistakes when coding.

    • @volkovolko
      @volkovolko  13 днів тому +1

      I would be happy to test with any prompt you give to me ^^

  • @Kusmoti
    @Kusmoti Місяць тому +3

    nice vid! what's your 3090 setup my guy

    • @volkovolko
      @volkovolko  Місяць тому +1

      Asus ROG STRIX 3090
      32Go ddr4 3200MHz
      i9 11900kf

  • @electroheadfx
    @electroheadfx Місяць тому +1

    amazing, thanks for the test

  • @SpaceReii
    @SpaceReii Місяць тому +3

    This is pretty cool to see! It's nice to see how the models compare between each other. For me, even the 3B model was amazing at making a Python snake game. Thanks for the comparison, it really does show the difference.

    • @volkovolko
      @volkovolko  Місяць тому +1

      Yeah, I totaly agree.
      The Qwen series (especially the coding one for me) are just so amazing.
      I don't know why they aren't as known as the llama ones.

    • @volkovolko
      @volkovolko  Місяць тому +2

      Do you want me to make a video comparing the 3B to the 32B ?

    • @SpaceReii
      @SpaceReii Місяць тому +1

      ​@@volkovolko Yeah, that would be really cool to see! I'd love to see how the models perform.

    • @volkovolko
      @volkovolko  Місяць тому +1

      Okay, I will try to do it tomorrow

  • @oguzhan.yilmaz
    @oguzhan.yilmaz Місяць тому +2

    Nice video but i think Claude is still better. If i compare these models at first i always say to myself "If these models are slightly close to each other (In terms of technical specifications) it is okay to compare but if it's not what is the point?
    Like i understand comparing between open source models like Qwen and Llama or closed source models like Gpt4-o and Claude 3.5 Sonnet

    • @volkovolko
      @volkovolko  Місяць тому +2

      Yes, the results of the tests I made in this video seems to show that :
      GPT4o < Qwen2.5coder32b < Claude 3.5 Sonnet (new)

    • @sthobvious
      @sthobvious Місяць тому +1

      The point is to compare quality... simple as that. Once you know quality, you can consider other factors like speed, price, availability, and of course confidentiality. The fact that Qwen2.5-Coder-32B is even close to Claude while being a _small_ open-weight model is amazing.
      Of course other factors can matter more than just quality. Speed and price are just as important. But limiting it to "Only compare quality when technical specs are comparable" makes no sense.

    • @oguzhan.yilmaz
      @oguzhan.yilmaz Місяць тому

      @@sthobvious actually makes sense because if you think to compare gpt-3.5 and gpt-o1 or gpt-4o, do you really think this is fair?
      Gpt-3.5: 😭
      Gpt-4o & gpt-o1: 🗿🗿

  • @owonobrandon8747
    @owonobrandon8747 Місяць тому +1

    the error produced by gpt was minimal; a "hallucination"

  • @cerilza_kiyowo
    @cerilza_kiyowo Місяць тому +2

    I think you should ask qwen 2.5 coder 32B again to make the tetris game better so it will be fair ..
    In my opininion In tetris game qwen literally win .. even claude generate better after error , but offcource it failed at first

    • @volkovolko
      @volkovolko  Місяць тому

      Yeah, for me the win was for Qwen.
      But okay, for the following videos, I will always let one second chance for all models.
      I will soon make a video comparing each size of qwen2.5 coder (so 0.5B vs 1.5B vs 3B vs 7B vs 14B vs 32b)
      So subscribe if you want be notified ^^
      I also started to quantize each model in GGUF and EXL2 on HuggingFace for those who are interested : huggingface.co/Volko76

    • @renerens
      @renerens Місяць тому +1

      Seems very interesting I will try it tomorrow, for me nemotron 70b was the best but even on my 4090 I can't run it locally.

    • @volkovolko
      @volkovolko  Місяць тому

      I made the video comparing sizes : ua-cam.com/video/WPziCratbpc/v-deo.htmlsi=o3eKo-3pGY78wmMr

    • @volkovolko
      @volkovolko  Місяць тому

      Yes, 70B is still a bit too much for consumer grade GPUs

  • @nashh600
    @nashh600 Місяць тому +1

    Thanks for the comparison but this was painful to watch. Please cut the parts that are not relevant to the subject or at least add timestamps

    • @volkovolko
      @volkovolko  Місяць тому

      I'm trying to do my best.
      When I made this video. I didn't had any speakers so I couldn't test the audio nor make great cuts

  • @kobi2187
    @kobi2187 Місяць тому +1

    If you do a real software project, you'll find out claude sonnet new is the best, and gpt4 is very good at organizing.

    • @volkovolko
      @volkovolko  Місяць тому

      I do real software projects as I'm a developer.
      While Claude and GPT4o are still better for big projects, qwen is a good alternative for just little prompting to avoid going to stack overflow for quick and simple questions

  • @mnageh-bo1mm
    @mnageh-bo1mm Місяць тому +1

    try a next js app.

    • @volkovolko
      @volkovolko  Місяць тому

      Okay, I will try in the next video

  • @mathiasmamsch3648
    @mathiasmamsch3648 Місяць тому +5

    Why do people do these stupid tests where the code can be found 1000 times on the internet.

    • @volkovolko
      @volkovolko  Місяць тому

      As explained in the video, I'm looking for other original tests.
      If you have one that you want me to try in a following video feel free to leave it in a comment so that I can try it in the following video

    • @mathiasmamsch3648
      @mathiasmamsch3648 Місяць тому +2

      @@volkovolko If you are testing how to write a snake game, then you are basically testing knowledge retrieval, because that code exists in 1000 variants on the Internet. It gets interesting if you demand variations, like 'but the snake grows in both directions' or 'random obstacles appear and disappear after some time in not too close proximity of the snakes'. Think of whatever you want, but if you can do Tetris or snake is hardly a test for llms these days.

    • @mathiasmamsch3648
      @mathiasmamsch3648 Місяць тому +1

      @5m5tj5wg The 'better' model is not one that can retrieve known solutions better, but the one that can piece the solution to a unheard but related problem better. If you can find the question and the answer on the net then comparing a model with 32B params to a Multi-hundred-billion parameter model like GPT4o or sonnet makes even less sense, because of cause they can store more knowledge. You need to ask for solutions to problems where you cannot find the answer on the Internet to evaluate how good a model will be in practical use.

    • @volkovolko
      @volkovolko  Місяць тому +1

      Yes, there is a part of true. However, I think you can all agree that you don't want a 50+ min video.
      Also most of the code you will ask it to make in the real world will also be knowledge retrieval. As developper we very often have to remake what as already been made.
      And the Snake game isn't that easy for LLMs. The Tetris game is very difficult and I didn't ever see a first try fully working

    • @volkovolko
      @volkovolko  Місяць тому +1

      And it is interresting to see that the Qwen model did better on these "retrieval" questions than GPT and Anthropic despite being way smaller in terms of parameters.
      It indicates that knowledge can still be compress a lot more than what we thought