They the issue is that we need to balance the complexity of the tasks. If it's too easy all models get it right so we cannot compare them If it's too difficult all models fails so we cannot compare them. Tetris and Pac man games seems currently a good fit for SOTA and aren't that tested so that's why I use them
Funny thing: I tried the same tetris example locally with the q8 and fp16 versions of Qwen coder 2.5 32b and it generated buggy code in both cases. When I tried with the default quantization (q4_k_m if I'm not mistaken) it got perfect the first time (properly bounded and you could lose the game too). I guess there's a luck factor involved.
Yeah, it might be because of the luck factor. Or maybe the architecture of qwen is optimised for high quantizations levels 🤷♂️ Or maybe your q8 version wasn't properly quantized, I think they updated their weight at one moment
Not a great test to me because these models have been trained with these games before and the codes are in there. Let's try something custom and let's see how it can reason, create and solve problems. That will make it a good model. Also Claud 3.5 sonet is the best coder and very hard to make mistakes when coding.
This is pretty cool to see! It's nice to see how the models compare between each other. For me, even the 3B model was amazing at making a Python snake game. Thanks for the comparison, it really does show the difference.
Nice video but i think Claude is still better. If i compare these models at first i always say to myself "If these models are slightly close to each other (In terms of technical specifications) it is okay to compare but if it's not what is the point? Like i understand comparing between open source models like Qwen and Llama or closed source models like Gpt4-o and Claude 3.5 Sonnet
The point is to compare quality... simple as that. Once you know quality, you can consider other factors like speed, price, availability, and of course confidentiality. The fact that Qwen2.5-Coder-32B is even close to Claude while being a _small_ open-weight model is amazing. Of course other factors can matter more than just quality. Speed and price are just as important. But limiting it to "Only compare quality when technical specs are comparable" makes no sense.
@@sthobvious actually makes sense because if you think to compare gpt-3.5 and gpt-o1 or gpt-4o, do you really think this is fair? Gpt-3.5: 😭 Gpt-4o & gpt-o1: 🗿🗿
I think you should ask qwen 2.5 coder 32B again to make the tetris game better so it will be fair .. In my opininion In tetris game qwen literally win .. even claude generate better after error , but offcource it failed at first
Yeah, for me the win was for Qwen. But okay, for the following videos, I will always let one second chance for all models. I will soon make a video comparing each size of qwen2.5 coder (so 0.5B vs 1.5B vs 3B vs 7B vs 14B vs 32b) So subscribe if you want be notified ^^ I also started to quantize each model in GGUF and EXL2 on HuggingFace for those who are interested : huggingface.co/Volko76
I do real software projects as I'm a developer. While Claude and GPT4o are still better for big projects, qwen is a good alternative for just little prompting to avoid going to stack overflow for quick and simple questions
As explained in the video, I'm looking for other original tests. If you have one that you want me to try in a following video feel free to leave it in a comment so that I can try it in the following video
@@volkovolko If you are testing how to write a snake game, then you are basically testing knowledge retrieval, because that code exists in 1000 variants on the Internet. It gets interesting if you demand variations, like 'but the snake grows in both directions' or 'random obstacles appear and disappear after some time in not too close proximity of the snakes'. Think of whatever you want, but if you can do Tetris or snake is hardly a test for llms these days.
@5m5tj5wg The 'better' model is not one that can retrieve known solutions better, but the one that can piece the solution to a unheard but related problem better. If you can find the question and the answer on the net then comparing a model with 32B params to a Multi-hundred-billion parameter model like GPT4o or sonnet makes even less sense, because of cause they can store more knowledge. You need to ask for solutions to problems where you cannot find the answer on the Internet to evaluate how good a model will be in practical use.
Yes, there is a part of true. However, I think you can all agree that you don't want a 50+ min video. Also most of the code you will ask it to make in the real world will also be knowledge retrieval. As developper we very often have to remake what as already been made. And the Snake game isn't that easy for LLMs. The Tetris game is very difficult and I didn't ever see a first try fully working
And it is interresting to see that the Qwen model did better on these "retrieval" questions than GPT and Anthropic despite being way smaller in terms of parameters. It indicates that knowledge can still be compress a lot more than what we thought
Thank you for this demonstration. In the future, please work on more complex apps. I’m happy you tried Tetris instead of only the snake game.
They the issue is that we need to balance the complexity of the tasks.
If it's too easy all models get it right so we cannot compare them
If it's too difficult all models fails so we cannot compare them.
Tetris and Pac man games seems currently a good fit for SOTA and aren't that tested so that's why I use them
Funny thing: I tried the same tetris example locally with the q8 and fp16 versions of Qwen coder 2.5 32b and it generated buggy code in both cases. When I tried with the default quantization (q4_k_m if I'm not mistaken) it got perfect the first time (properly bounded and you could lose the game too). I guess there's a luck factor involved.
Yeah, it might be because of the luck factor.
Or maybe the architecture of qwen is optimised for high quantizations levels 🤷♂️
Or maybe your q8 version wasn't properly quantized, I think they updated their weight at one moment
luck it's called temperature nowadays :D
Yeah, I now.
Top_k also right ? @@66_meme_99
you should ask for physics demos like softbody , particles, fluid particles, cloth. Anything math heavy pretty much.
Okay, I will try in the next video
Sweet. I remember, when Chat GPT just appeared, feeling very pessimistic that this tech would be locked in big companies datacenters. Glad I was wrong
Yes, it's so awesome they this technology is going toward open sourcing 👍
Not a great test to me because these models have been trained with these games before and the codes are in there. Let's try something custom and let's see how it can reason, create and solve problems. That will make it a good model. Also Claud 3.5 sonet is the best coder and very hard to make mistakes when coding.
I would be happy to test with any prompt you give to me ^^
nice vid! what's your 3090 setup my guy
Asus ROG STRIX 3090
32Go ddr4 3200MHz
i9 11900kf
amazing, thanks for the test
Glad you liked it!
This is pretty cool to see! It's nice to see how the models compare between each other. For me, even the 3B model was amazing at making a Python snake game. Thanks for the comparison, it really does show the difference.
Yeah, I totaly agree.
The Qwen series (especially the coding one for me) are just so amazing.
I don't know why they aren't as known as the llama ones.
Do you want me to make a video comparing the 3B to the 32B ?
@@volkovolko Yeah, that would be really cool to see! I'd love to see how the models perform.
Okay, I will try to do it tomorrow
Nice video but i think Claude is still better. If i compare these models at first i always say to myself "If these models are slightly close to each other (In terms of technical specifications) it is okay to compare but if it's not what is the point?
Like i understand comparing between open source models like Qwen and Llama or closed source models like Gpt4-o and Claude 3.5 Sonnet
Yes, the results of the tests I made in this video seems to show that :
GPT4o < Qwen2.5coder32b < Claude 3.5 Sonnet (new)
The point is to compare quality... simple as that. Once you know quality, you can consider other factors like speed, price, availability, and of course confidentiality. The fact that Qwen2.5-Coder-32B is even close to Claude while being a _small_ open-weight model is amazing.
Of course other factors can matter more than just quality. Speed and price are just as important. But limiting it to "Only compare quality when technical specs are comparable" makes no sense.
@@sthobvious actually makes sense because if you think to compare gpt-3.5 and gpt-o1 or gpt-4o, do you really think this is fair?
Gpt-3.5: 😭
Gpt-4o & gpt-o1: 🗿🗿
the error produced by gpt was minimal; a "hallucination"
I think you should ask qwen 2.5 coder 32B again to make the tetris game better so it will be fair ..
In my opininion In tetris game qwen literally win .. even claude generate better after error , but offcource it failed at first
Yeah, for me the win was for Qwen.
But okay, for the following videos, I will always let one second chance for all models.
I will soon make a video comparing each size of qwen2.5 coder (so 0.5B vs 1.5B vs 3B vs 7B vs 14B vs 32b)
So subscribe if you want be notified ^^
I also started to quantize each model in GGUF and EXL2 on HuggingFace for those who are interested : huggingface.co/Volko76
Seems very interesting I will try it tomorrow, for me nemotron 70b was the best but even on my 4090 I can't run it locally.
I made the video comparing sizes : ua-cam.com/video/WPziCratbpc/v-deo.htmlsi=o3eKo-3pGY78wmMr
Yes, 70B is still a bit too much for consumer grade GPUs
Thanks for the comparison but this was painful to watch. Please cut the parts that are not relevant to the subject or at least add timestamps
I'm trying to do my best.
When I made this video. I didn't had any speakers so I couldn't test the audio nor make great cuts
If you do a real software project, you'll find out claude sonnet new is the best, and gpt4 is very good at organizing.
I do real software projects as I'm a developer.
While Claude and GPT4o are still better for big projects, qwen is a good alternative for just little prompting to avoid going to stack overflow for quick and simple questions
try a next js app.
Okay, I will try in the next video
Why do people do these stupid tests where the code can be found 1000 times on the internet.
As explained in the video, I'm looking for other original tests.
If you have one that you want me to try in a following video feel free to leave it in a comment so that I can try it in the following video
@@volkovolko If you are testing how to write a snake game, then you are basically testing knowledge retrieval, because that code exists in 1000 variants on the Internet. It gets interesting if you demand variations, like 'but the snake grows in both directions' or 'random obstacles appear and disappear after some time in not too close proximity of the snakes'. Think of whatever you want, but if you can do Tetris or snake is hardly a test for llms these days.
@5m5tj5wg The 'better' model is not one that can retrieve known solutions better, but the one that can piece the solution to a unheard but related problem better. If you can find the question and the answer on the net then comparing a model with 32B params to a Multi-hundred-billion parameter model like GPT4o or sonnet makes even less sense, because of cause they can store more knowledge. You need to ask for solutions to problems where you cannot find the answer on the Internet to evaluate how good a model will be in practical use.
Yes, there is a part of true. However, I think you can all agree that you don't want a 50+ min video.
Also most of the code you will ask it to make in the real world will also be knowledge retrieval. As developper we very often have to remake what as already been made.
And the Snake game isn't that easy for LLMs. The Tetris game is very difficult and I didn't ever see a first try fully working
And it is interresting to see that the Qwen model did better on these "retrieval" questions than GPT and Anthropic despite being way smaller in terms of parameters.
It indicates that knowledge can still be compress a lot more than what we thought