This is exactly the ocmparison I was looking for. No one else has it so it's extremely valuable content. Now I don't have to spend a ton of time to do all the comparision testing and just go with 7b/14b. Also very interesting to see that the 72b didn't do better than the much smaller 7b.
0.5b is still useful for one thing - using as a draft model. If you load a small draft model with a larger model exllamav2 and llama.cpp can use the tokeniser from the tiny model to greatly speed up the large model.
Nice! I really like these tests. Can you share a list of questions that I can use to try to reproduce the results? I've noticed that if you try to pass the test several times, the results may be different.
Hear ye, hear ye! By the grace of His Majesty, I, thy humble servant, do bring tidings most wondrous. In mine inquiries, I have discovered a curious artifice known as "artificial intelligence," possessing powers most arcane. When presented with a task of great import, this sorcerous contraption doth summon forth the ability to inscribe upon parchment, as if by magic, through a mystical invocation they name "function calling in zed." Verily, 'tis a marvel to behold! Translated: qwen2.5 coder can do workflow file creation in zed.
6:23 while the 0.5b has the answer right, it actually has the reasoning completely wrong. In fact, it misunderstood the text. So it should be labeled as FAIL. 7:22 now this is impressive for 0.5b parameters.
I have 64GB Ram and 12GB vram, can I run Deepseek v2.5 locally ? lmstudio didn't work , It can offload to ram after vram is full Is there something that can also offload to ssd ... , can ollama do that ?
test king(?) I wonder in what kind of situation the small models like 0.5B and 1.5B can be useful. From what I've seen it is more like a AI toy instead of a AI model.
Really love your content Bro! As wel your calm and soothing way of doing it. You bring it in your own unique way .. Plus you make very helpful material!
First one to hit
TEST KING! I do love to see it all, I find picking the mind of these beings to be highly lovely.
This is exactly the ocmparison I was looking for. No one else has it so it's extremely valuable content. Now I don't have to spend a ton of time to do all the comparision testing and just go with 7b/14b. Also very interesting to see that the 72b didn't do better than the much smaller 7b.
I would have said C is playing table tennis, because it is not fun for E to play alone...
E might be Forrest Gump.
I concluded the same tbh. Table tennis is a 2+ player game after all.
OOOOh The King Test I would not dare to miss that!
Everyone is claiming they beat Claude. 😅
Thanks for testing all these versions. 7B does seem good.
Test king!!!
0.5b is still useful for one thing - using as a draft model. If you load a small draft model with a larger model exllamav2 and llama.cpp can use the tokeniser from the tiny model to greatly speed up the large model.
TEST KINGGGGUUU
Testking, it's why I watch
Nice! I really like these tests. Can you share a list of questions that I can use to try to reproduce the results? I've noticed that if you try to pass the test several times, the results may be different.
Test King ma man
Test king
Testking ✌️
Hear ye, hear ye! By the grace of His Majesty, I, thy humble servant, do bring tidings most wondrous. In mine inquiries, I have discovered a curious artifice known as "artificial intelligence," possessing powers most arcane. When presented with a task of great import, this sorcerous contraption doth summon forth the ability to inscribe upon parchment, as if by magic, through a mystical invocation they name "function calling in zed." Verily, 'tis a marvel to behold!
Translated: qwen2.5 coder can do workflow file creation in zed.
Test king ! 👑
test king!
Test King
6:23 while the 0.5b has the answer right, it actually has the reasoning completely wrong. In fact, it misunderstood the text. So it should be labeled as FAIL.
7:22 now this is impressive for 0.5b parameters.
TESTKING~~!
test king
I don’t want to write testking, but if it helps. 🎉 … my theory: We just don’t know how a butterfly should look like. 🤣🤣
TestKing!
TestKing :)
Test King
test king 🙂
Testking
Test king
king test for me 🎉
Test King is the best
Are you doing all your tests with temperature = 0? If not you’ll have randomness impacting the outputs giving you inconsistent results.
Yes but putting temperature 0 may worsen the performance
Great vid! Is your goto coding combo still 3.5 sonnet for code gen and qwen 1.5B for code completion ?
Yes, but I have recently switched for Supermaven for code completion. It's better and free.
@@AICodeKing alright I’ll give a try 👀 thank you 👊
TestKing
Testking
test king :-)
C is playing table tennis. One cannot play table tennis alone.
You can play table tennis alone.
@@AICodeKing yep. i love playing table tennis alone. I score with every shot!
These are not good models they seems to be focus on beating benchmarks instead focusing on reasoning of the models
Test king so you know
I have 64GB Ram and 12GB vram, can I run Deepseek v2.5 locally ?
lmstudio didn't work , It can offload to ram after vram is full
Is there something that can also offload to ssd ... , can ollama do that ?
normally for 7b model ollama u need minimum 6gb vram. so for DS 2.5 may not
testking
test king
Planning on publishing your test bench results as link or website? Do it before somebody else copies your idea😅
On a side note it would also be helpful to see the results side by side with other similar models to compare and choose
It's too much work. I am already exhausted by testing them. But, I can do it. Good suggestion.
do 1.5B/7B coder model support for fim/usable for autocompletion?
They don't support fim but can be used in autocompletion
Test king😂
Testing
Testing, although I'm not sure this is a good use of video time. You could just do the tests and report on them.
Most people want to see the responses as well. That's why I do it like this.
Hmmmmm… 🧐
Third❤❤❤❤
Test
999
Second 😊
test king skipping
Lol Qwen2.5 is a mess.
test king(?)
I wonder in what kind of situation the small models like 0.5B and 1.5B can be useful. From what I've seen it is more like a AI toy instead of a AI model.
I think translation could be one strong suite. But, I think that there are some issue in that as well.
Test King
Test king
test king
Test king
TestKing
Testking
Testking
Really love your content Bro! As wel your calm and soothing way of doing it. You bring it in your own unique way .. Plus you make very helpful material!
test king
Testing
Test King
Test king
test king
Testking
Test king
Testking
Testking
Testking
But why not create the landing page for the testking.and only maintain the dynamic question list generated by an LLM?