today news many developers have been overly obsessed with ranking positions, leading to an excessive reliance on evaluation set data during model training, and past evaluation criteria appeared too simplistic for the models. therefore, the current evaluation raised the difficulty level to test the genuine performance of these models under higher challenges. notably, alibaba's open-source Qwen-2 72B model stood out in fierce competition, not only outperforming tech giant Meta's Llama-3 but also surpassing Mixtral from Mistralai, a renowned large model platform in France, becoming the new industry leader. This achievement fully demonstrates China's leadership in the global open-source large model domain.
I'm usually hesitant to acknowledge this, given propaganda allegations, but I've used this model and it truly is something special. I'm happy to see real progress in the open weights space regardless of where it comes from.
LLama 3 70B is still king for text output. I've actually had 70b give me some output and use a word that I told it not to use, then say, "wait, you told me not to use that word", and start over! That blew my mind.
Do a video on self-play SPPO, I just tried their llama 3 8b fine-tune and was blown away, it actually seems better than llama3 70b which I know sounds crazy. Thanks for the video on the leaderboard update too ❤
I just tried the q8 version somebody had uploaded at the Ollama site and it didn't impress me much. Though it might be that something has gone wrong during conversion to gguf or quantization. So far my favorite 7b/8b model is still Open Hermes 2.5 FP16. :) It just answered a hard question I saw on YT that even ChatGPT couldn't get right in one shot. "How many days will it take for a pond to be half-emptied with lilies if the number of lilies decreases by half every day, and it takes 9 days for the pond to be completely emptied?"
The biggest takeaway is where are the benchmark comparisons for any of the big models that I would like to compare them against such as "GPT4o" and "Claude"? I am not seeing them. That makes this a floating ground situation with no real easy way to make baseline comparisons.
Yeah, I tend to agree. It's hard to tell if open llms or closed source llms generally present with less bias in their initial announcements / performance claims.
Tracking on the Benchmark Leaderboard is interesting. Is there a way to quantify and display the scores of an average and expert Human on the various Benchmarks as a means of comparison? Perhaps it's already implicit in the Leaderboard. 🤷
I'm still surprised more dashboards aren't factoring in the *rate* of model improvement along with aggregate scores. Rate of advancement is much more interesting than static ranking to me at least.
Qwen2 is not better than Llama3 so assuming this as a baseline seems wrong. If you take finetunes like the l3 gradient variants or uncensored ones into consideration it gets even more complicated. The old HF benchmark is completely useless for several reasons. Aside from the arena I only know one other named EQ bench that seems quite reasonable but is not updated very often
Llama 3 has 8k context, Qwen2 has 128k. Qwen2 does much better at multilingual tasks compared to llama 3. Qwen2 offers ultra-small variants that you can fine tune to perform tasks well. If you want a version of Qwen2 that is more English focused, just get the dolphin variant, which is a fine tune designed to uncensor LLMs, but has the side effect of potentially improving English performance for these Chinese LLMs.
@@blisphul8084 These are surely points to consider. But there are finetunes auf llama3 with large context too. Aside from the fact that it requires so much memory that it requires really expensive hardware to make fully use of it especially for the 70b models, I have limited trust in all these context extension hacks since even ChatGPT gets a lot dumber in longer chats. I see the same for local models once the original context length is reached. I'm also fine if the model can speak English and my native language which is German well. My main focus is on reasoning. If the model fails to answer simple logic questions that at times even a child can answer instantly, I can not rely on it and its coding skills are mostly likely affected by this too. Hence to define a base line that bascially asumes that Qwen2 is better than llama3 and then build on it seems wrong.
@@testales Indeed. I don't think that either llama 3 or qwen2 are entirely better, but each one has strengths and weaknesses. I do think that Qwen2 choosing to release 0.5b and 1.5b models is huge for some use-cases, even if they are pretty bad at logic and reasoning, and it is amazing what factual knowledge questions even those tiny models are able to answer coherently, making them great candidates for fine-tuning for lightweight use-cases. The fact that those tiny models are able to make use of a 32k context window, even passing the needle in a haystack test (based on their release paper) demonstrates strong abilities when combined with RAG for summarization and knowledge lookup tasks, even if you are running old hardware without AI acceleration, or a phone. Keep in mind that Qwen2 is trained for Chinese language performance, so you're likely not seeing its full potential when chatting with it in English, which is crazy to think about considering that its English performance is quite good for the provided sizes.
wonder why llama3 got such a dramatically bad score on GPQA, similar to gemma 7b (which is pretty bad) and other 7b models. Feel like something must be wrong with that score. edit: even Meta official benchmark shows 39.2 "0-shot".
I'd like to see scores for quantization of the models as well. at least 4-bit and 8-bit. That is what most people are using.
I was also curious why this wasn't a filter option when the first released the dashboard.
Where is deepeek coder v2?
today news
many developers have been overly obsessed with ranking positions, leading to an excessive reliance on evaluation set data during model training, and past evaluation criteria appeared too simplistic for the models. therefore, the current evaluation raised the difficulty level to test the genuine performance of these models under higher challenges.
notably, alibaba's open-source Qwen-2 72B model stood out in fierce competition, not only outperforming tech giant Meta's Llama-3 but also surpassing Mixtral from Mistralai, a renowned large model platform in France, becoming the new industry leader. This achievement fully demonstrates China's leadership in the global open-source large model domain.
I'm usually hesitant to acknowledge this, given propaganda allegations, but I've used this model and it truly is something special. I'm happy to see real progress in the open weights space regardless of where it comes from.
LLama 3 70B is still king for text output. I've actually had 70b give me some output and use a word that I told it not to use, then say, "wait, you told me not to use that word", and start over! That blew my mind.
Do a video on self-play SPPO, I just tried their llama 3 8b fine-tune and was blown away, it actually seems better than llama3 70b which I know sounds crazy.
Thanks for the video on the leaderboard update too ❤
Sure thing!
I just tried the q8 version somebody had uploaded at the Ollama site and it didn't impress me much. Though it might be that something has gone wrong during conversion to gguf or quantization. So far my favorite 7b/8b model is still Open Hermes 2.5 FP16. :) It just answered a hard question I saw on YT that even ChatGPT couldn't get right in one shot.
"How many days will it take for a pond to be half-emptied with lilies if the number of lilies decreases by half every day, and it takes 9 days for the pond to be completely emptied?"
The biggest takeaway is where are the benchmark comparisons for any of the big models that I would like to compare them against such as "GPT4o" and "Claude"? I am not seeing them. That makes this a floating ground situation with no real easy way to make baseline comparisons.
Yeah, I tend to agree. It's hard to tell if open llms or closed source llms generally present with less bias in their initial announcements / performance claims.
@@aifluxchannel Perhaps you could do a video comparing the top ranked performer with Claude?
Good vid, thanks for sharing!
Tracking on the Benchmark Leaderboard is interesting. Is there a way to quantify and display the scores of an average and expert Human on the various Benchmarks as a means of comparison? Perhaps it's already implicit in the Leaderboard. 🤷
I'm still surprised more dashboards aren't factoring in the *rate* of model improvement along with aggregate scores. Rate of advancement is much more interesting than static ranking to me at least.
Qwen2 is not better than Llama3 so assuming this as a baseline seems wrong. If you take finetunes like the l3 gradient variants or uncensored ones into consideration it gets even more complicated. The old HF benchmark is completely useless for several reasons. Aside from the arena I only know one other named EQ bench that seems quite reasonable but is not updated very often
I personally trust LMSys and Arena but that said, all of these can be gamed to a certain extent.
Llama 3 has 8k context, Qwen2 has 128k. Qwen2 does much better at multilingual tasks compared to llama 3. Qwen2 offers ultra-small variants that you can fine tune to perform tasks well. If you want a version of Qwen2 that is more English focused, just get the dolphin variant, which is a fine tune designed to uncensor LLMs, but has the side effect of potentially improving English performance for these Chinese LLMs.
@@blisphul8084 These are surely points to consider. But there are finetunes auf llama3 with large context too. Aside from the fact that it requires so much memory that it requires really expensive hardware to make fully use of it especially for the 70b models, I have limited trust in all these context extension hacks since even ChatGPT gets a lot dumber in longer chats. I see the same for local models once the original context length is reached. I'm also fine if the model can speak English and my native language which is German well. My main focus is on reasoning. If the model fails to answer simple logic questions that at times even a child can answer instantly, I can not rely on it and its coding skills are mostly likely affected by this too. Hence to define a base line that bascially asumes that Qwen2 is better than llama3 and then build on it seems wrong.
@@testales Indeed. I don't think that either llama 3 or qwen2 are entirely better, but each one has strengths and weaknesses. I do think that Qwen2 choosing to release 0.5b and 1.5b models is huge for some use-cases, even if they are pretty bad at logic and reasoning, and it is amazing what factual knowledge questions even those tiny models are able to answer coherently, making them great candidates for fine-tuning for lightweight use-cases. The fact that those tiny models are able to make use of a 32k context window, even passing the needle in a haystack test (based on their release paper) demonstrates strong abilities when combined with RAG for summarization and knowledge lookup tasks, even if you are running old hardware without AI acceleration, or a phone. Keep in mind that Qwen2 is trained for Chinese language performance, so you're likely not seeing its full potential when chatting with it in English, which is crazy to think about considering that its English performance is quite good for the provided sizes.
I hopped they would add more Codding benchmarks
Hopefully this is more useful than all of the biased ones showing up on Twitter every three minutes 😂
We'll have to wait and see, what leaderboards do you currently use outside of the Open LLM Leaderboard?
@@aifluxchannel I like to use leaderboards that include closed source models
wonder why llama3 got such a dramatically bad score on GPQA, similar to gemma 7b (which is pretty bad) and other 7b models. Feel like something must be wrong with that score.
edit: even Meta official benchmark shows 39.2 "0-shot".
This channel always has the wrong answer
Lost me at 0:31 for saying cloud 3.5 -.-
Thumb down
Thanks for the feedback.
bruh
Dude, it's just pronunciation. People who read a lot may know plenty, but pronounce things wrong.