Chinese AI models storm Hugging Face's Open LLM Leaderboard!

Ai Flux

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

КОМЕНТАРІ • 30

@frankjohannessen6383 4 місяці тому ⁺³
I'd like to see scores for quantization of the models as well. at least 4-bit and 8-bit. That is what most people are using.
@aifluxchannel 4 місяці тому
I was also curious why this wasn't a filter option when the first released the dashboard.
@HUEHUEUHEPony 4 місяці тому ⁺³
Where is deepeek coder v2?
@ericshum8796 4 місяці тому ⁺²
today news
many developers have been overly obsessed with ranking positions, leading to an excessive reliance on evaluation set data during model training, and past evaluation criteria appeared too simplistic for the models. therefore, the current evaluation raised the difficulty level to test the genuine performance of these models under higher challenges.
notably, alibaba's open-source Qwen-2 72B model stood out in fierce competition, not only outperforming tech giant Meta's Llama-3 but also surpassing Mixtral from Mistralai, a renowned large model platform in France, becoming the new industry leader. This achievement fully demonstrates China's leadership in the global open-source large model domain.
@blisphul8084 4 місяці тому
I'm usually hesitant to acknowledge this, given propaganda allegations, but I've used this model and it truly is something special. I'm happy to see real progress in the open weights space regardless of where it comes from.
@schongut9030 4 місяці тому
LLama 3 70B is still king for text output. I've actually had 70b give me some output and use a word that I told it not to use, then say, "wait, you told me not to use that word", and start over! That blew my mind.
@AaronALAI 4 місяці тому ⁺²
Do a video on self-play SPPO, I just tried their llama 3 8b fine-tune and was blown away, it actually seems better than llama3 70b which I know sounds crazy.
Thanks for the video on the leaderboard update too ❤
@aifluxchannel 4 місяці тому
Sure thing!
@testales 4 місяці тому ⁺¹
I just tried the q8 version somebody had uploaded at the Ollama site and it didn't impress me much. Though it might be that something has gone wrong during conversion to gguf or quantization. So far my favorite 7b/8b model is still Open Hermes 2.5 FP16. :) It just answered a hard question I saw on YT that even ChatGPT couldn't get right in one shot.
"How many days will it take for a pond to be half-emptied with lilies if the number of lilies decreases by half every day, and it takes 9 days for the pond to be completely emptied?"
@marcfruchtman9473 4 місяці тому ⁺⁴
The biggest takeaway is where are the benchmark comparisons for any of the big models that I would like to compare them against such as "GPT4o" and "Claude"? I am not seeing them. That makes this a floating ground situation with no real easy way to make baseline comparisons.
@aifluxchannel 4 місяці тому ⁺¹
Yeah, I tend to agree. It's hard to tell if open llms or closed source llms generally present with less bias in their initial announcements / performance claims.
@marcfruchtman9473 4 місяці тому
@@aifluxchannel Perhaps you could do a video comparing the top ranked performer with Claude?
@PythonAndy 4 місяці тому
Good vid, thanks for sharing!
@picksalot1 4 місяці тому ⁺¹
Tracking on the Benchmark Leaderboard is interesting. Is there a way to quantify and display the scores of an average and expert Human on the various Benchmarks as a means of comparison? Perhaps it's already implicit in the Leaderboard. 🤷
@aifluxchannel 4 місяці тому
I'm still surprised more dashboards aren't factoring in the *rate* of model improvement along with aggregate scores. Rate of advancement is much more interesting than static ranking to me at least.
@testales 4 місяці тому ⁺⁵
Qwen2 is not better than Llama3 so assuming this as a baseline seems wrong. If you take finetunes like the l3 gradient variants or uncensored ones into consideration it gets even more complicated. The old HF benchmark is completely useless for several reasons. Aside from the arena I only know one other named EQ bench that seems quite reasonable but is not updated very often
@aifluxchannel 4 місяці тому
I personally trust LMSys and Arena but that said, all of these can be gamed to a certain extent.
@blisphul8084 4 місяці тому
Llama 3 has 8k context, Qwen2 has 128k. Qwen2 does much better at multilingual tasks compared to llama 3. Qwen2 offers ultra-small variants that you can fine tune to perform tasks well. If you want a version of Qwen2 that is more English focused, just get the dolphin variant, which is a fine tune designed to uncensor LLMs, but has the side effect of potentially improving English performance for these Chinese LLMs.
@testales 4 місяці тому
@@blisphul8084 These are surely points to consider. But there are finetunes auf llama3 with large context too. Aside from the fact that it requires so much memory that it requires really expensive hardware to make fully use of it especially for the 70b models, I have limited trust in all these context extension hacks since even ChatGPT gets a lot dumber in longer chats. I see the same for local models once the original context length is reached. I'm also fine if the model can speak English and my native language which is German well. My main focus is on reasoning. If the model fails to answer simple logic questions that at times even a child can answer instantly, I can not rely on it and its coding skills are mostly likely affected by this too. Hence to define a base line that bascially asumes that Qwen2 is better than llama3 and then build on it seems wrong.
@blisphul8084 4 місяці тому
@@testales Indeed. I don't think that either llama 3 or qwen2 are entirely better, but each one has strengths and weaknesses. I do think that Qwen2 choosing to release 0.5b and 1.5b models is huge for some use-cases, even if they are pretty bad at logic and reasoning, and it is amazing what factual knowledge questions even those tiny models are able to answer coherently, making them great candidates for fine-tuning for lightweight use-cases. The fact that those tiny models are able to make use of a 32k context window, even passing the needle in a haystack test (based on their release paper) demonstrates strong abilities when combined with RAG for summarization and knowledge lookup tasks, even if you are running old hardware without AI acceleration, or a phone. Keep in mind that Qwen2 is trained for Chinese language performance, so you're likely not seeing its full potential when chatting with it in English, which is crazy to think about considering that its English performance is quite good for the provided sizes.
@jmirodg7094 4 місяці тому
I hopped they would add more Codding benchmarks
@GerryPrompt 4 місяці тому ⁺³
Hopefully this is more useful than all of the biased ones showing up on Twitter every three minutes 😂
@aifluxchannel 4 місяці тому
We'll have to wait and see, what leaderboards do you currently use outside of the Open LLM Leaderboard?
@GerryPrompt 4 місяці тому
@@aifluxchannel I like to use leaderboards that include closed source models
@Arcticwhir 4 місяці тому ⁺¹
wonder why llama3 got such a dramatically bad score on GPQA, similar to gemma 7b (which is pretty bad) and other 7b models. Feel like something must be wrong with that score.
edit: even Meta official benchmark shows 39.2 "0-shot".
@HUEHUEUHEPony 4 місяці тому ⁺¹
This channel always has the wrong answer
@StephenRayner 4 місяці тому ⁺³
Lost me at 0:31 for saying cloud 3.5 -.-
Thumb down
@aifluxchannel 4 місяці тому
Thanks for the feedback.
@Kutsushita_yukino 4 місяці тому
bruh
@blisphul8084 4 місяці тому
Dude, it's just pronunciation. People who read a lot may know plenty, but pronounce things wrong.

Наступне

Автоматичне відтворення

"Москва - это правнучка Киева, Крым - это Украина" - Борис Миронов размазал крымнашистов @omtvreal