Your Favorite LLMs BATTLE In Street Fighter - New Benchmark!! (Tutorial)
Вставка
- Опубліковано 31 бер 2024
- LLM Colosseum is a new way to think about AI benchmarks: let them battle it out in Street Fighter! Which LLM will win?
Join My Newsletter for Regular AI Updates 👇🏼
www.matthewberman.com
Need AI Consulting? ✅
forwardfuture.ai/
My Links 🔗
👉🏻 Subscribe: / @matthew_berman
👉🏻 Twitter: / matthewberman
👉🏻 Discord: / discord
👉🏻 Patreon: / matthewberman
Rent a GPU (MassedCompute) 🚀
bit.ly/matthew-berman-youtube
USE CODE "MatthewBerman" for 50% discount
Media/Sponsorship Inquiries 📈
bit.ly/44TC45V
Links:
docs.diambra.ai/#installation
github.com/OpenGenerativeAI/l... - Наука та технологія
Is this a good LLM benchmark?
I believe it could be for benchmarking some aspects of reasoning, until models start to become trained to be good at this specifically (as with all benchmarks, this is when they all become invalid).
However, only if we have this street fighter instance wait until both models have provided their output, else we're also (and most likely predominantly) measuring their inference speed. Tokens/s on a given machine + latency. Then again, if your use case is latency-dependent then fair enough, it is a decent benchmark as it is.
Testing usability and latency. I can't believe the prompts are just description in text not even pictures.
As long as the key functionality to win matches in this game is LLM inference speed is good for measure that but for nothing else. The most important characteristic to achieve AGI is complex reasoning and planing capability and this doesn't have anything to do with this game. Maybe a complex turn based strategy game (like chess or GO) could be much more interesting to measure more useful characteristics from language models. This game is just a funny curiosity in my opinion.
Playing Go might be. I suspect large models would have an advantage here, provided they manage to survive the game long enough at the beginning.
@@rootor1 True, would be Interesting to watch LLM's play chess or GO.
That’s truly amazing, love how people are thinking about different ways to compare models. So much fun
Crazy stuff. But I love that there is a new benchmark. I am so happy to be living during these times.
Sameee
A blessing and a curse.
Open AI GPT 3.5 is goated as expected. When Andrew Ng was talking about how GPT 3.5 multi agent swarms are comparable with GPT 4, even though GPT 3.5 is significantly worse on its own.
Fuuudge. I thought this was the coolest then remembered the day
Very cool! Thank you.
How does each llm know where their fighter is on screen ? How does it know if the character has jumped ? These llm cant actually see the screen. Does it use the coordinate of the character to locate their own characters?
I think the emulator gives certain outputs, like an api, that the LLM is fed. Certain data points are given to it so it knows what’s happening. People do this with other games like trackmania to train AIs.
You're a madman. What are you putting out like 3 videos per day?
Someone should make this for Monopoly, imagine having human-like bots that can trade cards and stuff
I think that would be better than this. This really seems like it would be more for testing speed. There really isnt many options that the LLM has to think through with this and thinking far in advance doesnt really help all that much either. Monopoly would work great for testing logic.
@@pin65371 Ikr, I'm surprised we don't have real AI Monopoly where the AI can chat and trade, I've seen one on the android store but the AI can't chat and LLM is only at max difficulty level annoyingly trying to trade every single turn (and btw what if I don't want to play at max difficulty?...)
Edit: I want literally ChatGPT to play monopoly with me 😅
Great find and reporting. Definitely week grab peoples attention.
I'm eager to see which LLM will be Daigo level!
I'd watch the punk ai streams. Can ai teabag?
Really coool, I trust this more than traditional benchmarks for sure
Thanks!
Thank you!!
So, let's see a Quake3 wasm deathmatch between 6 LLM bots.
Far more interestingly, let's see some squad based co-op gaming
Would be interesting to compare the results of the various matches with the equivalent but ran such that the emulator is paused until both LLMs have provided an answer.
It would be amazing if we could see the system prompts as well, just curious to understand how they work under the hood.
so wait, this is not an April's fools? lol
Where would be the best place to purchase/download the SF3 ROM file that would work with this?
Already testing A.I. for the battle field.
i got the rom and on linux rpi 5 what emulator shal i use to play it manually?
This is just a game but in a decade wars will be fought between good AI and a Bad AI. Scary times ahead of us and this entertaining presentation is just a beginning.
Can we also play with an LLM this way? I'd love to try that. But it definitely is interesting to watch. Just one year and now they even play Streetfighter against each other... I'd prefer Death & Alive too =)
This is pretty dope...I thoght is was going to be an April fools joke, but its real.
Wait, are you sure? Really?
I would guess that Claude 3 Haiku would be really good.
I can't wait to challenge some LLMs lol
This is very impressive. It's also alarming. LLMs are apparently already set to perform any job that requires a human to interpret the current situation, and to input "the right moves" into a system to achieve a win condition. Over and over.
This is applicable from accounting to air-traffic control. Anything a person at a terminal calls a job. Wow.
Nice.
Wow this is mindblowing xD
Teaching AI how to kick our asses in a street fight, great idea.
edit2: GAH, wrong name, I meant MUGEN, not MAME
Why don't they use an opensource fighting game engine like MAME that allows you to just use custom characters? (that's open source, right?)
edit: Or does it also need ROMs based on original games even when using custom characters?
I ran this and the prompt said Street Fighter 2: Turbo is the only game worth playing.
can you give us more updates on the LLM Street fighter league, it's a much more human way of benchmarking these models
I can relate to these larger, slow LLMs being smoked in blitz and bullet vs rapid chess. Damn you "quick on your feet" people. 😅
great video - One item to add if you installing by following video. Before installing python3 -m pip install diambra-arena, install python3 -m pip install diambra first,.
Thank You :)
Interesting.. thanks for sharing
Basically mashing buttons more quickly can lead to a victory over somebody who's trying to strategize in this kind of game. If you let the same models play a more turn-based strategy game that would show you the quality of their output over the quantity of their output.
You are sooo right. Remeber project CICERO from Meta? I wish someone did this Coleseum thing but Cicero style
Not true at all. Tell me you don't know the game without telling me you don't know the game.
Street fighter is not a button masher!
@@kr4dh4x0r You missed the point
@@chrisbraeuer9476 He never made that statement
Sir i from india and love your content alot it helps me think how use tech out of the box ❤❤❤❤❤
We should use OpenDevin to make Open Street Fighter 3. :)
Awesome man! Keep featuring new open source projects please!
I can't wait until we have LLMs duking it out in Starcraft. I still can't believe they got AlphaStar to play it, and it'd be fascinating to watch AI pick up more strategies and get thoroughly better than human players.
Someone should do a Heroes 3 variation of that.
Same army: 3 Archangels, 10 Champions, 25 Monks, 30 Crusaders, 50 Griffins, 70 Marksmen, 100 Pikemen.
Will the Openai team ready for the 2025 Esports World Cup?? 🤔
Oh man you're best. Thanks
Okay by can they Zelda randomizer speed run race?
three street fighters are in a room... lol
This should be the only benchmark
by the way with this knd of framework why can make a llm who drive... gta and later real cars
Anyone tried it on Groq?
❤
Awesome pick ❤
Imagine AI Vs Tekken 8 world champion.
Add Groq and you will win all the time
If they're this fast maybe we don't need groq
groq cannot be run locally realistically. doesnt it have huge vram requirements?
@@exzld I am talking about GroqCloud they have a very fast inference speed.
Damn it... Why today? This would've been so cool...
Imagine Ai pulling an Evo moment 37
At same time new models are becoming more intelligent and amplifying hardware speed, old models are becoming faster and faster!
cool, works but almost forgot to use requirements.txt, sh..t happens 😀 Many thanks for sharing !
Who did this? It clearly had to be street fighter 2 special champions edition on 8 stars speed
Nice light-hearted fun.
AGI will be achieved when models can win at dwarf fortress.
Woah! Best benchmark so far! IMPOSSIBLE TO PRETRAIN ON THE DATA!!
Video game guides made this possible in the first place. their are countless movesets for street fighter on the inernet.
It's an LLM, it's already trained :''D
@@MagusArtStudios But the fact still remains, the LLM can't just memorize the answers. There are no answers! It has to actually think in order to win.
How it works:
The python code "translates" the graphics in the game by narrating it, for example, it feeds the LLM: "You are far from the oponent, you should move closer" or "You are close to the oponent, you should attack". Of course, this is all done algorithmitically (like calculating distance between players) and is very restrictively leading to what the LLM should output. It goes on to narrate the actions of your oponent and adds to the prompt leading hints like "You have more than 30 energy, you could cast a super power".
Fun to watch, but not really doing a fair comparison of each LLM's agency and intelligence capabilities. Considering how restrictive it is, this being used as benchmark is most likely an april's fools joke. But seriously, though, using AIs as game agents to benchmark them is not a bad idea at all.
This (the pipeline) is only gonna be improved. I suspect a version of this plus an AlphaZero-esk pipeline would push it to better and better plays, like everything else we've seen
@@fire17102 The AI that came with the game 30 years ago is already miles ahead of anything you are suggesting, and uses at least 100,000x less power to do it.
Haikuu must kick ass
If anyone knows how 'AI's' in fighting games work. Especially in things like mugen. This is a good laugh. It's essentially button mashing. The faster button masher would win everytime.
Use Mario Kart to test them out all at once.
There is no learning if there is no Fine Tuning, right❓ How is learning accomplished❓
This seems stupid if there is no learning or Vision. ❗
How this works, limitations, and future improvements would be way more useful than Installation instructions.
Will it run Doom LOL?
Already did haha, saw a vid about it less than a week ago
this guy is sooo good
That AI is a free win, not impressed in the slightest with this.
While it is technically not illegal for you to dump the ROM off of the game you own yourself, that does not make it legal for you to download the ROM for a game that you own.
AI just took all the gamming jobs. lol
No place to dl safe ROM.
Waiting for GROK VS LLaMA Instead of Elon and Zuck canceled Smackdown 😂😂😂🎉
Croq would win.
I smell an april fools joke
cool but the llm models don't actually learn anything or improve. Interesting for 5 minutes
If I make a bot to spam heavy kick it will beat the LLMs so no I would say this is a terrible way to evaluate an LLM.
Bad UA-cam..,stop recommending me this video
dogone it, its april fools..
We are far from something meaningful. At this point the legacy CPU is still way more impressive.
April fool's
Meh .. most people are gullible or I am too jaded
this is not a valid test case. lol. wtf.
April fool's?
useless project
first to watch, first to comment
And still nobody gives a shit 😂
@@ArianeQube yeah cuz you are just dumb
Great we're training then to be ninjas🥷