Your Favorite LLMs BATTLE In Street Fighter - New Benchmark!! (Tutorial)

Matthew Berman

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 31 бер 2024
LLM Colosseum is a new way to think about AI benchmarks: let them battle it out in Street Fighter! Which LLM will win?
Join My Newsletter for Regular AI Updates 👇🏼
www.matthewberman.com
Need AI Consulting? ✅
forwardfuture.ai/
My Links 🔗
👉🏻 Subscribe: / @matthew_berman
👉🏻 Twitter: / matthewberman
👉🏻 Discord: / discord
👉🏻 Patreon: / matthewberman
Rent a GPU (MassedCompute) 🚀
bit.ly/matthew-berman-youtube
USE CODE "MatthewBerman" for 50% discount
Media/Sponsorship Inquiries 📈
bit.ly/44TC45V
Links:
docs.diambra.ai/#installation
github.com/OpenGenerativeAI/l...
Наука та технологія

КОМЕНТАРІ • 143

@matthew_berman 2 місяці тому ⁺¹⁴
Is this a good LLM benchmark?
@lio1234234 2 місяці тому ⁺⁶
I believe it could be for benchmarking some aspects of reasoning, until models start to become trained to be good at this specifically (as with all benchmarks, this is when they all become invalid).
However, only if we have this street fighter instance wait until both models have provided their output, else we're also (and most likely predominantly) measuring their inference speed. Tokens/s on a given machine + latency. Then again, if your use case is latency-dependent then fair enough, it is a decent benchmark as it is.
@yonatan09 2 місяці тому ⁺⁵
Testing usability and latency. I can't believe the prompts are just description in text not even pictures.
@rootor1 2 місяці тому ⁺⁶
As long as the key functionality to win matches in this game is LLM inference speed is good for measure that but for nothing else. The most important characteristic to achieve AGI is complex reasoning and planing capability and this doesn't have anything to do with this game. Maybe a complex turn based strategy game (like chess or GO) could be much more interesting to measure more useful characteristics from language models. This game is just a funny curiosity in my opinion.
@dasistdiewahrheit9585 2 місяці тому
Playing Go might be. I suspect large models would have an advantage here, provided they manage to survive the game long enough at the beginning.
@MoosaMemon. 2 місяці тому
@@rootor1 True, would be Interesting to watch LLM's play chess or GO.
@anthonyjobey8821 2 місяці тому ⁺⁴
That’s truly amazing, love how people are thinking about different ways to compare models. So much fun
@haroldpierre1726 2 місяці тому ⁺²⁸
Crazy stuff. But I love that there is a new benchmark. I am so happy to be living during these times.
@MoosaMemon. 2 місяці тому ⁺¹
Sameee
@wtcbd01 2 місяці тому
A blessing and a curse.
@greengoblin9567 2 місяці тому
Open AI GPT 3.5 is goated as expected. When Andrew Ng was talking about how GPT 3.5 multi agent swarms are comparable with GPT 4, even though GPT 3.5 is significantly worse on its own.
@AINEET 2 місяці тому ⁺¹²
Fuuudge. I thought this was the coolest then remembered the day
@dreamphoenix 2 місяці тому ⁺¹
Very cool! Thank you.
@ps3301 2 місяці тому ⁺⁶
How does each llm know where their fighter is on screen ? How does it know if the character has jumped ? These llm cant actually see the screen. Does it use the coordinate of the character to locate their own characters?
@christianherrera4729 2 місяці тому ⁺¹
I think the emulator gives certain outputs, like an api, that the LLM is fed. Certain data points are given to it so it knows what’s happening. People do this with other games like trackmania to train AIs.
@jeffg4686 2 місяці тому ⁺³
You're a madman. What are you putting out like 3 videos per day?
@markmuller7962 2 місяці тому ⁺⁵
Someone should make this for Monopoly, imagine having human-like bots that can trade cards and stuff
@pin65371 2 місяці тому ⁺³
I think that would be better than this. This really seems like it would be more for testing speed. There really isnt many options that the LLM has to think through with this and thinking far in advance doesnt really help all that much either. Monopoly would work great for testing logic.
@markmuller7962 2 місяці тому
@@pin65371 Ikr, I'm surprised we don't have real AI Monopoly where the AI can chat and trade, I've seen one on the android store but the AI can't chat and LLM is only at max difficulty level annoyingly trying to trade every single turn (and btw what if I don't want to play at max difficulty?...)
Edit: I want literally ChatGPT to play monopoly with me 😅
@wtcbd01 2 місяці тому
Great find and reporting. Definitely week grab peoples attention.
@mattgenaro 2 місяці тому ⁺²
I'm eager to see which LLM will be Daigo level!
@lazyautomation3481 2 місяці тому
I'd watch the punk ai streams. Can ai teabag?
@TheFocusedCoder 2 місяці тому ⁺³
Really coool, I trust this more than traditional benchmarks for sure
@GBS.NOCODEAI 2 місяці тому ⁺⁷
Thanks!
@matthew_berman 2 місяці тому ⁺¹
Thank you!!
@jelliott3604 2 місяці тому ⁺²
So, let's see a Quake3 wasm deathmatch between 6 LLM bots.
Far more interestingly, let's see some squad based co-op gaming
@TiagoTiagoT 2 місяці тому
Would be interesting to compare the results of the various matches with the equivalent but ran such that the emulator is paused until both LLMs have provided an answer.
@162arun 2 місяці тому
It would be amazing if we could see the system prompts as well, just curious to understand how they work under the hood.
@rodrimora 2 місяці тому ⁺⁶
so wait, this is not an April's fools? lol
@mpiciulo 2 місяці тому
Where would be the best place to purchase/download the SF3 ROM file that would work with this?
@ThomasJDavis 2 місяці тому
Already testing A.I. for the battle field.
@JNET_Reloaded 2 місяці тому
i got the rom and on linux rpi 5 what emulator shal i use to play it manually?
@StellarStoic 2 місяці тому
This is just a game but in a decade wars will be fought between good AI and a Bad AI. Scary times ahead of us and this entertaining presentation is just a beginning.
@gweneth5958 2 місяці тому
Can we also play with an LLM this way? I'd love to try that. But it definitely is interesting to watch. Just one year and now they even play Streetfighter against each other... I'd prefer Death & Alive too =)
@ohardest 2 місяці тому
This is pretty dope...I thoght is was going to be an April fools joke, but its real.
@4.0.4 2 місяці тому
Wait, are you sure? Really?
@cacogenicist 2 місяці тому
I would guess that Claude 3 Haiku would be really good.
@NoCodeFilmmaker 2 місяці тому
I can't wait to challenge some LLMs lol
@matthewbond375 2 місяці тому ⁺¹
This is very impressive. It's also alarming. LLMs are apparently already set to perform any job that requires a human to interpret the current situation, and to input "the right moves" into a system to achieve a win condition. Over and over.
This is applicable from accounting to air-traffic control. Anything a person at a terminal calls a job. Wow.
@marcfruchtman9473 2 місяці тому
Nice.
@quebono100 2 місяці тому
Wow this is mindblowing xD
@SasquatchBioacoustic 2 місяці тому
Teaching AI how to kick our asses in a street fight, great idea.
@TiagoTiagoT 2 місяці тому
edit2: GAH, wrong name, I meant MUGEN, not MAME
Why don't they use an opensource fighting game engine like MAME that allows you to just use custom characters? (that's open source, right?)
edit: Or does it also need ROMs based on original games even when using custom characters?
@pally8868 2 місяці тому ⁺¹
I ran this and the prompt said Street Fighter 2: Turbo is the only game worth playing.
@neanda 2 місяці тому
can you give us more updates on the LLM Street fighter league, it's a much more human way of benchmarking these models
@executivelifehacks6747 2 місяці тому
I can relate to these larger, slow LLMs being smoked in blitz and bullet vs rapid chess. Damn you "quick on your feet" people. 😅
@nathanwilton3383 2 місяці тому
great video - One item to add if you installing by following video. Before installing python3 -m pip install diambra-arena, install python3 -m pip install diambra first,.
@ThePrash123 2 місяці тому
Thank You :)
@screamingiraffe 2 місяці тому
Interesting.. thanks for sharing
@CrudelyMade 2 місяці тому ⁺²⁸
Basically mashing buttons more quickly can lead to a victory over somebody who's trying to strategize in this kind of game. If you let the same models play a more turn-based strategy game that would show you the quality of their output over the quantity of their output.
@MultiNiktar 2 місяці тому ⁺²
You are sooo right. Remeber project CICERO from Meta? I wish someone did this Coleseum thing but Cicero style
@kr4dh4x0r 2 місяці тому ⁺⁶
Not true at all. Tell me you don't know the game without telling me you don't know the game.
@chrisbraeuer9476 2 місяці тому ⁺⁶
Street fighter is not a button masher!
@MultiNiktar 2 місяці тому
@@kr4dh4x0r You missed the point
@MultiNiktar 2 місяці тому
@@chrisbraeuer9476 He never made that statement
@hacker57 2 місяці тому
Sir i from india and love your content alot it helps me think how use tech out of the box ❤❤❤❤❤
@WhyteHorse2023 2 місяці тому
We should use OpenDevin to make Open Street Fighter 3. :)
@ShouryanNikam 2 місяці тому
Awesome man! Keep featuring new open source projects please!
@PunmasterSTP 2 місяці тому
I can't wait until we have LLMs duking it out in Starcraft. I still can't believe they got AlphaStar to play it, and it'd be fascinating to watch AI pick up more strategies and get thoroughly better than human players.
@elyakimlev 2 місяці тому
Someone should do a Heroes 3 variation of that.
Same army: 3 Archangels, 10 Champions, 25 Monks, 30 Crusaders, 50 Griffins, 70 Marksmen, 100 Pikemen.
@alchan230 2 місяці тому
Will the Openai team ready for the 2025 Esports World Cup?? 🤔
@user-qr9qm3bt5d 2 місяці тому ⁺¹
Oh man you're best. Thanks
@reifuTD 2 місяці тому
Okay by can they Zelda randomizer speed run race?
@avi7278 2 місяці тому ⁺²
three street fighters are in a room... lol
@jaysonp9426 2 місяці тому ⁺²
This should be the only benchmark
@Silberschweifer 2 місяці тому
by the way with this knd of framework why can make a llm who drive... gta and later real cars
@fire17102 2 місяці тому ⁺¹
Anyone tried it on Groq?
@angloland4539 2 місяці тому
❤
@fabrizio-6172 2 місяці тому
Awesome pick ❤
@danish3249 2 місяці тому ⁺¹
Imagine AI Vs Tekken 8 world champion.
@rohitdas490 2 місяці тому ⁺³
Add Groq and you will win all the time
@yonatan09 2 місяці тому
If they're this fast maybe we don't need groq
@exzld 2 місяці тому
groq cannot be run locally realistically. doesnt it have huge vram requirements?
@rohitdas490 2 місяці тому
@@exzld I am talking about GroqCloud they have a very fast inference speed.
@mohamaddelkhah 2 місяці тому
Damn it... Why today? This would've been so cool...
@hunterking4228 Місяць тому
Imagine Ai pulling an Evo moment 37
2 місяці тому
At same time new models are becoming more intelligent and amplifying hardware speed, old models are becoming faster and faster!
@emil8367 2 місяці тому
cool, works but almost forgot to use requirements.txt, sh..t happens 😀 Many thanks for sharing !
@Solidfreeman01 2 місяці тому
Who did this? It clearly had to be street fighter 2 special champions edition on 8 stars speed
@NOTNOTJON 2 місяці тому ⁺²
Nice light-hearted fun.
AGI will be achieved when models can win at dwarf fortress.
@stickmanland 2 місяці тому
Woah! Best benchmark so far! IMPOSSIBLE TO PRETRAIN ON THE DATA!!
@MagusArtStudios 2 місяці тому
Video game guides made this possible in the first place. their are countless movesets for street fighter on the inernet.
@AndrewARitz 2 місяці тому
It's an LLM, it's already trained :''D
@stickmanland 2 місяці тому
@@MagusArtStudios But the fact still remains, the LLM can't just memorize the answers. There are no answers! It has to actually think in order to win.
@luisfonseca9045 2 місяці тому ⁺³
How it works:
The python code "translates" the graphics in the game by narrating it, for example, it feeds the LLM: "You are far from the oponent, you should move closer" or "You are close to the oponent, you should attack". Of course, this is all done algorithmitically (like calculating distance between players) and is very restrictively leading to what the LLM should output. It goes on to narrate the actions of your oponent and adds to the prompt leading hints like "You have more than 30 energy, you could cast a super power".
Fun to watch, but not really doing a fair comparison of each LLM's agency and intelligence capabilities. Considering how restrictive it is, this being used as benchmark is most likely an april's fools joke. But seriously, though, using AIs as game agents to benchmark them is not a bad idea at all.
@fire17102 2 місяці тому
This (the pipeline) is only gonna be improved. I suspect a version of this plus an AlphaZero-esk pipeline would push it to better and better plays, like everything else we've seen
@AndrewARitz 2 місяці тому
@@fire17102 The AI that came with the game 30 years ago is already miles ahead of anything you are suggesting, and uses at least 100,000x less power to do it.
@voidnull4282 2 місяці тому
Haikuu must kick ass
@DeepThinker193 2 місяці тому
If anyone knows how 'AI's' in fighting games work. Especially in things like mugen. This is a good laugh. It's essentially button mashing. The faster button masher would win everytime.
@lun321 2 місяці тому
Use Mario Kart to test them out all at once.
@ScottzPlaylists 2 місяці тому ⁺¹
There is no learning if there is no Fine Tuning, right❓ How is learning accomplished❓
This seems stupid if there is no learning or Vision. ❗
How this works, limitations, and future improvements would be way more useful than Installation instructions.
@kirkjaaa 2 місяці тому ⁺¹
Will it run Doom LOL?
@fire17102 2 місяці тому
Already did haha, saw a vid about it less than a week ago
@lukmonabdulsalam 2 місяці тому
this guy is sooo good
@kr4dh4x0r 2 місяці тому ⁺¹
That AI is a free win, not impressed in the slightest with this.
@TheYashakami 2 місяці тому
While it is technically not illegal for you to dump the ROM off of the game you own yourself, that does not make it legal for you to download the ROM for a game that you own.
@aaronfallis 2 місяці тому
AI just took all the gamming jobs. lol
@tonywhite4476 Місяць тому
No place to dl safe ROM.
@shadyworld1 2 місяці тому
Waiting for GROK VS LLaMA Instead of Elon and Zuck canceled Smackdown 😂😂😂🎉
@mltiago Місяць тому
Croq would win.
@burnt1ce85 2 місяці тому ⁺¹
I smell an april fools joke
@screamingiraffe 2 місяці тому
cool but the llm models don't actually learn anything or improve. Interesting for 5 minutes
@Sean4B 2 місяці тому
If I make a bot to spam heavy kick it will beat the LLMs so no I would say this is a terrible way to evaluate an LLM.
@petrz5474 2 місяці тому
Bad UA-cam..,stop recommending me this video
@charlie11ng42 2 місяці тому
dogone it, its april fools..
@evalangley3985 2 місяці тому
We are far from something meaningful. At this point the legacy CPU is still way more impressive.
@petrz5474 2 місяці тому
April fool's
@petrz5474 2 місяці тому
Meh .. most people are gullible or I am too jaded
@oldleaf3755 2 місяці тому
this is not a valid test case. lol. wtf.
@bigglyguy8429 2 місяці тому
April fool's?
@hqcart1 2 місяці тому ⁺¹
useless project
@lukmonabdulsalam 2 місяці тому ⁺²
first to watch, first to comment
@ArianeQube 2 місяці тому ⁺⁵
And still nobody gives a shit 😂
@lukmonabdulsalam 2 місяці тому
@@ArianeQube yeah cuz you are just dumb
@rocketPower047 2 місяці тому ⁺²
Great we're training then to be ninjas🥷

Наступне

Автоматичне відтворення

Build ENTIRE Apps With A Single Prompt - FREE Open-Source Devika Tutorial