I think the best way to test the coding models would be to give them an existing code and ask it to add a function to the code since that is the main use case for coding models (For now at least) rather than writing the entire code.
@@batuhankaral7320 o1 is pretty good at that, only downside is the recursion/use of private functions/redundancy in most cases. You still need to specifically state how you want it to structure your code, but other then that, I still have to catch it making syntax errors like o4
@@punk3900 that’s why you write static tests and contextualize outputs. Token limitations won’t change in the near future, so making use of techniques to overcome those limitations is the way to go.
(7:02) *kyū* is Japanese term for a rank used in martial arts, it indicates the number of steps away a student is from reaching "dan" (master). So 1 kyū outranks 2 kyū, etc.
Summary: • Quen 2.5 Coder 7B was the overall winner, being faster and the most effective, especially with tasks like the Snake game. • Quen stood out for its speed and practicality for locally run AI coding assistants. • Matthew encourages viewers to suggest other practical tests for coding models.
Here's a testing suggestion: Have the target models complete one fairly simple task such as snake game. Then perform two iterations: 1) Cosmetic changes, with a border around the play area, color changes and different head & tail blocks for the snake. 2) Functional changes; have the food appear preferentially in different areas, only head eats so if a food randomly appears under the body or tail it is not eaten and you have to 'go around again', Food is initially yellow (shrinks snake) but randomly turns red (elongates snake), scoring etc. The idea is that models can have more difficulty modifying existing code without breaking it entirely. Guess how I know-or at least why I think this. #FeedTheAlgorithm
Nemotron seems to be really good. Also, Mistral's Codestral. I am surprised you didn't test that. Also please don't test snake in python. I am pretty sure even non-code-oriented models can output it nowadays.
0:48 Wow, that PC is so cool, I'd like you to make a video talking about hardware, especially what's needed to run models, from the smallest to the largest.
Kyu is part of an Asian leveling system. It's pronounced 'que'. Kyus count down, so higher numbers are weaker. After 1 Kyu, you switch to dan (produced don) and start counting up. In karate, the colored belts are the Kyu levels and the levels of black belt are the dan levels. Dan is where you're considered to actually start to know what you're doing.
Not really. An 8k is a pretty strong club player in Go (the game the system, later adopted by Karate, was based on). A 5k is well above average, and 3k is respectably strong. 1D is like black belt. You've far surpassed "knowing what you're doing" at that stage. For the chess players out there, think 2k+ ELO.
@andrewcarnegie5805 "knowing what you're doing" is defined differently in the West than it is in the East. It is also different across the centuries and in different communities. This is splitting hairs. I am describing the abstract intent.
Comparing these models to Claude, chatGPT, and Meta free tier would be a ***very*** valuable/helpful comparison!! Any chance of seeing this in the near future? Thanks for the great Ai content, easily the best channel out there for this content!
I can't wait for local models to be on par with the current Claude Sonnet and ChatGPT o1-mini. I get MUCH better results from them and even more with the API, paying pennies per file to get great results.) These challenges are dead simple for Claude, using Cline (a plugin for VSCode) as the interface. Cline just works for me and I've created 10,000 lines of code with it.
@@chandrasekhar01 What makes you say that? These models aren't miles behind. This version of sonnet is incredible but 6 months from now, the open source models will be on par with it.
@@chandrasekhar01 why do you think that? Honestly sounds clueless on your part. You can see on desktop models that are way better than GPT 3.5 ... just for an example.
12:26 "...and if you want a beast of a machine that can load up models up to 96 GB into VRAM, definitely check out the Dell Precision 5860..." How much in kidneys is it?
@@r9999t I hate to be the party pooper, but you actually do need at least 1 kidney to live. I know, sucks, but I guess I'm gonna have to pass on that beast of a machine. Unless, you want to sponsor me? 🤣
My benchmark, which has proven to be too hard for any model i've tested under 20b parameters, is to ask them to write a simple ray casting program like Wolfenstien 3D. Easy for models like GPT4o or Claude 3.5 Sonnet, possible with models like Llama 3 70b or Codestral, but i've yet to test a model under 20b parameters that's able to do it.
My second hand Apple MacStudio M1 Ultra with 64GB RAM and 60 Core GPU runs at 60 tok/sec with Qwen2.5-Coder 7B 8bit! And it costs only 15% of the DELL workstation.
@@the42nd yeah. This apple unified memory architecture makes a whole difference using ram for cpu and gpu. My nvidia a5000 gpu with 16gb vram sounds like a joke compare to what m chips do.
Thank you for this awesome video. Which llm model would you choose if there were no more internet connection? Basic ai assistant or an ai assistant with more capabilities? Computer vision, voice chat mode? Most likely build your own cloud networ? Thanks again for your vids
Is there a model for LM Studio that can "think and reason" like the OpenAI 01 preview does? Would love to have a local model that could solve more complex set of tasks and design requirements. Currently I use LLMs to help me design by picking through different coding architectures and approaches, the actual structure and design choices for the program, not the coding by itself.
I love Gwen2.5 models, they are some of the best and fastest models for their size. My only complaint is that the smaller models 1.5 and 3b instruct models are VERY particular about the wording of the prompts. But this is common amongst smaller models, so it's definitely not a deal breaker.
Generally the micro circuits are getting so small that a few molecules being knocked off by static charges and magnetic field fluxes influence with radiation particular matters passing through and knocking off orbiting charges changing time keeping
It would be more interesting to give them a more complex real world problem and then give them a certain number of iterations or tokens to produce the final result. Also, it would be very interesting to see how they handle codebases of different sizes.
As good and useful as these larger model responses are, I've recently become spoiled by Cursor and their AI autocomplete. The fact that I don't even have to form the question makes for faster typing at minimum, and good bug fixing when needed. Tangentially, I think testing these models with naked requests quickly becomes less useful and it will come down to which can properly input your existing codebase.
The "hard" problems are not actually coding challenges, but math challenges. The task with fast prime ouput means you need to apply e.g. Eratosthenes sieve instead of going through all numbers and testing if they are prime. It's just stupidly phrased in the task
I think another eval would be to give it terrible code, like sorting a numerical array by using timers, and see how it corrects it. Or like ask it to rewrite a block of code to be smaller yet maintain the same functionality
With all your VRAM you should be running not quantized, but full precision fp16 models. Quantization hurts smaller models performance much more, than bigger models
So the RTX 6000 is more power efficient than the RTX 4090, but the it has more CUDA cores and same amount of VRAM. Apparently the RTX 4090 is like $1500 CAD and the RTX 6000 is like $11-12k CAD. Thanks for doing the tests, I will try to remember Quen.
With the electric circuits on most semiconductors becoming more prone to 'mistakes' beginning to breakdown and making mistakes at 80 degrees C. That is different than 80 F. 100 degrees C is boiling at sea level and Zero degrees C is rain water freezing at sea level. . . (France, a long time ago) Water frozen at 32 and boiling at. . . Is the scale the same
Matthew man. I love your videos. But what is up with the stupid thumbnails? Come on man. Just do a normal video, be yourself and more people will like it.
@@DannerBanks Fuck the algorithm man. I think he will do great if he ditch the stupid thumbnails and the positive vibes. Just be yourself and talk to us. Share your passion and enjoy.
Nice machine. Curious, you mention its pretty fast "even with other models loaded" but memory usage shouldn't matter right? So long as its not 100% and then using disk. At least from a regular main system CPU RAM perspective that's pretty much how it works. Be interested to know if its different for VRAM and AI.
I tried all the models you mentioned, maybe they are ok for python but they performed bad for me for web coding (frontend / backend). Qwen has problems with understanding the prompt in general. From all I tried, dolphin-2.7-mixtral-8x7b is best so far for coding
Evaluating coding models is tricky because their performance varies significantly across different programming languages. This makes it difficult to declare a single "best" model, as one might excel in Python but struggle with JavaScript, for example. I believe it would be more insightful to evaluate language models based on their performance in specific languages, rather than relying on a single general-purpose benchmark that may not accurately reflect their strengths and weaknesses in each language.
Artificial Intelligence is like cruise control on a vehicle it still requires supervision. . . Are there any old ' back doors' that haven't been used in a while
i don't know if you noticed but today everyone had access to o1 not preview. Jimmy apples posted a link and it actually worked. But i don't think it works anymore. Did you had a change to test it and make a video?
From what I can tell I think you were using Q4KM versions on the models and not Q8 which as always isn’t that much of a difference but still a drop in precision
Also if you could try out some of the new VPTQ quantized models, these look interesting for local inference or at least running in a Google Colab runtime with an L4 or A100
I'm jumping in the Ai arena on safety and info. I have posted the first "semester" of Ai101 on my channel. 20 classes (episodes) on the players and basic info on Ai Agi for beginners and where the industry is currently at using NotebookLM called the 'Listen Grasshopper Podcast'
We did a benchmark test of these local coding LLMs at our company, in our real setup in which we write real code. We used a test assignment so as to not disclose real confidential codebase (we were also assessing cloud based models such as chatGPT). Result was that using these LLMs actually make us less efficient, even compared to writing the code ourselves using just the IDE and its features.
Yeah I’ve heard from numerous professional developers that LLM coding isn’t good for much besides basic coding and completely falls apart when you try to use it for anything large or complex
@HCG exactly. Our test wasn't even that complex. Where it really shined and sped us up compared to manual, was writing documentation (javadocs) for the methods. That was quite accurate and saved quite some time.
I don’t doubt your results, but I will say that prompt construction as well as breaking the problem up into a sequence of steps designed to reach a goal makes a huge difference in the outcome.
maybe include a first person 3d shooter for the real strong models. Where zombies hunt the player and even animate the zombies movement. I let o1 do that inside a single html and it got it done with 3 prompts. i'm sure with better prompting it could have been 1 or 2. o1 preview struggled much more i think that is a good test for next gen models
Hi Matthew, Given the fact that you have a Beast of a machine (and also the content is sponsored by Nvidia) I don't quite undestand why you didin't include nemotron:70b-instruct-q8_0 . It would for sure would provide better results.
Matt, you're awesome... can you ask these models to process Web GPU / 3JS / GSAP tasks... this is the pinnacle of web ATM combined with the MERN stack + React.
For tests like this where you use local models, I think you need to increase the strength of the prompt context. Anyone expecting to run local models with weak prompts is likely to be disappointed.
2 місяці тому+2
I was thinking the same. I work with LLMs for work and we develop prompt templates, test, evaluate with inputs to improve successful outputs.
hello great content as always sir. i have a video idea. can you use these smaller or open source models but with good prompting and then compare the result with closed source models with simple prompting?
just make different tops. and you can make all sorts of comparisons between the models. it's interesting to watch, especially since you have a cool PC now.
Wait, the assigment is "create a regular expression that..." and you told it "don't use the regular expression library". So is the assignment to create an untested regular expression that someone who did have access to a regular expression library could use to complete the assignment? Or to create a regular expression library from scratch, and then use it to complete the assignment? No wonder HAL won't open the pod bay doors, it appears to be GIGO.
Can you conduct an interview as if it was a human please. you can pick any model for this, but you should have a general set of questions and a context for the kick off. coding challenge should be in parts. eg make a in memory database, make it an api then and if you are feeling crazy make it do a docker compose file for the applicaiton
If you are looking for a hard puzzle to test LLMs, consider this one: "An ardent swain said to his lady love, some years ago, “Once when a week ago last Tuesday was tomorrow, you said, ‘When a day just two fortnights hence will be yesterday, let us get married as it will be just this day next month.’ Now sweetheart, we have waited just a fortnight so as it is now the second of the month let us figure out our wedding day.”"
I don't recommend qwen 2.5 7b As my experience it's can generate code but can't edit code or fix it I have given it a code and tell it to fix it but it's generate different code It's not related to the code you sent ut.
For me the biggest deciding factor is the cut off date for the models, although qwen2.5 is great its nearly a year out of date, which is just too much 🙁
please check your "facts", the disinformation with so much confidence is annoying on internet, don't be one of them. Qwen 2.5 were released in September 2024.
@@andrepaes3908 That's true, but if you ask it what the training cut-off date was, it says 20241231, however, if I ask what version the Flutter SDK is, I get 3.10.5, which is from Jun-Jul 2023. So for me, it's around 12-18 months behind.
I am currently testing aideml with gpt-4o-mini and the results are better than with automl (autogluon 1h). I haven't tested longer than 24h yet, but it looks like it should be better.
And none of these are able to handle German? Well. 70b is too tiny to handle German properly. Claude is really good in German, ChatGPT is close to Claude, but Claude is ahead. But what about the open source models and the multi language capabilities? Coding and or other models? Meta claims that llama 3.2 is able to understand German. No, it isn‘t, except you talk to it in SPO structure without splittable verbs (in German it reminds me to phrasal verbs). It‘s a mess! What a disappointment! What an exclusion of at least 100 million people.
eXcLuSi0n LOL, dude what do you expect, its made in USA not Germany, its not their fault that we are so behind in technology. At least 100Mil❓ not only is that wrong you and I know that, a big portion have never learned to speak/write in German 😆 If you need translation, just use deepL, stell dich net so an wie's Kind beim 💩
Have you tried Qwen2.5 72b? The German is decent. It does occasional grammatical errors, though (wrong gender/declension on less often used words) Example: Erkläre mir, warum es nicht möglich ist, die Zinsen eines einzelnen Eurocent aus dem Jahr 0 abzuheben Die Frage, warum es nicht möglich ist, die Zinsen eines einzelnen Eurocent aus dem Jahr 0 abzuheben, beruht auf mehreren Gründen, die sowohl historisch, wirtschaftlich als auch mathematisch sind. Lassen Sie uns die wichtigsten Punkte betrachten: 1. Historische Gründe: Fehlende Währung: Das Eurocent wurde zum ersten Mal im Jahr 2002 eingeführt, als der Euro als Bargeldwährung in den Eurozone-Ländern eingeführt wurde. Vor diesem Zeitpunkt gab es weder den Euro noch das Eurocent. Fehlendes Finanzsystem: Im Jahr 0 gab es in Europa noch kein modernes Finanzsystem, das Zinsen in der Form berechnen oder auszahlen konnte, wie wir es heute kennen. Zinsgeschäfte und Banken existierten in ihrer heutigen Form nicht. 2. Wirtschaftliche Gründe: Feuerungs- und Nominalwert: Der Wert von Geld ändert sich über die Zeit. Inflation, Wirtschaftskrisen und wirtschaftliches Wachstum beeinflussen den Nominalwert von Geld. Ein Eurocent im Jahr 2002 hat einen anderen realen Wert als ein Eurocent heute, geschweige denn im Jahr 0. Zinsstruktur und Zinspolitik: Zinsen werden von Zentralbanken gesetzt und sind stark von der wirtschaftlichen Situation und den politischen Rahmenbedingungen abhängig. Diese Faktoren sind im Jahr 0 nicht vergleichbar mit denen, die heute bestehen. 3. Mathematische und Logistische Gründe: Zinseszinsen: Zinseszinsen wachsen exponentiell, was bedeutet, dass ein sehr kleiner Betrag wie ein Eurocent über eine lange Zeitspanne (von 2023 bis ins Jahr 0 sind es etwa 2023 Jahre) zu einem astronomisch hohen Betrag anwächst. Dieser Betrag wäre so hoch, dass er die gesamte weltwirtschaftliche Produktion mehrfach übersteigen würde. Praktische Herausforderungen: Selbst wenn es theoretisch möglich wäre, die Zinsen von einem Eurocent über die gesamte Zeitspanne zu berechnen, gäbe es praktische Herausforderungen. Wohin sollte der Betrag eingezahlt werden? Wie sollte er gespeichert werden? Welche Institutionen könnten ihn vermögen? 4. Rechtliche und Institutionelle Gründe: Fehlende Verträge und Rechte: Ohne einen formellen Vertrag oder Rechtstitel, der die Auszahlung von Zinsen regelt, gibt es keine rechtliche Grundlage, um Anspruch auf diese Zinsen zu erheben. ** gesetzliche Rahmenbedingungen:** Heutige Gesetze und Verträge haben eine bestimmte Gültigkeitsdauer. Langfristige Verbindlichkeiten über Jahrtausende hinweg sind nicht vorgesehen und würden in der Praxis nicht durchsetzbar sein. Zusammenfassung: Die Unmöglichkeit, die Zinsen eines Eurocent aus dem Jahr 0 abzuheben, resultiert aus der historischen Nichtexistenz des Euro und des modernen Finanzsystems, den enormen wirtschaftlichen und mathematischen Herausforderungen, die mit einem solchen Unterfangen verbunden wären, sowie den fehlenden rechtlichen und institutionellen Rahmenbedingungen.
Just fyi saying "going head to head" is sufficient for 3 or more competitors.. You don't need to add an extra "to head" for each competitor ... Good thing there weren't 7 competitors 😂
Maybe you can try better test. Most of us are not writing snake games. I used Anthropic to build app that interfaces the OpenAI real time audio using Golang. I am using go routines. I do not see your snake game as a valid test.
I think the best way to test the coding models would be to give them an existing code and ask it to add a function to the code since that is the main use case for coding models (For now at least) rather than writing the entire code.
@@batuhankaral7320 o1 is pretty good at that, only downside is the recursion/use of private functions/redundancy in most cases. You still need to specifically state how you want it to structure your code, but other then that, I still have to catch it making syntax errors like o4
The fun part starts with going beyond 2000 lines of code. Prototyping a short code is not what you need AI for
@@punk3900 that’s why you write static tests and contextualize outputs. Token limitations won’t change in the near future, so making use of techniques to overcome those limitations is the way to go.
what about a test if the model can understand existing Code?
So bad it will generate different code from what you give it
@@JINIRPAexactly!
(7:02) *kyū* is Japanese term for a rank used in martial arts, it indicates the number of steps away a student is from reaching "dan" (master). So 1 kyū outranks 2 kyū, etc.
Sad when a man reached that age without ever strapping on a gi.
Summary:
• Quen 2.5 Coder 7B was the overall winner, being faster and the most effective, especially with tasks like the Snake
game.
• Quen stood out for its speed and practicality for locally run AI coding assistants.
• Matthew encourages viewers to suggest other practical tests for coding models.
Is it summarized by AI?
@ Transcribed by whisper and summarized by AI, yes. Using a script I’m building into a website.
Here's a testing suggestion: Have the target models complete one fairly simple task such as snake game. Then perform two iterations: 1) Cosmetic changes, with a border around the play area, color changes and different head & tail blocks for the snake. 2) Functional changes; have the food appear preferentially in different areas, only head eats so if a food randomly appears under the body or tail it is not eaten and you have to 'go around again', Food is initially yellow (shrinks snake) but randomly turns red (elongates snake), scoring etc.
The idea is that models can have more difficulty modifying existing code without breaking it entirely. Guess how I know-or at least why I think this. #FeedTheAlgorithm
I know what you mean
I agree it's a more real world coding test
I agree when you start making changes - this is where they mess it up
Missed these kind of videos! They're the best!
Nemotron seems to be really good. Also, Mistral's Codestral. I am surprised you didn't test that. Also please don't test snake in python. I am pretty sure even non-code-oriented models can output it nowadays.
0:48 Wow, that PC is so cool, I'd like you to make a video talking about hardware, especially what's needed to run models, from the smallest to the largest.
Kyu is part of an Asian leveling system. It's pronounced 'que'. Kyus count down, so higher numbers are weaker. After 1 Kyu, you switch to dan (produced don) and start counting up. In karate, the colored belts are the Kyu levels and the levels of black belt are the dan levels. Dan is where you're considered to actually start to know what you're doing.
Not really. An 8k is a pretty strong club player in Go (the game the system, later adopted by Karate, was based on). A 5k is well above average, and 3k is respectably strong. 1D is like black belt. You've far surpassed "knowing what you're doing" at that stage. For the chess players out there, think 2k+ ELO.
@andrewcarnegie5805 "knowing what you're doing" is defined differently in the West than it is in the East. It is also different across the centuries and in different communities. This is splitting hairs. I am describing the abstract intent.
If you are going to mention a machine to do AI work, it would be nice if you linked to the exact spec you used in the testing.
Comparing these models to Claude, chatGPT, and Meta free tier would be a ***very*** valuable/helpful comparison!!
Any chance of seeing this in the near future?
Thanks for the great Ai content, easily the best channel out there for this content!
I can't wait for local models to be on par with the current Claude Sonnet and ChatGPT o1-mini. I get MUCH better results from them and even more with the API, paying pennies per file to get great results.) These challenges are dead simple for Claude, using Cline (a plugin for VSCode) as the interface. Cline just works for me and I've created 10,000 lines of code with it.
@@JohnLewis-old That's never gonna happen.. at least in this lifetime
@@chandrasekhar01 What makes you say that? These models aren't miles behind. This version of sonnet is incredible but 6 months from now, the open source models will be on par with it.
@@chandrasekhar01 why do you think that? Honestly sounds clueless on your part. You can see on desktop models that are way better than GPT 3.5 ... just for an example.
and Google Gemini
Please test with tasks like writing html+css+js code. A lot of people seem to be using these models to create landing pages and micro apps.
12:26 "...and if you want a beast of a machine that can load up models up to 96 GB into VRAM, definitely check out the Dell Precision 5860..." How much in kidneys is it?
Only two, you can afford it. :-)
@@r9999t 🤣🤣🤣
@@r9999t I hate to be the party pooper, but you actually do need at least 1 kidney to live. I know, sucks, but I guess I'm gonna have to pass on that beast of a machine. Unless, you want to sponsor me? 🤣
@@mrdevolver7999 While I try to be kind, I do sorta need my kidneys. Besides, I'm old, you really don't want my kidneys.
@@r9999t Hmm, I'm sure the beast of a machine wouldn't feel offended by your old kidneys, but okay, maybe next time. 😉🤝
My benchmark, which has proven to be too hard for any model i've tested under 20b parameters, is to ask them to write a simple ray casting program like Wolfenstien 3D. Easy for models like GPT4o or Claude 3.5 Sonnet, possible with models like Llama 3 70b or Codestral, but i've yet to test a model under 20b parameters that's able to do it.
My second hand Apple MacStudio M1 Ultra with 64GB RAM and 60 Core GPU runs at 60 tok/sec with Qwen2.5-Coder 7B 8bit! And it costs only 15% of the DELL workstation.
So m1 chip can use Ram as vram? That's a game changer (prepping to migrate from windows lol)
@@shramscar the new macbook pro m4 with 128GB RAM can run all these models.... its getting super interesting. Eager to see what llama 4 offers.
@@the42nd yeah. This apple unified memory architecture makes a whole difference using ram for cpu and gpu. My nvidia a5000 gpu with 16gb vram sounds like a joke compare to what m chips do.
@@the42nd Running it and getting decent speed in tokens is a different matter!
Thank you for this awesome video.
Which llm model would you choose if there were no more internet connection?
Basic ai assistant or an ai assistant with more capabilities? Computer vision, voice chat mode?
Most likely build your own cloud networ?
Thanks again for your vids
Is there a model for LM Studio that can "think and reason" like the OpenAI 01 preview does? Would love to have a local model that could solve more complex set of tasks and design requirements.
Currently I use LLMs to help me design by picking through different coding architectures and approaches, the actual structure and design choices for the program, not the coding by itself.
I love Gwen2.5 models, they are some of the best and fastest models for their size. My only complaint is that the smaller models 1.5 and 3b instruct models are VERY particular about the wording of the prompts. But this is common amongst smaller models, so it's definitely not a deal breaker.
Generally the micro circuits are getting so small that a few molecules being knocked off by static charges and magnetic field fluxes influence with radiation particular matters passing through and knocking off orbiting charges changing time keeping
Also micro power fluctuations, which is why the Tesla gigacluster runs off battery charged from the grid
It would be more interesting to give them a more complex real world problem and then give them a certain number of iterations or tokens to produce the final result. Also, it would be very interesting to see how they handle codebases of different sizes.
4:46 - that Dell AI workstation came with Windoze and not Linux?
On top of being very informative, this is also so much fun to watch.
i love the qwen2.5 models so far, the llm models are really good and they dont use too much memory
just ask it to code ever more complex games, more complex websites.
As good and useful as these larger model responses are, I've recently become spoiled by Cursor and their AI autocomplete. The fact that I don't even have to form the question makes for faster typing at minimum, and good bug fixing when needed.
Tangentially, I think testing these models with naked requests quickly becomes less useful and it will come down to which can properly input your existing codebase.
The "hard" problems are not actually coding challenges, but math challenges. The task with fast prime ouput means you need to apply e.g. Eratosthenes sieve instead of going through all numbers and testing if they are prime. It's just stupidly phrased in the task
I wonder if you tested quantized versions of the models, in which case they shouldn't be as good as their full B16F versions
I see you were running GGUF quantized models, what quantizations were they?
I think another eval would be to give it terrible code, like sorting a numerical array by using timers, and see how it corrects it. Or like ask it to rewrite a block of code to be smaller yet maintain the same functionality
Now that you have so much firepower in those A6000 cards, I would have loved to see what can be done with much larger models. Like 70B or 200B.
With all your VRAM you should be running not quantized, but full precision fp16 models. Quantization hurts smaller models performance much more, than bigger models
Nice video. I would like to see different IDEs that allow for different local models to collaborate in an agentic manner.
So the RTX 6000 is more power efficient than the RTX 4090, but the it has more CUDA cores and same amount of VRAM. Apparently the RTX 4090 is like $1500 CAD and the RTX 6000 is like $11-12k CAD. Thanks for doing the tests, I will try to remember Quen.
With the electric circuits on most semiconductors becoming more prone to 'mistakes' beginning to breakdown and making mistakes at 80 degrees C. That is different than 80 F. 100 degrees C is boiling at sea level and Zero degrees C is rain water freezing at sea level. . . (France, a long time ago) Water frozen at 32 and boiling at. . . Is the scale the same
I LOVE the idea to use codewars!!! What a great way to avoid contamination in the datasets :) great job
Matthew man. I love your videos. But what is up with the stupid thumbnails? Come on man. Just do a normal video, be yourself and more people will like it.
😂😂
He is just doing what the algorithm (and people clicking behind the algorithm) rewards him for
@@DannerBanks Fuck the algorithm man. I think he will do great if he ditch the stupid thumbnails and the positive vibes. Just be yourself and talk to us. Share your passion and enjoy.
@@kokomal bro, it's a thumbnail. Take a deep breath. Touch grass
Agree
Nice machine. Curious, you mention its pretty fast "even with other models loaded" but memory usage shouldn't matter right? So long as its not 100% and then using disk. At least from a regular main system CPU RAM perspective that's pretty much how it works. Be interested to know if its different for VRAM and AI.
I've been using Qwn in Gen AI image creation and pretty much smokes everything else.
Matt, could you use Nvidia's nemotron 70b for coding tasks just to see how it does?
I guess building some UI would be a neat test case
I tried all the models you mentioned, maybe they are ok for python but they performed bad for me for web coding (frontend / backend). Qwen has problems with understanding the prompt in general. From all I tried, dolphin-2.7-mixtral-8x7b is best so far for coding
good to see smaller models getting better, sitll a long way to go though
Need a baseline to compare. Like Llama 3.2 3B or 11B.
Please do nemotron with some complicated stuff. You'll like it
codegemma, codellama, and codestral would be interesting
Evaluating coding models is tricky because their performance varies significantly across different programming languages. This makes it difficult to declare a single "best" model, as one might excel in Python but struggle with JavaScript, for example. I believe it would be more insightful to evaluate language models based on their performance in specific languages, rather than relying on a single general-purpose benchmark that may not accurately reflect their strengths and weaknesses in each language.
Artificial Intelligence is like cruise control on a vehicle it still requires supervision. . . Are there any old ' back doors' that haven't been used in a while
i don't know if you noticed but today everyone had access to o1 not preview. Jimmy apples posted a link and it actually worked. But i don't think it works anymore. Did you had a change to test it and make a video?
I know that is not “open source” (it has a research license) but maybe pit that models against Codestral?
From what I can tell I think you were using Q4KM versions on the models and not Q8 which as always isn’t that much of a difference but still a drop in precision
DeepSeek Coder V2 Lite is a MoE with just 2.4B active, it should be faster than Yi and Qwen. Must be something wrong with your setup.
Does Yi know Yii?
Those cards are beasts!!! Nvidia hear my slow processing and low-vram cries...
Also if you could try out some of the new VPTQ quantized models, these look interesting for local inference or at least running in a Google Colab runtime with an L4 or A100
Why not Deepseek 2.5?
Does anyone know of a tutorial for this to run on M1 macbook pro?
Make some test with bigger context window in coding
Dude, badass video.
I'm jumping in the Ai arena on safety and info. I have posted the first "semester" of Ai101 on my channel. 20 classes (episodes) on the players and basic info on Ai Agi for beginners and where the industry is currently at using NotebookLM called the 'Listen Grasshopper Podcast'
You could also try IBM's Granite 3.0, Matthew
It is impossible that 2 A6000 are so slow doing inference with a 7B model
We did a benchmark test of these local coding LLMs at our company, in our real setup in which we write real code. We used a test assignment so as to not disclose real confidential codebase (we were also assessing cloud based models such as chatGPT).
Result was that using these LLMs actually make us less efficient, even compared to writing the code ourselves using just the IDE and its features.
Yeah I’ve heard from numerous professional developers that LLM coding isn’t good for much besides basic coding and completely falls apart when you try to use it for anything large or complex
@HCG exactly. Our test wasn't even that complex.
Where it really shined and sped us up compared to manual, was writing documentation (javadocs) for the methods. That was quite accurate and saved quite some time.
Did you use open AI's o1 model in test out of interest? This seems most likely to help
I don’t doubt your results, but I will say that prompt construction as well as breaking the problem up into a sequence of steps designed to reach a goal makes a huge difference in the outcome.
@@vassovas it was one of the tested models. But this video is mainly about alternative LLMs that you could also run locally.
maybe include a first person 3d shooter for the real strong models. Where zombies hunt the player and even animate the zombies movement. I let o1 do that inside a single html and it got it done with 3 prompts. i'm sure with better prompting it could have been 1 or 2. o1 preview struggled much more i think that is a good test for next gen models
What not trying some other languages such as Go or Rust (probably they wouldn’t time out)? Thanks for the videos.
Hi Matthew, Given the fact that you have a Beast of a machine (and also the content is sponsored by Nvidia) I don't quite undestand why you didin't include nemotron:70b-instruct-q8_0 . It would for sure would provide better results.
WOuld love to you see do the same but with models that fit into 24GB VRAM (RTX 4090 compliant).
Matt, you're awesome... can you ask these models to process Web GPU / 3JS / GSAP tasks... this is the pinnacle of web ATM combined with the MERN stack + React.
For tests like this where you use local models, I think you need to increase the strength of the prompt context. Anyone expecting to run local models with weak prompts is likely to be disappointed.
I was thinking the same. I work with LLMs for work and we develop prompt templates, test, evaluate with inputs to improve successful outputs.
hello great content as always sir. i have a video idea. can you use these smaller or open source models but with good prompting and then compare the result with closed source models with simple prompting?
Claude 3.5 Sonnet is the BEST at coding.
Why not see how they do with other languages
cool idea, maybe go a bit more in depth: code quality efficiency, maintenability, etc...
They all fail hard when you give them a simple (to humans) powershell request.
YEAH!!!! best video till now.
Multiple one board for each square with negative contrasts
just make different tops. and you can make all sorts of comparisons between the models. it's interesting to watch, especially since you have a cool PC now.
Wait, the assigment is "create a regular expression that..." and you told it "don't use the regular expression library". So is the assignment to create an untested regular expression that someone who did have access to a regular expression library could use to complete the assignment? Or to create a regular expression library from scratch, and then use it to complete the assignment? No wonder HAL won't open the pod bay doors, it appears to be GIGO.
Can you conduct an interview as if it was a human please. you can pick any model for this, but you should have a general set of questions and a context for the kick off.
coding challenge should be in parts. eg make a in memory database, make it an api then and if you are feeling crazy make it do a docker compose file for the applicaiton
I wish you tried granite code by IBM
If you are looking for a hard puzzle to test LLMs, consider this one: "An ardent swain said to his lady love, some years ago, “Once when a week ago last Tuesday was tomorrow, you said, ‘When a day just two fortnights hence will be yesterday, let us get married as it will be just this day next month.’ Now sweetheart, we have waited just a fortnight so as it is now the second of the month let us figure out our wedding day.”"
Seriosly Matt 2xA6000! ... we are not envious at all :)
I don't recommend qwen 2.5 7b
As my experience it's can generate code but can't edit code or fix it
I have given it a code and tell it to fix it but it's generate different code It's not related to the code you sent ut.
You feed it a leetcode question. You know it’s been trained on that.
It could only get the code wrong if it hallucinates.
I need Qwen 2.5 coder 32b
Write a program to solve the Rubik's cube in python
For me the biggest deciding factor is the cut off date for the models, although qwen2.5 is great its nearly a year out of date, which is just too much 🙁
Qwen 2.5 coder was release just a few weeks ago!
please check your "facts", the disinformation with so much confidence is annoying on internet, don't be one of them.
Qwen 2.5 were released in September 2024.
@@andrepaes3908 That's true, but if you ask it what the training cut-off date was, it says 20241231, however, if I ask what version the Flutter SDK is, I get 3.10.5, which is from Jun-Jul 2023. So for me, it's around 12-18 months behind.
With chess there are a limited number of moves now make the board 3D with the same rules
I am currently testing aideml with gpt-4o-mini and the results are better than with automl (autogluon 1h). I haven't tested longer than 24h yet, but it looks like it should be better.
GPS time stamp accuracy and echos
And none of these are able to handle German? Well. 70b is too tiny to handle German properly. Claude is really good in German, ChatGPT is close to Claude, but Claude is ahead. But what about the open source models and the multi language capabilities? Coding and or other models? Meta claims that llama 3.2 is able to understand German. No, it isn‘t, except you talk to it in SPO structure without splittable verbs (in German it reminds me to phrasal verbs).
It‘s a mess! What a disappointment! What an exclusion of at least 100 million people.
eXcLuSi0n LOL, dude what do you expect, its made in USA not Germany, its not their fault that we are so behind in technology.
At least 100Mil❓ not only is that wrong you and I know that, a big portion have never learned to speak/write in German 😆
If you need translation, just use deepL, stell dich net so an wie's Kind beim 💩
Have you tried Qwen2.5 72b? The German is decent. It does occasional grammatical errors, though (wrong gender/declension on less often used words) Example:
Erkläre mir, warum es nicht möglich ist, die Zinsen eines einzelnen Eurocent aus dem Jahr 0 abzuheben
Die Frage, warum es nicht möglich ist, die Zinsen eines einzelnen Eurocent aus dem Jahr 0 abzuheben, beruht auf mehreren Gründen, die sowohl historisch, wirtschaftlich als auch mathematisch sind. Lassen Sie uns die wichtigsten Punkte betrachten:
1. Historische Gründe:
Fehlende Währung: Das Eurocent wurde zum ersten Mal im Jahr 2002 eingeführt, als der Euro als Bargeldwährung in den Eurozone-Ländern eingeführt wurde. Vor diesem Zeitpunkt gab es weder den Euro noch das Eurocent.
Fehlendes Finanzsystem: Im Jahr 0 gab es in Europa noch kein modernes Finanzsystem, das Zinsen in der Form berechnen oder auszahlen konnte, wie wir es heute kennen. Zinsgeschäfte und Banken existierten in ihrer heutigen Form nicht.
2. Wirtschaftliche Gründe:
Feuerungs- und Nominalwert: Der Wert von Geld ändert sich über die Zeit. Inflation, Wirtschaftskrisen und wirtschaftliches Wachstum beeinflussen den Nominalwert von Geld. Ein Eurocent im Jahr 2002 hat einen anderen realen Wert als ein Eurocent heute, geschweige denn im Jahr 0.
Zinsstruktur und Zinspolitik: Zinsen werden von Zentralbanken gesetzt und sind stark von der wirtschaftlichen Situation und den politischen Rahmenbedingungen abhängig. Diese Faktoren sind im Jahr 0 nicht vergleichbar mit denen, die heute bestehen.
3. Mathematische und Logistische Gründe:
Zinseszinsen: Zinseszinsen wachsen exponentiell, was bedeutet, dass ein sehr kleiner Betrag wie ein Eurocent über eine lange Zeitspanne (von 2023 bis ins Jahr 0 sind es etwa 2023 Jahre) zu einem astronomisch hohen Betrag anwächst. Dieser Betrag wäre so hoch, dass er die gesamte weltwirtschaftliche Produktion mehrfach übersteigen würde.
Praktische Herausforderungen: Selbst wenn es theoretisch möglich wäre, die Zinsen von einem Eurocent über die gesamte Zeitspanne zu berechnen, gäbe es praktische Herausforderungen. Wohin sollte der Betrag eingezahlt werden? Wie sollte er gespeichert werden? Welche Institutionen könnten ihn vermögen?
4. Rechtliche und Institutionelle Gründe:
Fehlende Verträge und Rechte: Ohne einen formellen Vertrag oder Rechtstitel, der die Auszahlung von Zinsen regelt, gibt es keine rechtliche Grundlage, um Anspruch auf diese Zinsen zu erheben.
** gesetzliche Rahmenbedingungen:** Heutige Gesetze und Verträge haben eine bestimmte Gültigkeitsdauer. Langfristige Verbindlichkeiten über Jahrtausende hinweg sind nicht vorgesehen und würden in der Praxis nicht durchsetzbar sein.
Zusammenfassung:
Die Unmöglichkeit, die Zinsen eines Eurocent aus dem Jahr 0 abzuheben, resultiert aus der historischen Nichtexistenz des Euro und des modernen Finanzsystems, den enormen wirtschaftlichen und mathematischen Herausforderungen, die mit einem solchen Unterfangen verbunden wären, sowie den fehlenden rechtlichen und institutionellen Rahmenbedingungen.
Code an unmaintainable game in Haskell with multiple redundant backdoors and a _random number_ of deliberate 0-day vulnerabilities
With the fourth dimension being time
Just fyi saying "going head to head" is sufficient for 3 or more competitors.. You don't need to add an extra "to head" for each competitor ... Good thing there weren't 7 competitors 😂
Great video
4G with the forth dimension being time
Please do one video on supernova modell
1. The RTX A6000 is NOT the RTX 6000 ; 2. I also have a Titan RTX in my rig so I'm up around 115gb vram so =P
There is old Quadro RTX 6000, then there is newer RTX A6000, and there is latest RTX 6000 Ada which is used here.
With all that GPU horse power, why didn't you try a massive model just for fun?
Maybe you can try better test. Most of us are not writing snake games. I used Anthropic to build app that interfaces the OpenAI real time audio using Golang. I am using go routines.
I do not see your snake game as a valid test.
i want to see a model write doom