I'm not sure investing $6k for local model is worth it, deepseek v3 is great, but it will take atleast 3~4 month for next SLM to be useful, I'm pretty doubtful but maybe at the end of 2025 we will final see some good local model, but if you just use there api, it would be so cheap even if you run deepseekv3 for 24/7 * 365 its around $3600, about half of the price... Feel like MAXed out m4 is for youtubers who want to test things but not realistic for devs, but I could be wrong so feel free to comment please~
Man, depends on your project. I have a maxed out M3 max and work on a large Rust monorepo build with Bazel. I would not trade in the Max for anything because incremental build times just fly. A full rebuild on my previous computer took about an hour to complete. The M3 Max does the same job in about ten minutes… if your time is valuable, the Max gets the job done.
nah man, if u use it for API you spend 3600$ for nothing. yes. nothing, if u sell your m4 max next year still around 5k$ for the m4 max + 0$ API FEE.. yeah if u are calculated correctly u will know, there is areason elon musk spend 100 mio usd in GPU bruhh
I use the M4 Max 128GB for React Native and for LLM app development. It's very, very fast and very portable. Rapid development is a lot easier when everything flies, and that includes custom models for custom purposes.
are you a psychic? yesterday I was searching for a comprehensive M4 benchmark video and all I got was slop with slop thumbnails with stupid faces and now you upload this, great work man thank you so much
Besides data privacy, the other key purpose of running a local model is not speed (even 10 TPS is faster than I am) or cost (it is already getting increasingly cheaper), but upskilling it via RAG, fine tune, etc... Would be interested in those metrics when those local models are trained on my own data/knowledge base.
Your next video should be with something related to running local LLM models through Ollama vs MLX on Apple machines. You going to love how faster the MLX framework is, I have M4 Pro Mac Mini with 64GB unified RAM and I only run models through MLX framework, it’s a night and day vs Ollama.
Second this... Would love to see some comparative benchmarks of MLX vs Ollama. Let's see how good the Apple GPU optimisation is. I'd also be interested in some side by side comparisons of different quants. How much does the accuracy drop and the ups go up when going to say 8bit, 5bit, 4bit etc..
Exactly the same spec as mine! I absolutely love the machine. It’s not only a workhorse, it’s a piece of art. Just perfect build quality. Makes my razer blade 18 (one of the best build quality windoze machines) feel cheap. Don’t get me wrong, love my razer, but apple is next level.
Looks incredible could be the move for q3. Right now the big question I'm looking to answer is - how can I get DeepSeek v3 hosted? Will 3 Digits do the job? We'll see. Will update on the channel ofc.
@@indydevdan 3 units will work with llama.ccp for sure imo (though inference speed may be pretty slow compared to api). The real value with digits over M4 Ultra seems to be (at least for image and video gen) that at fp16 its twice as fast as M4 Ultra. If the memory bandwidth is the same as M4 Ultra, at 1092 GB/s then we'll really be cooking!
Thank you, I find your work very valuable! If I may: a video I would personally be looking out for is the comparisson of different quantizations of the same model. I understand and respect you can't do individual requests, but maybe it's something you or others may see value in as well.
Very nice perf comparison of both the hardware and models! Cool benchmarking tool and your rigorous process is excellent. One shot performance is definitely useful but would also be nice to see how increased context length impacts performance - e.g. what is the max usable token count for each model?
all I have to say is WOW! Fantastic video, and great job with "benchy"! would love to see how speed would increase with mlx, did not know of the other you mentioned, llamafile or what it was. thanks!
Great timing. I'm going through your course and have also been contemplating what spec M4 to get and whether local model usage was a use case worth buy for (vs getting a lower spec MacBook Pro & putting money instead to a GPU rig).
One interesting benchmark would be to allow model to correct it's answer by allowing it to review it's output 1-2 times before passing output to the evaluator function. This should increase success rate with a toll on execution time. Here local models have an edge compared to per token API models costs.
Nice!.My two cents: you could try specifying PARAMETER seed with a custom Modelfile, to generate predictable results and avoid different outputs of the model when benchmarking.
falcon 3 kaaawww I'm rolling. Also, enjoyed seeing just a static shot of the box and hearing the ambient noise being made while you openend. Audio was clear. I understand this is an AI channel, but I thought it was produced well
True, I took them out because it takes ages to run 30 tests. They do run quite well though, I'm keeping a sharp eye out for performant 30-70b models. It's seems like the sweet spot for the M4 Max.
I got the M4 Pro for Christmas and I am super interested to test these models locally. One of my favorite tech products ever as a developer coming from a windows laptop.
“Your videos are some of the best I’ve seen, and I always look forward to your guidance. I was wondering if you’re still using Marimo notebooks? I’ve tried a few times but can’t seem to make it click. Are you mainly using them as a prompt library?”
How much worse is the regular M4 compared to the M4 Max? Thinking on buying Air or Pro model with regular M4 and want to know if there is a big deal in Ai performance in there.
Given your logic here, a Project Digits build might be your next move in Q3-4. Similar price for two of them which Nvidia says you can run 405b models on.
I've got a m3 max out and was a little envious of the m4 here, but 100% getting at least one digits when they come out! Just saying, it's not just HIS logic!
Is mac the best device for local AI or should I buy a nvidia jensen GTX device? Would love to see a video about that. ps the index file in your project is blanc
33:02 > I had some loop of try and fail and ended up with a very reliable tool use prompt that was very consistent all my local Qwen2.5 Coder models including 1.5B, 7B as well 14B. At the end I decided only use 1.5B because it nailed perfectly every single time. That was surprising.
As far as min tokens per sec goes it depends on the type of task. One I'm waiting for results I'd agree 10 tokens per sec is about my minimum. However, an automated task that I don't care how long it takes there is no minimum.
It's a bit weird to hear you say that qwen 32b is wrong because it just gave the result without the underlying python code, when the prompt given was "Use Python to compute... print the numerical result only." I think the model performed exactly as expected given your prompt.
At one time I wanted to buy a MacBook for local LLM. But then I read Reddit and they said that MacBooks are very very slow when the prompt reaches the 30k limit. It turns out like some kind of marketing bullshit. Like "AMD Ryzen AI Max+ 395 2.2x faster than 4090" but the LLama 3.1-70B Q4 won't fit into the entire VRAM. 43GB model size vs 24gb vram size.
Open models often happen to be poor instruction followers while prone ro verbosity - I have same impression after testing plenty with simple "move piece on the chess board" prompt when simulating chess games (LLM vs Random player). DeepSeek v3, Phi-4 (as well as Qwen 2.5 and Llama 3.1) are at the bottom of my "LLM Chess Leaderboard" if sorted by the number if mistakes made by LLMs when failing to comply with prompt instructions
Wow that's surprising. Instruction following should be the first benchmark for model builders to prioritize imo. Obviously it's more complicated than turning the 'instruction following' dial but a model that doesn't follow instructions well is pretty useless to us.
Fantastic video, and great job with the benchmarking tool, have been waiting for something like this for a while. Are you by chance interested to test the Qwen QwQ-32B model? It's a really good model when it comes to reasoning, mathematics and problem solving.
Pricing is outrageous if you're buying this in Europe. It should cost less because of the usd to eur conversion being favorable to eur, instead it ends up costing way more. I have just spent 5k eur on a macbook pro m4 max with 64GB ram and 1tb ssd and nano texture display. It's getting to the point where it would ALMOST be more feasible to go to the US and buy it from there and at least this way I also take a one day vacation in the US. I am honestly very outraged by this. The only reason I am willing to pay these prices is that I really like to have things running locally and wouldn't want to rely on APIs also I like the apple ecosystem.
i got my 14 inch 128 gb ram 2 tb space black m4 max macbook on jan 4th 2025. This is a magical device..... I wanted to review this but i don't have a youtube channel...
Why aren't you using the highest quality version you can of the model, or at least the q8 version? There is a big difference in quality between the 3b-instruct-q4_K_M of llama3.2 (which you are using) and the q6, for example. The 3b-instruct-fp16 full version may be overkill but is only 6.4GB. It still sucks but it's better than the q4 version.
For some tasks, accuracy and intelligence trumps all. If it took 1 week to complete a prompt and that prompt has the right accuracy, no matter how fast a less capable model is, if it can’t give you the depth and accuracy you need then it’s better to use the slow model. What is more useful? The ability to ask a 300 IQ individual one question per day or the ability to ask 1000 80 IQ individuals 1000 questions per second? The low intelligence route cannot help you build a rocket ship, the high intelligence route can.
He can't stand speeds slow enough to read. If you're generating code, you'll have to read what's being written eventually, so you may as well do so during inference. I don't mind speeds as slow as natural human handwriting.
Now that's a solid model to work with. Phi-4 just (officially) dropped too. That will be a little faster but will drop your intelligence a bit coming down from qwen2.5:32b.
@@indydevdan I uploaded a video testing llama3.3:70b-instruct-q8_0 earlier this morning. yea I tested phi4:14b-fp16 with some of my apps was not really impressed with its intelligence.
I'm not sure investing $6k for local model is worth it, deepseek v3 is great, but it will take atleast 3~4 month for next SLM to be useful, I'm pretty doubtful but maybe at the end of 2025 we will final see some good local model, but if you just use there api, it would be so cheap even if you run deepseekv3 for 24/7 * 365 its around $3600, about half of the price... Feel like MAXed out m4 is for youtubers who want to test things but not realistic for devs, but I could be wrong so feel free to comment please~
Man, depends on your project. I have a maxed out M3 max and work on a large Rust monorepo build with Bazel. I would not trade in the Max for anything because incremental build times just fly. A full rebuild on my previous computer took about an hour to complete. The M3 Max does the same job in about ten minutes… if your time is valuable, the Max gets the job done.
unfortunately, you could be right...
nah man, if u use it for API you spend 3600$ for nothing. yes. nothing, if u sell your m4 max next year still around 5k$ for the m4 max + 0$ API FEE.. yeah if u are calculated correctly u will know, there is areason elon musk spend 100 mio usd in GPU bruhh
I use the M4 Max 128GB for React Native and for LLM app development. It's very, very fast and very portable. Rapid development is a lot easier when everything flies, and that includes custom models for custom purposes.
@ yea and some people still think better pay API Services 😂😂😂 they forgot the other benefit + price of the laptop etc its not compareable!
Been waiting for someone to post a video like this since M4 MAX launched. Thanks man!
are you a psychic? yesterday I was searching for a comprehensive M4 benchmark video and all I got was slop with slop thumbnails with stupid faces
and now you upload this, great work man thank you so much
Very impressive piece of work Dan! Keep it up
Besides data privacy, the other key purpose of running a local model is not speed (even 10 TPS is faster than I am) or cost (it is already getting increasingly cheaper), but upskilling it via RAG, fine tune, etc... Would be interested in those metrics when those local models are trained on my own data/knowledge base.
Your next video should be with something related to running local LLM models through Ollama vs MLX on Apple machines. You going to love how faster the MLX framework is, I have M4 Pro Mac Mini with 64GB unified RAM and I only run models through MLX framework, it’s a night and day vs Ollama.
Second this... Would love to see some comparative benchmarks of MLX vs Ollama. Let's see how good the Apple GPU optimisation is.
I'd also be interested in some side by side comparisons of different quants. How much does the accuracy drop and the ups go up when going to say 8bit, 5bit, 4bit etc..
Exactly the same spec as mine! I absolutely love the machine. It’s not only a workhorse, it’s a piece of art. Just perfect build quality. Makes my razer blade 18 (one of the best build quality windoze machines) feel cheap. Don’t get me wrong, love my razer, but apple is next level.
Project DIGITS will become my personal AI model machine to run local models. I may even get 2 to connect them together.
IndyDevDan out here making AI ASMR unboxing videos in 2025 like he's living in the future.
What do you make of the NIVIDIA Digits units coming out?
Looks incredible could be the move for q3. Right now the big question I'm looking to answer is - how can I get DeepSeek v3 hosted? Will 3 Digits do the job? We'll see. Will update on the channel ofc.
@@indydevdan 3 units will work with llama.ccp for sure imo (though inference speed may be pretty slow compared to api). The real value with digits over M4 Ultra seems to be (at least for image and video gen) that at fp16 its twice as fast as M4 Ultra. If the memory bandwidth is the same as M4 Ultra, at 1092 GB/s then we'll really be cooking!
Thank you, I find your work very valuable!
If I may: a video I would personally be looking out for is the comparisson of different quantizations of the same model. I understand and respect you can't do individual requests, but maybe it's something you or others may see value in as well.
This is a great idea - stay tuned.
Very nice perf comparison of both the hardware and models! Cool benchmarking tool and your rigorous process is excellent. One shot performance is definitely useful but would also be nice to see how increased context length impacts performance - e.g. what is the max usable token count for each model?
all I have to say is WOW! Fantastic video, and great job with "benchy"! would love to see how speed would increase with mlx, did not know of the other you mentioned, llamafile or what it was. thanks!
Great timing. I'm going through your course and have also been contemplating what spec M4 to get and whether local model usage was a use case worth buy for (vs getting a lower spec MacBook Pro & putting money instead to a GPU rig).
Appreciate this man , this is huge for prospective buyers!!
One interesting benchmark would be to allow model to correct it's answer by allowing it to review it's output 1-2 times before passing output to the evaluator function. This should increase success rate with a toll on execution time. Here local models have an edge compared to per token API models costs.
Nice, great video and interesting approach!
love the cross-test of prompts
Many thanks for this - was looking forthis exactly!
For agentic and batch tasks, I am fine with sub 10 TPS
I respect your patience
Always useful stuff, thanks man!😊
Fantastic video, I hope to run Llama3.3 70B at 10tokens/s soon but on a Linux compatible hardware with NPU
Your videos are starting good and are improving keep it going.
Nice!.My two cents: you could try specifying PARAMETER seed with a custom Modelfile, to generate predictable results and avoid different outputs of the model when benchmarking.
That's so beautiful, as a AI nerd it's like watching a beautiful Art Masterpiece
falcon 3 kaaawww I'm rolling. Also, enjoyed seeing just a static shot of the box and hearing the ambient noise being made while you openend. Audio was clear. I understand this is an AI channel, but I thought it was produced well
That caught me really off guard.
Really nice video as usual. With 126 GB of unified memory, you could have tested Llama 3.3 70B and Qwen2.5 72B on the last problem
True, I took them out because it takes ages to run 30 tests. They do run quite well though, I'm keeping a sharp eye out for performant 30-70b models. It's seems like the sweet spot for the M4 Max.
I got the M4 Pro for Christmas and I am super interested to test these models locally. One of my favorite tech products ever as a developer coming from a windows laptop.
Big upgrade imo. Enjoy the experience. The M4 has been a dream to work with so far.
@@indydevdan Been working on a web app the last few days and it has been a treat. Did not realize how badly I needed a Unix-based OS for coding too.
Glad to see my $199 went to a good cause. Enjoy it! :) One of my clients bought me the top spec mac mini with 64gb for xmas :)
“Your videos are some of the best I’ve seen, and I always look forward to your guidance. I was wondering if you’re still using Marimo notebooks? I’ve tried a few times but can’t seem to make it click. Are you mainly using them as a prompt library?”
How much worse is the regular M4 compared to the M4 Max? Thinking on buying Air or Pro model with regular M4 and want to know if there is a big deal in Ai performance in there.
I couldn't easily find the memory configuration of you Mac Pros, which I think is very important to know.
Given your logic here, a Project Digits build might be your next move in Q3-4. Similar price for two of them which Nvidia says you can run 405b models on.
I've got a m3 max out and was a little envious of the m4 here, but 100% getting at least one digits when they come out! Just saying, it's not just HIS logic!
100% - big announcement for local LLMs. That prediction is coming true only 7 days into 2025 🎯.
Is mac the best device for local AI or should I buy a nvidia jensen GTX device? Would love to see a video about that. ps the index file in your project is blanc
Will be graet also if they change or give option for a macbook with magnesium alloy !! at least not too heavy !!
Nice video!! Would be super interesting so get some info of boosting performance with conversion to mlx or any other tips
33:02 > I had some loop of try and fail and ended up with a very reliable tool use prompt that was very consistent all my local Qwen2.5 Coder models including 1.5B, 7B as well 14B. At the end I decided only use 1.5B because it nailed perfectly every single time. That was surprising.
all thanks to the people who paid for the course XD. Jokes, cool and always supportive and thankful of you!
im actually thinking about getting a m2 ultra. i wonder how it stack up to the m4 max
How about visual models benchmarking? Like Llama3.2-vision?
As far as min tokens per sec goes it depends on the type of task. One I'm waiting for results I'd agree 10 tokens per sec is about my minimum. However, an automated task that I don't care how long it takes there is no minimum.
It's a bit weird to hear you say that qwen 32b is wrong because it just gave the result without the underlying python code, when the prompt given was "Use Python to compute... print the numerical result only." I think the model performed exactly as expected given your prompt.
Love the Was Anderson style intro.
At one time I wanted to buy a MacBook for local LLM. But then I read Reddit and they said that MacBooks are very very slow when the prompt reaches the 30k limit. It turns out like some kind of marketing bullshit. Like "AMD Ryzen AI Max+ 395 2.2x faster than 4090" but the LLama 3.1-70B Q4 won't fit into the entire VRAM. 43GB model size vs 24gb vram size.
Open models often happen to be poor instruction followers while prone ro verbosity - I have same impression after testing plenty with simple "move piece on the chess board" prompt when simulating chess games (LLM vs Random player). DeepSeek v3, Phi-4 (as well as Qwen 2.5 and Llama 3.1) are at the bottom of my "LLM Chess Leaderboard" if sorted by the number if mistakes made by LLMs when failing to comply with prompt instructions
Wow that's surprising. Instruction following should be the first benchmark for model builders to prioritize imo. Obviously it's more complicated than turning the 'instruction following' dial but a model that doesn't follow instructions well is pretty useless to us.
that's on 2k context length, which is completely useless in alot of use cases
This is great, infairness that "tough" questions are really tough, even for me.
Fantastic video, and great job with the benchmarking tool, have been waiting for something like this for a while. Are you by chance interested to test the Qwen QwQ-32B model? It's a really good model when it comes to reasoning, mathematics and problem solving.
Would also really like to see the benchmarks for that model. Although reasoning might be a bit more difficult to evaluate.
Pricing is outrageous if you're buying this in Europe. It should cost less because of the usd to eur conversion being favorable to eur, instead it ends up costing way more. I have just spent 5k eur on a macbook pro m4 max with 64GB ram and 1tb ssd and nano texture display. It's getting to the point where it would ALMOST be more feasible to go to the US and buy it from there and at least this way I also take a one day vacation in the US. I am honestly very outraged by this.
The only reason I am willing to pay these prices is that I really like to have things running locally and wouldn't want to rely on APIs also I like the apple ecosystem.
i got my 14 inch 128 gb ram 2 tb space black m4 max macbook on jan 4th 2025. This is a magical device..... I wanted to review this but i don't have a youtube channel...
Why aren't you using the highest quality version you can of the model, or at least the q8 version? There is a big difference in quality between the 3b-instruct-q4_K_M of llama3.2 (which you are using) and the q6, for example. The 3b-instruct-fp16 full version may be overkill but is only 6.4GB. It still sucks but it's better than the q4 version.
True but why not just bump to 10b, 14b, or 32b models?
Right on time I have a m4 machine on the way
this buddy knows what others want 😄
Cool but can it run cyberpunk?
has anyone install deepseek-v3 locally ?
Someday I will do the same pack opening and I will be happy :)
For some tasks, accuracy and intelligence trumps all. If it took 1 week to complete a prompt and that prompt has the right accuracy, no matter how fast a less capable model is, if it can’t give you the depth and accuracy you need then it’s better to use the slow model.
What is more useful? The ability to ask a 300 IQ individual one question per day or the ability to ask 1000 80 IQ individuals 1000 questions per second?
The low intelligence route cannot help you build a rocket ship, the high intelligence route can.
How did you download the deepseekcoder v3 in Ollama?
Agreed.
I have a topped out M2 so I’m looking at this to learn.
3:40 he just stated that deep sea was cloud based
@____2080_____ thanks
Yes!! I was like dang do I need an m4, but I have a m3, think worth it then?
If you have a specced out M3 I would wait for m5 or mNext
@@indydevdan I have notifs on for your channel. Love the content. Agree, thanks!!
Sweet if you have cash to burn. Love the performance but that price is ouch ouch ouch.
Aesthetic AF :) Thank you for the video and the results.
ASMR vibes
He can't stand speeds slow enough to read. If you're generating code, you'll have to read what's being written eventually, so you may as well do so during inference. I don't mind speeds as slow as natural human handwriting.
You spelled Benchee wrong. ;)
oh great, another mac
interesting falcon fails on the first problem
maybe a test issue, since startup time is slow
Super
i got an m3 pro last year lol and theres already m4. my wallet would be cooked im not getting this
Didn't like the silent intro.
It would be better to put the 6k into bitcoin two years previous if thats still an option. Time can move backwards if yiur a photon 🤷♂️
would never buy apple, too authoritarian for me... so would be good if you can point out differences to windows
M3 Max, llama3.2:1b: total duration: 2.413439167s
load duration: 22.982167ms
prompt eval count: 44 token(s)
prompt eval duration: 67ms
prompt eval rate: 656.72 tokens/s
eval count: 390 token(s)
eval duration: 2.321s
eval rate: 168.03 tokens/s
Nice! Try a larger, more useful model though like Qwen2.5:32b.
@@indydevdan qwen2.5:32b: total duration: 5.795758917s
load duration: 20.935334ms
prompt eval count: 48 token(s)
prompt eval duration: 577ms
prompt eval rate: 83.19 tokens/s
eval count: 87 token(s)
eval duration: 5.196s
eval rate: 16.74 tokens/s
@@indydevdan qwen2.5:32b: total duration: 5.795758917s
load duration: 20.935334ms
prompt eval count: 48 token(s)
prompt eval duration: 577ms
prompt eval rate: 83.19 tokens/s
eval count: 87 token(s)
eval duration: 5.196s
eval rate: 16.74 tokens/s
Now that's a solid model to work with. Phi-4 just (officially) dropped too. That will be a little faster but will drop your intelligence a bit coming down from qwen2.5:32b.
@@indydevdan I uploaded a video testing llama3.3:70b-instruct-q8_0 earlier this morning. yea I tested phi4:14b-fp16 with some of my apps was not really impressed with its intelligence.