H100 is not inferencing, it is for training (but widely used for inferencing too, super wasteful). There is a massive market hole for inference hardware.
For your 4090 machine give it a try with ExLlamaV2 or Tensor-RT as an inference engine. They will give you +50% performance boost compared to ollama / gguf models. Also this initial 2 seconds performance hit is only ollama specific thing with default settings - and it happens because by default ollama unloads model from memory after some idle timeout. You can turn this behavior off
Running the GPU at a reduced power level can also be quite viable. It's common that you can sacrifice a bit of speed for significantly better efficiency. Especially for tasks that bottleneck on memory bandwidth. I rarely run my 3080ti past 80% power. It varies a lot from card to card and from task to task, so it takes a bit of testing to find the right balance. Though, the 4090 especially is known to be tuned deep into diminishing returns from the factory, so you can cut a lot of power without losing much performance.
Energy efficiency and heat are the reason I go for mac at the moment, upgraded a few years ago from a 1080 ti to a 3080 ti. Basically sold my pc a month later and bought a PS5 and MacBook. So I really appreciate the inclusion of efficiency in your testing.
I highly appreciate your focus to run LLMs locally, especially on affordable mini PCs. Very helpful. This kind of edge computing will grow massively soon. So please continue to create/run such tests.
Currently NPUs can't run the AI models of the tests well. llama.cpp (the inference code behind ollama) does not run on the NPUs (yet). It's all marketing only. Qualcomm/QNN, Intel, AMD, Apple all have different NPU architectures and frameworks, which make it very hard for llama.cpp to support them. Apple does not even support their own NPU (called ANE) with their own MLX-framework. llama.cpp does not even support the Qualcomm Snapdragon's GPU (Adreno), but ARM did some very clever CPU-tricks, so that the Napdragon's X Elite's CPU is approx. as fast as a M2 10-GPU for llama.cpp via their Q4_0_4_8 (re-)quantization. You can also use this speed-up with ollama (but need specially re-worked models).
@@flyingcheesecake3725 you mean Apple's CoreML and not MLX. CoreML supports the GPU and NPU (Apple calls it ANE), but with limited control. Apple's great open-source machine-learning framework MLX does NOT support the ANE (at least until now), only the CPU+GPU.
You don't have to load the model into the vram every time. It is actually crazy to compare like that. You load once and then send requests, then RTX latency will be the lowest.
0.53 EUR/kWh in Germany? Where did you get these numbers from? About 0.28 EUR/kWh would be an average price. In fact prices dropped considerably during the last year or so.
If you wanna make your tests deterministic with ollama, you can use the /set command to set the parameters for top k of 0 and top p, temperature of 0, and set the seed parameter to a same value. Also, if instead of running a chat if you put the prompt in the run command and use -verbose you’ll get the tps
The timing of this video could not have been more perfect. Literally working on figuring out how to get an LLM running locally because I want the freedom of choosing the LLM I want and for privacy reasons. I only have an RTX 4060 Ti w/ 16GB but it should be more than sufficient for my purposes. Love this style of format! It makes sense to consider electric cost when running these sorts of setups. Awesome quality as always!
I'd love to see the same comparison with the Mac Studio with the Ultra chip since it has similar cost to a high spec PC with a RTX4090. This would be quite useful to know for workstation situations
Very cool content thanks Alex, I've always wanted to try one of these models locally but I couldn't find the time, at least I can follow your steps and do it a bit faster :) See you in your next video!
You can make LLMs nearly deterministic if not entirely with the “Temperature” setting. I haven’t yet had time to experiment with this myself, but I’ve been following this with great interest. I’d be joyous to see a similar comparison woth the (hopefully just days away) M4 Pro and M4 Max alongside your other unique, insightful, and well designed testing.
Not to mention setting the seed value. I wonder if there would be any per-machine difference among them with the same seed/temp, or if it would generate the same on both. But without them pinned to the same output, a 'speed test' that isn't about tokens/sec is pointless.
@@TheHardcard “Hallucinations” (i.e. incorrect answers) and temperature are two different things. Although increasing the temperature (variability in the probability of the next token) is likely to produce more incorrect responses, it also can create more creative responses (that may be ‘better’) in many situations. Incorrect responses are simply the result of both the training data and the architecture of a next word prediction model.
I’m with you! Apple wins by a mile. I have compared my 64G Mac Studio M2 Ultra to my Windows WS that has dual Nvidia A4500 using NVlink (20G for each card) and at half the price the Mac Studio easily competes with the Nvidia cards. I can’t wait to get an M4 Ultra Mac Studio with 192G Ram -maybe more RAM 😂. I use AI for LOCAL RAG and research and local code assistant for Vs code
This was fantastic, thanks. I have done a video or two about Ollama, but haven't been able to do something like this because I haven't bought any of the nvidia cards. It is crazy to think that we have finally hit a use case where the cheapest way to go is with a brand new Mac.
How about on the Snapdragon CPUs? Do they have Ollama running natively on those yet? I'm guessing not, but it would be interesting to see how they match up against the Mac hardware.
There's plenty of frontends on Android(Layla, Pocket Pal, chatterAI, MLCChat, etc) or you can run ollama through Termux if you want. Snapdragons do pretty well on phones, considering they're also a phone. About 2.5-3 tokens/sec on Snapdragon 695 on Llama 3.2 8B parameter model (~$200USD phone, ala Motorola g84. Slow memory, slow CPU, 12GB RAM. About 5 watts usage). About 12-22 tokens/sec+ on Snapdragon 8 gen 3 (~$600-700USD phone, ala OnePlus Ace 3 Pro. Good memory, good CPU, 24GB RAM, so can do bigger models. About 17.5 Watts usage). So, they're not bad, but they're nowhere near a decent desktop CPU/ GPU/ or Mac integrated RAM thingy. But they can run them, to vaguely "usable" speeds (smaller models go quite a bit quicker, and are getting pretty good too. There's Llama 4B Magnum and Qwen 2.5 3B that will double and a bit those speeds, especially using ARM optimized versions. They're not "super smart/ knowledgeable", but they're good enough for entertainment purposes).
Great work Alex! Really timely information about running LLMs locally. If you are wanting to protect your IP and security this is the way to go. It's a pioneering time. I am sure there will be more dedicated setups to do the inferenence efficiently as right now it seems still very clunky at the hardware level. It would also be great to compare a EC2 Instance spun up on demand to do the inference too. This comparison would give an option to protect your IP/Security but get perhaps a cost effective solution that doesn't act as a room heater.
I wonder if the benchmark uses the latest MLX apple library, when I switch to it (on LM Studio), it was an incredible difference. I can't wait until Ollama add it!
Ollama uses the llama.cpp code-base for inference. And llama.cpp doesn't use MLX. I admit, that I did not benchmark MLX via llama.cpp recently, I have to look into it. MLX in my understanding can use GGUF-files, but only limited quantization variants. Q4_0 seems a good compromise, but llama.cpp can do better.
I got the K9 and 96gb after your first video about it, and am happy with the output speed and power consumption for the price. I did get a silent fan for it though cause it is loud. I only really do coding work so I think the Mac would be great, but the RTX is just overkill. I'm very curious the next gen of Intel processors so please do that test. I don't care about the RTX honestly cause I needed something that good I can put that money towards claude or chat gippity online and not have to deal with the heat, power consumption, maintenance, and run a huge model. I personally don't see the benefits of running a 7B parameter model at lightning speed. I'd rather run a bigger model slower at home. Great video! So much to think about.
@@brulsmurfI use 123B models on my M1 ultra 128GB. It can be as slow as 7 tokens per second, but I find that still usable for interactive chat. I’m more into quality than speed.
There is quite a lot of talk about this in r/localllama. For LLM work 4090 is faster than an M2 Ultra. You can get more memory on a M2 Ultra, but you can also get multiple 4090 (or perhaps even better, 3090) and put in a computer for typically less money (and better performance). If you get over 90 GB requirements (4 xx90 cards) then neither model is really fast enough to be useful. (When you get to about 1 token per second then you're probably better off just renting a machine online or buying tokens from eg OpenAI.)
thanks for a great and interesting video! I have a Linux machine with a nvidia card and a macbook pro m3, and personally I really care about the noise and heat from the linux/nvidia machine - after a while it's bothering. the mac makes zero noise, and no heat and is always super responsive. in my opinion, it's incredible what Apple has built. thanks again for your fun and cool technical videos!
@@ALCE-h7b No they are still doing the what they were doing before but I suppose they took a big leap now with the release of their 'Asahi game playing toolkit' which in short makes playing AAA games on Mac very probable Maybe if you wanna lean more, why not read their blog.
That gigantic rig with the rubber fingers is comedy gold. Like you couldn't trigger them remotely, or just test them not simultaneously. Clearly it's for laughs. Testing weirdly tiny computers though, for local LLM performance and using a cable attached GPU. Who would seriously use machines like that for this purpose?
Really interesting to see the Mac mini and how well it performs. As someone who owns a Mac mini M1, it’s simple design, relatively small footprint, exceptionally quiet, operation, and low power consumption are all pluses. I think with my experience on the original Apple Silicon, I would definitely go with the M2, especially since good for chunks of the day, this system would be idle.
Great video! Thanks for doing not only the comparison, but also the analysis for the second half of the video. I can’t wait when you test the new Intel Core CPU that just came out. Since you use the iGPU on the NUC, please mention the speed of the DRAM in your next comparison video, as it can have an impact on results (just like DRAM can have an impact on the performance of a video game). I hope a desktop machine with a 4090 makes it in your next comparison video. Even if it is a machine with a mini-ITX MB with a 16x pie slot would be much better
It is not only the hardware, but the CUDA (Compute Unified Device Architecture) framework that allows developers to harness the massive parallel processing. The question is whether Apple will develop an MLX framework more suitable for AI development.
I'll be totally honest... I don't have the slightest clue what's happening in this video but the little bit that I could understand seems really cool lol.
I was actually interested in building a dedicated LLM server, but after a lot of looking around for language models, I realized most open source "coding" focused LLMs are either extremely outdated or planely not good. LLama is working with data from 2021, DeepSeek Coder 2022 etc. Unfortunately the best models for coding purposes are still Claude and ChatGPT and those are closed.
Can you compare some AMD GPUs in these tests? They're significantly cheaper especially for the extra RAM. So I'm curious if the 'less performant' AMD GPUs can do the same as a 4090 or a lesser Nvidia GPU or a linear result of RAM= performance.. like a 20-24g AMD vs 16g Nvidia? What about using a real motherboard with full PCI-E GEN 5 and DDR5 RAM? Where is the bottleneck? Is there a limit or benefit to doing tons of RAM to either not bother with a GPU or to supplement a GPU? How can you use both the RAM & GPU? Lastly, VERY curious about the latest AMD and Intel CPUs coming that are supposed to be more power efficient and 'built for AI' 🍻
I ended up with a used studio M1 Ultra with 128GB RAM. I can run any model up to 96GB RAM, and often do. It could be faster, but the important thing is I have few limitations for large-ish models and context. What really competes with this?
Great test! This is exactly what I'm interested in, especially the idle power test is important, as you won't be running inference 24/7 usually. Amazing to see that "mini PC" with the eGPU takes more power in idle than my full blown desktop PC with a 4090 😅
Well, yes, of course we want to know how the 200H core ultra series will perform. Have you heard any news about when Asus is releasing the Nuc 14 AI? I haven't seen any public information yet.
I'm surprised the the Oculink doesn't create more of a performance hit than it did. My 4090 plugged directly into the Mobo gets about 145 tokens/sec on llama3.1:8b verses the ~130 tokens/sec that you got. Kind of make sense since the model is first loaded into memory on the GPU.
Token-generation during inference is largely memory-bandwidth bound (OK, with some minimal impact for quantization-calculations) - Intel 1TB/s. And the LLM runs entirely on the 4090. The 4090 really shines during (batched) prompt-processing, blowing the Intel/Apple machine away - probably >20x faster than the M2 Pro, and way, way more so than the Intel CPU/GPU.
Alex, thank you, and please comment on the NUC shared video memory management (in general - because the smaller test probably fit in either/both), especially when switching back-and-forth between a CPU test and a GPU test -would this be windows managed, or would you be changing parameters? Thank you again.
Great video! Thanks In the verdict you forgot to mention that the only way to run bigger LLMs is with Mac’s if you have enough unified memory. yes it’s gonna be slow but still possible and faster than cpu You mentioned it in the beginning though :)
12:37 Shouldn't quality of responses also matter? I imagine the quality that the 4090 spits out is better than the mac in terms of accuracy and quality?
@@EugeneYunak Performance isn't just raw hardware power - it's how well your software is optimized for it. Think of it this way: Current LLMs are like engines tuned specifically for NVIDIA's H100s and Google's TPUs. Sure, Apple silicon is powerful, but running these models is like having a Porsche tuned to Prius specs - you've got the hardware muscle, but the software can't tap into it. We don't have LLMs yet that can fully saturate Apple silicon's potential. Does that make sense?
@@art-thou-gomeo first, no it does not, i say that as a developer. i don’t think we need to go into details here, i certainly won’t because your general point is correct but it does not matter in this specific context. second, the question here was on “accuracy and quality”, not performance so i don’t understand why you bring it up. unless you specifically limit the runtime “quality” is going to be the same between mac and rtx
@@EugeneYunak It seems like we are using different meanings. I'm using accuracy/quality to mean the quality of the language outputs, and it seems you're talking more about technical accuracy - is this right? I probably overhyped the hardware quality gap initially. While platforms handle floating point math and RNG differently, it's not some dramatic "4090 = big brain" situation. But from my tests I consistently get better outputs on NVIDIA. When I ask for creative stuff like poems about life's transient nature, the NVIDIA version just flows better and feels more polished. Not being a dev, I'm genuinely curious - what would cause this? I know it could be confirmation bias, but I've seen this pattern across many tests. Any thoughts?
@@art-thou-gomeo it could be the software you are using to run the prompts places more aggressive execution limits on the inferer on non-gtx hardware, in terms of execution time, tokens, memory availability (smaller models). if the exact same execution is performed on 4090 and m2 npu, they should produce similar output in terms of quality. in fact, if you set temperature to zero, you should get deterministically identical results. slower hardware will just arrive at it later - sometimes dramatically. e. g. if i ran a huge model that doesn’t fit into vram on the gpu, but fits into mac’s unified memory, we would be talking orders of magnitude (which is why this particular test in the video was capped at a size that fits into ram/vram on all machines). in fact, mac machines are currently the only viable local development platform for huge ML models which is why the claim that development optimizes for gpus is outdated if not wrong entirely - i haven’t seen anything in the code anywhere that would produce different output based on hardware. there certainly hardware-specific optimizations, but these only help it run faster, not produce different output. so yeah, either the software you are using limits the inferers to give you outputs in what the devs consider reasonable time (which is a reasonable thing for user-facing software to do), or you are running non-identical software, or it’s confirmation bias :)
I'm curious to see the first Strix Halo APUs running this test with ROCm + Ollama. I'm dreaming of 96GB VRAM allocated and 128GB total for the system. What do you think?
Thanks for this content and comparison. I think as this becomes more mainstream, there will be less nerdy setup’s and software to use this on a pc, Mac or tablet in the near future. 👍🤖
well I have some questions about test method because you do not close and open llm everytime you need it mostly you load it once and it will run for long time, you generate multiple queries at that time so loading time will be non effective point on daily use. second most of the time you will put llm on a mini server (you can use minipc or any pc for that matter I do not have mac so cant talk about them) so you just reach it over lan so heat can be little to no problem. but for last point how big model it can work on or did you need biggest model? like did you need storyteller for coding or coder for story telling. and lastly who need what and for how much? as long as you have inernet connection even cheapest solution here is around 2+ years of subscription plan. and I use local llms so I can say its good way when you do not have internet connection but its not for everybody.
at 15:10 you show a document that says both that M2 16GB costs $0.11 per day and $.03 per day, and yearly costs are $40 and $12 per day. I looked closer because 3 cents per day seems suspisouly low.
I think the 4060 ti 16GB should have been included in the comparison. It seems like a most valuable solution. It combines a small price, good performance, low consumption and fairly compact size. I think for local llm it should be the best solution from nvidia for the average user.
Great comparison Alex just shows if a model is tuned to run on a specific hardware it will outperform in terms of effficiency. However I saw an article today that shows Microsoft open-sourced bitnet.cpp a blazing-fast 1-bit LLM inference that runs directly on CPUs. It says that now you can run 100B parameter models on local devices. Will be waiting for your video on who this changes everything.
LLM inference is largely determined by RAM bandwidth. The newest SoCs (Qualcomm/AMD/Intel/Apple) almost all have 100-130 GB/s. While the M2/M3 Max has 400GB/s, the Ultra 800 GB/s, the 4090 has >1TB/s. And all the new CPUs have very fast matrix-instructions, rivalling the GPUs for performance. 1.5-Bit quantization might be some future thing (but not in any useable model), but currently 4-Bit is the sweet-spot. Snapdragon X Elite CPU Q4_0_4_8 quantized inference is already similar in speed to M2 10-GPU Q4_0 inference with the same accuracy.
Thank you for doing this. I wonder about this stuff, and wish that there was a bit more content on GPUs and LLMs. Like is it better for LLMs to get 2 AMD cards or 1 4090? or even 2 A770s?
You can actually measure the heat VERY EASILY because every watt consumed is converted to heat with EFFECTIVELY 100% efficiency! It's just resistive heating with a few extra steps, but all those extra steps only produce heat as a loss anyway. Technically a very small amount is lost to things like vibrations and maybe UV radiation leaving through an open window or whatever, but you can basically ignore that for all practical purposes. 1W/h of power consumption is around 3.5 BTU/h and 1000J is around 1 BTU. So in your case, a single run consumes (rounding for easiness) let's say 5 BTU for the intel, 2.5 for the M2 and 4 for the RTX.
Excellent comparison! For me, an interesting future comparison would be between the RTX setup and an equally priced Mac Studio. And then re-compare after the new M4 Pro/Max is available (hopefully next week?).
I like this style of video and 100% want to see more, especially with continuously updated test as hardware is updated like an m4 device or if you get a mac studio or ultra, intel, etc.
Hey Alex. Thank your for this kind of comparison. Often I only see speed matrixes, this with average energy consumption, speed, initial cost and heat generated was exactly what I wanted to see. A full array with multiple tests. Thank you, I will definitely see more if you make more power/speeds/cost/Wcost/.... Comparison. ! Thank you.
It is interesting to multiply the runtime of each setup with the cost, to get a ”bang for the buck” measure. I run both the NUC with 96Gb and a pc with RTX3080. The first is slowest but it ”can” run very large LLMs. The latter is fast but needs smaller low precision models. I wonder if the Apple is faster on Stable Diffusion?
Interesting analysis! Two things: 1) I believe heat is pretty much just the energy consumed. There's no chemistry involved, just physics, so aside from negligible light and sound and a small amount of work from spinning fans, a computer using 100 watts is essentially a 100W heater. 2) The Intel mini pc loses in speed and efficiency, but its real strength is the amount of RAM. You could run a 70b model with minimal quantization and get better output than either of the other machines is capable of (though quantifying that and putting it on a chart would be difficult).
12:00 Are you sure? Normally ollama will keep the model in memory for a few minutes. So if you do something like coding, the model should already be in VRAM for most queries.
I wonder how it performs for M1 Mac Mini, will have to give a try. Rather than selling it I might repurpose it for this use case. Also for that RTX4090, you mentioned that there's a warmup where the model is copied to the vram -- does this happen for each request or it stays in vram for awhile?
You should be able to keep it in the vram. There is a 'keep alive' variable when starting which decides how long the model is kept. However, if you are developing stuff that needs the GPU itself I don't know if both are allowed to stay there.
Hi Alex, can make a video about running llm code editors like zed editor or vscode with some extension in local setup? im planning buy a gpu but i dont find any video demostrating the local generate code example.
Thank you for making this. I think Mac Studio could be comparable to 4090? Haven’t said that, there is 192GB option for Mac Studio, and not even mentioned the potential new M4 Mac Studio.
If the model can fit in 16gb, the nVidia 4060ti/16gb is a solid performer, you should give it a try. That card can even run Codestral/Mistral Small at 4bit quantization at decent speeds (My old 3060 12gb was just a couple gigs short on vram)
I hope its possible to include performance comparison for training and fine tuning models at some point in the future. I'm guessing 4090 wins hands down, but maybe a higher end apple silicon isn't too far behind? Its just that if I'm spending 4090 or max/ultra money I will want to do more than just inference on it. For inference tasks it would be cool to include a raspberry pi with a helio npu setup for ultra-budget tier comparisons as well.
Would be interesting how those RTX Small Form Factor cards would perform. Max. 75W as PCIe bus only powered... Also: if a model was trained on NVidia, does running it on other hardware give different results?
If you run ollama command with `--verbose` flag, it will give you the tokens/sec for each prompt. So you don't have to time each machine separately.
Man this question was literally on my mind. Not everyone can afford a H100
H100 is not inferencing, it is for training (but widely used for inferencing too, super wasteful).
There is a massive market hole for inference hardware.
@@adamrak7560 jup! And somehow the m4 max with 128gb is a cheap option for it now..
By orin kit 64 G or THOR
@@malloottwaiting for it but my M1 Max just need to be replace then I will test the M2 MAX 64G/2T
For your 4090 machine give it a try with ExLlamaV2 or Tensor-RT as an inference engine. They will give you +50% performance boost compared to ollama / gguf models. Also this initial 2 seconds performance hit is only ollama specific thing with default settings - and it happens because by default ollama unloads model from memory after some idle timeout. You can turn this behavior off
Running the GPU at a reduced power level can also be quite viable. It's common that you can sacrifice a bit of speed for significantly better efficiency. Especially for tasks that bottleneck on memory bandwidth. I rarely run my 3080ti past 80% power. It varies a lot from card to card and from task to task, so it takes a bit of testing to find the right balance. Though, the 4090 especially is known to be tuned deep into diminishing returns from the factory, so you can cut a lot of power without losing much performance.
Energy efficiency and heat are the reason I go for mac at the moment, upgraded a few years ago from a 1080 ti to a 3080 ti. Basically sold my pc a month later and bought a PS5 and MacBook.
So I really appreciate the inclusion of efficiency in your testing.
Electricity prices are really expensive in some parts of the world. I'd be happy running a MacBook, Mac Mini or some small NUC for a private LLM.
@@fallinginthed33p AND it good for environment!
@@vadym8713 sooolaaaar
7:42 If you set the temperature of the model to zero, the output should be deterministic, and give the same results across all machines.
I highly appreciate your focus to run LLMs locally, especially on affordable mini PCs. Very helpful. This kind of edge computing will grow massively soon. So please continue to create/run such tests.
Would be interested to test iGPU vs (new) NPU perf.
Yeah tops per watt
Currently NPUs can't run the AI models of the tests well. llama.cpp (the inference code behind ollama) does not run on the NPUs (yet). It's all marketing only. Qualcomm/QNN, Intel, AMD, Apple all have different NPU architectures and frameworks, which make it very hard for llama.cpp to support them. Apple does not even support their own NPU (called ANE) with their own MLX-framework.
llama.cpp does not even support the Qualcomm Snapdragon's GPU (Adreno), but ARM did some very clever CPU-tricks, so that the Napdragon's X Elite's CPU is approx. as fast as a M2 10-GPU for llama.cpp via their Q4_0_4_8 (re-)quantization. You can also use this speed-up with ollama (but need specially re-worked models).
@@andikunar7183 good info, thanks
@@andikunar7183i heard apple mlx do use npu but we don't have control to manually target it. correct me if i am wrong
@@flyingcheesecake3725 you mean Apple's CoreML and not MLX. CoreML supports the GPU and NPU (Apple calls it ANE), but with limited control. Apple's great open-source machine-learning framework MLX does NOT support the ANE (at least until now), only the CPU+GPU.
You don't have to load the model into the vram every time. It is actually crazy to compare like that. You load once and then send requests, then RTX latency will be the lowest.
0.53 EUR/kWh in Germany? Where did you get these numbers from? About 0.28 EUR/kWh would be an average price. In fact prices dropped considerably during the last year or so.
Yeah definitely please keep making this style of video!
If you wanna make your tests deterministic with ollama, you can use the /set command to set the parameters for top k of 0 and top p, temperature of 0, and set the seed parameter to a same value. Also, if instead of running a chat if you put the prompt in the run command and use -verbose you’ll get the tps
Thanks! My small participation in the electricity cost :)
Appreciate you!
Very instructive, I have never thought about power costs. It makes paying $20 a month for a LLM service seem reasonable.
The timing of this video could not have been more perfect. Literally working on figuring out how to get an LLM running locally because I want the freedom of choosing the LLM I want and for privacy reasons. I only have an RTX 4060 Ti w/ 16GB but it should be more than sufficient for my purposes.
Love this style of format! It makes sense to consider electric cost when running these sorts of setups. Awesome quality as always!
LM studio should meet your needs if you're looking for a one click install system
I'd love to see the same comparison with the Mac Studio with the Ultra chip since it has similar cost to a high spec PC with a RTX4090. This would be quite useful to know for workstation situations
I was thinking the same think. And more model size headroom due to unified ram up to 128gb whereas 4090 is locked at 24gb
Very cool content thanks Alex, I've always wanted to try one of these models locally but I couldn't find the time, at least I can follow your steps and do it a bit faster :) See you in your next video!
You can make LLMs nearly deterministic if not entirely with the “Temperature” setting. I haven’t yet had time to experiment with this myself, but I’ve been following this with great interest.
I’d be joyous to see a similar comparison woth the (hopefully just days away) M4 Pro and M4 Max alongside your other unique, insightful, and well designed testing.
Not to mention setting the seed value. I wonder if there would be any per-machine difference among them with the same seed/temp, or if it would generate the same on both.
But without them pinned to the same output, a 'speed test' that isn't about tokens/sec is pointless.
@@TheHardcard “Hallucinations” (i.e. incorrect answers) and temperature are two different things. Although increasing the temperature (variability in the probability of the next token) is likely to produce more incorrect responses, it also can create more creative responses (that may be ‘better’) in many situations. Incorrect responses are simply the result of both the training data and the architecture of a next word prediction model.
I’m with you! Apple wins by a mile. I have compared my 64G Mac Studio M2 Ultra to my Windows WS that has dual Nvidia A4500 using NVlink (20G for each card) and at half the price the Mac Studio easily competes with the Nvidia cards. I can’t wait to get an M4 Ultra Mac Studio with 192G Ram -maybe more RAM 😂. I use AI for LOCAL RAG and research and local code assistant for Vs code
BTW, the Mac Studio was half the price of the Windows workstation
This was fantastic, thanks. I have done a video or two about Ollama, but haven't been able to do something like this because I haven't bought any of the nvidia cards. It is crazy to think that we have finally hit a use case where the cheapest way to go is with a brand new Mac.
As long as you don't upgrade anything that is. 2TB costs $800, while a Gen4 2TB Samsung 990 costs $160
How about on the Snapdragon CPUs? Do they have Ollama running natively on those yet? I'm guessing not, but it would be interesting to see how they match up against the Mac hardware.
Asking the real question!
There's plenty of frontends on Android(Layla, Pocket Pal, chatterAI, MLCChat, etc) or you can run ollama through Termux if you want. Snapdragons do pretty well on phones, considering they're also a phone.
About 2.5-3 tokens/sec on Snapdragon 695 on Llama 3.2 8B parameter model (~$200USD phone, ala Motorola g84. Slow memory, slow CPU, 12GB RAM. About 5 watts usage).
About 12-22 tokens/sec+ on Snapdragon 8 gen 3 (~$600-700USD phone, ala OnePlus Ace 3 Pro. Good memory, good CPU, 24GB RAM, so can do bigger models. About 17.5 Watts usage).
So, they're not bad, but they're nowhere near a decent desktop CPU/ GPU/ or Mac integrated RAM thingy. But they can run them, to vaguely "usable" speeds (smaller models go quite a bit quicker, and are getting pretty good too. There's Llama 4B Magnum and Qwen 2.5 3B that will double and a bit those speeds, especially using ARM optimized versions. They're not "super smart/ knowledgeable", but they're good enough for entertainment purposes).
Great work Alex! Really timely information about running LLMs locally. If you are wanting to protect your IP and security this is the way to go. It's a pioneering time. I am sure there will be more dedicated setups to do the inferenence efficiently as right now it seems still very clunky at the hardware level. It would also be great to compare a EC2 Instance spun up on demand to do the inference too. This comparison would give an option to protect your IP/Security but get perhaps a cost effective solution that doesn't act as a room heater.
Thanks for making this video! I needed to watch something like this!
I wonder if the benchmark uses the latest MLX apple library, when I switch to it (on LM Studio), it was an incredible difference. I can't wait until Ollama add it!
Ollama uses the llama.cpp code-base for inference. And llama.cpp doesn't use MLX.
I admit, that I did not benchmark MLX via llama.cpp recently, I have to look into it. MLX in my understanding can use GGUF-files, but only limited quantization variants. Q4_0 seems a good compromise, but llama.cpp can do better.
I got the K9 and 96gb after your first video about it, and am happy with the output speed and power consumption for the price. I did get a silent fan for it though cause it is loud. I only really do coding work so I think the Mac would be great, but the RTX is just overkill. I'm very curious the next gen of Intel processors so please do that test. I don't care about the RTX honestly cause I needed something that good I can put that money towards claude or chat gippity online and not have to deal with the heat, power consumption, maintenance, and run a huge model. I personally don't see the benefits of running a 7B parameter model at lightning speed. I'd rather run a bigger model slower at home. Great video! So much to think about.
I'm curious how the Mac Studio with M2 Ultra would compare with the RT 4090
Depending on the task, I think it would do pretty well. It won't be as fast on smaller models, but it will destroy the RTX on larger models.
@@AZisk It can run pretty large models, but the low speed makes it close to unusable for real time interaction.
@@brulsmurfI use 123B models on my M1 ultra 128GB. It can be as slow as 7 tokens per second, but I find that still usable for interactive chat. I’m more into quality than speed.
There is quite a lot of talk about this in r/localllama. For LLM work 4090 is faster than an M2 Ultra. You can get more memory on a M2 Ultra, but you can also get multiple 4090 (or perhaps even better, 3090) and put in a computer for typically less money (and better performance).
If you get over 90 GB requirements (4 xx90 cards) then neither model is really fast enough to be useful. (When you get to about 1 token per second then you're probably better off just renting a machine online or buying tokens from eg OpenAI.)
This video is exactly the answer for me. Thank you!
thanks for a great and interesting video! I have a Linux machine with a nvidia card and a macbook pro m3, and personally I really care about the noise and heat from the linux/nvidia machine - after a while it's bothering. the mac makes zero noise, and no heat and is always super responsive. in my opinion, it's incredible what Apple has built. thanks again for your fun and cool technical videos!
I am pretty sure you heard about asahi linux so will you make a video on it ?
it seems like he already did 2 years ago.. Did something changed?
@@ALCE-h7b
No they are still doing the what they were doing before but I suppose they took a big leap now with the release of their 'Asahi game playing toolkit' which in short makes playing AAA games on Mac very probable
Maybe if you wanna lean more, why not read their blog.
Maybe I need to revisit it. Cheers
That gigantic rig with the rubber fingers is comedy gold. Like you couldn't trigger them remotely, or just test them not simultaneously. Clearly it's for laughs.
Testing weirdly tiny computers though, for local LLM performance and using a cable attached GPU. Who would seriously use machines like that for this purpose?
Really interesting to see the Mac mini and how well it performs. As someone who owns a Mac mini M1, it’s simple design, relatively small footprint, exceptionally quiet, operation, and low power consumption are all pluses. I think with my experience on the original Apple Silicon, I would definitely go with the M2, especially since good for chunks of the day, this system would be idle.
Great video! Thanks for doing not only the comparison, but also the analysis for the second half of the video.
I can’t wait when you test the new Intel Core CPU that just came out. Since you use the iGPU on the NUC, please mention the speed of the DRAM in your next comparison video, as it can have an impact on results (just like DRAM can have an impact on the performance of a video game). I hope a desktop machine with a 4090 makes it in your next comparison video. Even if it is a machine with a mini-ITX MB with a 16x pie slot would be much better
It is not only the hardware, but the CUDA (Compute Unified Device Architecture) framework that allows developers to harness the massive parallel processing. The question is whether Apple will develop an MLX framework more suitable for AI development.
Really nice video tx
What about performing the test on the Qualcomm dev kit machine? Interesting to see how the snapdragon performs
I'm not sure how I feel about you reading all our minds but I'm glad you made this
I'll be totally honest... I don't have the slightest clue what's happening in this video but the little bit that I could understand seems really cool lol.
Best comment 😂
I was actually interested in building a dedicated LLM server, but after a lot of looking around for language models, I realized most open source "coding" focused LLMs are either extremely outdated or planely not good. LLama is working with data from 2021, DeepSeek Coder 2022 etc. Unfortunately the best models for coding purposes are still Claude and ChatGPT and those are closed.
love the comparison. can't wait for a next, maybe try the 4090 in a pc so it can actually stretch its legs and copy the data much quicker to vram
I was waited for that video, thanks for comparison
Can you compare some AMD GPUs in these tests? They're significantly cheaper especially for the extra RAM. So I'm curious if the 'less performant' AMD GPUs can do the same as a 4090 or a lesser Nvidia GPU or a linear result of RAM= performance.. like a 20-24g AMD vs 16g Nvidia?
What about using a real motherboard with full PCI-E GEN 5 and DDR5 RAM? Where is the bottleneck?
Is there a limit or benefit to doing tons of RAM to either not bother with a GPU or to supplement a GPU? How can you use both the RAM & GPU?
Lastly, VERY curious about the latest AMD and Intel CPUs coming that are supposed to be more power efficient and 'built for AI' 🍻
thank you. i do have some amds i was planning to check out for this. not sure it will do much better than intel igpus
I ended up with a used studio M1 Ultra with 128GB RAM. I can run any model up to 96GB RAM, and often do. It could be faster, but the important thing is I have few limitations for large-ish models and context. What really competes with this?
Great test! This is exactly what I'm interested in, especially the idle power test is important, as you won't be running inference 24/7 usually. Amazing to see that "mini PC" with the eGPU takes more power in idle than my full blown desktop PC with a 4090 😅
Alex, a quick question about running Machine Learing on these machines. Is there any input from the AI chips of these pcs?
what u think about the new mini pro could be pretty good no?
thanks yet again !
@Alex for me it is even more interesting how much longer the Mac Mini will take on a finetuning session than a 4090 - only inference is a bit boring
Well, yes, of course we want to know how the 200H core ultra series will perform. Have you heard any news about when Asus is releasing the Nuc 14 AI? I haven't seen any public information yet.
I'm surprised the the Oculink doesn't create more of a performance hit than it did. My 4090 plugged directly into the Mobo gets about 145 tokens/sec on llama3.1:8b verses the ~130 tokens/sec that you got. Kind of make sense since the model is first loaded into memory on the GPU.
Token-generation during inference is largely memory-bandwidth bound (OK, with some minimal impact for quantization-calculations) - Intel 1TB/s. And the LLM runs entirely on the 4090. The 4090 really shines during (batched) prompt-processing, blowing the Intel/Apple machine away - probably >20x faster than the M2 Pro, and way, way more so than the Intel CPU/GPU.
Alex, thank you, and please comment on the NUC shared video memory management (in general - because the smaller test probably fit in either/both), especially when switching back-and-forth between a CPU test and a GPU test -would this be windows managed, or would you be changing parameters? Thank you again.
Great video! Thanks
In the verdict you forgot to mention
that the only way to run bigger LLMs is with Mac’s if you have enough unified memory.
yes it’s gonna be slow but still possible and faster than cpu
You mentioned it in the beginning though :)
Hi Alex, will a M4 pro mac mini with 24/64GB RAM beat the 4090? thanks
nice video thank you, now that you have some M4 mini did you run the benchmark on it to compared with the M2 from that video ?
This was a great video and I'm interested in the intel chips too. Do you know if intel has something equivalent to apple MLX?
Have you thought about running such tests on the new AMD 8700 on tiny PCs? Heard that they have pretty good iGPU
How about a used RTX 3060 12gb VRAM or RTX 4060 Ti 16Gb VRAM in place of RTX 4090? Will it beat the mac mini setup in terms of performance?
This is what I want to watch about tech. Amazing man! ❤
Glad you enjoyed it.
@@AZisk Great work sir. No one else does these things except you. 👌🏻
Did you consider also the Nvidia dev kits ? Like the Nvidia Jetson AGX orin 64gb ?
12:37 Shouldn't quality of responses also matter? I imagine the quality that the 4090 spits out is better than the mac in terms of accuracy and quality?
@@Gome.o why would it be? it’s the same model.
@@EugeneYunak Performance isn't just raw hardware power - it's how well your software is optimized for it. Think of it this way: Current LLMs are like engines tuned specifically for NVIDIA's H100s and Google's TPUs. Sure, Apple silicon is powerful, but running these models is like having a Porsche tuned to Prius specs - you've got the hardware muscle, but the software can't tap into it. We don't have LLMs yet that can fully saturate Apple silicon's potential. Does that make sense?
@@art-thou-gomeo first, no it does not, i say that as a developer. i don’t think we need to go into details here, i certainly won’t because your general point is correct but it does not matter in this specific context. second, the question here was on “accuracy and quality”, not performance so i don’t understand why you bring it up. unless you specifically limit the runtime “quality” is going to be the same between mac and rtx
@@EugeneYunak It seems like we are using different meanings. I'm using accuracy/quality to mean the quality of the language outputs, and it seems you're talking more about technical accuracy - is this right?
I probably overhyped the hardware quality gap initially. While platforms handle floating point math and RNG differently, it's not some dramatic "4090 = big brain" situation. But from my tests I consistently get better outputs on NVIDIA. When I ask for creative stuff like poems about life's transient nature, the NVIDIA version just flows better and feels more polished.
Not being a dev, I'm genuinely curious - what would cause this? I know it could be confirmation bias, but I've seen this pattern across many tests. Any thoughts?
@@art-thou-gomeo it could be the software you are using to run the prompts places more aggressive execution limits on the inferer on non-gtx hardware, in terms of execution time, tokens, memory availability (smaller models). if the exact same execution is performed on 4090 and m2 npu, they should produce similar output in terms of quality. in fact, if you set temperature to zero, you should get deterministically identical results. slower hardware will just arrive at it later - sometimes dramatically. e. g. if i ran a huge model that doesn’t fit into vram on the gpu, but fits into mac’s unified memory, we would be talking orders of magnitude (which is why this particular test in the video was capped at a size that fits into ram/vram on all machines). in fact, mac machines are currently the only viable local development platform for huge ML models which is why the claim that development optimizes for gpus is outdated if not wrong entirely - i haven’t seen anything in the code anywhere that would produce different output based on hardware. there certainly hardware-specific optimizations, but these only help it run faster, not produce different output.
so yeah, either the software you are using limits the inferers to give you outputs in what the devs consider reasonable time (which is a reasonable thing for user-facing software to do), or you are running non-identical software, or it’s confirmation bias :)
I'm curious to see the first Strix Halo APUs running this test with ROCm + Ollama. I'm dreaming of 96GB VRAM allocated and 128GB total for the system. What do you think?
Didn’t you add a lot of memory to the Intel Box? Was that in your cost of ownership?
I did, but in this case, the model I was showing wasn't using all that memory and could easily run on the 32GB that the machine came with.
@@AZisk But if you had that box, you probably wouldn't be able to resist adding 64GBs of memory 🤣
Thanks for this content and comparison. I think as this becomes more mainstream, there will be less nerdy setup’s and software to use this on a pc, Mac or tablet in the near future. 👍🤖
well I have some questions about test method because you do not close and open llm everytime you need it mostly you load it once and it will run for long time, you generate multiple queries at that time so loading time will be non effective point on daily use. second most of the time you will put llm on a mini server (you can use minipc or any pc for that matter I do not have mac so cant talk about them) so you just reach it over lan so heat can be little to no problem.
but for last point how big model it can work on or did you need biggest model? like did you need storyteller for coding or coder for story telling.
and lastly who need what and for how much? as long as you have inernet connection even cheapest solution here is around 2+ years of subscription plan.
and I use local llms so I can say its good way when you do not have internet connection but its not for everybody.
Hello Alex, could you try Mixtral 8x22b on the 96GB Ram mini pc? I am really curious to see the speed and results with that setup.
Im considering an m4 max be of the 128gb unifieded ram. I have a 4090 but its locked at 11B models due to the low vram. Thoughts?
at 15:10 you show a document that says both that M2 16GB costs $0.11 per day and $.03 per day, and yearly costs are $40 and $12 per day. I looked closer because 3 cents per day seems suspisouly low.
I love the test; I would love to see the speed of a beefed up M4 laptop and then maybe using 1.58bit interference
What about running the LLM on M2's NPU?
Very interesting, just a question. Can the mac npu improve the performances ?
I’d like to see which cooling system would work best per price/kWh level- how much it’d cost to keep your machine cool and operative
I think the 4060 ti 16GB should have been included in the comparison.
It seems like a most valuable solution.
It combines a small price, good performance, low consumption and fairly compact size. I think for local llm it should be the best solution from nvidia for the average user.
Totally agree M2 Pro would do it for now. I gave thumbs up up up
Great comparison Alex just shows if a model is tuned to run on a specific hardware it will outperform in terms of effficiency. However I saw an article today that shows Microsoft open-sourced bitnet.cpp a blazing-fast 1-bit LLM inference that runs directly on CPUs. It says that now you can run 100B parameter models on local devices.
Will be waiting for your video on who this changes everything.
LLM inference is largely determined by RAM bandwidth. The newest SoCs (Qualcomm/AMD/Intel/Apple) almost all have 100-130 GB/s. While the M2/M3 Max has 400GB/s, the Ultra 800 GB/s, the 4090 has >1TB/s. And all the new CPUs have very fast matrix-instructions, rivalling the GPUs for performance. 1.5-Bit quantization might be some future thing (but not in any useable model), but currently 4-Bit is the sweet-spot. Snapdragon X Elite CPU Q4_0_4_8 quantized inference is already similar in speed to M2 10-GPU Q4_0 inference with the same accuracy.
Thank you for doing this. I wonder about this stuff, and wish that there was a bit more content on GPUs and LLMs. Like is it better for LLMs to get 2 AMD cards or 1 4090? or even 2 A770s?
You can actually measure the heat VERY EASILY because every watt consumed is converted to heat with EFFECTIVELY 100% efficiency! It's just resistive heating with a few extra steps, but all those extra steps only produce heat as a loss anyway. Technically a very small amount is lost to things like vibrations and maybe UV radiation leaving through an open window or whatever, but you can basically ignore that for all practical purposes.
1W/h of power consumption is around 3.5 BTU/h and 1000J is around 1 BTU. So in your case, a single run consumes (rounding for easiness) let's say 5 BTU for the intel, 2.5 for the M2 and 4 for the RTX.
Does the RTX 4090 is not keep the model in the memory between short requests? Time to first token 2 sec would only impact the first request.
Why did you choose the Intel vs the a amd 8945hs?
cool, but where is AMD Ryzen 7 ?
Im curious with performance of amd strix halo when it launch
Excellent comparison! For me, an interesting future comparison would be between the RTX setup and an equally priced Mac Studio. And then re-compare after the new M4 Pro/Max is available (hopefully next week?).
I’m curious to see how it would run on Mac Studio. At $2000, it’s closer to the higher end one you shared.
Is the drop in performance for rtx in egpu large when compared to using it inside a pc
Thanks
I like this style of video and 100% want to see more, especially with continuously updated test as hardware is updated like an m4 device or if you get a mac studio or ultra, intel, etc.
Hey Alex. Thank your for this kind of comparison. Often I only see speed matrixes, this with average energy consumption, speed, initial cost and heat generated was exactly what I wanted to see. A full array with multiple tests. Thank you, I will definitely see more if you make more power/speeds/cost/Wcost/.... Comparison. ! Thank you.
It is interesting to multiply the runtime of each setup with the cost, to get a ”bang for the buck” measure. I run both the NUC with 96Gb and a pc with RTX3080. The first is slowest but it ”can” run very large LLMs. The latter is fast but needs smaller low precision models. I wonder if the Apple is faster on Stable Diffusion?
Interesting analysis! Two things:
1) I believe heat is pretty much just the energy consumed. There's no chemistry involved, just physics, so aside from negligible light and sound and a small amount of work from spinning fans, a computer using 100 watts is essentially a 100W heater.
2) The Intel mini pc loses in speed and efficiency, but its real strength is the amount of RAM. You could run a 70b model with minimal quantization and get better output than either of the other machines is capable of (though quantifying that and putting it on a chart would be difficult).
can you please compare the performance of the Ryzen AI onboard chipset.
Any thoughts on minisforum AMD APU
Intel mac trashcan with 12 core xeon and D300 AMD graphics card with 64gb Ram - is it any good for LLM? Better on Mac OS? Or Windows 11? Windows 10?
12:00 Are you sure? Normally ollama will keep the model in memory for a few minutes. So if you do something like coding, the model should already be in VRAM for most queries.
The default is just 5 seconds, but if you increase that it might affect your cost substantially. Trade off.
Great! I myself test a lot. Could you make a comparison of cloud based energy consumption and the effect on our planet and local LLM?
I wonder how it performs for M1 Mac Mini, will have to give a try. Rather than selling it I might repurpose it for this use case. Also for that RTX4090, you mentioned that there's a warmup where the model is copied to the vram -- does this happen for each request or it stays in vram for awhile?
You should be able to keep it in the vram. There is a 'keep alive' variable when starting which decides how long the model is kept.
However, if you are developing stuff that needs the GPU itself I don't know if both are allowed to stay there.
Just what I needed, have been carrying the idea of a multi mac mini setup for a while now, especially with the use of exo labs.
I wonder how will the new MacMini M4 Chip perform. If m2 is already fast, m4 ?🤔
Hi Alex, can make a video about running llm code editors like zed editor or vscode with some extension in local setup? im planning buy a gpu but i dont find any video demostrating the local generate code example.
Thank you for making this. I think Mac Studio could be comparable to 4090? Haven’t said that, there is 192GB option for Mac Studio, and not even mentioned the potential new M4 Mac Studio.
If the model can fit in 16gb, the nVidia 4060ti/16gb is a solid performer, you should give it a try. That card can even run Codestral/Mistral Small at 4bit quantization at decent speeds (My old 3060 12gb was just a couple gigs short on vram)
I hope its possible to include performance comparison for training and fine tuning models at some point in the future.
I'm guessing 4090 wins hands down, but maybe a higher end apple silicon isn't too far behind?
Its just that if I'm spending 4090 or max/ultra money I will want to do more than just inference on it.
For inference tasks it would be cool to include a raspberry pi with a helio npu setup for ultra-budget tier comparisons as well.
Can you do a same test for stable diffusion or flux models?
You assumed that the wattage was constant, which it likely isn't. Your outlet meter can measure the total energy usage.
Would be interesting how those RTX Small Form Factor cards would perform. Max. 75W as PCIe bus only powered... Also: if a model was trained on NVidia, does running it on other hardware give different results?
Anything on Jetson Orin nano??