I’m getting about 5 minutes using the full (not quantized) flux dev model on a M3 Max 64GB. 16GB isn’t enough to hold the model in memory along side all the other system stuff that’s is already using that 16GB. You should note the swap utilization when running a generation to get an idea of how much data is being moved between memory and disk.
I wouldn’t recommend a Mac if you are only looking for fast image generation at high res with full sized flux dev models. (Not why I have this particular system btw.) but having 64GB or more of memory that the GPU can access directly is good for large models of any kind.
Great video! My Mini M4 Pro is due in a few days. Can't wait. I didn't purchase it with local LLMs in mind, but being able to do something with them is a plus. The extra memory bandwidth should help. For reference, my close to retirement rtx 3090 (power limited to 65%) has an eval rate of 52 tokens/s with the 14b model, and 28 tokens/s with the 32b one. Nonetheless, the M4/M4 Pro power efficiency is exceptional.
Thanks bro, I was gonna buy it to replace my m2 pro. But now I feel that I should just stick to what I have and use the online solution instead of offline.
Ollama does not yet support MLX yet, unlike LM Studio. MLX has been reported to provide up to 40% faster inference speeds compared to Llama.cpp, which is currently the backend used by Ollama.
Honestly i just tested the Ollama the same way you did got almost the same type of output from the same string. But with my OLD Nvidia 3090 it was more than 100 tokens. So even though this M4 is fast. It's by far not as fast as a decent Nvidia GPU
Id be worried about my ssd long term using that much swap consistently. Good video though. I will check out your tutorial on running flux and installing comfyui.
For just usage as code assistance using continue dev, I heard these Mac's are not that great here as it needs new input processing repeatedly as the user might use multiple iterations to tweak the code output. For a normal development task do you find the performance ok with the qwen coder models on this specific mac?
Good test! Just wondering to get this base Mac mini (never used Mac b4 😜) or stick with Windows Amd 5600 + 7800xt desktop with 16gb ram? What would you recommend for local llm purpose? I suppose upgrade Mac mini to 32gb RAM or more will enable access to larger model with same speed token/s?
Thank you for sharing this video. 👍 I now understand that the base model probably won't allow me to run open LLM models beyond 14 billion parameters locally. The only remaining question I have is if 24GB of unified memory can handle 32 billion parameters.
Based on my research I would not expect this unless quantized down to q5. Are you OK with that quality trade off? Also, instead of just bumping the RAM I would bump up to the pro chip, which includes the RAM, more GPU, and better memory bandwidth I believe.
schnell took more than 2 minutes for 4 steps!!! those are horrible times. Unfeasible for using .dev. Apple needs to improve a LOT their tech or they are going to be behind all the time. (and I am an apple lover) but this times... are reaaally bad
Pls test M4 Max and Flux Dev + video generations
Hope you are working on a WaveSpeed or wave-speed tutorial for flux and ltx for us Mac users that could really use the speed boost.
I’m getting about 5 minutes using the full (not quantized) flux dev model on a M3 Max 64GB. 16GB isn’t enough to hold the model in memory along side all the other system stuff that’s is already using that 16GB. You should note the swap utilization when running a generation to get an idea of how much data is being moved between memory and disk.
Sounds like you've been ripped off
@@greendsnowwhat would you recommend instead for this system price?
I wouldn’t recommend a Mac if you are only looking for fast image generation at high res with full sized flux dev models. (Not why I have this particular system btw.) but having 64GB or more of memory that the GPU can access directly is good for large models of any kind.
@@paultparker Second hand NVIDIA GPU's with 24GB. My old 3090 is wayyy wayyyy faster.
@@noahleamanRTX 5090 will be presented soon. DDR 7 RAM. :)
Great video! My Mini M4 Pro is due in a few days. Can't wait. I didn't purchase it with local LLMs in mind, but being able to do something with them is a plus. The extra memory bandwidth should help. For reference, my close to retirement rtx 3090 (power limited to 65%) has an eval rate of 52 tokens/s with the 14b model, and 28 tokens/s with the 32b one. Nonetheless, the M4/M4 Pro power efficiency is exceptional.
Thanks bro, I was gonna buy it to replace my m2 pro. But now I feel that I should just stick to what I have and use the online solution instead of offline.
Ollama does not yet support MLX yet, unlike LM Studio. MLX has been reported to provide up to 40% faster inference speeds compared to Llama.cpp, which is currently the backend used by Ollama.
Helpful for my studying MAC mini M4 pro...:) Thanks!!!
This video is really useful thank you!
Qwen-coder 2.5 14B model, same question on a gtx 1080 ti
total duration: 14.016461817s
load duration: 35.829686ms
prompt eval count: 35 token(s)
prompt eval duration: 137.915ms
prompt eval rate: 253.78 tokens/s
eval count: 314 token(s)
eval duration: 13.697011s
eval rate: 22.92 tokens/s
So I’d be better off buying an old 1080ti instead of a mac mini? Damn. Saved me $1500
@@LumiLumi1300 purely for LLM's, yes.
really useful information🎉
Thanks for doing this and showing the usage stats. Seems like buying the extra RAM is worth it for any kind of generative AI or LLMs.
Honestly i just tested the Ollama the same way you did got almost the same type of output from the same string. But with my OLD Nvidia 3090 it was more than 100 tokens. So even though this M4 is fast. It's by far not as fast as a decent Nvidia GPU
Id be worried about my ssd long term using that much swap consistently. Good video though. I will check out your tutorial on running flux and installing comfyui.
Glad it was helpful!
For just usage as code assistance using continue dev, I heard these Mac's are not that great here as it needs new input processing repeatedly as the user might use multiple iterations to tweak the code output. For a normal development task do you find the performance ok with the qwen coder models on this specific mac?
Good test!
Just wondering to get this base Mac mini (never used Mac b4 😜) or stick with Windows Amd 5600 + 7800xt desktop with 16gb ram?
What would you recommend for local llm purpose? I suppose upgrade Mac mini to 32gb RAM or more will enable access to larger model with same speed token/s?
I think discreet GPU still is faster. But the VRAM is limited for them. So Mac's unified RAM is an advantage to enable fitting of lager model size.
This is a great test, would you test the flux pixelwave model q4-Dev is 6 gb and works in my Mac with amd 16gb gpu
thanks! Should have similar speed as the Schnell q4.
Thanks Ollama
What is the name of the RAM monitor app?
it's called 'stats', see my previous uploaded video for details: ua-cam.com/video/USpvp5Uk1e4/v-deo.html
Thank you for sharing this video. 👍 I now understand that the base model probably won't allow me to run open LLM models beyond 14 billion parameters locally. The only remaining question I have is if 24GB of unified memory can handle 32 billion parameters.
Based on my research I would not expect this unless quantized down to q5. Are you OK with that quality trade off? Also, instead of just bumping the RAM I would bump up to the pro chip, which includes the RAM, more GPU, and better memory bandwidth I believe.
Muchas gracias por tu video y trabajo. Estaba buscando algo parecido para decidir cual comprar para usar LLM de forma local. Gracias.
¡Me alegra que haya ayudado!
Nice one! Can you please let us know what app you use to measure cpu, gpu , ram percentage as shown in the video
Thank you! The tool is called 'stats'. I uploaded a video for it at ua-cam.com/video/USpvp5Uk1e4/v-deo.html
@@tech-practice9805 Thanks a lot, actually I installed istats menu but it seems not a open source . So will try this one stats, thanks for that
I just got the M4 Pro Mac Mini with 64 Gigs of RAM for the best AI inference
is there a 64GB version?
versions with the M4 Pro chip are available with 24, 48, 64GB
@@cbuchner1 How is it working for you? I've been considering the same.
Why dont you buy a pc with rtx ? Btw i like macs but for ai thing it is expensive
@devmely I am coming from this side. Fed up with power consumption, fan noise and inflated GPU pricing & lack of VRAM in the consumer space.
How did SDXL base model perform ?
I didn't test SDXL. But should be faster than the Flux
That's where/when we Mac users have CUDA envy.
For the price it comes i think its worth as compared to rtx
did you try flux dev on this? can it generate if so how much time it would take?
the GGUF FLUX dev works. Times is 5 times that for the FLUX schnell.
我滴天
schnell took more than 2 minutes for 4 steps!!! those are horrible times. Unfeasible for using .dev. Apple needs to improve a LOT their tech or they are going to be behind all the time. (and I am an apple lover) but this times... are reaaally bad
There should be room for optimization. LLM works quite well with ARM CPU.
And how much is the gpu alone that you are comparing to the mini?
No they don't, they should start selling upgrades for $500, each. 24gb ram? +$500!
@kornerson23 Can you substantiate this with a comparable price machine with better performance? What about comparable performance?
this tiny machine is half of price of RTX 4090
What a lousy M4 chip.. Flux took 3-4mins to generate?!? Can throw that into the garbage bin for any AI inference works
they can run smaller models fast