I thought that, but apparently that's just for screens - actual DATA bandwidth is more 80Gb/sec. And honestly I feel like 40Gb/s is just fine for compute with good locality :D
@@isbestlizard That's a bummer, but it makes sense, 120 is 30 shy of the M3 pro internal memory bandwind (I know they are not the same), imagine buying 2 m4 base model and chaining them, double of everything at $1200, maybe in like 3 - 5 years that would be possible
@@isbestlizard Thunderbolt 5 transfer speeds have 2 options: 1. 80/80 (simply twice that of TB4, 2 wires in each direction) 2. 120/40 (3x/1x that of TB4, 3 wires in one direction, 1 in the other) they use the same 4 wires, just allocating them differently displays are obviously going to use 120/40, since you need a ton of outgoing bandwidth vs very little incoming (assuming you aren't running thunderbolt devices chained downstream) external ssds are probably best on 80/80 which runs pcie gen4 x4, unless we get gen5 chips and gen5 ssds so you can choose to either read or write at up to 12 GB/s with just 4 GB/s available in the other direction, as needed I really have no idea about GPU bidirectional bandwidth, but TB5 is at least twice as good
You should make more videos about this to compare how wifi vs eithernet vs thunderbolt connections impact performance, how larger models run, and other stuff that you are in the unique position to experiment with.
I've been waiting for someone to do this test forever! I had a feeling you'd be the first. :D This is one of the main reasons I felt comfortable pulling the trigger on the base-model mini. I expect to use it in some fashion even a decade (or two) from now.
You do not need to restart your terminal to have environment variables take effect. You just edited your zshrc; You can run export VARIABLE=VALUE and it takes effect in that session only, or after editing your zshrc you can run "source ~/.zshrc" and it will reload the config immediately. You can also source other files for that matter and have multiple shell configs active so to speak. It basically just runs the file
Considering how you can't spec out the Mac Mini with that much RAM or anything better than an M4 Pro, buying these for clusters would be highly expensive for the performance you'd be getting. A base model 32 gb m1 max Mac studio costs around $1000 used and still outperforms the M4 Pro Mac mini. I even bought a 64 GB m1 max macbook pro with a broken screen for $1230 on ebay.
@@meh2285 Well... You are not thinking big enough, I'd say. Yes you can get single machines, that might be outperforming 2 or 3 mac minis. Think 10. Think 20. Think 100. Now it gets interesting, because we are speaking about a more commercial use. No company will buy used and broken macbooks. And the value proposition of 100 Mac Minis (base M4 Pro model) vs 75 Mac Studios (Base M2 Max modell) is intriguing. I am not saying, that I calculated all of that to the end, but I kind of see this sort of thing happening. Not only for LLMs but for all kinds of clusters. Mac Minis have been used for that in the past and will be used for that in the future ;-)
@@meh2285the point of clustering is so that you don’t have to spec up the base model and buy multiple of them instead. I imagine two m4 in a cluster will smoke a m4 pro
@@beaumac Yeah but it's still a terrible price to performance tradeoff to get Mac Minis for clustering. The base model is a great value on its own, but it's not ideal for clustering at this price, it just doesn't have that powerful of a gpu. The higher end models are an even worse value for clustering, considering you can get a new M1 Studio with 64 gb of ram on Ebay for $500 less than a spec'd out M4 mini ($2200 at base storage) that will have worse GPU performance and 2x more storage. Also, the second you buy a second base model Mac Mini, you could have gotten a 32 gb Mac Studio for that price, which would preform better due to still having more GPU compute and less latency from clustering. Unless you really need two computers for some other reason, clustering with Mac Minis is a bad idea.
@@ErikBussink This is what I was wondering too. Would be more interested in seeing a larger model running on a cluster of smaller machines that can't possibly run them on their own.
Im interested to see the token per second difference between running llama 3.1 70b on the 64gb MacBook compared to the tok/s on the cluster with the thunderbolt configuration. Also why not try llama 405b so we can see how fast is it?
Would you mind posting a run of that final test that works? Only difference being multiple calls across the cluster to the same model. I'd love to see how it parallelizes that type of workload and what the resulting tokens/sec ends up being.
You should compare two 32gb Mac minis running on exo vs. one 64gb Mac mini. Because two 32gb minis cost the same as one 64gb. Which is better? Two m4 gpu connected over thunderbolt 5 or one m4 pro gpu running the model on a singular system!
how does this actually work? you're not actually sharing the compute power right? basically it determines to which computer to send the query to, and then that computer shares the result with the one you're working on? would combining 3 of the same computer be beneficial or just repetitive?
This is a nice POC. Another great video from Alex. I would definitely prefer having 10Gb switch and having everything connected to it (there are some 8 ports for 300USD). More stable to actually work, and probably, less messy. Maybe getting miniPC with 10Gb port and some decent amount of memory? Its shame Apple has such a big tax on memory and storage upgrade. There is also Asustor 10Gb SSD only NAS device with 12x SSD slots (Flashstor 12 Pro).
Great! Instead of buying 16 GPUs and really large expensive boards, power supplies, cases, I can just buy 3 MBP M4 Max 128 gigglebytes and cluster them together for only $15,000!
This is awesome, can't wait for the M4 Mac Mini LLM review! Could you consider a video about the elephant in the room, multiple 8gb gpus clustered together to run a large model? There are millions of 8gb gpus that are stuck running quantized version of 7B models or just underutilized.
Alex, this is an interesting setup. I would like to see more of your results when clustering these machines together to consume various LLM workloads, especially the larger models.
@@Kaalkianbut then the value drops already massively, i think those 600 usd base models are great just use more of them cheaper than upgrading RAM or CPU
Wow. You literally pointed out everything I have going on. Now I know to connect my two trash can Mac Pros via thunderbolt and run things on my NAS instead of my DAS. Currently running llama3.2 just fine. Llama3.1 doesn’t even budge. Maybe two of three together can bring my access to an 8B llm. Or maybe I can just fine tune llama3.2 and run it more efficiently
2 questions: 1. does this bridge support jumbo frames (the default 1500 bytes seems too small)? 2. Why CIFS and not NFS? NFS seems to be about 10-20% faster on MacOS
Thanks Alex!! Can you do an experiment with Exo is running across PC and Mac ? I have a PC with 4070 and 3 Macbook (m1, m2, m3 pro) Also a setup where PC is the NAS to save us some money :D
Alex. While I appreciate the use of the SSD only File Server, couldn't you have Direct Attached to the MacBook Pro and done File Share over the Thunderbolt Bridge pointing your cache "to the same place" Mac Share vs SMB would seem to be efficient and eliminate the WiFi to Storage bottleneck? Just wondering if a measurable impact.
Hi Alex, I am following your instructions and almost got it. I have a question regarding the Thunderbolt bridge.. is it just file sharing? I am already screensharing so I can see from one Mac, but now I want to have Exo run on the 4, but not use the network because it never works for me. Please let me know Thanks! Javy
Hi this is very helpful. i am curious if you could run LLM benchmarks of the various M4 models you have and see if an increase in GPU core counts make a difference, if so how much of a difference.
Can you use the cluster model with multiple max for any program? Light wave? Final Cut Pro? Basically, I have a super computer for everything? Or does EXO only help you run LLM?
They’re very short on basic documentation. Any ideas how can i manually add LLM to exp, so that they appear in tiny chat? Maybe you can do video about it?
Dear Alex, I follow your channel for the language models, specifically for the MacBook Pro with Apple silicon. I congratulate you for your very precise and detailed content. I have a question. Can a Llama3.1 70b Q5_0 model with a weight of 49GB damage a MacBook Pro 16 M2 Max with 64GB ram? I ran, on the MacBook, 2 models. (Mixtral8x7b Q4_0 26GB and Llama3.1 70B Q5_0 49GB). When the 26GB one was running, the response was more fluid and quiet and the memory flow on the monitor looked "good", with a certain amount free and also without pressure. When I ran the 49GB weight (Llama3.1 70B Q5_0) it was not so fluid and also the Mac made an internal noise that was synchronized with the rhythm of each word that the model answered, in addition the memory monitor marked me that there was pressure in the memory. So far so good. Just that detail. The problem came when I decided to reset the MacBook with a clean installation of the operating system and deleted the installation from utilities (as marked by Apple), then I exited disk utilities and clicked on install macOS Sonoma. The installation began, it marked me 3 hours of waiting, and everything started well. After about 6 minutes of installation, the screen image was transformed into a poor quality image at the same time that was fading in areas (from bottom to top) until it disappeared. In that screen image you could see lines and dots of green colors as well. All this happened in a second. He never gave me an image again, only a black image could be seen. You could only see that it turned on the MacBook by the keyboard lighting and if it turned off the office lights you could see a very faint white flash in the center of the screen. I connected a screen by HDMI but you couldn't see anything either, just a black screen. I can see it's the video card. Do you think memory pressure could have influenced the heavier model that overloaded the MacBook Pro? Or do you think it was a matter of luck and it has no to do with language models? I ran the models with Ollama and downloaded them from the same page. Thank you very much for reading me, Greetings
Humm so I guess one more question this brings : is it better to go for one m4 pro with 48GB of ram or 2 m4 with 24GB each to run local LLMs since it would be the same price
Hey Alex! Your videos are great! I’m considering getting a MacBook Pro but not sure which model would be best for my needs. I’m a data science and machine learning student, so I’ll mostly use it for coding, data analysis, and some AI projects. Since I’m still in learning mode and not yet working professionally, I’m unsure if I need the latest high-end model. Any recommendations on which model or specs would best fit my use case? Thanks in advance!
Actually NAS is not required. First Networking via Thunderbolt cables, and then assigning internal or external drives or TB DAS as LLM sources should be faster.
Why is the inference performance different per machine? Are they sharing the GPU cores too or just the VRAM? Because based on the output you are getting the VRAM bandwidth is around 300 - 400GB/s
I'd like to see you experiment with the models that won't fit inside a desktop GPU (RTX 4090 maxes out at 24GB of VRAM). With the Mac Minis going up to 64GB of unified memory, a couple of them should be able to run most 70B models without any quantization.
This setup can use models with more parameters right? For example using 14B can be how much better than using chatgpt 4o in normal chat? In which examples, stories, code, ect? If someone can help me understand this, I really apreciate!
Nice one Alex, whats the effective t/sec across them from your testing - say its x for 128Gigs across a single device versus 64+32+16+8+8 nodes and the model needs two or more machines to run it. Think it wont hit more than 0.5x even with thunderbolt bridge to pass around stuff.
so does this mean that if you have enough hardware/laptops you could DL and run Olama 405b (230Gb) model, and the running of it would be spread across all the nodes ? (Albeit likely slow)
This is impressive. If this was for a real-world use case, I’d implement these optimizations: - Don’t use the NAS since it introduces a single point of failure and it is much slower than directly attached storage. For best performance, the internal SSDs are your best choice. Storing the model on each computer is ok. This is called “shared nothing” - Use identical computers. My hypothesis is that slower computers slow down the whole cluster. You would need to measure it with the Activity Monitor - Measure the network traffic. Use a network switch (or better two together with ethernet bonding for redundancy and speed increase) so that you can add an arbitrary number of computers to your setup - Measure how well your model scales out. If you have three computers and add a fourth, you would expect to get one third more tokens per second. The increase that you actually get in relation to the computing power you added, defines your scale out efficiency. - use identical computers to get comparable results - Now you have a perfect cluster where you can remove any component without breaking the complete cluster. Whichever component you remove, the rest would still function.
Has anybody tried this setup for LLM, (Which would do better in LLM processing, training, inference, RAG etc.)- would this run llama3.1 (70B) (2x m4 base mini with 32gb ram each 256 ssd -Tbolt4 linked and load distributed ) VS 1x m4pro with 64gb ram 512gb. This i wanna see if you can pull it off. very curious about the effectiveness of a small cluster vs all in 1 system.
Alex, other than testing stuff for videos, what do you need all this for? Wouldn't it be far easier and far cheaper to just use the pro version of Claude, Gemini, or ChatGPT instead rather than running all these models locally? Seems like you are spending a lot of time and money on a problem that has already been solved.
@@HadesTimer a lot of things you want to run on local. Not cloud. There are restrictions on this online services. But on local you can do whatever you want.
I have a question regarding the installation of SQL Server Management Studio (SSMS) on a Mac. Specifically, I would like to know if it is feasible to install SSMS within a Windows 11 virtual machine using Parallels Desktop, and whether I would be able to connect this installation to an SQL Server that is running on the host macOS. Are there any specific configurations or steps I should be aware of to ensure a successful connection between SSMS and the SQL Server on macOS? Thank you!
Didnt understand point of this video. if you already have 64GB laptop, why use others to run LLMs ? even on shared NAS, it will run on that 64GB one only. Why would anyone have multiple laptops lying around ?
Oh my god my plan will be to get 4 mac mini's (base config) and build a 64GB 480GB/sec cluster hahaha those 3 thunderbolt ports make a perfect fully connected 4 node cluster :D
Come on, Alex!!! Three of the new Macs Mini together!!!
I'm now waiting for three mac minis with m4 pro. Will share the results here as well!
4x 3090 for price of single m4, mac - no way thank you
@@akierum how did you even come up with that number
@@Peterotica he made it up
3? What about 5!
Imagine this with the thunderbolt 5 120Gb/s bandwidth, so much potential
But the M4 Chip has a WiFI 6E wireless card
I thought that, but apparently that's just for screens - actual DATA bandwidth is more 80Gb/sec. And honestly I feel like 40Gb/s is just fine for compute with good locality :D
@@isbestlizard That's a bummer, but it makes sense, 120 is 30 shy of the M3 pro internal memory bandwind (I know they are not the same), imagine buying 2 m4 base model and chaining them, double of everything at $1200, maybe in like 3 - 5 years that would be possible
@@isnakolah nah, 120 Gbps is only 15 GB/s, so 10%
@@isbestlizard Thunderbolt 5 transfer speeds have 2 options:
1. 80/80 (simply twice that of TB4, 2 wires in each direction)
2. 120/40 (3x/1x that of TB4, 3 wires in one direction, 1 in the other)
they use the same 4 wires, just allocating them differently
displays are obviously going to use 120/40, since you need a ton of outgoing bandwidth vs very little incoming (assuming you aren't running thunderbolt devices chained downstream)
external ssds are probably best on 80/80 which runs pcie gen4 x4, unless we get gen5 chips and gen5 ssds so you can choose to either read or write at up to 12 GB/s with just 4 GB/s available in the other direction, as needed
I really have no idea about GPU bidirectional bandwidth, but TB5 is at least twice as good
Thanks! I bought the NAS. When are you going to hook up the 4090 also?
thanks! 🤔 i don’t know if it can do both cuda and non-cuda in a cluster, but I’d be curious to find out
@@AZisk I saw a person on Reddit who had hooked up a Mac Studio with a 4090 pc through a thunderbolt bridge. It should be possible.
@@rafaeldomenikos5978 That's just networking though... the real question is whether the EXO software supports the heterogeneous cluster.
You should make more videos about this to compare how wifi vs eithernet vs thunderbolt connections impact performance, how larger models run, and other stuff that you are in the unique position to experiment with.
I am sure this was a lot to figure out. But it seems unreasonably easy to setup. I am so impressed. Hope you refine it and keep telling us about it.
I've been waiting for someone to do this test forever! I had a feeling you'd be the first. :D
This is one of the main reasons I felt comfortable pulling the trigger on the base-model mini. I expect to use it in some fashion even a decade (or two) from now.
You are killing it with the AI vids dude... 10/10!
You do not need to restart your terminal to have environment variables take effect. You just edited your zshrc; You can run export VARIABLE=VALUE and it takes effect in that session only, or after editing your zshrc you can run "source ~/.zshrc" and it will reload the config immediately. You can also source other files for that matter and have multiple shell configs active so to speak. It basically just runs the file
I am super excited for the Mac Mini Cluster we are going to see ;-)
HAH yes everyone has this idea XD
Considering how you can't spec out the Mac Mini with that much RAM or anything better than an M4 Pro, buying these for clusters would be highly expensive for the performance you'd be getting. A base model 32 gb m1 max Mac studio costs around $1000 used and still outperforms the M4 Pro Mac mini. I even bought a 64 GB m1 max macbook pro with a broken screen for $1230 on ebay.
@@meh2285 Well... You are not thinking big enough, I'd say.
Yes you can get single machines, that might be outperforming 2 or 3 mac minis. Think 10. Think 20. Think 100.
Now it gets interesting, because we are speaking about a more commercial use. No company will buy used and broken macbooks. And the value proposition of 100 Mac Minis (base M4 Pro model) vs 75 Mac Studios (Base M2 Max modell) is intriguing. I am not saying, that I calculated all of that to the end, but I kind of see this sort of thing happening. Not only for LLMs but for all kinds of clusters. Mac Minis have been used for that in the past and will be used for that in the future ;-)
@@meh2285the point of clustering is so that you don’t have to spec up the base model and buy multiple of them instead. I imagine two m4 in a cluster will smoke a m4 pro
@@beaumac Yeah but it's still a terrible price to performance tradeoff to get Mac Minis for clustering. The base model is a great value on its own, but it's not ideal for clustering at this price, it just doesn't have that powerful of a gpu. The higher end models are an even worse value for clustering, considering you can get a new M1 Studio with 64 gb of ram on Ebay for $500 less than a spec'd out M4 mini ($2200 at base storage) that will have worse GPU performance and 2x more storage. Also, the second you buy a second base model Mac Mini, you could have gotten a 32 gb Mac Studio for that price, which would preform better due to still having more GPU compute and less latency from clustering. Unless you really need two computers for some other reason, clustering with Mac Minis is a bad idea.
2:16 environmental variable. Loved it
@Alex your Qwen 2.5 14B Instruct Q4 should run on your MacbookPro 64GB without needing the exo cluster. Are you seeing the same performance then ?
@@ErikBussink This is what I was wondering too. Would be more interested in seeing a larger model running on a cluster of smaller machines that can't possibly run them on their own.
I was think the same, i don't see the exo cluster advantage or llama3.2 405B running
I think it might even be faster because of the lack of external communication with the other computers.
Im interested to see the token per second difference between running llama 3.1 70b on the 64gb MacBook compared to the tok/s on the cluster with the thunderbolt configuration. Also why not try llama 405b so we can see how fast is it?
Would you mind posting a run of that final test that works? Only difference being multiple calls across the cluster to the same model. I'd love to see how it parallelizes that type of workload and what the resulting tokens/sec ends up being.
Please compare 4x Mac mini base model with 1x 4090 :D
this is the one we are all waiting for
You should compare two 32gb Mac minis running on exo vs. one 64gb Mac mini. Because two 32gb minis cost the same as one 64gb.
Which is better? Two m4 gpu connected over thunderbolt 5 or one m4 pro gpu running the model on a singular system!
how does this actually work? you're not actually sharing the compute power right? basically it determines to which computer to send the query to, and then that computer shares the result with the one you're working on? would combining 3 of the same computer be beneficial or just repetitive?
This is a nice POC. Another great video from Alex.
I would definitely prefer having 10Gb switch and having everything connected to it (there are some 8 ports for 300USD). More stable to actually work, and probably, less messy.
Maybe getting miniPC with 10Gb port and some decent amount of memory? Its shame Apple has such a big tax on memory and storage upgrade.
There is also Asustor 10Gb SSD only NAS device with 12x SSD slots (Flashstor 12 Pro).
Great! Instead of buying 16 GPUs and really large expensive boards, power supplies, cases, I can just buy 3 MBP M4 Max 128 gigglebytes and cluster them together for only $15,000!
This is awesome, can't wait for the M4 Mac Mini LLM review! Could you consider a video about the elephant in the room, multiple 8gb gpus clustered together to run a large model? There are millions of 8gb gpus that are stuck running quantized version of 7B models or just underutilized.
Alex, this is an interesting setup. I would like to see more of your results when clustering these machines together to consume various LLM workloads, especially the larger models.
if you are buying Mac Book, make sure it has the larger storage
Oh my, you’re killing it 😮 Great job!
10 base m4 Mac mini cluster here I come
@@op87867 cooooool
you want atleast the base m4 pro mini for tb5
@@Kaalkianbut then the value drops already massively, i think those 600 usd base models are great just use more of them cheaper than upgrading RAM or CPU
Wow. You literally pointed out everything I have going on. Now I know to connect my two trash can Mac Pros via thunderbolt and run things on my NAS instead of my DAS. Currently running llama3.2 just fine. Llama3.1 doesn’t even budge. Maybe two of three together can bring my access to an 8B llm. Or maybe I can just fine tune llama3.2 and run it more efficiently
I am wondering if this could help to run 70B or bigger model given that if I have 2-3 64 Gb Ram Mac Silicon machines?
I was looking for the location variable! Absolutely thanks
As a follow on your presentation today is - what if i wanna run Llama 3.1-70 or even 405 gb on a distributed computing setup
What do you need the NAS for? The models run on GPU memory, not RAM.
finally a good exo explanation. thanks alex!
2 questions: 1. does this bridge support jumbo frames (the default 1500 bytes seems too small)? 2. Why CIFS and not NFS? NFS seems to be about 10-20% faster on MacOS
Can you please show how you moved it to the SSD. Like transferring llama to external storage
If I have this kind of cluster set up, how do I access the cluster from my main machine that is not part of the cluster?
Thanks Alex!! Can you do an experiment with Exo is running across PC and Mac ? I have a PC with 4070 and 3 Macbook (m1, m2, m3 pro) Also a setup where PC is the NAS to save us some money :D
Could you please verify the functionality of connecting three computers to the Terramaster F8 SSD Plus by utilizing each of its USB ports?
Alex. While I appreciate the use of the SSD only File Server, couldn't you have Direct Attached to the MacBook Pro and done File Share over the Thunderbolt Bridge pointing your cache "to the same place" Mac Share vs SMB would seem to be efficient and eliminate the WiFi to Storage bottleneck? Just wondering if a measurable impact.
Can we also fine tune locally with multiple devices using exo!?
Nice! Glad you gave it a try! Maybe next round Thunderbolt 5 on the new minis! ;)
Question: Can nodes be different OS? 😅
Hi Alex, I am following your instructions and almost got it. I have a question regarding the Thunderbolt bridge.. is it just file sharing? I am already screensharing so I can see from one Mac, but now I want to have Exo run on the 4, but not use the network because it never works for me. Please let me know Thanks! Javy
Hi this is very helpful. i am curious if you could run LLM benchmarks of the various M4 models you have and see if an increase in GPU core counts make a difference, if so how much of a difference.
Can you use the cluster model with multiple max for any program? Light wave? Final Cut Pro? Basically, I have a super computer for everything? Or does EXO only help you run LLM?
where is llama3.2 405 running ? 🤔
Please test with various context windows 8k/ 32k/128k and especially with longer prompts > 1000 tokens.
They’re very short on basic documentation. Any ideas how can i manually add LLM to exp, so that they appear in tiny chat? Maybe you can do video about it?
I am having the same problem , how do you set ollama to save to SSD?
You should try same but with SAN
Rust compile time comparison with M4 _vs_ older M Series please.
Can you make a video with 4 X the 16Gb with the new mac mini in an cluster, or even better 4X 64GB to make a 256 of Vram 🙂
Dear Alex, I follow your channel for the language models, specifically for the MacBook Pro with Apple silicon. I congratulate you for your very precise and detailed content.
I have a question.
Can a Llama3.1 70b Q5_0 model with a weight of 49GB damage a MacBook Pro 16 M2 Max with 64GB ram?
I ran, on the MacBook, 2 models. (Mixtral8x7b Q4_0 26GB and Llama3.1 70B Q5_0 49GB).
When the 26GB one was running, the response was more fluid and quiet and the memory flow on the monitor looked "good", with a certain amount free and also without pressure. When I ran the 49GB weight (Llama3.1 70B Q5_0) it was not so fluid and also the Mac made an internal noise that was synchronized with the rhythm of each word that the model answered, in addition the memory monitor marked me that there was pressure in the memory.
So far so good. Just that detail. The problem came when I decided to reset the MacBook with a clean installation of the operating system and deleted the installation from utilities (as marked by Apple), then I exited disk utilities and clicked on install macOS Sonoma. The installation began, it marked me 3 hours of waiting, and everything started well. After about 6 minutes of installation, the screen image was transformed into a poor quality image at the same time that was fading in areas (from bottom to top) until it disappeared. In that screen image you could see lines and dots of green colors as well. All this happened in a second. He never gave me an image again, only a black image could be seen. You could only see that it turned on the MacBook by the keyboard lighting and if it turned off the office lights you could see a very faint white flash in the center of the screen. I connected a screen by HDMI but you couldn't see anything either, just a black screen.
I can see it's the video card. Do you think memory pressure could have influenced the heavier model that overloaded the MacBook Pro? Or do you think it was a matter of luck and it has no to do with language models?
I ran the models with Ollama and downloaded them from the same page.
Thank you very much for reading me,
Greetings
Humm so I guess one more question this brings : is it better to go for one m4 pro with 48GB of ram or 2 m4 with 24GB each to run local LLMs since it would be the same price
So the endgame is to get a couple of base mac minis m4?
Thanks for the testing! I am very interested in this but don't have the extra hardware to try.
I bought two cheap m1 Max Macs with 64 gb of ram for this use case
@@meh2285 and what are your results ?
Hey Alex! Your videos are great!
I’m considering getting a MacBook Pro but not sure which model would be best for my needs. I’m a data science and machine learning student, so I’ll mostly use it for coding, data analysis, and some AI projects. Since I’m still in learning mode and not yet working professionally, I’m unsure if I need the latest high-end model. Any recommendations on which model or specs would best fit my use case? Thanks in advance!
Great video! 🔥
Actually NAS is not required. First Networking via Thunderbolt cables, and then assigning internal or external drives or TB DAS as LLM sources should be faster.
Why is the inference performance different per machine? Are they sharing the GPU cores too or just the VRAM? Because based on the output you are getting the VRAM bandwidth is around 300 - 400GB/s
Hey, which new M4 configuration should I get if I want to play around with local LLMs?
Love to see this in a M4 Mac mini cluster
what about SD e elite chips ?
I'd like to see you experiment with the models that won't fit inside a desktop GPU (RTX 4090 maxes out at 24GB of VRAM). With the Mac Minis going up to 64GB of unified memory, a couple of them should be able to run most 70B models without any quantization.
This setup can use models with more parameters right? For example using 14B can be how much better than using chatgpt 4o in normal chat? In which examples, stories, code, ect? If someone can help me understand this, I really apreciate!
Nice one Alex, whats the effective t/sec across them from your testing - say its x for 128Gigs across a single device versus 64+32+16+8+8 nodes and the model needs two or more machines to run it. Think it wont hit more than 0.5x even with thunderbolt bridge to pass around stuff.
So, what about to have lot of rapsberries conected with exo and the nas running a very LLM?
Thunderbolt 5 m4 64gb ram x3 - is it going to be a 192gb gpu memory cluster ?
You can use Spark ? To load model on the RAM
I remember rendering times in Final Cut & Compressor, on many machines - the same problems :)
So rather than upgrade a Mac mini, you just buy more of them?
Couldn’t you create a smb share on one Mac and then point the other MacBooks to it? Over thunderbolt loading the model should be even faster.
Can we run the biggest 405B Llama 3.2 model on this Apple Silicon Cluster?
I didn't try the 405 because even the 70 was a pretty lengthy download. However, with 2 128GB MacBooks, you'd be able to. That's what Alex Cheema did
Is it possible to do with windows laptops?
Can your test new m4 pro base variant 16/512. I was looking to buy this varient
so does this mean that if you have enough hardware/laptops you could DL and run Olama 405b (230Gb) model, and the running of it would be spread across all the nodes ? (Albeit likely slow)
This is impressive. If this was for a real-world use case, I’d implement these optimizations:
- Don’t use the NAS since it introduces a single point of failure and it is much slower than directly attached storage. For best performance, the internal SSDs are your best choice. Storing the model on each computer is ok. This is called “shared nothing”
- Use identical computers. My hypothesis is that slower computers slow down the whole cluster. You would need to measure it with the Activity Monitor
- Measure the network traffic. Use a network switch (or better two together with ethernet bonding for redundancy and speed increase) so that you can add an arbitrary number of computers to your setup
- Measure how well your model scales out. If you have three computers and add a fourth, you would expect to get one third more tokens per second. The increase that you actually get in relation to the computing power you added, defines your scale out efficiency.
- use identical computers to get comparable results
- Now you have a perfect cluster where you can remove any component without breaking the complete cluster. Whichever component you remove, the rest would still function.
awesome, does this exo tool works to cluster x86 minipc (fedora) + macbooks?
is it sharing only ram but not thecompute resources?
Why is the ping latency an entire millisecond? Shouldn’t thunderbolt be faster than that? Isn’t it external PCIE?
between m3 max 30 core GPU 36ram and m4 pro 48gb ram which one should I choose?
off topic question for apple users. is possible to have the terminal with black background?
yes, it's possible!
@alexexoxoxo thanks !! Im saving up to get a Mac and seeing that white terminal scared the hell outta me 😂
Has anybody tried this setup for LLM, (Which would do better in LLM processing, training, inference, RAG etc.)- would this run llama3.1 (70B)
(2x m4 base mini with 32gb ram each 256 ssd -Tbolt4 linked and load distributed ) VS 1x m4pro with 64gb ram 512gb. This i wanna see if you can pull it off. very curious about the effectiveness of a small cluster vs all in 1 system.
are you able to run the cluster across MacOS and Windows?
Alex, how did you change the default ports? mine keeps coming up on 52415 no matter what flags I give it on launch
that’s the new hard coded port. it used to be 8000, now it’s this
Thanks for adding MacBook Air m2 base model...😊
Get Mac mini pros, you will have 120Gb thunderbolt connection for the cluster.
CAN’T WAIT FOR YOUR M4 video
Could you cluster a bunch of Mac mini base models? At $600 they have to be the best bang for the buck.
Waiting⏳ for m4 max
i saw some1 connected 4 minis and run llm
Alex wth, u r crazy!!!
11:30 -
Alex, other than testing stuff for videos, what do you need all this for? Wouldn't it be far easier and far cheaper to just use the pro version of Claude, Gemini, or ChatGPT instead rather than running all these models locally? Seems like you are spending a lot of time and money on a problem that has already been solved.
@@HadesTimer a lot of things you want to run on local. Not cloud. There are restrictions on this online services. But on local you can do whatever you want.
@@passionatebeast24 very true, but that's hardly worth the expense. Especially if you are using image generation.
Can you try MacMini M4 Pro cluster?😂
With thunder bolt 5
What is the use case of running you own LLM?
The maintainers of the project definitely groaned very loudly when you tried to run two different models at once😂
sure did :D
we'll fix it tho
oops. Nah, I asked Alex Cheema about this and he said they are planning to work that out as a feature down the line.
I have a question regarding the installation of SQL Server Management Studio (SSMS) on a Mac. Specifically, I would like to know if it is feasible to install SSMS within a Windows 11 virtual machine using Parallels Desktop, and whether I would be able to connect this installation to an SQL Server that is running on the host macOS. Are there any specific configurations or steps I should be aware of to ensure a successful connection between SSMS and the SQL Server on macOS? Thank you!
Yes it is possible
Didnt understand point of this video. if you already have 64GB laptop, why use others to run LLMs ? even on shared NAS, it will run on that 64GB one only. Why would anyone have multiple laptops lying around ?
Who is still waiting for m4 machine review
Cool Stuff
Oh my god my plan will be to get 4 mac mini's (base config) and build a 64GB 480GB/sec cluster hahaha those 3 thunderbolt ports make a perfect fully connected 4 node cluster :D
Would this be cheaper than an m4 ultra setup? I havent done the math, but feels like it's not.
How did you get to the 460GB/s? Aren't you limited to the Thunderbolt transfer speed?
Where's m4 Mr Alex