even though I could afford to get a 4080 or 4090, I refuse to pay extortion prices. Nvidia has gotten too greedy. So glad I have my M2Max with 96G to do fun ML research projects
I'm riding this out on my 4090 until we get some clarity where local models are going. The general trend seems to be fitting the same performance into smaller models over time.
@@djayjp Certain tasks - a major current task is LLM token generation - are memory bandwidth limited. Assuming the next Max Macs keep a 512-bit bus with faster memory, it will have the highest bandwidth. Strix Halo will be for the admittedly sizeable market of people who hate Macs, hate Apple, or both. Outside of that the upcoming Max will have the technical advantage. Can AMD undercut on price? Maybe, but not guaranteed.
WiFi definitely would be a latency and throughput bottleneck. Thunderbolt may take some extra cycles of the CPU but the throughput increases certainly won’t hurt. Not sure how well TB does on latency but I’m sure it is better than WiFi unless there is heavy protocol inherited latency.
@@Zaf9670 really? if the model is only running certain layers, then the only communication you're getting is the of the context size. So if 1024 then it's 1024 x 768 numbers. I think a much bigger factor is the immense number of matrix multiplications. That's what is slowest. Unfortunately distributing the model this way, you're only as fast as your slowest node.
Just starting to mess around with the idea of using clusters for my home AI server, and hoping that the 48GB M4 Mac mini I've got coming will play nicely with my existing 64GB Ryzen 9-based mini-PC system with the 12GB RTX 3060 (hey, I'm on a budget) on an Oculink-dock. If I can get 70b models running okay with those two under Exo, it will be useful for my particular application (writing and research assistant).
Cool! Everyone that makes AI models more convenient and accessible is a hero in my book (that includes you, Alex). Currently I'm running the smaller mistral model on my M2 macbook air base model. I am considering to buy a mac mini or mac studio when the new ones come out, and this might be what I need to run the larger models. Mistral is great, but I want to use it in combination with fabric and for that it just does not cut it. Keep it up Alex, you make me look smarter at work with every video ;)
I've tested exo before and ran into a lot of the same issues that you were experiencing and this was on a 10gbe network. I haven't tried it again after the failed attempts, but I do think that this kind of clustering could be very powerful with even smaller models. If it supports being able to handle multiple concurrent requests and exo acts as a "load balancer" for the requests, then you could have one entry point into a much larger capacity network of machines running inferences. This is opposed to trying to find your own load balancing mechanism (maybe HAPROXY) to balance the load, but then you still have the issue of orchestrating each machine to download and run the requested model.
You can cluster Mac minis using thunderbolt 5, which gives you 80Gb. That's supposed to give ~30 tokens per second on a 4-bit quantized 70b param model.
That's why I'm waiting for Mac Studio with M4Max/Ultra. 256GB for big models with good SoC soon will be esential... or already is... Anyway... An iOS dev I'm using 20-40b models, they are heavy but not too much, they can respond in resonable time and they are not using 50GB+
i tried it with Nvidia Jetson NANO Cluster, results are amazing, I tried other similar options i.e. Raspberry AI Kit, Google Corel, in comparison to Nvidia Jetson Nano do not even stand a chance.
Hi Alex, Very nice video, but I had to smile a bit because of the test setup. I have a cluster running at a customer, but for a different application and this technology can really bring a lot in the area of performance and failover. I am enthusiastic when cluster computing becomes generally more available and usable. It is very important to build a high-performance, dedicated network for cluster communication. With Macs, this is quite easily possible via Thunderbolt bridge. I recommend assigning the network addresses manually and separate the subnet from the normal network. With 40 Gbit/s you have something at hand that otherwise causes a lot of work and costs. (apart from the expensive cables.) Of course, it is better if all cluster nodes work with comparable hardware, which simplifies the load distribution, but in generally different machines are possible. In your case, unfortunately, a base air, which in itself hardly can handle the application, is more of a brake pad than an acceleration, as you impressively showed. A test with two powerful Macs would be interesting.
The idea is super cool, I’d love to be able to use multiple computers to accomplish more than what is possible with just one computer. The idea seems to me that it addresses the issue of expensive graphic cards… which is probably the next best alternative… a ‘modeling host’ with powerful graphics cards being available over the network to smaller ‘terminals’
lol when he was looking for the safetensors i was thinking “please be in HF cache, please be in HF cache” and of course, this Alex fellow is wonderful. Means this will be simple to drop in current workflows. 405b should fit well across 4 Mac studios with 192GB 👌 next question will be if it can distribute fine-tuning
@@AZisk can you try a comparison of hardwired vs WiFi? Like the above comment, what about involving a NAS? What about all the exact same computers vs 1 PC of the same power vs multiple PCs with varied power?.. so we can see where the benefits actually come from? I suspect a significant portion of your delays are from networking and waiting on the slower and maxed out PCs to catch up and assist the powerful one. BUT the benefit here is offloading a giant model that wouldn't fit on a single machine.. there's no way networking is going to be faster with tokens per sec vs a CPU/GPU/RAM all in the same system. SO where do you gain performance, where are the diminishing returns? Can you use like 2-3 low power mini PCs like the new Intel and AMD mobile chips about to hit and actually do better at-scale with 1 bigger PC that can just barely handle a big request on its own? Because each of the small PCs can also still do smaller things on their own running tasks in parallel but pair up for big tasks? A single PC will only be able to do one task at a time regardless? Lots of questions that can be tested here and going cheap but many vs expensive single devices.
Call back to the good old days of Beowulf clusters for unix. I picked up 5 old HP mini PCs with an intel 6 core cpu, 1TB NVMe and 64GB of ram in each. These are all on my 10GB in house ethernet so I'll give it a go and let you know. Great video, thanks.
this is a good start but the problem is still that it's trying to load the entire model on each machine. A better solution would be to share the model across machines and access it in a bittorrent type of style. Not sure how that will work though.
Naively, I’m surprised this would work without an even bigger hit on net performance than you found: I’d think that partitioning the model across machines would be tricky: You somehow split the weights, then calculate the two halves (or multiple shards) of the matrix math separately?
Interesting project. It would be interesting to see a video where you explain fairly simple about these models. Now you mention numbers like memory and tokens/s and X number of parameters. Can you please explain to us not so in to LLMs?
I design air gapped AI inference systems, I do my initial tests on 30x Raspberry Pi’s to focus on efficiency. Obviously dedicated GPU memory is not possible. Maybe this teamed with the about to be announce M4 Mac Mini will be the next evolution. Also derisks getting a £thousands bill by accident on a cloud based test lab.
It's a great solution for a small business getting into the ML/AI realm but keeping their research in-house. Scrub the Macs and go for some lower-cost gaming PCs. Install a base model of Linux and kick off 3-4 nodes. Under $5K and amazing solution.
So I think like with RAM it may be running at the slowest common denominator. Just because you put two sticks of RAM together does not mean you get one running at 5600 and one running at 4800 even though they're the same 16 gig they will opt to work together at the slowest speed available to both, kind of like your motherboard communicating with CPU and RAM everything will slow itself down to operate simultaneous functions to the lowest common denominator.
Would be very interesting if you could try a 70B model on a new laptop /mini pc with 128GB RAM and the new Intel Core Ultra 7 (2nd gen) Processor 256V / 266V running Linux and llama.cpp (compiled with AVX512 and SYCL). I don't know it there are any 128GB laptops out in the wild with Core Ultra 7 (2nd gen) yet.
12:40 - I can tell you easy, we have three MacBooks at home, each with 128GB and three iPad Pros with M4, each with 16GB, and two beefy windows machines with 4090. That’s altogether 500GB of VRAM that’s going to power my AGI for free.
Thank you. I was struggling to find where the models were located as well. Really annoying that it is not documented and they make it so hard to find. Yeah, don't mind me, just dumping 100+ GB, don't worry about it, you don't need to know where it's at... lol
Kinda reminds me swam intelligence. Bunch of devs sitting together, all sharing some of the compute power of their PCs, forming a clustered AI that serves all and as a whole has more performance and is smarter than simply the sum of each individual PC.
You construct is a classic example of a bottleneck! The request enters a pool of resources, where its parts are divided across three instances, each waiting for the others to complete. Imagine three people are meeting up: one takes a rocket, another a speedboat, and the last one rides a bicycle. Sure, two of them will arrive quickly, but for the meeting to happen, all three need to be there. So, everyone ends up waiting for the one on the bicycle.
I am an old follower since your sub numbers were in 4 digits. a comparison of "Nvidia: Jetson Nano" and likes of it would be opening a lot of more portals of possibilities.
I actually have an m3 max 64GB and an m2 air 8gb. I am so intrigued by this! If it works I can set it up with my studio in the office with an m2 ultra and 192gb! Now that’ll be a lot of Ram. Maybe 405b quantized ?😂
The best use case would be a small - medium corporation, a retail chain, a corporation, or a learning institution that is looking to have it's data trained. Heck, If I had our farm's data, I would gladly run that model.
It’s also right to wonder “who is distcc for?” and more importantly, can we get a generalized cluster architecture for modern computers so that every large application can take advantage of spare hardware? This could lead to some very large clusters organized by any group that needs it. Of course, it would undercut AWS and no one wants that! Meanwhile, watch those huggingface cache folders. They do get very large and should be cleaned and purged frequently.
Love this experiment. Been following your channel for the past 2 years since I got into ML and I own a M2 Max Mac Studio with 32GB Unified Memory that I used so far for ML, happy with it but also waiting for the M4 Max Macbook Pro so I can finally get a portable powerhouse. Saved up for a whole year for the upgrade and I’m planing on getting the 8TB SSD and 128GB UNified Memory version maxed out for the 16” Model, or maybe more unified memory if they add it on the M4 Max. Benchmarks so far for the leaked models fron the russian youtubers seem so good as an estimate for performance but I cant wait to see hte new ones coming out soon.
So many AI hype people talk or show it using models with llama-cpp, etc but just few prompt questions or a simple toy code. Nobody show real implementation of this AI LLM models integrated in a real project. Alex I would like to see videos of examples in your projects where integrated AI models and the added value that brings to your software.
i did a project like this back in 2004 using a beowolf cluster with 9 apple 2's an amd pc, an intel/nvidia pc and an acer/intel laptop. the 2 biggest bottlenecks were the macs and the 10mbit networking, but it was a good proof of concept. in my experience, any time you cluster, your bottlenecked by your slowest component. ya you can do it, but it's better for like vm's and lots of small individual programs. not to mention whatever software your running has to be written or modified to take advantage of the distributed hardware. just because you could, doesn't mean you should. the R.O.I. just isn't there.
They should upgrade the cluster join automation to peer2peer transfer the model the the new nodes if they don't have the model. No reason to go to the WAN over and over.
AMD Strix Halo APUs with 256gb RAM to the rescue in 2025. Won't have to pay Apple tax and can upgrade SSDs for a fraction of the price without having to resort to de-soldering NAND chips like on MacBook pros or spending $4000 for an 8TB SSD.
I believe the one who has fast and reliable cable networking))) I don't know if they did, but it would be logical not to download models from the internet every time; once one machine has downloaded the model, it can serve the model to others. Even better, you don't need multiple copies of the model in the same network; it can be a single fast network drive.
If only there were any Strix Point laptops with 4 RAM sticks... That would be 192GB memory with 4x48GB sticks. Then running things on a budget would be achievable
Petals did P2P LLM inference 3 years ago and it lead nowhere. Memory bandwidth constrains makes this inefficient. You trade off too much speed for this. If you take the efforts to put together 4 machines, you get way way more VRAM per Dollar and speed buying 2x A16 with 64 GB vram each for 3.3k so you get 128GB vram for 6.6k. Or you could do with some RTX cards. Also, I was laughing hearing "production ready". Such projects are barely ever production ready, even ollama isn't :) Got tons of problems when trying to use inference solutions like that with our clients.
Hi there. first of all, great video, as always, many thanks for efford, appreciate it, really 😊. Now to the EXO, I can image huge on-prem data centre of … for instance … certain automaker r&d department running this across lets say 5 servers with 256GB RAM+2 highend GPUs each interconnected with high throughput LAN and connected to another internal vectordb cluster to enrich generated answers. This way you can easilly utilize all the advantages that modern LLM models provides without sharing even a tiny bit of data with vendors like OpenAI. Another case would be classified environments, where you aren’t connected to the internet at all. And don’t forget, it is not only about chating. Your can right our of the box integrate them also to the langchain powered applications. Such cluster projects should be in my opinion very good in distributing multiple requests across itself as well. I'm keeping my fingers crossed this project make it through to be stable.
the idea is that: - make it p2p and block based. - to use it you need to allocate resources in idle time of android, ios, mac, pc, server. - exo ai cluster is a starting line i think. to rescue ai from big corps the people in anarchist community must train ai and intference ai in peoples customer grade devices. this is the only way to salvation.
Oh my, It's great that someone is making content of this depth ❤️
even though I could afford to get a 4080 or 4090, I refuse to pay extortion prices. Nvidia has gotten too greedy. So glad I have my M2Max with 96G to do fun ML research projects
will you finally be upgrading this year to M4?
x86 will be joining the party once Strix Halo launches.
I'm riding this out on my 4090 until we get some clarity where local models are going. The general trend seems to be fitting the same performance into smaller models over time.
@@djayjp Certain tasks - a major current task is LLM token generation - are memory bandwidth limited. Assuming the next Max Macs keep a 512-bit bus with faster memory, it will have the highest bandwidth. Strix Halo will be for the admittedly sizeable market of people who hate Macs, hate Apple, or both.
Outside of that the upcoming Max will have the technical advantage. Can AMD undercut on price? Maybe, but not guaranteed.
With AMD exiting the high-end market, the cost may increase. The new 5090's will require 16-pins and a 600W from what I have read
Would thunderbolt networking speed up the cluster at all? Are they just communicating over wifi?
I only tried wifi. he might be using tb
@@AZisk you can use tb too!
WiFi definitely would be a latency and throughput bottleneck. Thunderbolt may take some extra cycles of the CPU but the throughput increases certainly won’t hurt. Not sure how well TB does on latency but I’m sure it is better than WiFi unless there is heavy protocol inherited latency.
@@Zaf9670 really? if the model is only running certain layers, then the only communication you're getting is the of the context size. So if 1024 then it's 1024 x 768 numbers.
I think a much bigger factor is the immense number of matrix multiplications. That's what is slowest.
Unfortunately distributing the model this way, you're only as fast as your slowest node.
@@acasualviewer5861 as fast as your slowest node, that's bad
Just starting to mess around with the idea of using clusters for my home AI server, and hoping that the 48GB M4 Mac mini I've got coming will play nicely with my existing 64GB Ryzen 9-based mini-PC system with the 12GB RTX 3060 (hey, I'm on a budget) on an Oculink-dock. If I can get 70b models running okay with those two under Exo, it will be useful for my particular application (writing and research assistant).
Cool! Everyone that makes AI models more convenient and accessible is a hero in my book (that includes you, Alex). Currently I'm running the smaller mistral model on my M2 macbook air base model. I am considering to buy a mac mini or mac studio when the new ones come out, and this might be what I need to run the larger models. Mistral is great, but I want to use it in combination with fabric and for that it just does not cut it. Keep it up Alex, you make me look smarter at work with every video ;)
I've tested exo before and ran into a lot of the same issues that you were experiencing and this was on a 10gbe network. I haven't tried it again after the failed attempts, but I do think that this kind of clustering could be very powerful with even smaller models. If it supports being able to handle multiple concurrent requests and exo acts as a "load balancer" for the requests, then you could have one entry point into a much larger capacity network of machines running inferences. This is opposed to trying to find your own load balancing mechanism (maybe HAPROXY) to balance the load, but then you still have the issue of orchestrating each machine to download and run the requested model.
You can cluster Mac minis using thunderbolt 5, which gives you 80Gb. That's supposed to give ~30 tokens per second on a 4-bit quantized 70b param model.
That's why I'm waiting for Mac Studio with M4Max/Ultra. 256GB for big models with good SoC soon will be esential... or already is...
Anyway... An iOS dev I'm using 20-40b models, they are heavy but not too much, they can respond in resonable time and they are not using 50GB+
Try 10 mini pc all with 96gb
Anyone with a Raspberry Pi cluster could have some fun, although the mini pc with the extra ram would be more cost effective
i tried it with Nvidia Jetson NANO Cluster, results are amazing,
I tried other similar options i.e. Raspberry AI Kit, Google Corel, in comparison to Nvidia Jetson Nano do not even stand a chance.
@@aatef.tasneem Very interesting. Do you happen to have the exact specs of your setup to share with us?
Hi Alex,
Very nice video, but I had to smile a bit because of the test setup.
I have a cluster running at a customer, but for a different application and this technology can really bring a lot in the area of performance and failover. I am enthusiastic when cluster computing becomes generally more available and usable.
It is very important to build a high-performance, dedicated network for cluster communication. With Macs, this is quite easily possible via Thunderbolt bridge. I recommend assigning the network addresses manually and separate the subnet from the normal network.
With 40 Gbit/s you have something at hand that otherwise causes a lot of work and costs. (apart from the expensive cables.)
Of course, it is better if all cluster nodes work with comparable hardware, which simplifies the load distribution, but in generally different machines are possible.
In your case, unfortunately, a base air, which in itself hardly can handle the application, is more of a brake pad than an acceleration, as you impressively showed.
A test with two powerful Macs would be interesting.
The idea is super cool, I’d love to be able to use multiple computers to accomplish more than what is possible with just one computer. The idea seems to me that it addresses the issue of expensive graphic cards… which is probably the next best alternative… a ‘modeling host’ with powerful graphics cards being available over the network to smaller ‘terminals’
You could try to run the tool in docker containers a with one shared storage over network for the model. That would help with the disk space issues.
Good day to you :) thanks for the content.
lol when he was looking for the safetensors i was thinking “please be in HF cache, please be in HF cache” and of course, this Alex fellow is wonderful. Means this will be simple to drop in current workflows. 405b should fit well across 4 Mac studios with 192GB 👌 next question will be if it can distribute fine-tuning
Use a NAS ... wouldn't have to download multiple times ... and point to a shared directory.
🤔
Good idea and also RAID on SSDs to boost performance and then computes as a cluster
@@AZisk create a shared directory for your huggingface hub directory that points to the shared NAS directory.
@@AZisk can you try a comparison of hardwired vs WiFi? Like the above comment, what about involving a NAS?
What about all the exact same computers vs 1 PC of the same power vs multiple PCs with varied power?.. so we can see where the benefits actually come from?
I suspect a significant portion of your delays are from networking and waiting on the slower and maxed out PCs to catch up and assist the powerful one. BUT the benefit here is offloading a giant model that wouldn't fit on a single machine.. there's no way networking is going to be faster with tokens per sec vs a CPU/GPU/RAM all in the same system.
SO where do you gain performance, where are the diminishing returns? Can you use like 2-3 low power mini PCs like the new Intel and AMD mobile chips about to hit and actually do better at-scale with 1 bigger PC that can just barely handle a big request on its own? Because each of the small PCs can also still do smaller things on their own running tasks in parallel but pair up for big tasks? A single PC will only be able to do one task at a time regardless?
Lots of questions that can be tested here and going cheap but many vs expensive single devices.
It's for people who have two MacBook Pro with 256gb of ram on a plain
I wonder if you could share the models and network via thunderbolt
Call back to the good old days of Beowulf clusters for unix. I picked up 5 old HP mini PCs with an intel 6 core cpu, 1TB NVMe and 64GB of ram in each. These are all on my 10GB in house ethernet so I'll give it a go and let you know. Great video, thanks.
Is it possible to run it with Mac plus windows or Linux machine in one cluster?
You should investigate and do more videos with this cluster llms
I know a use case. College kids gather their macbooks together and forge essays.
I realize this may be a major leap in complexity, but would you consider a couple of videos on customizing LLM models to introduce local content?
this is a good start but the problem is still that it's trying to load the entire model on each machine. A better solution would be to share the model across machines and access it in a bittorrent type of style. Not sure how that will work though.
might have to try
a great project and idea, maybe the next step could be the addition of shared memory from the cloud
We're all the laptops connected via WiFi or hardwired?
Firewalls blocking coms between the systems?
Naively, I’m surprised this would work without an even bigger hit on net performance than you found: I’d think that partitioning the model across machines would be tricky: You somehow split the weights, then calculate the two halves (or multiple shards) of the matrix math separately?
If you could simplify and extract some of the connections, you might be able to make a grid. But you'd wind up powering a lot of hardware.
Interesting project. It would be interesting to see a video where you explain fairly simple about these models. Now you mention numbers like memory and tokens/s and X number of parameters. Can you please explain to us not so in to LLMs?
This is so cool but i think if there is a possibility to get multiple VPSs and connect them then run the model on them it would be cooler
Supposedly you can add Nvidia or Linux etc with like tinigard as backend for ex0
I design air gapped AI inference systems, I do my initial tests on 30x Raspberry Pi’s to focus on efficiency. Obviously dedicated GPU memory is not possible. Maybe this teamed with the about to be announce M4 Mac Mini will be the next evolution. Also derisks getting a £thousands bill by accident on a cloud based test lab.
Hi man, can you compare the AMD and NVIDIA card when run ollama, something like amd 7800xt vs 4060ti. Thanks
Try use Llama 3.2 90B on MAC STUDIO M2 ULTRA
It's a great solution for a small business getting into the ML/AI realm but keeping their research in-house. Scrub the Macs and go for some lower-cost gaming PCs. Install a base model of Linux and kick off 3-4 nodes. Under $5K and amazing solution.
This would a be great idea to try with all my Raspberry Pi's which are collecting dust on my shelf.. I wonder how old Pi's could be used.
So I think like with RAM it may be running at the slowest common denominator. Just because you put two sticks of RAM together does not mean you get one running at 5600 and one running at 4800 even though they're the same 16 gig they will opt to work together at the slowest speed available to both, kind of like your motherboard communicating with CPU and RAM everything will slow itself down to operate simultaneous functions to the lowest common denominator.
Would be very interesting if you could try a 70B model on a new laptop /mini pc with 128GB RAM and the new Intel Core Ultra 7 (2nd gen) Processor 256V / 266V running Linux and llama.cpp (compiled with AVX512 and SYCL). I don't know it there are any 128GB laptops out in the wild with Core Ultra 7 (2nd gen) yet.
I think this would scale out better with equally sized computers and a fast network connection (10GbE).
12:40 - I can tell you easy, we have three MacBooks at home, each with 128GB and three iPad Pros with M4, each with 16GB, and two beefy windows machines with 4090.
That’s altogether 500GB of VRAM that’s going to power my AGI for free.
I think you need thunderbolt bridge between the different machines to ensure low latency and speed.
Anything similar for Windows thats doesn't rely on vram of gpu??
What about 10 maxed out Raspberry Pis with NPU card and ssd in a cluster?
Thank you. I was struggling to find where the models were located as well. Really annoying that it is not documented and they make it so hard to find. Yeah, don't mind me, just dumping 100+ GB, don't worry about it, you don't need to know where it's at... lol
Does this work on Linux with cpu only? People with beefy home labs might REALLY enjoy it. :)
Kinda reminds me swam intelligence. Bunch of devs sitting together, all sharing some of the compute power of their PCs, forming a clustered AI that serves all and as a whole has more performance and is smarter than simply the sum of each individual PC.
Try maybe Meta Llama 3.2 light model
Imagine what a cluster of new Mac Studio M4 Ultra 512GB can do. They would beat Blackwell compute cards.
you can find out where the model files are saved using dtrace
Reverse proxy caching, proxy caching, and rsync will easily solve the downloading issues; download once and distribute locally at high speed
You construct is a classic example of a bottleneck! The request enters a pool of resources, where its parts are divided across three instances, each waiting for the others to complete. Imagine three people are meeting up: one takes a rocket, another a speedboat, and the last one rides a bicycle. Sure, two of them will arrive quickly, but for the meeting to happen, all three need to be there. So, everyone ends up waiting for the one on the bicycle.
I am an old follower since your sub numbers were in 4 digits.
a comparison of "Nvidia: Jetson Nano" and likes of it would be opening a lot of more portals of possibilities.
Alex, I like your T-shirt where did you get it?
Very nice video. However, I think that connecting the computers in a wired local network using thunderbolt cables should provide some improvement
maybe limited by the networking switch and ports maybe go with 10gig
I actually have an m3 max 64GB and an m2 air 8gb. I am so intrigued by this! If it works I can set it up with my studio in the office with an m2 ultra and 192gb! Now that’ll be a lot of Ram. Maybe 405b quantized ?😂
The best use case would be a small - medium corporation, a retail chain, a corporation, or a learning institution that is looking to have it's data trained. Heck, If I had our farm's data, I would gladly run that model.
Except that EXO as currently setup is for inference only, not for training. For training you'd need a big server (on prem or cloud)
Just the tought of comparing the bandwidth of your ram with the network overhead of fitting a model across 2 machines is depressing
What internet do you have to download nearly 20 MB a second ?? (fiber ?)
yes
@@AZisk 1, 5 or 10 Gbit down ?
@Garrus-w2h bro please stop flexing internet speed I can't even get more than 2 megabits(yes not even megabyte) per second 😭😭
It’s also right to wonder “who is distcc for?” and more importantly, can we get a generalized cluster architecture for modern computers so that every large application can take advantage of spare hardware? This could lead to some very large clusters organized by any group that needs it. Of course, it would undercut AWS and no one wants that! Meanwhile, watch those huggingface cache folders. They do get very large and should be cleaned and purged frequently.
can you review Mini pcs? i want to know if i can run LLMs on my SER 5 Max 😂
this setup helps if you want your own little farm without selling your soul
Pretty cool to see this
Use case is simple.
1. 2+ docker containers with models installed; eg mistral
2. Put containers on same docker network.
3.????
4. Profit.
Did you try to setup a proxy server for the LLM download?
Its high time that models are stored in network storage so they are shared by all the machines.
yep: example: Running LLM Clusters on ALL THIS 🚀
ua-cam.com/video/uuRkRmM9XMc/v-deo.html
Interesting, I guess faster networking cables/ports and faster hard drives could help.
Have you seen Qualcomm Snapdragon Dev Kit for Windows Teardown (2024) from Jeff Geerling... Hopefully LLMs going to work with NPUs LMStudio soon
still waiting for mine
Is that the fx3 or fx30?
fx30
@@AZisk I have one also, fantastic lilttle camera
@@nyambe Yeah I like the instant access to ISO and Aperture controls. But the battery drains so fast!
Hmmmm, wonder if inference would be stable enough if getting 10 of my "Gamer pals" on a VPN and running Exo across zee Interweb?
I'm adding support for invite links where you can invite friends to join your exo network
thanks for this video... was about to go spin some shi up myself lol
Love this experiment. Been following your channel for the past 2 years since I got into ML and I own a M2 Max Mac Studio with 32GB Unified Memory that I used so far for ML, happy with it but also waiting for the M4 Max Macbook Pro so I can finally get a portable powerhouse. Saved up for a whole year for the upgrade and I’m planing on getting the 8TB SSD and 128GB UNified Memory version maxed out for the 16” Model, or maybe more unified memory if they add it on the M4 Max. Benchmarks so far for the leaked models fron the russian youtubers seem so good as an estimate for performance but I cant wait to see hte new ones coming out soon.
So many AI hype people talk or show it using models with llama-cpp, etc but just few prompt questions or a simple toy code. Nobody show real implementation of this AI LLM models integrated in a real project. Alex I would like to see videos of examples in your projects where integrated AI models and the added value that brings to your software.
OMG you got patience. Cool? Yes cool.
I was literally just researching to see if anyone’s had done this yet!
This will be powerful when run cluster on Intel Mini PC with 96RAM.
i did a project like this back in 2004 using a beowolf cluster with 9 apple 2's an amd pc, an intel/nvidia pc and an acer/intel laptop. the 2 biggest bottlenecks were the macs and the 10mbit networking, but it was a good proof of concept. in my experience, any time you cluster, your bottlenecked by your slowest component. ya you can do it, but it's better for like vm's and lots of small individual programs. not to mention whatever software your running has to be written or modified to take advantage of the distributed hardware. just because you could, doesn't mean you should. the R.O.I. just isn't there.
They should upgrade the cluster join automation to peer2peer transfer the model the the new nodes if they don't have the model. No reason to go to the WAN over and over.
AMD Strix Halo APUs with 256gb RAM to the rescue in 2025. Won't have to pay Apple tax and can upgrade SSDs for a fraction of the price without having to resort to de-soldering NAND chips like on MacBook pros or spending $4000 for an 8TB SSD.
Very cool 👏
Oh man, I only have one of those MacBook Pros with the highest M3, Max processor and ram configuration😅
can you test that in macos 15 nested virtualization is supported or not in m3 macs?
this project could be run on the thunderbolt bridge, I think that should be more reliable
I believe the one who has fast and reliable cable networking)))
I don't know if they did, but it would be logical not to download models from the internet every time; once one machine has downloaded the model, it can serve the model to others. Even better, you don't need multiple copies of the model in the same network; it can be a single fast network drive.
Use case.. individualised , personalised , aligned assistants.
you should've connected the macs with Thunderbolt 4 cables instead of the wireless network...
pair of 3090's (or three 16GB 4060Ti's) can run 70B models, reasonable compromise imo
😅😋 this sounds fun 😊
Hey, wanna give UV a try?
a group M1,M2,M3 owner to build a cluster llm training with group studying
If only there were any Strix Point laptops with 4 RAM sticks... That would be 192GB memory with 4x48GB sticks. Then running things on a budget would be achievable
Petals did P2P LLM inference 3 years ago and it lead nowhere. Memory bandwidth constrains makes this inefficient. You trade off too much speed for this.
If you take the efforts to put together 4 machines, you get way way more VRAM per Dollar and speed buying 2x A16 with 64 GB vram each for 3.3k so you get 128GB vram for 6.6k. Or you could do with some RTX cards.
Also, I was laughing hearing "production ready". Such projects are barely ever production ready, even ollama isn't :) Got tons of problems when trying to use inference solutions like that with our clients.
Wow
nice vid, would recomend not mounting your camera to the table, when you touch the table the camera moves.
yeah, space is limited otherwise i would love to have a nice tripod
Let’s think clustering the machines up. What are the solutions out there, kubernetes? Any ideas?
As the LLMs developing, soon or later it will be equal to a human if we use 1000 of MacBookPro. :)
Use case is a dev team
this was instructional
Hi there. first of all, great video, as always, many thanks for efford, appreciate it, really 😊. Now to the EXO, I can image huge on-prem data centre of … for instance … certain automaker r&d department running this across lets say 5 servers with 256GB RAM+2 highend GPUs each interconnected with high throughput LAN and connected to another internal vectordb cluster to enrich generated answers. This way you can easilly utilize all the advantages that modern LLM models provides without sharing even a tiny bit of data with vendors like OpenAI. Another case would be classified environments, where you aren’t connected to the internet at all. And don’t forget, it is not only about chating. Your can right our of the box integrate them also to the langchain powered applications. Such cluster projects should be in my opinion very good in distributing multiple requests across itself as well. I'm keeping my fingers crossed this project make it through to be stable.
So cool
Something tells me that 128gb of RAM will be in my next build as minimum.
the idea is that:
- make it p2p and block based.
- to use it you need to allocate resources in idle time of android, ios, mac, pc, server.
- exo ai cluster is a starting line i think.
to rescue ai from big corps the people in anarchist community must train ai and intference ai in peoples customer grade devices.
this is the only way to salvation.
and im wrong badly. ai needs morecomputation than storage. and current bittorrent and blockchain technology is about the space not speed.