Using Clusters to Boost LLMs 🚀

Поділитися
Вставка
  • Опубліковано 22 гру 2024

КОМЕНТАРІ • 196

  • @AlmorTech
    @AlmorTech 2 місяці тому +8

    Oh my, It's great that someone is making content of this depth ❤️

  • @woolfel
    @woolfel 2 місяці тому +139

    even though I could afford to get a 4080 or 4090, I refuse to pay extortion prices. Nvidia has gotten too greedy. So glad I have my M2Max with 96G to do fun ML research projects

    • @AZisk
      @AZisk  2 місяці тому +22

      will you finally be upgrading this year to M4?

    • @djayjp
      @djayjp 2 місяці тому +7

      x86 will be joining the party once Strix Halo launches.

    • @seeibe
      @seeibe 2 місяці тому +4

      I'm riding this out on my 4090 until we get some clarity where local models are going. The general trend seems to be fitting the same performance into smaller models over time.

    • @TheHardcard
      @TheHardcard 2 місяці тому +3

      @@djayjp Certain tasks - a major current task is LLM token generation - are memory bandwidth limited. Assuming the next Max Macs keep a 512-bit bus with faster memory, it will have the highest bandwidth. Strix Halo will be for the admittedly sizeable market of people who hate Macs, hate Apple, or both.
      Outside of that the upcoming Max will have the technical advantage. Can AMD undercut on price? Maybe, but not guaranteed.

    • @annraoi
      @annraoi 2 місяці тому +2

      With AMD exiting the high-end market, the cost may increase. The new 5090's will require 16-pins and a 600W from what I have read

  • @Rushil69420
    @Rushil69420 2 місяці тому +10

    Would thunderbolt networking speed up the cluster at all? Are they just communicating over wifi?

    • @AZisk
      @AZisk  2 місяці тому +4

      I only tried wifi. he might be using tb

    • @alexcheema6270
      @alexcheema6270 2 місяці тому +4

      @@AZisk you can use tb too!

    • @Zaf9670
      @Zaf9670 2 місяці тому +3

      WiFi definitely would be a latency and throughput bottleneck. Thunderbolt may take some extra cycles of the CPU but the throughput increases certainly won’t hurt. Not sure how well TB does on latency but I’m sure it is better than WiFi unless there is heavy protocol inherited latency.

    • @acasualviewer5861
      @acasualviewer5861 2 місяці тому +1

      @@Zaf9670 really? if the model is only running certain layers, then the only communication you're getting is the of the context size. So if 1024 then it's 1024 x 768 numbers.
      I think a much bigger factor is the immense number of matrix multiplications. That's what is slowest.
      Unfortunately distributing the model this way, you're only as fast as your slowest node.

    • @zhanglance557
      @zhanglance557 29 днів тому

      @@acasualviewer5861 as fast as your slowest node, that's bad

  • @GVDub
    @GVDub 11 днів тому +2

    Just starting to mess around with the idea of using clusters for my home AI server, and hoping that the 48GB M4 Mac mini I've got coming will play nicely with my existing 64GB Ryzen 9-based mini-PC system with the 12GB RTX 3060 (hey, I'm on a budget) on an Oculink-dock. If I can get 70b models running okay with those two under Exo, it will be useful for my particular application (writing and research assistant).

  • @Manuel-o7g
    @Manuel-o7g 2 місяці тому +17

    Cool! Everyone that makes AI models more convenient and accessible is a hero in my book (that includes you, Alex). Currently I'm running the smaller mistral model on my M2 macbook air base model. I am considering to buy a mac mini or mac studio when the new ones come out, and this might be what I need to run the larger models. Mistral is great, but I want to use it in combination with fabric and for that it just does not cut it. Keep it up Alex, you make me look smarter at work with every video ;)

  • @dave_kimura
    @dave_kimura 2 місяці тому +9

    I've tested exo before and ran into a lot of the same issues that you were experiencing and this was on a 10gbe network. I haven't tried it again after the failed attempts, but I do think that this kind of clustering could be very powerful with even smaller models. If it supports being able to handle multiple concurrent requests and exo acts as a "load balancer" for the requests, then you could have one entry point into a much larger capacity network of machines running inferences. This is opposed to trying to find your own load balancing mechanism (maybe HAPROXY) to balance the load, but then you still have the issue of orchestrating each machine to download and run the requested model.

    • @kevin.malone
      @kevin.malone 12 днів тому

      You can cluster Mac minis using thunderbolt 5, which gives you 80Gb. That's supposed to give ~30 tokens per second on a 4-bit quantized 70b param model.

  • @WujoFefer
    @WujoFefer 2 місяці тому +6

    That's why I'm waiting for Mac Studio with M4Max/Ultra. 256GB for big models with good SoC soon will be esential... or already is...
    Anyway... An iOS dev I'm using 20-40b models, they are heavy but not too much, they can respond in resonable time and they are not using 50GB+

  • @ItsBullyMaguire
    @ItsBullyMaguire 2 місяці тому +32

    Try 10 mini pc all with 96gb

    • @digital321
      @digital321 2 місяці тому +4

      Anyone with a Raspberry Pi cluster could have some fun, although the mini pc with the extra ram would be more cost effective

    • @aatef.tasneem
      @aatef.tasneem 2 місяці тому +3

      i tried it with Nvidia Jetson NANO Cluster, results are amazing,
      I tried other similar options i.e. Raspberry AI Kit, Google Corel, in comparison to Nvidia Jetson Nano do not even stand a chance.

    • @christianweyer74
      @christianweyer74 2 місяці тому

      @@aatef.tasneem Very interesting. Do you happen to have the exact specs of your setup to share with us?

  • @thesecristan5905
    @thesecristan5905 2 місяці тому +1

    Hi Alex,
    Very nice video, but I had to smile a bit because of the test setup.
    I have a cluster running at a customer, but for a different application and this technology can really bring a lot in the area of performance and failover. I am enthusiastic when cluster computing becomes generally more available and usable.
    It is very important to build a high-performance, dedicated network for cluster communication. With Macs, this is quite easily possible via Thunderbolt bridge. I recommend assigning the network addresses manually and separate the subnet from the normal network.
    With 40 Gbit/s you have something at hand that otherwise causes a lot of work and costs. (apart from the expensive cables.)
    Of course, it is better if all cluster nodes work with comparable hardware, which simplifies the load distribution, but in generally different machines are possible.
    In your case, unfortunately, a base air, which in itself hardly can handle the application, is more of a brake pad than an acceleration, as you impressively showed.
    A test with two powerful Macs would be interesting.

  • @danwaterloo3549
    @danwaterloo3549 2 місяці тому +1

    The idea is super cool, I’d love to be able to use multiple computers to accomplish more than what is possible with just one computer. The idea seems to me that it addresses the issue of expensive graphic cards… which is probably the next best alternative… a ‘modeling host’ with powerful graphics cards being available over the network to smaller ‘terminals’

  • @danielserranotorres4230
    @danielserranotorres4230 2 місяці тому

    You could try to run the tool in docker containers a with one shared storage over network for the model. That would help with the disk space issues.

  • @Adriatic
    @Adriatic 2 місяці тому +13

    Good day to you :) thanks for the content.

  • @thomasmitchell2514
    @thomasmitchell2514 2 місяці тому +1

    lol when he was looking for the safetensors i was thinking “please be in HF cache, please be in HF cache” and of course, this Alex fellow is wonderful. Means this will be simple to drop in current workflows. 405b should fit well across 4 Mac studios with 192GB 👌 next question will be if it can distribute fine-tuning

  • @RunForPeace-hk1cu
    @RunForPeace-hk1cu 2 місяці тому +11

    Use a NAS ... wouldn't have to download multiple times ... and point to a shared directory.

    • @AZisk
      @AZisk  2 місяці тому +5

      🤔

    • @AtlasBit
      @AtlasBit 2 місяці тому +3

      Good idea and also RAID on SSDs to boost performance and then computes as a cluster

    • @RunForPeace-hk1cu
      @RunForPeace-hk1cu 2 місяці тому

      @@AZisk create a shared directory for your huggingface hub directory that points to the shared NAS directory.

    • @Artificial.Unintelligence
      @Artificial.Unintelligence 2 місяці тому

      @@AZisk can you try a comparison of hardwired vs WiFi? Like the above comment, what about involving a NAS?
      What about all the exact same computers vs 1 PC of the same power vs multiple PCs with varied power?.. so we can see where the benefits actually come from?
      I suspect a significant portion of your delays are from networking and waiting on the slower and maxed out PCs to catch up and assist the powerful one. BUT the benefit here is offloading a giant model that wouldn't fit on a single machine.. there's no way networking is going to be faster with tokens per sec vs a CPU/GPU/RAM all in the same system.
      SO where do you gain performance, where are the diminishing returns? Can you use like 2-3 low power mini PCs like the new Intel and AMD mobile chips about to hit and actually do better at-scale with 1 bigger PC that can just barely handle a big request on its own? Because each of the small PCs can also still do smaller things on their own running tasks in parallel but pair up for big tasks? A single PC will only be able to do one task at a time regardless?
      Lots of questions that can be tested here and going cheap but many vs expensive single devices.

  • @litengut
    @litengut 2 місяці тому +31

    It's for people who have two MacBook Pro with 256gb of ram on a plain

  • @psychurch
    @psychurch Місяць тому +1

    I wonder if you could share the models and network via thunderbolt

  • @WireHedd
    @WireHedd Місяць тому

    Call back to the good old days of Beowulf clusters for unix. I picked up 5 old HP mini PCs with an intel 6 core cpu, 1TB NVMe and 64GB of ram in each. These are all on my 10GB in house ethernet so I'll give it a go and let you know. Great video, thanks.

  • @Rinat-p7f
    @Rinat-p7f 2 місяці тому +1

    Is it possible to run it with Mac plus windows or Linux machine in one cluster?

  • @Z-add
    @Z-add 2 місяці тому +2

    You should investigate and do more videos with this cluster llms

  • @5pm_Hazyblue
    @5pm_Hazyblue 2 дні тому +1

    I know a use case. College kids gather their macbooks together and forge essays.

  • @dougall1687
    @dougall1687 2 місяці тому +1

    I realize this may be a major leap in complexity, but would you consider a couple of videos on customizing LLM models to introduce local content?

  • @cyberdeth8427
    @cyberdeth8427 2 місяці тому +3

    this is a good start but the problem is still that it's trying to load the entire model on each machine. A better solution would be to share the model across machines and access it in a bittorrent type of style. Not sure how that will work though.

    • @AZisk
      @AZisk  2 місяці тому +2

      might have to try

  • @stefanodicecco3948
    @stefanodicecco3948 2 місяці тому +1

    a great project and idea, maybe the next step could be the addition of shared memory from the cloud

  • @quadcom
    @quadcom 2 місяці тому

    We're all the laptops connected via WiFi or hardwired?
    Firewalls blocking coms between the systems?

  • @DaveEtchells
    @DaveEtchells 2 місяці тому

    Naively, I’m surprised this would work without an even bigger hit on net performance than you found: I’d think that partitioning the model across machines would be tricky: You somehow split the weights, then calculate the two halves (or multiple shards) of the matrix math separately?

  • @LewisCowles
    @LewisCowles 2 місяці тому

    If you could simplify and extract some of the connections, you might be able to make a grid. But you'd wind up powering a lot of hardware.

  • @gunnarfernqvist4896
    @gunnarfernqvist4896 2 місяці тому

    Interesting project. It would be interesting to see a video where you explain fairly simple about these models. Now you mention numbers like memory and tokens/s and X number of parameters. Can you please explain to us not so in to LLMs?

  • @houssemouerghie6036
    @houssemouerghie6036 2 місяці тому +1

    This is so cool but i think if there is a possibility to get multiple VPSs and connect them then run the model on them it would be cooler

  • @dessatel
    @dessatel 2 місяці тому +1

    Supposedly you can add Nvidia or Linux etc with like tinigard as backend for ex0

  • @allanmaclean
    @allanmaclean 2 місяці тому +3

    I design air gapped AI inference systems, I do my initial tests on 30x Raspberry Pi’s to focus on efficiency. Obviously dedicated GPU memory is not possible. Maybe this teamed with the about to be announce M4 Mac Mini will be the next evolution. Also derisks getting a £thousands bill by accident on a cloud based test lab.

  • @tutran-b4i
    @tutran-b4i 2 місяці тому

    Hi man, can you compare the AMD and NVIDIA card when run ollama, something like amd 7800xt vs 4060ti. Thanks

  • @MrKim-pt2vm
    @MrKim-pt2vm 2 місяці тому +1

    Try use Llama 3.2 90B on MAC STUDIO M2 ULTRA

  • @RocktCityTim
    @RocktCityTim 2 місяці тому +6

    It's a great solution for a small business getting into the ML/AI realm but keeping their research in-house. Scrub the Macs and go for some lower-cost gaming PCs. Install a base model of Linux and kick off 3-4 nodes. Under $5K and amazing solution.

  • @the_other_ones1904
    @the_other_ones1904 2 місяці тому

    This would a be great idea to try with all my Raspberry Pi's which are collecting dust on my shelf.. I wonder how old Pi's could be used.

  • @seneschal6526
    @seneschal6526 2 місяці тому

    So I think like with RAM it may be running at the slowest common denominator. Just because you put two sticks of RAM together does not mean you get one running at 5600 and one running at 4800 even though they're the same 16 gig they will opt to work together at the slowest speed available to both, kind of like your motherboard communicating with CPU and RAM everything will slow itself down to operate simultaneous functions to the lowest common denominator.

  • @burgerbee5169
    @burgerbee5169 2 місяці тому

    Would be very interesting if you could try a 70B model on a new laptop /mini pc with 128GB RAM and the new Intel Core Ultra 7 (2nd gen) Processor 256V / 266V running Linux and llama.cpp (compiled with AVX512 and SYCL). I don't know it there are any 128GB laptops out in the wild with Core Ultra 7 (2nd gen) yet.

  • @_hmh
    @_hmh Місяць тому

    I think this would scale out better with equally sized computers and a fast network connection (10GbE).

  • @The_Collective_I
    @The_Collective_I Місяць тому

    12:40 - I can tell you easy, we have three MacBooks at home, each with 128GB and three iPad Pros with M4, each with 16GB, and two beefy windows machines with 4090.
    That’s altogether 500GB of VRAM that’s going to power my AGI for free.

  • @Peter-rm7io
    @Peter-rm7io Місяць тому

    I think you need thunderbolt bridge between the different machines to ensure low latency and speed.

  • @vipuljain5683
    @vipuljain5683 2 місяці тому

    Anything similar for Windows thats doesn't rely on vram of gpu??

  • @TheBadFred
    @TheBadFred 2 місяці тому

    What about 10 maxed out Raspberry Pis with NPU card and ssd in a cluster?

  • @brennan123
    @brennan123 Місяць тому

    Thank you. I was struggling to find where the models were located as well. Really annoying that it is not documented and they make it so hard to find. Yeah, don't mind me, just dumping 100+ GB, don't worry about it, you don't need to know where it's at... lol

  • @HaydonRyan
    @HaydonRyan 2 місяці тому

    Does this work on Linux with cpu only? People with beefy home labs might REALLY enjoy it. :)

  • @SnowDrift-bh7wb
    @SnowDrift-bh7wb 2 місяці тому

    Kinda reminds me swam intelligence. Bunch of devs sitting together, all sharing some of the compute power of their PCs, forming a clustered AI that serves all and as a whole has more performance and is smarter than simply the sum of each individual PC.

  • @gool54
    @gool54 2 місяці тому +2

    Try maybe Meta Llama 3.2 light model

  • @allansh828
    @allansh828 Місяць тому +1

    Imagine what a cluster of new Mac Studio M4 Ultra 512GB can do. They would beat Blackwell compute cards.

  • @piratestreasure2009
    @piratestreasure2009 Місяць тому

    you can find out where the model files are saved using dtrace

  • @billlodhia5640
    @billlodhia5640 27 днів тому

    Reverse proxy caching, proxy caching, and rsync will easily solve the downloading issues; download once and distribute locally at high speed

  • @yoSunshineyo
    @yoSunshineyo 2 місяці тому

    You construct is a classic example of a bottleneck! The request enters a pool of resources, where its parts are divided across three instances, each waiting for the others to complete. Imagine three people are meeting up: one takes a rocket, another a speedboat, and the last one rides a bicycle. Sure, two of them will arrive quickly, but for the meeting to happen, all three need to be there. So, everyone ends up waiting for the one on the bicycle.

  • @aatef.tasneem
    @aatef.tasneem 2 місяці тому +1

    I am an old follower since your sub numbers were in 4 digits.
    a comparison of "Nvidia: Jetson Nano" and likes of it would be opening a lot of more portals of possibilities.

  • @fsfaysalcse
    @fsfaysalcse 2 місяці тому

    Alex, I like your T-shirt where did you get it?

  • @matteolulli2654
    @matteolulli2654 2 місяці тому

    Very nice video. However, I think that connecting the computers in a wired local network using thunderbolt cables should provide some improvement

  • @MihiroParmar
    @MihiroParmar Місяць тому

    maybe limited by the networking switch and ports maybe go with 10gig

  • @rafaeldomenikos5978
    @rafaeldomenikos5978 2 місяці тому +1

    I actually have an m3 max 64GB and an m2 air 8gb. I am so intrigued by this! If it works I can set it up with my studio in the office with an m2 ultra and 192gb! Now that’ll be a lot of Ram. Maybe 405b quantized ?😂

  • @ClementOngera
    @ClementOngera 2 місяці тому

    The best use case would be a small - medium corporation, a retail chain, a corporation, or a learning institution that is looking to have it's data trained. Heck, If I had our farm's data, I would gladly run that model.

    • @nmstoker
      @nmstoker 2 місяці тому

      Except that EXO as currently setup is for inference only, not for training. For training you'd need a big server (on prem or cloud)

  • @StraussBR
    @StraussBR Місяць тому

    Just the tought of comparing the bandwidth of your ram with the network overhead of fitting a model across 2 machines is depressing

  • @nickvangeel
    @nickvangeel 2 місяці тому

    What internet do you have to download nearly 20 MB a second ?? (fiber ?)

    • @AZisk
      @AZisk  2 місяці тому +1

      yes

    • @nickvangeel
      @nickvangeel 2 місяці тому

      @@AZisk 1, 5 or 10 Gbit down ?

    • @fevad1246
      @fevad1246 2 місяці тому

      @Garrus-w2h bro please stop flexing internet speed I can't even get more than 2 megabits(yes not even megabyte) per second 😭😭

  • @ScottLahteine
    @ScottLahteine 2 місяці тому

    It’s also right to wonder “who is distcc for?” and more importantly, can we get a generalized cluster architecture for modern computers so that every large application can take advantage of spare hardware? This could lead to some very large clusters organized by any group that needs it. Of course, it would undercut AWS and no one wants that! Meanwhile, watch those huggingface cache folders. They do get very large and should be cleaned and purged frequently.

  • @Kitsune_Dev
    @Kitsune_Dev 2 місяці тому

    can you review Mini pcs? i want to know if i can run LLMs on my SER 5 Max 😂

  • @StevenAkinyemi
    @StevenAkinyemi 2 місяці тому

    this setup helps if you want your own little farm without selling your soul

  • @garynagle3093
    @garynagle3093 2 місяці тому

    Pretty cool to see this

  • @RandyAugustus
    @RandyAugustus 2 місяці тому

    Use case is simple.
    1. 2+ docker containers with models installed; eg mistral
    2. Put containers on same docker network.
    3.????
    4. Profit.

  • @HighTecker75
    @HighTecker75 2 місяці тому

    Did you try to setup a proxy server for the LLM download?

  • @samre3006
    @samre3006 18 днів тому

    Its high time that models are stored in network storage so they are shared by all the machines.

    • @AZisk
      @AZisk  17 днів тому

      yep: example: Running LLM Clusters on ALL THIS 🚀
      ua-cam.com/video/uuRkRmM9XMc/v-deo.html

  • @abrahamsimonramirez2933
    @abrahamsimonramirez2933 2 місяці тому +1

    Interesting, I guess faster networking cables/ports and faster hard drives could help.

  • @ottoneff
    @ottoneff 2 місяці тому

    Have you seen Qualcomm Snapdragon Dev Kit for Windows Teardown (2024) from Jeff Geerling... Hopefully LLMs going to work with NPUs LMStudio soon

    • @AZisk
      @AZisk  2 місяці тому

      still waiting for mine

  • @nyambe
    @nyambe 2 місяці тому

    Is that the fx3 or fx30?

    • @AZisk
      @AZisk  2 місяці тому +1

      fx30

    • @nyambe
      @nyambe 2 місяці тому

      @@AZisk I have one also, fantastic lilttle camera

    • @AZisk
      @AZisk  2 місяці тому +1

      @@nyambe Yeah I like the instant access to ISO and Aperture controls. But the battery drains so fast!

  • @KCM25NJL
    @KCM25NJL 2 місяці тому +1

    Hmmmm, wonder if inference would be stable enough if getting 10 of my "Gamer pals" on a VPN and running Exo across zee Interweb?

    • @alexcheema6270
      @alexcheema6270 2 місяці тому +1

      I'm adding support for invite links where you can invite friends to join your exo network

  • @calmsimon
    @calmsimon Місяць тому

    thanks for this video... was about to go spin some shi up myself lol

  • @tudoriustin22
    @tudoriustin22 2 місяці тому

    Love this experiment. Been following your channel for the past 2 years since I got into ML and I own a M2 Max Mac Studio with 32GB Unified Memory that I used so far for ML, happy with it but also waiting for the M4 Max Macbook Pro so I can finally get a portable powerhouse. Saved up for a whole year for the upgrade and I’m planing on getting the 8TB SSD and 128GB UNified Memory version maxed out for the 16” Model, or maybe more unified memory if they add it on the M4 Max. Benchmarks so far for the leaked models fron the russian youtubers seem so good as an estimate for performance but I cant wait to see hte new ones coming out soon.

  • @MrSparc
    @MrSparc 2 місяці тому +1

    So many AI hype people talk or show it using models with llama-cpp, etc but just few prompt questions or a simple toy code. Nobody show real implementation of this AI LLM models integrated in a real project. Alex I would like to see videos of examples in your projects where integrated AI models and the added value that brings to your software.

  • @monkeyfish227
    @monkeyfish227 2 місяці тому

    OMG you got patience. Cool? Yes cool.

  • @EcomGraduates
    @EcomGraduates 2 місяці тому

    I was literally just researching to see if anyone’s had done this yet!

  • @hermanthotan
    @hermanthotan 2 місяці тому

    This will be powerful when run cluster on Intel Mini PC with 96RAM.

  • @fatherfoxstrongpaw8968
    @fatherfoxstrongpaw8968 2 місяці тому +1

    i did a project like this back in 2004 using a beowolf cluster with 9 apple 2's an amd pc, an intel/nvidia pc and an acer/intel laptop. the 2 biggest bottlenecks were the macs and the 10mbit networking, but it was a good proof of concept. in my experience, any time you cluster, your bottlenecked by your slowest component. ya you can do it, but it's better for like vm's and lots of small individual programs. not to mention whatever software your running has to be written or modified to take advantage of the distributed hardware. just because you could, doesn't mean you should. the R.O.I. just isn't there.

  • @geofftsjy
    @geofftsjy 2 місяці тому

    They should upgrade the cluster join automation to peer2peer transfer the model the the new nodes if they don't have the model. No reason to go to the WAN over and over.

  • @geforce5591
    @geforce5591 2 місяці тому +5

    AMD Strix Halo APUs with 256gb RAM to the rescue in 2025. Won't have to pay Apple tax and can upgrade SSDs for a fraction of the price without having to resort to de-soldering NAND chips like on MacBook pros or spending $4000 for an 8TB SSD.

  • @ranjitmandal1612
    @ranjitmandal1612 2 місяці тому

    Very cool 👏

  • @Nathan15038
    @Nathan15038 13 днів тому

    Oh man, I only have one of those MacBook Pros with the highest M3, Max processor and ram configuration😅

  • @adilhussain6301
    @adilhussain6301 2 місяці тому

    can you test that in macos 15 nested virtualization is supported or not in m3 macs?

  • @kakaaika3302
    @kakaaika3302 2 місяці тому

    this project could be run on the thunderbolt bridge, I think that should be more reliable

  • @Ukuraina-cs6su
    @Ukuraina-cs6su 2 місяці тому

    I believe the one who has fast and reliable cable networking)))
    I don't know if they did, but it would be logical not to download models from the internet every time; once one machine has downloaded the model, it can serve the model to others. Even better, you don't need multiple copies of the model in the same network; it can be a single fast network drive.

  • @DJWESG1
    @DJWESG1 2 місяці тому

    Use case.. individualised , personalised , aligned assistants.

  • @anshulpathak01
    @anshulpathak01 6 днів тому

    you should've connected the macs with Thunderbolt 4 cables instead of the wireless network...

  • @TazzSmk
    @TazzSmk 2 місяці тому

    pair of 3090's (or three 16GB 4060Ti's) can run 70B models, reasonable compromise imo

  • @JonCaraveo
    @JonCaraveo 2 місяці тому

    😅😋 this sounds fun 😊

  • @swastikgorai2332
    @swastikgorai2332 2 місяці тому

    Hey, wanna give UV a try?

  • @zhouyangbo4498
    @zhouyangbo4498 2 місяці тому

    a group M1,M2,M3 owner to build a cluster llm training with group studying

  • @one_step_sideways
    @one_step_sideways Місяць тому

    If only there were any Strix Point laptops with 4 RAM sticks... That would be 192GB memory with 4x48GB sticks. Then running things on a budget would be achievable

  • @JanBadertscher
    @JanBadertscher 2 місяці тому +1

    Petals did P2P LLM inference 3 years ago and it lead nowhere. Memory bandwidth constrains makes this inefficient. You trade off too much speed for this.
    If you take the efforts to put together 4 machines, you get way way more VRAM per Dollar and speed buying 2x A16 with 64 GB vram each for 3.3k so you get 128GB vram for 6.6k. Or you could do with some RTX cards.
    Also, I was laughing hearing "production ready". Such projects are barely ever production ready, even ollama isn't :) Got tons of problems when trying to use inference solutions like that with our clients.

  • @weeee733
    @weeee733 2 місяці тому +3

    Wow

  • @matej_hajek
    @matej_hajek Місяць тому

    nice vid, would recomend not mounting your camera to the table, when you touch the table the camera moves.

    • @AZisk
      @AZisk  Місяць тому

      yeah, space is limited otherwise i would love to have a nice tripod

  • @kamurashev
    @kamurashev Місяць тому

    Let’s think clustering the machines up. What are the solutions out there, kubernetes? Any ideas?

  • @belatorok5630
    @belatorok5630 2 місяці тому

    As the LLMs developing, soon or later it will be equal to a human if we use 1000 of MacBookPro. :)

  • @user-cw7jy9zr3z
    @user-cw7jy9zr3z 2 місяці тому +1

    Use case is a dev team

  • @giridharpavan1592
    @giridharpavan1592 Місяць тому

    this was instructional

  • @AureliusRosetti
    @AureliusRosetti 2 місяці тому

    Hi there. first of all, great video, as always, many thanks for efford, appreciate it, really 😊. Now to the EXO, I can image huge on-prem data centre of … for instance … certain automaker r&d department running this across lets say 5 servers with 256GB RAM+2 highend GPUs each interconnected with high throughput LAN and connected to another internal vectordb cluster to enrich generated answers. This way you can easilly utilize all the advantages that modern LLM models provides without sharing even a tiny bit of data with vendors like OpenAI. Another case would be classified environments, where you aren’t connected to the internet at all. And don’t forget, it is not only about chating. Your can right our of the box integrate them also to the langchain powered applications. Such cluster projects should be in my opinion very good in distributing multiple requests across itself as well. I'm keeping my fingers crossed this project make it through to be stable.

  • @jetman-x4e
    @jetman-x4e 2 місяці тому

    So cool

  • @spacedavid
    @spacedavid 2 місяці тому

    Something tells me that 128gb of RAM will be in my next build as minimum.

  • @keepasskeep5322
    @keepasskeep5322 Місяць тому

    the idea is that:
    - make it p2p and block based.
    - to use it you need to allocate resources in idle time of android, ios, mac, pc, server.
    - exo ai cluster is a starting line i think.
    to rescue ai from big corps the people in anarchist community must train ai and intference ai in peoples customer grade devices.
    this is the only way to salvation.

    • @keepasskeep5322
      @keepasskeep5322 Місяць тому

      and im wrong badly. ai needs morecomputation than storage. and current bittorrent and blockchain technology is about the space not speed.