I use the Noctua SP3 on another epyc Rome system, a 7B12, this is a very tall but should just barely clear on the GPU rack from a measurement I just did. Like 10mm close. geni.us/NoctuaSP3_CPUCooler Alternatively the D9 also fits SP3 and is lower profile at 92mm and It is featured in this video on my 7995wx. I like it but its noticably louder under load, but the 7995wx is the hottest chip to my knowledge. If it ramps full blast fans, its not quiet at all. ua-cam.com/video/YQZ2HWonnGA/v-deo.htmlsi=d5CtIYrhjHF0D5AB
@@TuanAnhLe-ef9yk To be honest I think it's crazy, I used to use quad GPUs from 2 cards, but that was difficult enough with power and cooling. I'm not even sure that 4 rtx 3090s will scale well on machine learning.
WoW, lovin this build! I built my LLama3 on my ProxMox with a single 3090RTX Passthrough. It gets pretty hot, I can only imagine what kind of heat load your pumping into that room..
I have active cooling but yes that has limits. I have some ideas on placement to hopefully help I'll be testing in the next video. Rearranging the area a bit now.
It depends on your workload and whether you're running a model or training a model. If you're just running a model, unless you're constantly peppering it with requests, it might not get all that hot. (I'm running a dual 3090 setup with the open-webui, which uses the Ollama backend, running the Codestral 22b model, and it only spins up when I ask it something or type a response back, but then in between that, the GPU sits at idle.) If you're TRAINING a model, then that's a different story.
I have 2 RTX 3090's in a AMD 5900x system and they thermal throttle because of the spacing. Once I get the PCIE extension cables and install them in a new case that should solve the problem.
@@jksoftware1 yeah the 3090 is a really fat card. I toyed with the idea of water cooling but it was much cheaper to just use the risers. The risers are still too much imo also.
@@jksoftware1 I have two 3090s in an Asus Z170-E motherboard with a 6700K and 64 GB of RAM. The two slots drop down to a x8/x8 configuration, so it makes it more difficult to push a hard enough load onto the GPU to get it to thermal throttle. (And that's with running both InvokeAI and open-webui with the codestral:22b LLM models simultaneously.) If I need to space them out, I can use some of the GPU mining hardware that I still have. The PCIe x1 link back to the motherboard won't be great for bandwidth, but it will provide ample spacing for cooling.
Really cool build, I also have a 4 gpu rig. The only thing that I would recomend is trying to give some space between the GPUs as maximum as possible, because they being close to each other will generate a lot of heat, the difference is enourmous. I would also add extra coolers to the GPUs, I personally like the maxmium of 1 gpu per 120mm cooler , and the coolers blowing ar direct to the GPU. I'm not sure if a watter cooler is a good Idea here, I'm saying that because no server uses watter cooling, neither CPU miners (people using the cpu on 100% 24/7), because watter coller tends to stop refrigerenting at some point, and it doesn't have the best efficiency. I'm not sure also if a 120mm cooler will fit on your build, I'm just giving food for tought
Yeah the water cooling is working fantastic at keeping the CPU cool under workloads for really large models, but Im fast becoming annoyed badly those. Too dang slow. The 3090 non ti also generates a lot of heat on the back. Im working on a placements to help with heat load also. Im likely to have a video on that also at some point. Ive got vornado knockoff right now and it hits the board and keeps the nvmes cool, but redoing the whole layout of the mini datacenter is very likely.
LOL when you moved the camera from your 200W " reasonable power draw" on the rig to that insane server rack probably drawing several kilowatts. Nice video!
Im in a fairly large metro and my local sellers want more then ebay. I pointed this out to one who had a similar 3090 Ventus for about 75 more thinking they would be like okay ill go to 600 and they refused to match ebay. Totally their choice but local sellers being way high on prices and unreasonable in negotiating is a recurring theme recently.
About 275/mo. We bought this house with electric rates in mind also as everyone doing any form of HPC should in my opinion. It can go as high as 800/mo if Im really cranking flops but thats a fraction of what cloud costs would be. There is a not small part of this that is production for my business.
Nice build! Have you ran many training workloads on it? The single core perf of the 7702, even with boost , is pretty mediocre. I fear it would bottleneck training unless you spend a bunch of time optimizing data loading code. I went with a threadripper pro for my 4x3090 for this reason but always wondered how a 7702 would preform.
cool build, love the spreadsheets. You should project cost of ownership. At what point does it become to expensive to own ? Is there perhaps an undervoltage potential to bring power draw down without adversely affecting performance to much ? How can you make money out of it ? Can you bridge the gap from your website to this awesome beast and share time against your AI ? Is there a particular branch of AI that is more cost effective against another that might not be landing on the sweet spot ? Could you add an ASUS raid card, perhaps another network card , say 10gbe , can processes be re-routed to avoid cpu and ram bottlenecks ? Good job, want more.
A4000 have such a nice single slot format and 16GB vram it's a great card. What is your biggest models you like to run in that 8x ? I have 1 A4000 and 2 A5000 but thinking of selling those for more 4090s.
Your video is great! There is a resale Asus prime trx40 pro at pretty decent price. Do you think it would do the job as good? The mb comes with 3 x pcie 4.0.x16 and may need a splitter to accommodate 4 gpu. In addition, I already have 6 x rtx3090. Do you think it would be beneficial to utilize all the 6 GPUs? I plan to go for llama 3.1 70b.
Powering each 3090 with a single 8 pin ala piggyback connector? I thought 8 pin standards had a max wattage of 150w and even though your going to use afterburning (or equivelent) to reduce, you stated anticipated 275w. I just finished a chia miner 4x 4090 setup and I'm going to have to use afterburner to reduce the power as three 4090's on a single Asus ROG Thor 1600 trips the internal breaker and shuts off the PSU. Will be interesting to see how your four 3090's and the Corsair 1500 handle similar with additional draw from an Epyc processor/motherboard combo. May need to add a second to perform as desired. Thumbs up! - I see the PHISON drives also, nice touch.
Okay im going to research more on that here. Im not aware of it, and none of the cables are warm just touched um under load for about an hour. Still a good thing to know.
@@DigitalSpaceport, @bishop838 is correct. 150 W per cable for the PCIe connector is the official spec. You also get 75 W from the PCIe MB connector. Some of my power supplies list 200 W per cable max (even with two connectors), so if you can limit your GPU power to ~225 - 275 W, you'll be under the limit. If you're just running LLMs doing inference or even Stable Diffusion/Flux image generation, though, you should be fine even with the current setup. Unless you're doing training or fine-tuning that runs the GPUs at 100% continuously, you're unlikely to trip any breakers or brownout your power supply.
Sorry if I'm repeating questions from elsewhere in the comments. How did you "manually limit the GPUs?". What approach did you take to lower their power consumption to fit within the profile necessary for your PSU? What was the impact to performance of doing that? Also, why not add a second PSU so they can run unconstrained? Is there a way to safely do that?
nvidia-smi pm1 and pl XXX whatever wattage. It turns out however your GPU processing power utilization is /n so like each 3090 here only runs at 1/4 power. This is due to how ollama/llama.cpp currently handles model splitting. Tensor parallelism and such. There is work that may address this in the future and I would consider it more then.
With the power supply cables, the graphics card manufacturers will not recommend using the two connections on one cable, they recommend using separate cables 😵💫
Yeah for Inference workloads, the way the model gets split each added card only does its /n workload, so they are never hitting hard on any 1 card. Each card here gets ~25% of the GPU processing workload. However llama.cpp is implementing tensor parallelism so that should correct that and yes I need a second PSU also on this rig as well. Might use a larger rack also.
Yes for inference, like running premade models, you can mix GPUs and pcie BW is not that important. Im going to test mixing in some other various other GPUs to test how this impacts that performance. For training you want to be as close to each other as possible.
Why limit gpus' TDP?!? Just add another PSU! 4x3090 is 1400w already! 512ram and 7702 cpu is another 500w, so one more PSU, 750w minimum, it costs nothing compared to the price of the system. And with 1500w you don't want to run it on the max limits, keep 20% reserve, If you want stable reliable system, your gpus has to be limited to 150w instead of default 350, that's a huge hit!
I likely move it into the server racks at some point, for power draw reasons. Im also likely to get a DC powerboard for it then but I've got a lot of rearranging gear to tackle before it gets to that point. The CPU/ram doesn't cross 250 from what I've seen on my 7B12/H12 combo unless im running the cpu hard which this type workload doesn't seem to so far. I also don't trust the 2psu adapter kits, have known a guy who mined eth hard and had several burn up
@@DigitalSpaceport Llama uses cpu at first to compile and ram, then gpus as of my observations and I asked llama how it(he/she) works and got the reply confirming it :)
@DigitalSpaceport also, for llms, the peak performance is less important. Tha vram is what is golden about that setup right? I limit the power of my 3090 on my grid, just so they don't reach the highest temps in my fairly small case(i am at my financial limits, 800 dolars of used components is what i could afford)
AMD EPYC™ 7002 Series Processors AMD EPYC™ 7001 Series Processors Single processor, 7nm technology Up to 64 cores, 128 threads *cTDP up to 280W No 7702 cpu supported, furthermore AMD no WAY
Oh lots of them. Check the channel history and there is many benchmarks. The current best is Nemotron on this rig. ua-cam.com/video/QXVSIR2z1q4/v-deo.html
Ollama on multiple gpus actually needs very little bandwidth .. Check it yourself with tools like nvtop make a ollama query on a multiple gpu setup/model and see that pci bandwidth is rarely in the MegaBytes territory.. I think what counts more is the latency.
Thank you for amazing content. I bought the same setup using your links. I am having a hard time to understand where to plug the power switch. How are you turning on/off yours? Is there any spot on the motherboard I can plug the switch on?
Find the model number online. Then search that document for PWR and it will show you the header position pins. The black cable - is ground and the red is +
I have a epyc 7551p 256gb with Tesla p4. Going to potentially put my 3090 in it. The 7551p also does 2ghz for 32 cores. Do you find 64 cores at 2ghz is working well or is the cpu ghz speed a bottleneck regardless of core count?
Hey, what are your thoughts on mix-match GPUs? (i.e., dual 3090s and RTX ADA 4000/4500). Are there any benefit or disadvantage in mix-match or all same GPUs?
Good question! I'm not sure I guess I will test that out here actually. I would guess that core speed/ram speed will dictate a lowest common denominator outcome as all work pieces need to be completed before response. I am also curious about scaling up VRAM in non-homogenous routes and the impact that has. I think layers are propagated in ollama/openwebui intelligently to each GPU based on capacity in VRAM. Im going to check, this is an important question. Thanks for asking!
A good alternative is buying a refurbished Mac Studio with M1 ultra + 128GB RAM on e-bay for around $3k. The M1 ultra with 128GB RAM will run 70b models with q8 precision at ~7.5 tokens/second and draws less than 100W when running such models. Additionally, you can configure it to allow up to 120GB RAM for the GPU, which should be enough to run 70b models at 64k token context.
I have one big issue with the Mac Studio route and that is indeed the Tokens/second fall into what I deem an unusable range for middle size models. Under 10 is painful and discouraging to use imo.
@@DigitalSpaceport 7 tokens/second is slightly above the speed at which I can read and get good comprehension of what I'm reading, so anything above that speed doesn't make much difference for me when using the model on a chat UI. However, for using the model as an agent for automating tasks, then yes this speed is very low. One thing I'm curious is what kind of speeds you get when using larger contexts with quad rtx 3090 setup. On the M1 Ultra it gets very slow for 70b models at close to 30k tokens in context, about 2-3 tokens/second
I miss where Ollama training has been shown and how you tell it to divide itself among 4 GPUs. Can you fit in these 4 GPUs a 70GB model, for example something in FP16 with ~30 bil.params ?
The user does not tell it to split into layers, the underlying parallelism method is automagically applied from llama.cpp which powers ollama. You cannot fit the llama3.1-70b-instruct FP16 into just 96GB vram. That takes 140GB ollama.com/library/llama3.1/tags but you can fit in the q8.
I love the FE editions also they are works of art. I have FE 3070 and 3080ti but noticed on used markets the 3090 FE is not just slightly more expensive. Good eye! I just went and seated it fully.
AMD EPYC 7763 is probably a better choice for CPU, it's released about a year later after 7702, and is a fully 7nm process. About the same price as 7702 on eBay right now.
I like this comment but let me ask you, if I wanted to bias a bit more for more mid range epyc freq what would be your recommendation? Id be okay sacrificing core # but selling to switch would be ideal.
I'm curious whether it's better to have identical 3090s or if using different brands wouldn't make a difference. Also, is it possible to mix 3090s with 4090s?
Yes you can mix any GPUs, even extremes like a 1070 and a 4090 and benefit from added vram size, however if you have slower cuda cores (usually older gen is slower) then you will baseline performance at the lowest cards level. Mixing 3090 and 4090 for quant 8 or lower is next to no discernable difference. Mixing 30/40 at fp16 you will have several t/s slowdown. I like keeping it to one model of 3090 if you can as you likely need to clean/repad 3090s especially. Its easier to be able to know what screws and pads go in which spots that way.
How do you plan to cool your mainboard, the board is made to run in a server enclosure. In your current setup only the top of te rack with the gpu's is actively cooled. How do you plan to cool the ddr and other chips on the motherboard? I run a h12ssl-1 motherboard with Epyc 7573x with 2 x 3090 and went with an Artic freeze u4 air cooler instead of water cooling to get that extra needed airflow inside a 4u server case. I was considering the Gigabyte motherboard but since I don't use a riser setup the top pcie slots wouldn't be usable since the gpu would be directly over the cpu socket.
I think the H12 is an excellent choice also! Have one myself going into a 4u case soon. Four fat 3090s wouldnt fit in the case however and this board is v cheap. I have a small vornado knockoff fan that moves a lot of air over the mobo. Thats shown in the most recent video now also.
Agree and I have more focus on power efficient systems also now. Video I launched today is a PC that idles around 25w. ua-cam.com/video/iflTQFn0jx4/v-deo.html and that machine should hit like maybe 75% of most avg ppls use cases. I am working on one that hits 8w idle also but it likely hits around 50% but may also be of interest. I did get that rigs idle down lower after working on it some more and I need to mention those steps in a video sometime also. Powersave and some kernel flags really can cut the wattage for an epyc with minor usability impact.
What speed you get tokens ollama 70b q8 gpu, and only cpu if you use air llm and use ddr4 ram virtual hdd for air llm layer offloading? As air llm can run even larger llms on single gpu any amount ram but very slow, but nobody made ram hdd test
Do you think is possible to make a server like this, but add a file server also ? I mean, I want to make a server that I can use like NAS, Emby/Plex server and IA... I want to use maybe Proxmox and share the GPU fol all this servers... Is this possible please ? Thanks a lot
you could have mounted the cpu radiator on the shelf below? level with the GPU's, maybe help take the strain of that one hose. Dude, 12:34 what are you into lol, thats some set up.
Yeah I want to fabricate an entire new case. This is just not optimal. In a nutshell I have a backup of the usgs geotiffs and I do a geospatial rendering based workload for my business. Its now able to be done with GPUs faster at nearly the same quality as CPUs so those r930s are not really needed as much.
@@DigitalSpaceport cool stuff, just recently watched a tutorial, VAPOR for WRF-Fire. Im started to learn a bit about visualization with matplotlib, mostly on dataset embedding and query returns.
If there is not a premade waterblock for the P40, I dont think you will be able to. Maybe some sort of immersion cooling could work but I am still to small of a channel to toss expensive parts into something like that. Those are expensive.
While this is mainly a tutorial to get open webui ollama and meta llama 3.1 setup in ubuntu, it does feature me running the 70b and while the stats I shared for a story generation may not be the same as hard logic, its pretty good. Ill have full in depth testing on 8 and 70 soon. 405 is now giving me issues, was running a few days ago.... The stats part is closer to the end. ua-cam.com/video/q_cDvCq1pww/v-deo.html
great job keeping that under 5k. I made so many mistakes like dual xeon gold 6148s which didnt cost me money but time. I got it to about 5700 and it is not as good as yours.
Please, test mixing diferent VRAM size cards like 3090 (24 gb) and 4070 (12 gb). Can it balance the work in a way that don't crash when hit the 12 gb mark?
Am planning on this and a few other test. Here are some cards I have on hand that I may run mixed workload testing against. 3060ti, 3070, 4090, A4000, A5000. I think the A4000 + 3070 + 2x 3090 would be a good test.
I'm using metas llama 3.1 70b and it hits between 22 and 17 tok/s. 8b hits around 95 and 405 hits around 1. Have a full video on each model coming up but this video I think I have a chapter on llama 3.1 70b you could check. ua-cam.com/video/q_cDvCq1pww/v-deo.html
For inference workloads nvlink doesnt help. But for training workloads I will need to get nvlink. Im a bit hesitant until we see what the 5XXX price/features are to do anything though.
I need better upload speeds. Cable modem has me capped at 40mbits upload but high split is on the way and should make that a viable route for idle times. I need to think about reservations and utilization more before I put this rig on it but competition is pretty high on depin, and impossible at 40mbits.
Due to the way the model workload splits across the GPUs when you are using their VRAM, they are often around 25% utilization on the processors. There are other ways to split workloads but lamma.cpp is under the ollama hood so that would need to be addressed there. Tensor Parallelization is the term.
Explain to me like I'm knew to this. Why would you want to run an AI server? What applications would this enable, and is it actually any better than building a server with more consumer kinda parts? AKA 7950x or 7900x + ONE 3090
Love this question. Im going to quote it in the single GPU video which will fully answer the why part. Of note is speed for inference (processing) requests to the system and models landing inside VRAM is of course ideal to the tune of like 10x speedups. Thats a major reason. Several other big ones exist as well.
I'd love to do something like this, and I have some reasonable hardware to make it happen, but I straight up don't have the power. What do you use as a power source? giant solar array? my power in CT just went up to .35/kwh
Im on grid for power unfortunately still, but that likely changes this year. Our rate is .10/kWh and im on a co-op that does a great job controlling costs. We do have land for a ground based array onsite but trenching in limestone is expensive. Austin gets a lot sun so it likely makes good sense for us. At .35 Im not sure what I would do!
Hi dear ,awesome totally, how can I build a server to get performance equal to AMD Ryzen 7995X Threadripper Pro , with RTX 4090, and 128 GB 6400MHz Ram with Pcei 5.0 NVMe? I am doing research on building a Server for training my AI &ML models I considered AWS but its very costly so I am considering my own Server
Yes but how do you get any LLM to run on all that? For example, llama v3 requires a high VRAM count. Does this get around that per card VRAM by being able to aggregate the VRAM or is that not a thing?
Im technically very open to all gear lol. Ill have more in the benchmark video on all the stats, but the 70b so far is looking good on tok/s at 17.7 and 98 for 8b llama3.1
Yes I forgot to mention in this video I'm splitting up hardware related, software setup/config and benchmark videos. Use case definition will be covered in the software videos.
What about running diffusion models. Can one use vlink to increase unified vram to fit big models ? Would it be possible to switch to 4090’s for extra speed?
Im not sure nvlink is needed now. I think with LLMs at least you can count on the layers being propagated with something like ollama automagically. Not sure about diffusers but will keep an eye on nvtop when I do that video.
Okay I did get 405b to run on this. It was EXTREMELY slow however. I would class it as unusable. That was not unexpected but only 44 layers of 145 can load into VRAM on the GPUs so yeah I guess I would need ~ 12 GPUS of 24GB to run it at respectable speeds. Hit .75 TOK/S at 2048 which ended up being around 6 min generation time on easy logic.
@@DigitalSpaceport thank you very much for testing if you are just limited by vram would it be feasable to run M40 instead? i have seen them on ebay go for 170€
Id be surprised as Maxwell generation is pretty old now. CUDA 5.2 and also pcie3. I'd not go with those cards but there may be more recent ones I should check into.
When you will be paying roughly $2400 for a 32 GB 5090 and most likely $1200+ on a 16 GB 5080, I would expect the 4090's to be selling for at least $1400+. The 3090 will probably continue to be the best bet in town!
Im likely selling my 4090s in anticipation of the 5090s launch. Going to campout or whatever it takes to get one when they launch. The 3090s just to a great job so they get to stay. 3060 12GB on way currently lol. I *may* have a GPU problem
Thanks for review. I have Asus z10pe-d16 ws main board,2x xeon 2683 v3, 8x 16gb ddr4 2133p,5x 3090 and many corsair a 1500i PSU. Tried 70b q8 and q4 and 405b q2. They are extremely slow. What do I miss? What is 4i sff 8654?Ty
I don’t understand something with this setup aren’t you limited to just small LLMs. Mainly because only 2 RTX 3090 can sync together via NVLink so you essentially have 2 sets of pair of RTX with you four cards. Also, I wondering about PCIe bottleneck. Lastly, would advise to get a big enough RAM to load the entire 300Billion parameter LLM which works out to about 1.2 Tbytes. If you could please discuss the limitation with this setup?
No that's not a correct starting point for assumptions, but one I started with as well. Its poorly discussed but im working on talking and sharing much more of my learning about this all also. You do not use nvlink for inference. The llama.cpp runner code automatically layers the model into GPUs automatically, so no need outside highend training for nvlink. To also state, Im using no nvlink. It also can layer it into system ram as well. However there is no need to run any large parameter model off system ram, as performance is abysmal. Even on the worlds fastest CPU/RAM combo, it is unacceptably slow. Think 1tk/s. At q4 for llama3.1 405b.
You wont get improvement for inference workloads. Have tested on dual A5000 and NVLINK. I may however use NVLINK for doing training on these 3090s but am not there yet.
@@DigitalSpaceport thank you very much for your reply, I'm just getting into building pc for LLM and gathering information on which gpu I should use and how multiple gpu can be beneficial
Sure. You could run the setup I did a year back that has worked very well easily here. It is just proxmox and a LXC but there appears to be a way to get a LXC windows instance running. I need to look more into that. ua-cam.com/video/IylJNfLi36E/v-deo.html
No its not future proof at all, but I wanted to wait until we see the next nvidia GPUs before I decide on something bigger. I dont think we will see more than 24GB VRAM in the 5090 currently, and while model split is a thing and does work.... its pretty slow.
They had been used by a friend for ethereum mining prior, in a harsh environment. The amount of dirt I had to clean off these was really a lot. The pads had also been destroyed. All replaced now but a lot of work.
For 3090s im not sure it does anything for inference tasks? Does it? I have a dual A5000 with nvlink and it does enable a larger nonsharded memory size but I only know of that in the context of GIS. Also just to be clear Im pretty new to running local Ai and not trying to larp as an expert. Here learning myself also.
I work and record in a harder audio environment than any other homelab youtuber I hope you consider that as well. I spent over an hour already on the audio on this and its impossible without shutting down the rack machines to get clean audio. If I was in a studio like they are I would for sure be embarrassed at the audio quality, but im in a 8 ft away from a mini datacenter. I do want to set your expectations ahead of time that this may be the audio quality I can achieve.
I wonder what are the "really cool AI and other things"? Outside of maybe home AI, maybe some prompting I can't really wrap my mind around hosting a LLM. Anyone can tell me the other applications?
Check the most recent video here for some examples of vision routing and realtime web search engine hosting. I didnt want to drag that video on longer and I am building and learning in realtime also (sharing along the way) and there are more functional use case based videos coming. I agree that part is lacking in this video, but it was only intended to showcase how to build the thing.
What car do you drive? I decide to put my money into homelab stuff, vs new cars, and somehow I must be rich? That is not accurate. I dont waste money on things I dont care about, like new cars. Everything you see in my entire lab cost less than a mid range car costs.
why didn't you go for a tower cooler? There are some decent 3U/4U options that are not loud and the performance is more than adequate. Please note that server motherboards rely on airflow over the VRM for optimal operation. You could run the risk of hitting thermal limits and cause throttling/shutdown of the system.
Yes I have a HDX vornado ripoff mini fan that I have pointed at the mobo. It will be in the testing video. I do have tower coolers, but they are all utilized in other systems currently. This Corsair 420 I had free and I very well might be putting the 7995WX into this rack at some point for testing on the fastest platform available.
there is absolutely no way you are going to see any kind of condensation unless the room is at below freezing temperatures or you are using liquid nitrogen why even mention condensation???
The window AC spits out sub 32F air and wqs one, of many, considered plans was to have heatsinks right next to it. I opted not to and everything is greatly cooled from a distance as well. I do see condensation at times on the AC directional fins and need to wipe it off and pay attention to it so I dont get mold growth.
Its not that bad. When we bought the house we made sure to go outside city owned utility to a much cheaper COOP. Under 300/mo for the whole house in central texas is very decent
Unlike gaming, AI and machine learning really do not benefit from 16 vs 8x lanes. That is because models are loaded once. Once the model or models are loaded into VRAM, the CPU has a minimal effect. Now, if you are pooling VRAM with NVLink, it is so much faster than PCIe 3.0, 4.0, or even 5.0 by a long shot. Also, though I have U.2 access with both the Z590 Dark and Z690 Dark Kingpin, they pale in comparison to the speeds of native PCIe 3.0 and 4.0 NVMe. I, too, have that same chassis from mining but have always wondered how it would perform as an AI frame--just haven't gotten around to tinkering. At 3:20 I've stopped, because experience gained from the last year of the Etherium GPU mining boom to now is sufficient, and for me, I doubt there is any real new value.
Oh Im doing training also but yeah full lanes are not needed for inference. I did mention that. Can I nvlink the 3090? Ive read its minimal return recently. I guess the channel isnt for you, no harm at all there lol.
Do you have any recommendations for an air cooler for a CPU?
what is the specific CPU part number?
@@DigitalSpaceport I’m going to purchase the 7702p CPU, just like your specifications.
I use the Noctua SP3 on another epyc Rome system, a 7B12, this is a very tall but should just barely clear on the GPU rack from a measurement I just did. Like 10mm close. geni.us/NoctuaSP3_CPUCooler
Alternatively the D9 also fits SP3 and is lower profile at 92mm and It is featured in this video on my 7995wx. I like it but its noticably louder under load, but the 7995wx is the hottest chip to my knowledge. If it ramps full blast fans, its not quiet at all. ua-cam.com/video/YQZ2HWonnGA/v-deo.htmlsi=d5CtIYrhjHF0D5AB
@@DigitalSpaceport Thank you for clarifying. My expectation was for the air cooler to be quiet.
@@TuanAnhLe-ef9yk To be honest I think it's crazy, I used to use quad GPUs from 2 cards, but that was difficult enough with power and cooling.
I'm not even sure that 4 rtx 3090s will scale well on machine learning.
WoW, lovin this build! I built my LLama3 on my ProxMox with a single 3090RTX Passthrough. It gets pretty hot, I can only imagine what kind of heat load your pumping into that room..
I have active cooling but yes that has limits. I have some ideas on placement to hopefully help I'll be testing in the next video. Rearranging the area a bit now.
It depends on your workload and whether you're running a model or training a model.
If you're just running a model, unless you're constantly peppering it with requests, it might not get all that hot.
(I'm running a dual 3090 setup with the open-webui, which uses the Ollama backend, running the Codestral 22b model, and it only spins up when I ask it something or type a response back, but then in between that, the GPU sits at idle.)
If you're TRAINING a model, then that's a different story.
I have 2 RTX 3090's in a AMD 5900x system and they thermal throttle because of the spacing. Once I get the PCIE extension cables and install them in a new case that should solve the problem.
@@jksoftware1 yeah the 3090 is a really fat card. I toyed with the idea of water cooling but it was much cheaper to just use the risers. The risers are still too much imo also.
@@jksoftware1
I have two 3090s in an Asus Z170-E motherboard with a 6700K and 64 GB of RAM.
The two slots drop down to a x8/x8 configuration, so it makes it more difficult to push a hard enough load onto the GPU to get it to thermal throttle.
(And that's with running both InvokeAI and open-webui with the codestral:22b LLM models simultaneously.)
If I need to space them out, I can use some of the GPU mining hardware that I still have. The PCIe x1 link back to the motherboard won't be great for bandwidth, but it will provide ample spacing for cooling.
Really cool build, I also have a 4 gpu rig.
The only thing that I would recomend is trying to give some space between the GPUs as maximum as possible, because they being close to each other will generate a lot of heat, the difference is enourmous.
I would also add extra coolers to the GPUs, I personally like the maxmium of 1 gpu per 120mm cooler , and the coolers blowing ar direct to the GPU.
I'm not sure if a watter cooler is a good Idea here, I'm saying that because no server uses watter cooling, neither CPU miners (people using the cpu on 100% 24/7), because watter coller tends to stop refrigerenting at some point, and it doesn't have the best efficiency. I'm not sure also if a 120mm cooler will fit on your build, I'm just giving food for tought
Yeah the water cooling is working fantastic at keeping the CPU cool under workloads for really large models, but Im fast becoming annoyed badly those. Too dang slow. The 3090 non ti also generates a lot of heat on the back. Im working on a placements to help with heat load also. Im likely to have a video on that also at some point. Ive got vornado knockoff right now and it hits the board and keeps the nvmes cool, but redoing the whole layout of the mini datacenter is very likely.
LOL when you moved the camera from your 200W " reasonable power draw" on the rig to that insane server rack probably drawing several kilowatts. Nice video!
You do have a point there 😅
The budget of 5K for that build is awesome, at first i thought we were looking around 10K.
On a side note, i am having trouble sourcing used 3090 24GB in my area.
Thanks I also think its very cheap for what it is. I couldnt get as much bang for the buck any other way.
Im in a fairly large metro and my local sellers want more then ebay. I pointed this out to one who had a similar 3090 Ventus for about 75 more thinking they would be like okay ill go to 600 and they refused to match ebay. Totally their choice but local sellers being way high on prices and unreasonable in negotiating is a recurring theme recently.
What about electricity bill
About 275/mo. We bought this house with electric rates in mind also as everyone doing any form of HPC should in my opinion. It can go as high as 800/mo if Im really cranking flops but thats a fraction of what cloud costs would be. There is a not small part of this that is production for my business.
Nice build! Have you ran many training workloads on it?
The single core perf of the 7702, even with boost , is pretty mediocre. I fear it would bottleneck training unless you spend a bunch of time optimizing data loading code. I went with a threadripper pro for my 4x3090 for this reason but always wondered how a 7702 would preform.
Good stuff, man. Looking forward to what the performance will be like.
cool build, love the spreadsheets. You should project cost of ownership. At what point does it become to expensive to own ? Is there perhaps an undervoltage potential to bring power draw down without adversely affecting performance to much ?
How can you make money out of it ? Can you bridge the gap from your website to this awesome beast and share time against your AI ? Is there a particular branch of AI that is more cost effective against another that might not be landing on the sweet spot ? Could you add an ASUS raid card, perhaps another network card , say 10gbe , can processes be re-routed to avoid cpu and ram bottlenecks ?
Good job, want more.
Nothing wrong with zip ties. Sweet rig!
Sweet.
Gonna build this, my old school dual C2070 is now a dinosaur.
I had to look that GPU up. Does Fermi still work for ollama?
Nice one! I build a 8xA4000 epyc server which was...epic! 128GB vram
A4000 have such a nice single slot format and 16GB vram it's a great card. What is your biggest models you like to run in that 8x ? I have 1 A4000 and 2 A5000 but thinking of selling those for more 4090s.
This is exactly what I want to watch!!!!!!
Sweet! Glad this build floated your boat next build video is the smaller guy, filming that now.
that odd fan is making me go crazy LOL
I know but I spent all my money on GPUs and Pads
Your video is great!
There is a resale Asus prime trx40 pro at pretty decent price. Do you think it would do the job as good? The mb comes with 3 x pcie 4.0.x16 and may need a splitter to accommodate 4 gpu.
In addition, I already have 6 x rtx3090. Do you think it would be beneficial to utilize all the 6 GPUs? I plan to go for llama 3.1 70b.
What is it priced at and does it include CPU?
@@DigitalSpaceport Trx40 is 120usd without CPU or Ram.
Quite hard to find AMD server mb in my area. What would be comparable setup if on Intel set?
Powering each 3090 with a single 8 pin ala piggyback connector? I thought 8 pin standards had a max wattage of 150w and even though your going to use afterburning (or equivelent) to reduce, you stated anticipated 275w. I just finished a chia miner 4x 4090 setup and I'm going to have to use afterburner to reduce the power as three 4090's on a single Asus ROG Thor 1600 trips the internal breaker and shuts off the PSU. Will be interesting to see how your four 3090's and the Corsair 1500 handle similar with additional draw from an Epyc processor/motherboard combo. May need to add a second to perform as desired. Thumbs up! - I see the PHISON drives also, nice touch.
Okay im going to research more on that here. Im not aware of it, and none of the cables are warm just touched um under load for about an hour. Still a good thing to know.
@@DigitalSpaceport, @bishop838 is correct. 150 W per cable for the PCIe connector is the official spec. You also get 75 W from the PCIe MB connector. Some of my power supplies list 200 W per cable max (even with two connectors), so if you can limit your GPU power to ~225 - 275 W, you'll be under the limit. If you're just running LLMs doing inference or even Stable Diffusion/Flux image generation, though, you should be fine even with the current setup. Unless you're doing training or fine-tuning that runs the GPUs at 100% continuously, you're unlikely to trip any breakers or brownout your power supply.
When it comes to tabs, I am your wife to a T lolol, thanks for the shoutout. Loved the video!
Sorry if I'm repeating questions from elsewhere in the comments. How did you "manually limit the GPUs?". What approach did you take to lower their power consumption to fit within the profile necessary for your PSU? What was the impact to performance of doing that? Also, why not add a second PSU so they can run unconstrained? Is there a way to safely do that?
nvidia-smi pm1 and pl XXX whatever wattage. It turns out however your GPU processing power utilization is /n so like each 3090 here only runs at 1/4 power. This is due to how ollama/llama.cpp currently handles model splitting. Tensor parallelism and such. There is work that may address this in the future and I would consider it more then.
With the power supply cables, the graphics card manufacturers will not recommend using the two connections on one cable, they recommend using separate cables 😵💫
Yeah for Inference workloads, the way the model gets split each added card only does its /n workload, so they are never hitting hard on any 1 card. Each card here gets ~25% of the GPU processing workload. However llama.cpp is implementing tensor parallelism so that should correct that and yes I need a second PSU also on this rig as well. Might use a larger rack also.
Does it work to mix diferent generation of gpus like rtx 30 and rtx 40? Well job!
Yes
Yes for inference, like running premade models, you can mix GPUs and pcie BW is not that important. Im going to test mixing in some other various other GPUs to test how this impacts that performance. For training you want to be as close to each other as possible.
Why limit gpus' TDP?!? Just add another PSU! 4x3090 is 1400w already! 512ram and 7702 cpu is another 500w, so one more PSU, 750w minimum, it costs nothing compared to the price of the system. And with 1500w you don't want to run it on the max limits, keep 20% reserve, If you want stable reliable system, your gpus has to be limited to 150w instead of default 350, that's a huge hit!
I likely move it into the server racks at some point, for power draw reasons. Im also likely to get a DC powerboard for it then but I've got a lot of rearranging gear to tackle before it gets to that point. The CPU/ram doesn't cross 250 from what I've seen on my 7B12/H12 combo unless im running the cpu hard which this type workload doesn't seem to so far. I also don't trust the 2psu adapter kits, have known a guy who mined eth hard and had several burn up
@@DigitalSpaceport Llama uses cpu at first to compile and ram, then gpus as of my observations and I asked llama how it(he/she) works and got the reply confirming it :)
Very good point, btw
@DigitalSpaceport also, for llms, the peak performance is less important. Tha vram is what is golden about that setup right? I limit the power of my 3090 on my grid, just so they don't reach the highest temps in my fairly small case(i am at my financial limits, 800 dolars of used components is what i could afford)
AMD EPYC™ 7002 Series Processors
AMD EPYC™ 7001 Series Processors
Single processor, 7nm technology
Up to 64 cores, 128 threads
*cTDP up to 280W
No 7702 cpu supported, furthermore AMD no WAY
Thank you. Looking forward to thermal paste report.
Coming soon!
Nice video! Do you have a demo how well it runs? or some benchmarks?
Oh lots of them. Check the channel history and there is many benchmarks. The current best is Nemotron on this rig. ua-cam.com/video/QXVSIR2z1q4/v-deo.html
Ollama on multiple gpus actually needs very little bandwidth .. Check it yourself with tools like nvtop make a ollama query on a multiple gpu setup/model and see that pci bandwidth is rarely in the MegaBytes territory.. I think what counts more is the latency.
Thank you for amazing content.
I bought the same setup using your links. I am having a hard time to understand where to plug the power switch. How are you turning on/off yours? Is there any spot on the motherboard I can plug the switch on?
Find the model number online. Then search that document for PWR and it will show you the header position pins. The black cable - is ground and the red is +
perhaps I missed it. But why that powerful CPU and RAM for an LLM Server ? The VRAM is the important part. Or do I forgot something?
No Ollama demo?
Yeah I'm separating the hardware videos, software install/config videos, and benchmarking videos. Those will both be out very soon.
I have a epyc 7551p 256gb with Tesla p4. Going to potentially put my 3090 in it. The 7551p also does 2ghz for 32 cores. Do you find 64 cores at 2ghz is working well or is the cpu ghz speed a bottleneck regardless of core count?
Hey, what are your thoughts on mix-match GPUs? (i.e., dual 3090s and RTX ADA 4000/4500). Are there any benefit or disadvantage in mix-match or all same GPUs?
Good question! I'm not sure I guess I will test that out here actually. I would guess that core speed/ram speed will dictate a lowest common denominator outcome as all work pieces need to be completed before response. I am also curious about scaling up VRAM in non-homogenous routes and the impact that has. I think layers are propagated in ollama/openwebui intelligently to each GPU based on capacity in VRAM. Im going to check, this is an important question. Thanks for asking!
Would love to see a Geekbench result for this machine.
Geekbench? If that can run in ubuntu 22 ill toss it into the benchmarking video.
A good alternative is buying a refurbished Mac Studio with M1 ultra + 128GB RAM on e-bay for around $3k. The M1 ultra with 128GB RAM will run 70b models with q8 precision at ~7.5 tokens/second and draws less than 100W when running such models. Additionally, you can configure it to allow up to 120GB RAM for the GPU, which should be enough to run 70b models at 64k token context.
I have one big issue with the Mac Studio route and that is indeed the Tokens/second fall into what I deem an unusable range for middle size models. Under 10 is painful and discouraging to use imo.
@@DigitalSpaceport 7 tokens/second is slightly above the speed at which I can read and get good comprehension of what I'm reading, so anything above that speed doesn't make much difference for me when using the model on a chat UI. However, for using the model as an agent for automating tasks, then yes this speed is very low.
One thing I'm curious is what kind of speeds you get when using larger contexts with quad rtx 3090 setup. On the M1 Ultra it gets very slow for 70b models at close to 30k tokens in context, about 2-3 tokens/second
Any video of it running the big Llama 3 model?
I miss where Ollama training has been shown and how you tell it to divide itself among 4 GPUs. Can you fit in these 4 GPUs a 70GB model, for example something in FP16 with ~30 bil.params ?
The user does not tell it to split into layers, the underlying parallelism method is automagically applied from llama.cpp which powers ollama. You cannot fit the llama3.1-70b-instruct FP16 into just 96GB vram. That takes 140GB ollama.com/library/llama3.1/tags but you can fit in the q8.
Wow that's a super impressive build. I'm looking at doing the same gpus with the Lenovo p520 or Lenovo p920.
Lenovo used systems price point is pretty attractive!
Looks cool. I prefer founders editions so pretty and such high build quality. Oh by the way your MOBO&EATX connectors didnt look pushed in properly?
I love the FE editions also they are works of art. I have FE 3070 and 3080ti but noticed on used markets the 3090 FE is not just slightly more expensive. Good eye! I just went and seated it fully.
AMD EPYC 7763 is probably a better choice for CPU, it's released about a year later after 7702, and is a fully 7nm process. About the same price as 7702 on eBay right now.
I like this comment but let me ask you, if I wanted to bias a bit more for more mid range epyc freq what would be your recommendation? Id be okay sacrificing core # but selling to switch would be ideal.
I'm curious whether it's better to have identical 3090s or if using different brands wouldn't make a difference. Also, is it possible to mix 3090s with 4090s?
Yes you can mix any GPUs, even extremes like a 1070 and a 4090 and benefit from added vram size, however if you have slower cuda cores (usually older gen is slower) then you will baseline performance at the lowest cards level. Mixing 3090 and 4090 for quant 8 or lower is next to no discernable difference. Mixing 30/40 at fp16 you will have several t/s slowdown. I like keeping it to one model of 3090 if you can as you likely need to clean/repad 3090s especially. Its easier to be able to know what screws and pads go in which spots that way.
How do you plan to cool your mainboard, the board is made to run in a server enclosure. In your current setup only the top of te rack with the gpu's is actively cooled. How do you plan to cool the ddr and other chips on the motherboard?
I run a h12ssl-1 motherboard with Epyc 7573x with 2 x 3090 and went with an Artic freeze u4 air cooler instead of water cooling to get that extra needed airflow inside a 4u server case. I was considering the Gigabyte motherboard but since I don't use a riser setup the top pcie slots wouldn't be usable since the gpu would be directly over the cpu socket.
I think the H12 is an excellent choice also! Have one myself going into a 4u case soon. Four fat 3090s wouldnt fit in the case however and this board is v cheap. I have a small vornado knockoff fan that moves a lot of air over the mobo. Thats shown in the most recent video now also.
Used 3090s are less than a THIRD of a new 4090. Now I need an EPYC board to support all of my PCIe stuff.
EPYC is a very good option and yeah the price difference in 3090 to 4090 is notable.
Used 3090 is 850eur how come that's cheap
How much is a 4090?
200W idle in Europe / Germany would you kill your electricity bill
Agree and I have more focus on power efficient systems also now. Video I launched today is a PC that idles around 25w. ua-cam.com/video/iflTQFn0jx4/v-deo.html and that machine should hit like maybe 75% of most avg ppls use cases. I am working on one that hits 8w idle also but it likely hits around 50% but may also be of interest. I did get that rigs idle down lower after working on it some more and I need to mention those steps in a video sometime also. Powersave and some kernel flags really can cut the wattage for an epyc with minor usability impact.
which ollama 3.1 model have you deployed / tested?
all of them. Anything specific you are looking for an answer to?
What speed you get tokens ollama 70b q8 gpu, and only cpu if you use air llm and use ddr4 ram virtual hdd for air llm layer offloading? As air llm can run even larger llms on single gpu any amount ram but very slow, but nobody made ram hdd test
Do you think is possible to make a server like this, but add a file server also ?
I mean, I want to make a server that I can use like NAS, Emby/Plex server and IA...
I want to use maybe Proxmox and share the GPU fol all this servers...
Is this possible please ?
Thanks a lot
you could have mounted the cpu radiator on the shelf below? level with the GPU's, maybe help take the strain of that one hose. Dude, 12:34 what are you into lol, thats some set up.
Yeah I want to fabricate an entire new case. This is just not optimal. In a nutshell I have a backup of the usgs geotiffs and I do a geospatial rendering based workload for my business. Its now able to be done with GPUs faster at nearly the same quality as CPUs so those r930s are not really needed as much.
@@DigitalSpaceport cool stuff, just recently watched a tutorial, VAPOR for WRF-Fire. Im started to learn a bit about visualization with matplotlib, mostly on dataset embedding and query returns.
I want to build something similar but with a teslap40 gpus, can you help me find out how to cool them with liquid cooling? 🙏
If there is not a premade waterblock for the P40, I dont think you will be able to. Maybe some sort of immersion cooling could work but I am still to small of a channel to toss expensive parts into something like that. Those are expensive.
Grate video 👍 can you make a video on your system running ollama 3.1b 70b
While this is mainly a tutorial to get open webui ollama and meta llama 3.1 setup in ubuntu, it does feature me running the 70b and while the stats I shared for a story generation may not be the same as hard logic, its pretty good. Ill have full in depth testing on 8 and 70 soon. 405 is now giving me issues, was running a few days ago.... The stats part is closer to the end. ua-cam.com/video/q_cDvCq1pww/v-deo.html
great job keeping that under 5k. I made so many mistakes like dual xeon gold 6148s which didnt cost me money but time. I got it to about 5700 and it is not as good as yours.
6148s are pretty nice chips also though! Are you air framing it or rack case?
Please, test mixing diferent VRAM size cards like 3090 (24 gb) and 4070 (12 gb). Can it balance the work in a way that don't crash when hit the 12 gb mark?
Am planning on this and a few other test. Here are some cards I have on hand that I may run mixed workload testing against. 3060ti, 3070, 4090, A4000, A5000. I think the A4000 + 3070 + 2x 3090 would be a good test.
Which ollama model did you use? What is token count? There is not any info...
I'm using metas llama 3.1 70b and it hits between 22 and 17 tok/s. 8b hits around 95 and 405 hits around 1. Have a full video on each model coming up but this video I think I have a chapter on llama 3.1 70b you could check. ua-cam.com/video/q_cDvCq1pww/v-deo.html
can you test dual asus ai accelerator card vs that quad 3090 for comparison which is much faster on running ai and training new models?
I dont have those cards and I dont know if they would be compatible either.
No NVLINK? The problem going to 4090s is that nVidia dropped NVLINK for the 40 series GPUs, sad.
For inference workloads nvlink doesnt help. But for training workloads I will need to get nvlink. Im a bit hesitant until we see what the 5XXX price/features are to do anything though.
man where did you find a 3090 for that price
Have you tried renting out your hardware with vast ai or salad?
I need better upload speeds. Cable modem has me capped at 40mbits upload but high split is on the way and should make that a viable route for idle times. I need to think about reservations and utilization more before I put this rig on it but competition is pretty high on depin, and impossible at 40mbits.
Great video - subbed!
Welcome to the channel!
These 4 GPUs should only draw 250-300W alltogether?
Due to the way the model workload splits across the GPUs when you are using their VRAM, they are often around 25% utilization on the processors. There are other ways to split workloads but lamma.cpp is under the ollama hood so that would need to be addressed there. Tensor Parallelization is the term.
can they run a 70b model?
Oh yeah they run any 70 pretty good and my new fav is Nemotron 70b featured here ua-cam.com/video/QXVSIR2z1q4/v-deo.html
Explain to me like I'm knew to this.
Why would you want to run an AI server? What applications would this enable, and is it actually any better than building a server with more consumer kinda parts? AKA 7950x or 7900x + ONE 3090
Love this question. Im going to quote it in the single GPU video which will fully answer the why part. Of note is speed for inference (processing) requests to the system and models landing inside VRAM is of course ideal to the tune of like 10x speedups. Thats a major reason. Several other big ones exist as well.
I'd love to do something like this, and I have some reasonable hardware to make it happen, but I straight up don't have the power. What do you use as a power source? giant solar array? my power in CT just went up to .35/kwh
Im on grid for power unfortunately still, but that likely changes this year. Our rate is .10/kWh and im on a co-op that does a great job controlling costs. We do have land for a ground based array onsite but trenching in limestone is expensive. Austin gets a lot sun so it likely makes good sense for us. At .35 Im not sure what I would do!
Hi dear ,awesome totally, how can I build a server to get performance equal to AMD Ryzen 7995X Threadripper Pro , with RTX 4090, and 128 GB 6400MHz Ram with Pcei 5.0 NVMe? I am doing research on building a Server for training my AI &ML models I considered AWS but its very costly so I am considering my own Server
Yes but how do you get any LLM to run on all that? For example, llama v3 requires a high VRAM count. Does this get around that per card VRAM by being able to aggregate the VRAM or is that not a thing?
Yes it spans the vram of all the cards needed to fit the model
You should get hailo to sponsor a video with their 8 or 10h m.2 module
Also howmany tops is this setup?
Im open to free storage gear. Like VERY OPEN lol.
They arnt storage. They are ai compute modules. 26 and 40 tosp of comput at less then 5w.
Im technically very open to all gear lol. Ill have more in the benchmark video on all the stats, but the 70b so far is looking good on tok/s at 17.7 and 98 for 8b llama3.1
Can you do a video on the software setup?
Here you go ua-cam.com/video/TmNSDkjDTOs/v-deo.html
@@DigitalSpaceport Nice! Thank you.
Can you make a video follow up on use case
Yes I forgot to mention in this video I'm splitting up hardware related, software setup/config and benchmark videos. Use case definition will be covered in the software videos.
What about running diffusion models. Can one use vlink to increase unified vram to fit big models ? Would it be possible to switch to 4090’s for extra speed?
Im not sure nvlink is needed now. I think with LLMs at least you can count on the layers being propagated with something like ollama automagically. Not sure about diffusers but will keep an eye on nvtop when I do that video.
what motherboard Cheap do you recommend me for 2 rtx3090? Regards
Inference only or do you need the ability to run them at full PCIE 16X gen 4 speeds simultaneously like with Training?
@@DigitalSpaceport both options please
Can you make a home server with the 120 cores epyc?
Sure. Why not?
can you run llama3.1:405b model on this?
Okay I did get 405b to run on this. It was EXTREMELY slow however. I would class it as unusable. That was not unexpected but only 44 layers of 145 can load into VRAM on the GPUs so yeah I guess I would need ~ 12 GPUS of 24GB to run it at respectable speeds. Hit .75 TOK/S at 2048 which ended up being around 6 min generation time on easy logic.
@@DigitalSpaceport thank you very much for testing if you are just limited by vram would it be feasable to run M40 instead? i have seen them on ebay go for 170€
Id be surprised as Maxwell generation is pretty old now. CUDA 5.2 and also pcie3. I'd not go with those cards but there may be more recent ones I should check into.
Loving this content
Any specific reason for going with the XianXian GPU rack instead of "AAAwave The Sluice V.2"?
Yes the price is lower on the one I have included and from what I can tell they all look like the exact same rack. So going cheap FTW.
When you will be paying roughly $2400 for a 32 GB 5090 and most likely $1200+ on a 16 GB 5080, I would expect the 4090's to be selling for at least $1400+. The 3090 will probably continue to be the best bet in town!
Im likely selling my 4090s in anticipation of the 5090s launch. Going to campout or whatever it takes to get one when they launch. The 3090s just to a great job so they get to stay. 3060 12GB on way currently lol. I *may* have a GPU problem
Thanks for review. I have Asus z10pe-d16 ws main board,2x xeon 2683 v3, 8x 16gb ddr4 2133p,5x 3090 and many corsair a 1500i PSU. Tried 70b q8 and q4 and 405b q2. They are extremely slow. What do I miss? What is 4i sff 8654?Ty
You checked with nvtop when running and they are hitting the gpu vram during operation? If its running slow, thats the place to start.
Perhaps old cpu does not have avx512 or avx 2 even
I don’t understand something with this setup aren’t you limited to just small LLMs. Mainly because only 2 RTX 3090 can sync together via NVLink so you essentially have 2 sets of pair of RTX with you four cards.
Also, I wondering about PCIe bottleneck.
Lastly, would advise to get a big enough RAM to load the entire 300Billion parameter LLM which works out to about 1.2 Tbytes.
If you could please discuss the limitation with this setup?
No that's not a correct starting point for assumptions, but one I started with as well. Its poorly discussed but im working on talking and sharing much more of my learning about this all also. You do not use nvlink for inference. The llama.cpp runner code automatically layers the model into GPUs automatically, so no need outside highend training for nvlink. To also state, Im using no nvlink. It also can layer it into system ram as well. However there is no need to run any large parameter model off system ram, as performance is abysmal. Even on the worlds fastest CPU/RAM combo, it is unacceptably slow. Think 1tk/s. At q4 for llama3.1 405b.
@@DigitalSpaceport thanks
I made this video that shows this pretty well also ua-cam.com/video/-heFPHKy3jY/v-deo.html
does it matter if I purchase a 7702 over a 7702p amd epyc cpu?
@@Grapheneolic 7702p is fine for single socket boards. It wont allow for a second processor is the only difference.
@@DigitalSpaceport Thanks for the quick reply. So given I purchased a 7702, I could technically add a second processor if wanted to?
@Grapheneolic if you have a motherboard with a second socket, yes.
Why don’t you use NVLink?
You wont get improvement for inference workloads. Have tested on dual A5000 and NVLINK. I may however use NVLINK for doing training on these 3090s but am not there yet.
Do you have to use even number of GPU(4)? Will it work with 3 GPUs?
Yeah 3 will work, just remember that the VRAM is additive so you want the whole model to fit into the VRAM of the cumulative cards.
@@DigitalSpaceport thank you very much for your reply, I'm just getting into building pc for LLM and gathering information on which gpu I should use and how multiple gpu can be beneficial
Can this build be used for everyday computing, as well?
Sure. You could run the setup I did a year back that has worked very well easily here. It is just proxmox and a LXC but there appears to be a way to get a LXC windows instance running. I need to look more into that. ua-cam.com/video/IylJNfLi36E/v-deo.html
woud also be nice for Rendering blender scenes
Ill add this to the benchmarks 👍
This is dope and extremely cost effective but it's not future proof. What happens if 2x 5090 makes it possible to run llama 4 1t
No its not future proof at all, but I wanted to wait until we see the next nvidia GPUs before I decide on something bigger. I dont think we will see more than 24GB VRAM in the 5090 currently, and while model split is a thing and does work.... its pretty slow.
how did you get 3090 so cheap ?
In Europe they are 1500 a pop.
They had been used by a friend for ethereum mining prior, in a harsh environment. The amount of dirt I had to clean off these was really a lot. The pads had also been destroyed. All replaced now but a lot of work.
It remind me those Miner: dejavu I have been in this place before 😂
did i miss any Pricing comparisons and infos in the video?
I didnt do direct price comparisons. I would suggest you consider the H12SSL for a mobo however. Its worth the extra imo.
@@DigitalSpaceport what is th meaning of the word Mobo
Motherboard
Sweet
Why no NVIDIA NVLink used?
For 3090s im not sure it does anything for inference tasks? Does it? I have a dual A5000 with nvlink and it does enable a larger nonsharded memory size but I only know of that in the context of GIS. Also just to be clear Im pretty new to running local Ai and not trying to larp as an expert. Here learning myself also.
Nice video. Work on that audio though. The voice overs sound off.
I work and record in a harder audio environment than any other homelab youtuber I hope you consider that as well. I spent over an hour already on the audio on this and its impossible without shutting down the rack machines to get clean audio. If I was in a studio like they are I would for sure be embarrassed at the audio quality, but im in a 8 ft away from a mini datacenter. I do want to set your expectations ahead of time that this may be the audio quality I can achieve.
I wonder what are the "really cool AI and other things"? Outside of maybe home AI, maybe some prompting I can't really wrap my mind around hosting a LLM. Anyone can tell me the other applications?
Check the most recent video here for some examples of vision routing and realtime web search engine hosting. I didnt want to drag that video on longer and I am building and learning in realtime also (sharing along the way) and there are more functional use case based videos coming. I agree that part is lacking in this video, but it was only intended to showcase how to build the thing.
I used to build systems like these to mine Ethereum
Same rack yup with some modification to fit fullbwidth risers. Im going to work on a larger one lol next, need more gpu ha
Is AI training the new Bitcoin mining?
*Maybe, if you have 1gb upload speeds
Rolled you to 1.8 k likes
I didnt know this video had that many likes 😳 Thanks!
It must be a great feeling if money is not an issue and you can just make whatever you want with it…
What car do you drive? I decide to put my money into homelab stuff, vs new cars, and somehow I must be rich? That is not accurate. I dont waste money on things I dont care about, like new cars. Everything you see in my entire lab cost less than a mid range car costs.
@@DigitalSpaceport it was not meant to be an attack… we struggle to get food on the table… long illness basically destroyed my life….
why didn't you go for a tower cooler? There are some decent 3U/4U options that are not loud and the performance is more than adequate. Please note that server motherboards rely on airflow over the VRM for optimal operation. You could run the risk of hitting thermal limits and cause throttling/shutdown of the system.
Yes I have a HDX vornado ripoff mini fan that I have pointed at the mobo. It will be in the testing video. I do have tower coolers, but they are all utilized in other systems currently. This Corsair 420 I had free and I very well might be putting the 7995WX into this rack at some point for testing on the fastest platform available.
@@DigitalSpaceport if it's free, then all good. great video
Tap 3 screws through the mobo?? 😮
LOL oh god no. I did mount the board up and use a pencil to mark the 3 spots. Then removed the board and tapped the 3 spots. I'm not that crazy!
@@DigitalSpaceport phewww lol 😂
But, will it mine (Bitcoin)?
That it cannot
@@DigitalSpaceport
there is absolutely no way you are going to see any kind of condensation unless the room is at below freezing temperatures or you are using liquid nitrogen why even mention condensation???
The window AC spits out sub 32F air and wqs one, of many, considered plans was to have heatsinks right next to it. I opted not to and everything is greatly cooled from a distance as well. I do see condensation at times on the AC directional fins and need to wipe it off and pay attention to it so I dont get mold growth.
$500 challenge
RIP your power bill. 😢
Its not that bad. When we bought the house we made sure to go outside city owned utility to a much cheaper COOP. Under 300/mo for the whole house in central texas is very decent
Will i be good at fortnite finally?
That game is impossible. There is always a tween on a cell phone that is faster!
2x 4090 is better than 4x 3090 by all means.
except for total VRAM amount but I do agree also as an owner of 2 4090s
How to build a ai girlfriend
Okay just for you, im gonna try to make one. Wife might end me though 😆
Unlike gaming, AI and machine learning really do not benefit from 16 vs 8x lanes. That is because models are loaded once. Once the model or models are loaded into VRAM, the CPU has a minimal effect. Now, if you are pooling VRAM with NVLink, it is so much faster than PCIe 3.0, 4.0, or even 5.0 by a long shot. Also, though I have U.2 access with both the Z590 Dark and Z690 Dark Kingpin, they pale in comparison to the speeds of native PCIe 3.0 and 4.0 NVMe.
I, too, have that same chassis from mining but have always wondered how it would perform as an AI frame--just haven't gotten around to tinkering. At 3:20 I've stopped, because experience gained from the last year of the Etherium GPU mining boom to now is sufficient, and for me, I doubt there is any real new value.
Oh Im doing training also but yeah full lanes are not needed for inference. I did mention that. Can I nvlink the 3090? Ive read its minimal return recently. I guess the channel isnt for you, no harm at all there lol.
Watching this video with $100 on my account :|