INSANE Ollama AI Home Server - Quad 3090 Hardware Build, Costs, Tips and Tricks

Digital Spaceport

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 14 лис 2024

КОМЕНТАРІ • 230

@TuanAnhLe-ef9yk 2 місяці тому ⁺⁵
Do you have any recommendations for an air cooler for a CPU?
@DigitalSpaceport 2 місяці тому
what is the specific CPU part number?
@TuanAnhLe-ef9yk 2 місяці тому ⁺¹
@@DigitalSpaceport I’m going to purchase the 7702p CPU, just like your specifications.
@DigitalSpaceport 2 місяці тому ⁺¹
I use the Noctua SP3 on another epyc Rome system, a 7B12, this is a very tall but should just barely clear on the GPU rack from a measurement I just did. Like 10mm close. geni.us/NoctuaSP3_CPUCooler
Alternatively the D9 also fits SP3 and is lower profile at 92mm and It is featured in this video on my 7995wx. I like it but its noticably louder under load, but the 7995wx is the hottest chip to my knowledge. If it ramps full blast fans, its not quiet at all. ua-cam.com/video/YQZ2HWonnGA/v-deo.htmlsi=d5CtIYrhjHF0D5AB
@TuanAnhLe-ef9yk 2 місяці тому
@@DigitalSpaceport Thank you for clarifying. My expectation was for the air cooler to be quiet.
@thomastoseland7113 29 днів тому
@@TuanAnhLe-ef9yk To be honest I think it's crazy, I used to use quad GPUs from 2 cards, but that was difficult enough with power and cooling.
I'm not even sure that 4 rtx 3090s will scale well on machine learning.
@JoeVSvolcano 3 місяці тому ⁺²⁷
WoW, lovin this build! I built my LLama3 on my ProxMox with a single 3090RTX Passthrough. It gets pretty hot, I can only imagine what kind of heat load your pumping into that room..
@DigitalSpaceport 3 місяці тому ⁺⁴
I have active cooling but yes that has limits. I have some ideas on placement to hopefully help I'll be testing in the next video. Rearranging the area a bit now.
@ewenchan1239 3 місяці тому ⁺³
It depends on your workload and whether you're running a model or training a model.
If you're just running a model, unless you're constantly peppering it with requests, it might not get all that hot.
(I'm running a dual 3090 setup with the open-webui, which uses the Ollama backend, running the Codestral 22b model, and it only spins up when I ask it something or type a response back, but then in between that, the GPU sits at idle.)
If you're TRAINING a model, then that's a different story.
@jksoftware1 2 місяці тому ⁺³
I have 2 RTX 3090's in a AMD 5900x system and they thermal throttle because of the spacing. Once I get the PCIE extension cables and install them in a new case that should solve the problem.
@DigitalSpaceport 2 місяці тому
@@jksoftware1 yeah the 3090 is a really fat card. I toyed with the idea of water cooling but it was much cheaper to just use the risers. The risers are still too much imo also.
@ewenchan1239 2 місяці тому ⁺¹
@@jksoftware1
I have two 3090s in an Asus Z170-E motherboard with a 6700K and 64 GB of RAM.
The two slots drop down to a x8/x8 configuration, so it makes it more difficult to push a hard enough load onto the GPU to get it to thermal throttle.
(And that's with running both InvokeAI and open-webui with the codestral:22b LLM models simultaneously.)
If I need to space them out, I can use some of the GPU mining hardware that I still have. The PCIe x1 link back to the motherboard won't be great for bandwidth, but it will provide ample spacing for cooling.
@arturschuch 3 місяці тому ⁺¹⁰
Really cool build, I also have a 4 gpu rig.
The only thing that I would recomend is trying to give some space between the GPUs as maximum as possible, because they being close to each other will generate a lot of heat, the difference is enourmous.
I would also add extra coolers to the GPUs, I personally like the maxmium of 1 gpu per 120mm cooler , and the coolers blowing ar direct to the GPU.
I'm not sure if a watter cooler is a good Idea here, I'm saying that because no server uses watter cooling, neither CPU miners (people using the cpu on 100% 24/7), because watter coller tends to stop refrigerenting at some point, and it doesn't have the best efficiency. I'm not sure also if a 120mm cooler will fit on your build, I'm just giving food for tought
@DigitalSpaceport 2 місяці тому ⁺³
Yeah the water cooling is working fantastic at keeping the CPU cool under workloads for really large models, but Im fast becoming annoyed badly those. Too dang slow. The 3090 non ti also generates a lot of heat on the back. Im working on a placements to help with heat load also. Im likely to have a video on that also at some point. Ive got vornado knockoff right now and it hits the board and keeps the nvmes cool, but redoing the whole layout of the mini datacenter is very likely.
@paulumz 2 місяці тому ⁺²
LOL when you moved the camera from your 200W " reasonable power draw" on the rig to that insane server rack probably drawing several kilowatts. Nice video!
@DigitalSpaceport Місяць тому
You do have a point there 😅
@johnbell1810 Місяць тому ⁺²⁰
The budget of 5K for that build is awesome, at first i thought we were looking around 10K.
@johnbell1810 Місяць тому ⁺²
On a side note, i am having trouble sourcing used 3090 24GB in my area.
@DigitalSpaceport Місяць тому ⁺³
Thanks I also think its very cheap for what it is. I couldnt get as much bang for the buck any other way.
@DigitalSpaceport Місяць тому ⁺²
Im in a fairly large metro and my local sellers want more then ebay. I pointed this out to one who had a similar 3090 Ventus for about 75 more thinking they would be like okay ill go to 600 and they refused to match ebay. Totally their choice but local sellers being way high on prices and unreasonable in negotiating is a recurring theme recently.
@FabricioMTL Місяць тому ⁺¹
What about electricity bill
@DigitalSpaceport 24 дні тому ⁺²
About 275/mo. We bought this house with electric rates in mind also as everyone doing any form of HPC should in my opinion. It can go as high as 800/mo if Im really cranking flops but thats a fraction of what cloud costs would be. There is a not small part of this that is production for my business.
@taxplum4858 2 місяці тому ⁺⁴
Nice build! Have you ran many training workloads on it?
The single core perf of the 7702, even with boost , is pretty mediocre. I fear it would bottleneck training unless you spend a bunch of time optimizing data loading code. I went with a threadripper pro for my 4x3090 for this reason but always wondered how a 7702 would preform.
@Drkayb 3 місяці тому ⁺³
Good stuff, man. Looking forward to what the performance will be like.
@thebasicmaterialsproject 20 днів тому
cool build, love the spreadsheets. You should project cost of ownership. At what point does it become to expensive to own ? Is there perhaps an undervoltage potential to bring power draw down without adversely affecting performance to much ?
How can you make money out of it ? Can you bridge the gap from your website to this awesome beast and share time against your AI ? Is there a particular branch of AI that is more cost effective against another that might not be landing on the sweet spot ? Could you add an ASUS raid card, perhaps another network card , say 10gbe , can processes be re-routed to avoid cpu and ram bottlenecks ?
Good job, want more.
@joshwilson8501 3 місяці тому ⁺³
Nothing wrong with zip ties. Sweet rig!
@whosestone 27 днів тому
Sweet.
Gonna build this, my old school dual C2070 is now a dinosaur.
@DigitalSpaceport 27 днів тому
I had to look that GPU up. Does Fermi still work for ollama?
@isbestlizard 3 місяці тому ⁺²
Nice one! I build a 8xA4000 epyc server which was...epic! 128GB vram
@DigitalSpaceport 3 місяці тому ⁺³
A4000 have such a nice single slot format and 16GB vram it's a great card. What is your biggest models you like to run in that 8x ? I have 1 A4000 and 2 A5000 but thinking of selling those for more 4090s.
@samihenrique 3 місяці тому ⁺⁴
This is exactly what I want to watch!!!!!!
@DigitalSpaceport 3 місяці тому
Sweet! Glad this build floated your boat next build video is the smaller guy, filming that now.
@bekappa488 3 місяці тому ⁺⁵
that odd fan is making me go crazy LOL
@DigitalSpaceport 3 місяці тому
I know but I spent all my money on GPUs and Pads
@thanadeehong921 20 днів тому ⁺¹
Your video is great!
There is a resale Asus prime trx40 pro at pretty decent price. Do you think it would do the job as good? The mb comes with 3 x pcie 4.0.x16 and may need a splitter to accommodate 4 gpu.
In addition, I already have 6 x rtx3090. Do you think it would be beneficial to utilize all the 6 GPUs? I plan to go for llama 3.1 70b.
@DigitalSpaceport 20 днів тому
What is it priced at and does it include CPU?
@thanadeehong921 19 днів тому
@@DigitalSpaceport Trx40 is 120usd without CPU or Ram.
Quite hard to find AMD server mb in my area. What would be comparable setup if on Intel set?
@bishop838 3 місяці тому ⁺⁴
Powering each 3090 with a single 8 pin ala piggyback connector? I thought 8 pin standards had a max wattage of 150w and even though your going to use afterburning (or equivelent) to reduce, you stated anticipated 275w. I just finished a chia miner 4x 4090 setup and I'm going to have to use afterburner to reduce the power as three 4090's on a single Asus ROG Thor 1600 trips the internal breaker and shuts off the PSU. Will be interesting to see how your four 3090's and the Corsair 1500 handle similar with additional draw from an Epyc processor/motherboard combo. May need to add a second to perform as desired. Thumbs up! - I see the PHISON drives also, nice touch.
@DigitalSpaceport 3 місяці тому ⁺⁴
Okay im going to research more on that here. Im not aware of it, and none of the cables are warm just touched um under load for about an hour. Still a good thing to know.
@hienngo6730 2 місяці тому ⁺³
@@DigitalSpaceport, @bishop838 is correct. 150 W per cable for the PCIe connector is the official spec. You also get 75 W from the PCIe MB connector. Some of my power supplies list 200 W per cable max (even with two connectors), so if you can limit your GPU power to ~225 - 275 W, you'll be under the limit. If you're just running LLMs doing inference or even Stable Diffusion/Flux image generation, though, you should be fine even with the current setup. Unless you're doing training or fine-tuning that runs the GPUs at 100% continuously, you're unlikely to trip any breakers or brownout your power supply.
@kr00tman 2 місяці тому ⁺¹
When it comes to tabs, I am your wife to a T lolol, thanks for the shoutout. Loved the video!
@mbaldwinlobue 12 днів тому
Sorry if I'm repeating questions from elsewhere in the comments. How did you "manually limit the GPUs?". What approach did you take to lower their power consumption to fit within the profile necessary for your PSU? What was the impact to performance of doing that? Also, why not add a second PSU so they can run unconstrained? Is there a way to safely do that?
@DigitalSpaceport 10 днів тому
nvidia-smi pm1 and pl XXX whatever wattage. It turns out however your GPU processing power utilization is /n so like each 3090 here only runs at 1/4 power. This is due to how ollama/llama.cpp currently handles model splitting. Tensor parallelism and such. There is work that may address this in the future and I would consider it more then.
@TerminalNebula 6 днів тому ⁺¹
With the power supply cables, the graphics card manufacturers will not recommend using the two connections on one cable, they recommend using separate cables 😵‍💫
@DigitalSpaceport 4 дні тому
Yeah for Inference workloads, the way the model gets split each added card only does its /n workload, so they are never hitting hard on any 1 card. Each card here gets ~25% of the GPU processing workload. However llama.cpp is implementing tensor parallelism so that should correct that and yes I need a second PSU also on this rig as well. Might use a larger rack also.
@LucasAlves-bs7pf 3 місяці тому ⁺⁵
Does it work to mix diferent generation of gpus like rtx 30 and rtx 40? Well job!
@dude2093 3 місяці тому ⁺³
Yes
@DigitalSpaceport 3 місяці тому ⁺³
Yes for inference, like running premade models, you can mix GPUs and pcie BW is not that important. Im going to test mixing in some other various other GPUs to test how this impacts that performance. For training you want to be as close to each other as possible.
@treniotajuodvarnis 3 місяці тому ⁺²³
Why limit gpus' TDP?!? Just add another PSU! 4x3090 is 1400w already! 512ram and 7702 cpu is another 500w, so one more PSU, 750w minimum, it costs nothing compared to the price of the system. And with 1500w you don't want to run it on the max limits, keep 20% reserve, If you want stable reliable system, your gpus has to be limited to 150w instead of default 350, that's a huge hit!
@DigitalSpaceport 3 місяці тому ⁺⁷
I likely move it into the server racks at some point, for power draw reasons. Im also likely to get a DC powerboard for it then but I've got a lot of rearranging gear to tackle before it gets to that point. The CPU/ram doesn't cross 250 from what I've seen on my 7B12/H12 combo unless im running the cpu hard which this type workload doesn't seem to so far. I also don't trust the 2psu adapter kits, have known a guy who mined eth hard and had several burn up
@treniotajuodvarnis 3 місяці тому ⁺³
@@DigitalSpaceport Llama uses cpu at first to compile and ram, then gpus as of my observations and I asked llama how it(he/she) works and got the reply confirming it :)
@ukrainian333 Місяць тому ⁺²
Very good point, btw
@claxvii177th6 Місяць тому
@DigitalSpaceport also, for llms, the peak performance is less important. Tha vram is what is golden about that setup right? I limit the power of my 3090 on my grid, just so they don't reach the highest temps in my fairly small case(i am at my financial limits, 800 dolars of used components is what i could afford)
@akierum 6 днів тому
AMD EPYC™ 7002 Series Processors
AMD EPYC™ 7001 Series Processors
Single processor, 7nm technology
Up to 64 cores, 128 threads
*cTDP up to 280W
No 7702 cpu supported, furthermore AMD no WAY
@B_r_u_c_e 27 днів тому
Thank you. Looking forward to thermal paste report.
@DigitalSpaceport 25 днів тому
Coming soon!
@nicksonyap 14 днів тому
Nice video! Do you have a demo how well it runs? or some benchmarks?
@DigitalSpaceport 10 днів тому
Oh lots of them. Check the channel history and there is many benchmarks. The current best is Nemotron on this rig. ua-cam.com/video/QXVSIR2z1q4/v-deo.html
@tiberiumihairezus417 12 днів тому
Ollama on multiple gpus actually needs very little bandwidth .. Check it yourself with tools like nvtop make a ollama query on a multiple gpu setup/model and see that pci bandwidth is rarely in the MegaBytes territory.. I think what counts more is the latency.
@AdminUser-k1x 25 днів тому
Thank you for amazing content.
I bought the same setup using your links. I am having a hard time to understand where to plug the power switch. How are you turning on/off yours? Is there any spot on the motherboard I can plug the switch on?
@DigitalSpaceport 25 днів тому ⁺¹
Find the model number online. Then search that document for PWR and it will show you the header position pins. The black cable - is ground and the red is +
@teprox7690 10 годин тому
perhaps I missed it. But why that powerful CPU and RAM for an LLM Server ? The VRAM is the important part. Or do I forgot something?
@dude2093 3 місяці тому ⁺⁹
No Ollama demo?
@DigitalSpaceport 3 місяці тому ⁺⁷
Yeah I'm separating the hardware videos, software install/config videos, and benchmarking videos. Those will both be out very soon.
@RyanKnowsTechStuff 25 днів тому
I have a epyc 7551p 256gb with Tesla p4. Going to potentially put my 3090 in it. The 7551p also does 2ghz for 32 cores. Do you find 64 cores at 2ghz is working well or is the cpu ghz speed a bottleneck regardless of core count?
@boomerlife44 2 місяці тому ⁺¹
Hey, what are your thoughts on mix-match GPUs? (i.e., dual 3090s and RTX ADA 4000/4500). Are there any benefit or disadvantage in mix-match or all same GPUs?
@DigitalSpaceport 2 місяці тому
Good question! I'm not sure I guess I will test that out here actually. I would guess that core speed/ram speed will dictate a lowest common denominator outcome as all work pieces need to be completed before response. I am also curious about scaling up VRAM in non-homogenous routes and the impact that has. I think layers are propagated in ollama/openwebui intelligently to each GPU based on capacity in VRAM. Im going to check, this is an important question. Thanks for asking!
@TheYoutubes-f1s 3 місяці тому ⁺²
Would love to see a Geekbench result for this machine.
@DigitalSpaceport 3 місяці тому
Geekbench? If that can run in ubuntu 22 ill toss it into the benchmarking video.
@tpadilha84 Місяць тому
A good alternative is buying a refurbished Mac Studio with M1 ultra + 128GB RAM on e-bay for around $3k. The M1 ultra with 128GB RAM will run 70b models with q8 precision at ~7.5 tokens/second and draws less than 100W when running such models. Additionally, you can configure it to allow up to 120GB RAM for the GPU, which should be enough to run 70b models at 64k token context.
@DigitalSpaceport Місяць тому
I have one big issue with the Mac Studio route and that is indeed the Tokens/second fall into what I deem an unusable range for middle size models. Under 10 is painful and discouraging to use imo.
@tpadilha84 Місяць тому
@@DigitalSpaceport 7 tokens/second is slightly above the speed at which I can read and get good comprehension of what I'm reading, so anything above that speed doesn't make much difference for me when using the model on a chat UI. However, for using the model as an agent for automating tasks, then yes this speed is very low.
One thing I'm curious is what kind of speeds you get when using larger contexts with quad rtx 3090 setup. On the M1 Ultra it gets very slow for 70b models at close to 30k tokens in context, about 2-3 tokens/second
@int_pro 21 день тому
Any video of it running the big Llama 3 model?
@Angel24112411 Місяць тому ⁺¹
I miss where Ollama training has been shown and how you tell it to divide itself among 4 GPUs. Can you fit in these 4 GPUs a 70GB model, for example something in FP16 with ~30 bil.params ?
@DigitalSpaceport Місяць тому
The user does not tell it to split into layers, the underlying parallelism method is automagically applied from llama.cpp which powers ollama. You cannot fit the llama3.1-70b-instruct FP16 into just 96GB vram. That takes 140GB ollama.com/library/llama3.1/tags but you can fit in the q8.
@TheInternalNet 3 місяці тому
Wow that's a super impressive build. I'm looking at doing the same gpus with the Lenovo p520 or Lenovo p920.
@DigitalSpaceport 3 місяці тому
Lenovo used systems price point is pretty attractive!
@MrButuz 3 місяці тому ⁺³
Looks cool. I prefer founders editions so pretty and such high build quality. Oh by the way your MOBO&EATX connectors didnt look pushed in properly?
@DigitalSpaceport 3 місяці тому ⁺¹
I love the FE editions also they are works of art. I have FE 3070 and 3080ti but noticed on used markets the 3090 FE is not just slightly more expensive. Good eye! I just went and seated it fully.
@kingcu 17 годин тому
AMD EPYC 7763 is probably a better choice for CPU, it's released about a year later after 7702, and is a fully 7nm process. About the same price as 7702 on eBay right now.
@DigitalSpaceport 7 годин тому
I like this comment but let me ask you, if I wanted to bias a bit more for more mid range epyc freq what would be your recommendation? Id be okay sacrificing core # but selling to switch would be ideal.
@MichaelAsgian 23 дні тому
I'm curious whether it's better to have identical 3090s or if using different brands wouldn't make a difference. Also, is it possible to mix 3090s with 4090s?
@DigitalSpaceport 23 дні тому
Yes you can mix any GPUs, even extremes like a 1070 and a 4090 and benefit from added vram size, however if you have slower cuda cores (usually older gen is slower) then you will baseline performance at the lowest cards level. Mixing 3090 and 4090 for quant 8 or lower is next to no discernable difference. Mixing 30/40 at fp16 you will have several t/s slowdown. I like keeping it to one model of 3090 if you can as you likely need to clean/repad 3090s especially. Its easier to be able to know what screws and pads go in which spots that way.
@boukeelsinghorst4848 3 місяці тому
How do you plan to cool your mainboard, the board is made to run in a server enclosure. In your current setup only the top of te rack with the gpu's is actively cooled. How do you plan to cool the ddr and other chips on the motherboard?
I run a h12ssl-1 motherboard with Epyc 7573x with 2 x 3090 and went with an Artic freeze u4 air cooler instead of water cooling to get that extra needed airflow inside a 4u server case. I was considering the Gigabyte motherboard but since I don't use a riser setup the top pcie slots wouldn't be usable since the gpu would be directly over the cpu socket.
@DigitalSpaceport 3 місяці тому ⁺¹
I think the H12 is an excellent choice also! Have one myself going into a 4u case soon. Four fat 3090s wouldnt fit in the case however and this board is v cheap. I have a small vornado knockoff fan that moves a lot of air over the mobo. Thats shown in the most recent video now also.
@maxmustermann194 18 днів тому
Used 3090s are less than a THIRD of a new 4090. Now I need an EPYC board to support all of my PCIe stuff.
@DigitalSpaceport 18 днів тому
EPYC is a very good option and yeah the price difference in 3090 to 4090 is notable.
@akierum 6 днів тому
Used 3090 is 850eur how come that's cheap
@DigitalSpaceport 6 днів тому
How much is a 4090?
@simoneggersdorfer5305 8 днів тому ⁺¹
200W idle in Europe / Germany would you kill your electricity bill
@DigitalSpaceport 7 днів тому
Agree and I have more focus on power efficient systems also now. Video I launched today is a PC that idles around 25w. ua-cam.com/video/iflTQFn0jx4/v-deo.html and that machine should hit like maybe 75% of most avg ppls use cases. I am working on one that hits 8w idle also but it likely hits around 50% but may also be of interest. I did get that rigs idle down lower after working on it some more and I need to mention those steps in a video sometime also. Powersave and some kernel flags really can cut the wattage for an epyc with minor usability impact.
@ChristianRojas Місяць тому ⁺¹
which ollama 3.1 model have you deployed / tested?
@DigitalSpaceport Місяць тому
all of them. Anything specific you are looking for an answer to?
@akierum 6 днів тому
What speed you get tokens ollama 70b q8 gpu, and only cpu if you use air llm and use ddr4 ram virtual hdd for air llm layer offloading? As air llm can run even larger llms on single gpu any amount ram but very slow, but nobody made ram hdd test
@MrDenisJoshua 21 день тому
Do you think is possible to make a server like this, but add a file server also ?
I mean, I want to make a server that I can use like NAS, Emby/Plex server and IA...
I want to use maybe Proxmox and share the GPU fol all this servers...
Is this possible please ?
Thanks a lot
@quercus3290 Місяць тому
you could have mounted the cpu radiator on the shelf below? level with the GPU's, maybe help take the strain of that one hose. Dude, 12:34 what are you into lol, thats some set up.
@DigitalSpaceport Місяць тому
Yeah I want to fabricate an entire new case. This is just not optimal. In a nutshell I have a backup of the usgs geotiffs and I do a geospatial rendering based workload for my business. Its now able to be done with GPUs faster at nearly the same quality as CPUs so those r930s are not really needed as much.
@quercus3290 Місяць тому
@@DigitalSpaceport cool stuff, just recently watched a tutorial, VAPOR for WRF-Fire. Im started to learn a bit about visualization with matplotlib, mostly on dataset embedding and query returns.
@davidchico9130 8 днів тому
I want to build something similar but with a teslap40 gpus, can you help me find out how to cool them with liquid cooling? 🙏
@DigitalSpaceport 7 днів тому
If there is not a premade waterblock for the P40, I dont think you will be able to. Maybe some sort of immersion cooling could work but I am still to small of a channel to toss expensive parts into something like that. Those are expensive.
@8888-u6n 3 місяці тому
Grate video 👍 can you make a video on your system running ollama 3.1b 70b
@DigitalSpaceport 3 місяці тому
While this is mainly a tutorial to get open webui ollama and meta llama 3.1 setup in ubuntu, it does feature me running the 70b and while the stats I shared for a story generation may not be the same as hard logic, its pretty good. Ill have full in depth testing on 8 and 70 soon. 405 is now giving me issues, was running a few days ago.... The stats part is closer to the end. ua-cam.com/video/q_cDvCq1pww/v-deo.html
@zeusconquers 2 місяці тому
great job keeping that under 5k. I made so many mistakes like dual xeon gold 6148s which didnt cost me money but time. I got it to about 5700 and it is not as good as yours.
@DigitalSpaceport 2 місяці тому
6148s are pretty nice chips also though! Are you air framing it or rack case?
@LucasAlves-bs7pf 3 місяці тому ⁺²
Please, test mixing diferent VRAM size cards like 3090 (24 gb) and 4070 (12 gb). Can it balance the work in a way that don't crash when hit the 12 gb mark?
@DigitalSpaceport 3 місяці тому ⁺¹
Am planning on this and a few other test. Here are some cards I have on hand that I may run mixed workload testing against. 3060ti, 3070, 4090, A4000, A5000. I think the A4000 + 3070 + 2x 3090 would be a good test.
@SuperSayiyajin 2 місяці тому
Which ollama model did you use? What is token count? There is not any info...
@DigitalSpaceport 2 місяці тому
I'm using metas llama 3.1 70b and it hits between 22 and 17 tok/s. 8b hits around 95 and 405 hits around 1. Have a full video on each model coming up but this video I think I have a chapter on llama 3.1 70b you could check. ua-cam.com/video/q_cDvCq1pww/v-deo.html
@mawkuri5496 Місяць тому
can you test dual asus ai accelerator card vs that quad 3090 for comparison which is much faster on running ai and training new models?
@DigitalSpaceport Місяць тому
I dont have those cards and I dont know if they would be compatible either.
@kingcu 5 днів тому ⁺¹
No NVLINK? The problem going to 4090s is that nVidia dropped NVLINK for the 40 series GPUs, sad.
@DigitalSpaceport 4 дні тому ⁺¹
For inference workloads nvlink doesnt help. But for training workloads I will need to get nvlink. Im a bit hesitant until we see what the 5XXX price/features are to do anything though.
@benx1326 15 днів тому
man where did you find a 3090 for that price
@fragtrap0083 3 місяці тому ⁺¹
Have you tried renting out your hardware with vast ai or salad?
@DigitalSpaceport 3 місяці тому
I need better upload speeds. Cable modem has me capped at 40mbits upload but high split is on the way and should make that a viable route for idle times. I need to think about reservations and utilization more before I put this rig on it but competition is pretty high on depin, and impossible at 40mbits.
@themarksmith 2 місяці тому
Great video - subbed!
@DigitalSpaceport 2 місяці тому
Welcome to the channel!
@stuffinfinland Місяць тому ⁺¹
These 4 GPUs should only draw 250-300W alltogether?
@DigitalSpaceport Місяць тому
Due to the way the model workload splits across the GPUs when you are using their VRAM, they are often around 25% utilization on the processors. There are other ways to split workloads but lamma.cpp is under the ollama hood so that would need to be addressed there. Tensor Parallelization is the term.
@DanFrederiksen 11 днів тому ⁺¹
can they run a 70b model?
@DigitalSpaceport 10 днів тому
Oh yeah they run any 70 pretty good and my new fav is Nemotron 70b featured here ua-cam.com/video/QXVSIR2z1q4/v-deo.html
@ArouzedLamp 3 місяці тому ⁺⁵
Explain to me like I'm knew to this.
Why would you want to run an AI server? What applications would this enable, and is it actually any better than building a server with more consumer kinda parts? AKA 7950x or 7900x + ONE 3090
@DigitalSpaceport 2 місяці тому ⁺⁵
Love this question. Im going to quote it in the single GPU video which will fully answer the why part. Of note is speed for inference (processing) requests to the system and models landing inside VRAM is of course ideal to the tune of like 10x speedups. Thats a major reason. Several other big ones exist as well.
@joshhardin666 3 місяці тому
I'd love to do something like this, and I have some reasonable hardware to make it happen, but I straight up don't have the power. What do you use as a power source? giant solar array? my power in CT just went up to .35/kwh
@DigitalSpaceport 3 місяці тому ⁺¹
Im on grid for power unfortunately still, but that likely changes this year. Our rate is .10/kWh and im on a co-op that does a great job controlling costs. We do have land for a ground based array onsite but trenching in limestone is expensive. Austin gets a lot sun so it likely makes good sense for us. At .35 Im not sure what I would do!
@husratmehmood2629 Місяць тому
Hi dear ,awesome totally, how can I build a server to get performance equal to AMD Ryzen 7995X Threadripper Pro , with RTX 4090, and 128 GB 6400MHz Ram with Pcei 5.0 NVMe? I am doing research on building a Server for training my AI &ML models I considered AWS but its very costly so I am considering my own Server
@ArielLothlorien Місяць тому
Yes but how do you get any LLM to run on all that? For example, llama v3 requires a high VRAM count. Does this get around that per card VRAM by being able to aggregate the VRAM or is that not a thing?
@DigitalSpaceport Місяць тому
Yes it spans the vram of all the cards needed to fit the model
@mastermoarman 3 місяці тому ⁺¹
You should get hailo to sponsor a video with their 8 or 10h m.2 module
Also howmany tops is this setup?
@DigitalSpaceport 3 місяці тому ⁺¹
Im open to free storage gear. Like VERY OPEN lol.
@mastermoarman 3 місяці тому
They arnt storage. They are ai compute modules. 26 and 40 tosp of comput at less then 5w.
@DigitalSpaceport 3 місяці тому
Im technically very open to all gear lol. Ill have more in the benchmark video on all the stats, but the 70b so far is looking good on tok/s at 17.7 and 98 for 8b llama3.1
@Krath1988 27 днів тому
Can you do a video on the software setup?
@DigitalSpaceport 27 днів тому ⁺¹
Here you go ua-cam.com/video/TmNSDkjDTOs/v-deo.html
@Krath1988 26 днів тому
@@DigitalSpaceport Nice! Thank you.
@xainslik8138 3 місяці тому ⁺¹
Can you make a video follow up on use case
@DigitalSpaceport 3 місяці тому ⁺¹
Yes I forgot to mention in this video I'm splitting up hardware related, software setup/config and benchmark videos. Use case definition will be covered in the software videos.
@___x__x_r___xa__x_____f______ 2 місяці тому
What about running diffusion models. Can one use vlink to increase unified vram to fit big models ? Would it be possible to switch to 4090’s for extra speed?
@DigitalSpaceport 2 місяці тому
Im not sure nvlink is needed now. I think with LLMs at least you can count on the layers being propagated with something like ollama automagically. Not sure about diffusers but will keep an eye on nvtop when I do that video.
@jesusleguiza77 Місяць тому
what motherboard Cheap do you recommend me for 2 rtx3090? Regards
@DigitalSpaceport Місяць тому
Inference only or do you need the ability to run them at full PCIE 16X gen 4 speeds simultaneously like with Training?
@jesusleguiza77 Місяць тому
@@DigitalSpaceport both options please
@notaras1985 13 днів тому
Can you make a home server with the 120 cores epyc?
@DigitalSpaceport 10 днів тому
Sure. Why not?
@ShitpostingArchive 3 місяці тому ⁺³
can you run llama3.1:405b model on this?
@DigitalSpaceport 3 місяці тому ⁺²
Okay I did get 405b to run on this. It was EXTREMELY slow however. I would class it as unusable. That was not unexpected but only 44 layers of 145 can load into VRAM on the GPUs so yeah I guess I would need ~ 12 GPUS of 24GB to run it at respectable speeds. Hit .75 TOK/S at 2048 which ended up being around 6 min generation time on easy logic.
@ShitpostingArchive 3 місяці тому ⁺¹
@@DigitalSpaceport thank you very much for testing if you are just limited by vram would it be feasable to run M40 instead? i have seen them on ebay go for 170€
@DigitalSpaceport 3 місяці тому ⁺¹
Id be surprised as Maxwell generation is pretty old now. CUDA 5.2 and also pcie3. I'd not go with those cards but there may be more recent ones I should check into.
@TheSasquatchjones 3 місяці тому
Loving this content
@boomerlife44 2 місяці тому
Any specific reason for going with the XianXian GPU rack instead of "AAAwave The Sluice V.2"?
@DigitalSpaceport 2 місяці тому
Yes the price is lower on the one I have included and from what I can tell they all look like the exact same rack. So going cheap FTW.
@johndelabretonne2373 24 дні тому
When you will be paying roughly $2400 for a 32 GB 5090 and most likely $1200+ on a 16 GB 5080, I would expect the 4090's to be selling for at least $1400+. The 3090 will probably continue to be the best bet in town!
@DigitalSpaceport 24 дні тому ⁺¹
Im likely selling my 4090s in anticipation of the 5090s launch. Going to campout or whatever it takes to get one when they launch. The 3090s just to a great job so they get to stay. 3060 12GB on way currently lol. I *may* have a GPU problem
@SuperSayiyajin 2 місяці тому
Thanks for review. I have Asus z10pe-d16 ws main board,2x xeon 2683 v3, 8x 16gb ddr4 2133p,5x 3090 and many corsair a 1500i PSU. Tried 70b q8 and q4 and 405b q2. They are extremely slow. What do I miss? What is 4i sff 8654?Ty
@DigitalSpaceport 2 місяці тому
You checked with nvtop when running and they are hitting the gpu vram during operation? If its running slow, thats the place to start.
@akierum 6 днів тому
Perhaps old cpu does not have avx512 or avx 2 even
@HKashaf 2 місяці тому
I don’t understand something with this setup aren’t you limited to just small LLMs. Mainly because only 2 RTX 3090 can sync together via NVLink so you essentially have 2 sets of pair of RTX with you four cards.
Also, I wondering about PCIe bottleneck.
Lastly, would advise to get a big enough RAM to load the entire 300Billion parameter LLM which works out to about 1.2 Tbytes.
If you could please discuss the limitation with this setup?
@DigitalSpaceport 2 місяці тому ⁺¹
No that's not a correct starting point for assumptions, but one I started with as well. Its poorly discussed but im working on talking and sharing much more of my learning about this all also. You do not use nvlink for inference. The llama.cpp runner code automatically layers the model into GPUs automatically, so no need outside highend training for nvlink. To also state, Im using no nvlink. It also can layer it into system ram as well. However there is no need to run any large parameter model off system ram, as performance is abysmal. Even on the worlds fastest CPU/RAM combo, it is unacceptably slow. Think 1tk/s. At q4 for llama3.1 405b.
@HKashaf 2 місяці тому
@@DigitalSpaceport thanks
@DigitalSpaceport 2 місяці тому ⁺¹
I made this video that shows this pretty well also ua-cam.com/video/-heFPHKy3jY/v-deo.html
@Grapheneolic 2 місяці тому ⁺¹
does it matter if I purchase a 7702 over a 7702p amd epyc cpu?
@DigitalSpaceport 2 місяці тому
@@Grapheneolic 7702p is fine for single socket boards. It wont allow for a second processor is the only difference.
@Grapheneolic 2 місяці тому
@@DigitalSpaceport Thanks for the quick reply. So given I purchased a 7702, I could technically add a second processor if wanted to?
@DigitalSpaceport 2 місяці тому ⁺¹
@Grapheneolic if you have a motherboard with a second socket, yes.
@creativecomputers6060 15 днів тому
Why don’t you use NVLink?
@DigitalSpaceport 10 днів тому
You wont get improvement for inference workloads. Have tested on dual A5000 and NVLINK. I may however use NVLINK for doing training on these 3090s but am not there yet.
@natc9 2 місяці тому
Do you have to use even number of GPU(4)? Will it work with 3 GPUs?
@DigitalSpaceport 2 місяці тому ⁺¹
Yeah 3 will work, just remember that the VRAM is additive so you want the whole model to fit into the VRAM of the cumulative cards.
@natc9 2 місяці тому
@@DigitalSpaceport thank you very much for your reply, I'm just getting into building pc for LLM and gathering information on which gpu I should use and how multiple gpu can be beneficial
@optronixatron 12 днів тому
Can this build be used for everyday computing, as well?
@DigitalSpaceport 10 днів тому
Sure. You could run the setup I did a year back that has worked very well easily here. It is just proxmox and a LXC but there appears to be a way to get a LXC windows instance running. I need to look more into that. ua-cam.com/video/IylJNfLi36E/v-deo.html
@ziozzot 3 місяці тому ⁺³
woud also be nice for Rendering blender scenes
@DigitalSpaceport 3 місяці тому
Ill add this to the benchmarks 👍
@yaterifalimiliyoni9929 3 місяці тому
This is dope and extremely cost effective but it's not future proof. What happens if 2x 5090 makes it possible to run llama 4 1t
@DigitalSpaceport 3 місяці тому
No its not future proof at all, but I wanted to wait until we see the next nvidia GPUs before I decide on something bigger. I dont think we will see more than 24GB VRAM in the 5090 currently, and while model split is a thing and does work.... its pretty slow.
@ErikFrits 2 місяці тому
how did you get 3090 so cheap ?
In Europe they are 1500 a pop.
@DigitalSpaceport 2 місяці тому
They had been used by a friend for ethereum mining prior, in a harsh environment. The amount of dirt I had to clean off these was really a lot. The pads had also been destroyed. All replaced now but a lot of work.
@frankwong9486 3 місяці тому
It remind me those Miner: dejavu I have been in this place before 😂
@Beauty.and.FashionPhotographer 3 місяці тому
did i miss any Pricing comparisons and infos in the video?
@DigitalSpaceport 3 місяці тому ⁺¹
I didnt do direct price comparisons. I would suggest you consider the H12SSL for a mobo however. Its worth the extra imo.
@Beauty.and.FashionPhotographer 3 місяці тому
@@DigitalSpaceport what is th meaning of the word Mobo
@DigitalSpaceport 3 місяці тому ⁺¹
Motherboard
@Archy_143 10 днів тому
Sweet
@simo.koivukoski 3 місяці тому ⁺¹
Why no NVIDIA NVLink used?
@DigitalSpaceport 3 місяці тому ⁺²
For 3090s im not sure it does anything for inference tasks? Does it? I have a dual A5000 with nvlink and it does enable a larger nonsharded memory size but I only know of that in the context of GIS. Also just to be clear Im pretty new to running local Ai and not trying to larp as an expert. Here learning myself also.
@sarahracing2619 3 місяці тому
Nice video. Work on that audio though. The voice overs sound off.
@DigitalSpaceport 3 місяці тому
I work and record in a harder audio environment than any other homelab youtuber I hope you consider that as well. I spent over an hour already on the audio on this and its impossible without shutting down the rack machines to get clean audio. If I was in a studio like they are I would for sure be embarrassed at the audio quality, but im in a 8 ft away from a mini datacenter. I do want to set your expectations ahead of time that this may be the audio quality I can achieve.
@araa5184 Місяць тому
I wonder what are the "really cool AI and other things"? Outside of maybe home AI, maybe some prompting I can't really wrap my mind around hosting a LLM. Anyone can tell me the other applications?
@DigitalSpaceport Місяць тому ⁺¹
Check the most recent video here for some examples of vision routing and realtime web search engine hosting. I didnt want to drag that video on longer and I am building and learning in realtime also (sharing along the way) and there are more functional use case based videos coming. I agree that part is lacking in this video, but it was only intended to showcase how to build the thing.
@chirvo Місяць тому ⁺¹
I used to build systems like these to mine Ethereum
@DigitalSpaceport Місяць тому
Same rack yup with some modification to fit fullbwidth risers. Im going to work on a larger one lol next, need more gpu ha
@dwightschrute7342 5 днів тому ⁺¹
Is AI training the new Bitcoin mining?
@DigitalSpaceport 4 дні тому
*Maybe, if you have 1gb upload speeds
@moonlightsoldier8443 5 днів тому
Rolled you to 1.8 k likes
@DigitalSpaceport 5 днів тому
I didnt know this video had that many likes 😳 Thanks!
@Naranek 19 днів тому
It must be a great feeling if money is not an issue and you can just make whatever you want with it…
@DigitalSpaceport 19 днів тому ⁺¹
What car do you drive? I decide to put my money into homelab stuff, vs new cars, and somehow I must be rich? That is not accurate. I dont waste money on things I dont care about, like new cars. Everything you see in my entire lab cost less than a mid range car costs.
@Naranek 19 днів тому
@@DigitalSpaceport it was not meant to be an attack… we struggle to get food on the table… long illness basically destroyed my life….
@ZaPirate 3 місяці тому
why didn't you go for a tower cooler? There are some decent 3U/4U options that are not loud and the performance is more than adequate. Please note that server motherboards rely on airflow over the VRM for optimal operation. You could run the risk of hitting thermal limits and cause throttling/shutdown of the system.
@DigitalSpaceport 3 місяці тому
Yes I have a HDX vornado ripoff mini fan that I have pointed at the mobo. It will be in the testing video. I do have tower coolers, but they are all utilized in other systems currently. This Corsair 420 I had free and I very well might be putting the 7995WX into this rack at some point for testing on the fastest platform available.
@ZaPirate 3 місяці тому ⁺¹
@@DigitalSpaceport if it's free, then all good. great video
@mjes911 3 місяці тому
Tap 3 screws through the mobo?? 😮
@DigitalSpaceport 3 місяці тому ⁺¹
LOL oh god no. I did mount the board up and use a pencil to mark the 3 spots. Then removed the board and tapped the 3 spots. I'm not that crazy!
@mjes911 3 місяці тому
@@DigitalSpaceport phewww lol 😂
@davidunderwood9037 3 місяці тому ⁺¹
But, will it mine (Bitcoin)?
@DigitalSpaceport 3 місяці тому
That it cannot
@davidunderwood9037 3 місяці тому ⁺¹
@@DigitalSpaceport
@FrgottenFrshness 24 дні тому
there is absolutely no way you are going to see any kind of condensation unless the room is at below freezing temperatures or you are using liquid nitrogen why even mention condensation???
@DigitalSpaceport 24 дні тому
The window AC spits out sub 32F air and wqs one, of many, considered plans was to have heatsinks right next to it. I opted not to and everything is greatly cooled from a distance as well. I do see condensation at times on the AC directional fins and need to wipe it off and pay attention to it so I dont get mold growth.
@FSK1138 2 місяці тому
$500 challenge
@int_pro 21 день тому
RIP your power bill. 😢
@DigitalSpaceport 20 днів тому ⁺¹
Its not that bad. When we bought the house we made sure to go outside city owned utility to a much cheaper COOP. Under 300/mo for the whole house in central texas is very decent
@rob8823 3 місяці тому
Will i be good at fortnite finally?
@DigitalSpaceport 3 місяці тому
That game is impossible. There is always a tween on a cell phone that is faster!
@大支爺 2 місяці тому
2x 4090 is better than 4x 3090 by all means.
@DigitalSpaceport 2 місяці тому
except for total VRAM amount but I do agree also as an owner of 2 4090s
@FrostDagger Місяць тому
How to build a ai girlfriend
@DigitalSpaceport Місяць тому ⁺¹
Okay just for you, im gonna try to make one. Wife might end me though 😆
@squirrel6687 3 місяці тому
Unlike gaming, AI and machine learning really do not benefit from 16 vs 8x lanes. That is because models are loaded once. Once the model or models are loaded into VRAM, the CPU has a minimal effect. Now, if you are pooling VRAM with NVLink, it is so much faster than PCIe 3.0, 4.0, or even 5.0 by a long shot. Also, though I have U.2 access with both the Z590 Dark and Z690 Dark Kingpin, they pale in comparison to the speeds of native PCIe 3.0 and 4.0 NVMe.
I, too, have that same chassis from mining but have always wondered how it would perform as an AI frame--just haven't gotten around to tinkering. At 3:20 I've stopped, because experience gained from the last year of the Etherium GPU mining boom to now is sufficient, and for me, I doubt there is any real new value.
@DigitalSpaceport 2 місяці тому
Oh Im doing training also but yeah full lanes are not needed for inference. I did mention that. Can I nvlink the 3090? Ive read its minimal return recently. I guess the channel isnt for you, no harm at all there lol.
@winsucker7755 3 місяці тому ⁺²
Watching this video with $100 on my account :|

Наступне

Автоматичне відтворення