This is very helpful! I buy most of my hardeare from facebook marketplace and i often have to wait long spans between getting components so knowing what to watch out is very important. Thanks a lot for this!
Hands down #1 question in videos. Not with llama.cpp yet but hopefully soon. Bigger models and running models on seperate gpus at the same time are the current reasons and running bigger models like nemotron is a big quality step. Or use vLLM which isnt as end user friendly as ollama/owui
You inspired me to experiment with own AI server based on 3090/4090. I did little different choices like: ASRock WRX80D8-2T + Threadripper Pro 3945wx. As you mentioned CPU clock speed matters and I got a brand new motherboard + CPU for around 900 USD. I also want to try OCulink ports (ASRock has 2 of them) instead of risers There are 2 advantages: OCulink offers flexible cabling and works on separate power supply so you are no longer dependent on a single expensive PSU. So far I see 2 problems: Intel X710 10gbe ports cause some errors under ubuntu 24.04 and Noctua NH-U14S is too big to close a Lian Lin 011 XL so I have to turn to an open air case. Can't wait to see your future projects.
@@danielstrzelczyk4177 I've been wondering if OCuLink would find it's way into these types of builds. Wasn't aware ASRock mobo had 2 ports like that. Have to check that out.
I am doing a build that is about 60% aligned with yours. Total investment to date is $7200. My suggestion if you have a commercial use goal is to invest in the server grade parts.
Any advise on potential local server for a small startup looking to support 50-100 concurrent users doing basic inference/embeddings with small-medium sized models - 13B for eg. Would a single RTX 3090 suffice for this?
This is my guess, so don't hold me to it. I would start with figuring out exactly which model or models you want to run concurrently. You would want to set the timeout to those to be pretty long to avoid something like people coming back and all warming it up at the same time, so greater than 1 hour. I think you would be better off with 3 3060 12GBs if it would support the models that you intend to use. If you are looking for any flexibility, then start with a good base system and add 3090s as needed is the safest advice. If there is a big impact from undersizing, just go 3090s. Make sure to get a CPU that has a good fast single-thread speed. Adjust your batch size around as needed but the frequency of your users' interactions need to be observed in NVTOP of other more LLM specific specialized performance monitoring tools.
Im wanting to get a 7f72 but they are expensive and I would need a pair. If i was scratch building I would likely have used an air cooler for cpu also. Maybe the h12ssl-i would be the board id go with since the mz32-ar0 has gone up in price a good bit.
It’s preferable . AFAIK, Olllama isn’t yet optimized to work with ROCm. Would’ve been interesting though like “how far do you get with AMD”. AMD is so much more affordable per GB. Especially when you look at used stuff. Maybe that’s something for a future video, @DigitalSpaceport ?
@@DigitalSpaceport I have a bunch of A770 16tgb cards along with Asrock H510 BTC Pro+ motherboards sitting around. Was thinking of trying to make a 12 card cluster connect by 10gb network cards and 10900k for the cpu with the 3 linked to each other. Any problems you can think of that I am missing? 4 gpu per motherboard with 2 10 gb cards. The biggest problem I can think of would be the single 32gb ram stick that the cpu is using.
Its not needed unless you are training but i need to test on my a5000's that have nvlink to not just be a parrot on that. I did try it out but messed up something iirc and got frustrated. Will give it another shot soonish
Forvsurevyou want tobwatch this video! Its the most in depth test on cpu impacts around and ive got a pretty crazy 7995wx in it 8 channels filled. ua-cam.com/video/qfqHAAjdTzk/v-deo.html
same thoughts...faster cache and higher amounts would be my bet both on cpu and gpu. If I'm not getting something wrong the fastest gpus running llm ( both older and newers models) seems to be those with higher cache, higher memory bandwidht and bigger Memory Bus sizes. of course TFlops do count but to lower extent
The most difficult decision is how much money to spend for a first buy. I’m kinda reluctant to get a 3090 config not knowing if I’ll be totally into local AI.
3060 12GB is a good starter then. If you want img video gen heavy, 24GB is desirable. Local AI is best left running 24/7 in a setup however to really get the benefits with integrations abound in so many homeserve apps now.
@@DigitalSpaceport Yes, they're nice. I was looking for a trio of A100s over a year ago and couldn't find them, so instead, I bought 6 A6000s because at least I could find them.
@@DigitalSpaceport That could be, but usually power scales with # of modules, not size. But then again, maybe you're right, because I looked at an 8x A100-SXM rig a while back, it idled each GPU between 48-50W and had 80GB per GPU.
Writeup - digitalspaceport.com/homelab-ai-server-rig-tips-tricks-gotchas-and-takeaways/
This is very helpful! I buy most of my hardeare from facebook marketplace and i often have to wait long spans between getting components so knowing what to watch out is very important.
Thanks a lot for this!
Great video! The most eye-opening takeaway: having two GPUs doesn’t mean double the speed.
Hands down #1 question in videos. Not with llama.cpp yet but hopefully soon. Bigger models and running models on seperate gpus at the same time are the current reasons and running bigger models like nemotron is a big quality step. Or use vLLM which isnt as end user friendly as ollama/owui
dude, what is up with your camera, feels like I am drunk or on a boat :) another great video :)
You inspired me to experiment with own AI server based on 3090/4090. I did little different choices like: ASRock WRX80D8-2T + Threadripper Pro 3945wx. As you mentioned CPU clock speed matters and I got a brand new motherboard + CPU for around 900 USD. I also want to try OCulink ports (ASRock has 2 of them) instead of risers There are 2 advantages: OCulink offers flexible cabling and works on separate power supply so you are no longer dependent on a single expensive PSU. So far I see 2 problems: Intel X710 10gbe ports cause some errors under ubuntu 24.04 and Noctua NH-U14S is too big to close a Lian Lin 011 XL so I have to turn to an open air case. Can't wait to see your future projects.
On the intel, if thats fiber x710, do you have approved optics?
@@danielstrzelczyk4177 I've been wondering if OCuLink would find it's way into these types of builds. Wasn't aware ASRock mobo had 2 ports like that. Have to check that out.
Been waiting for that one and happy to write the first comment!
Legend!
I am doing a build that is about 60% aligned with yours. Total investment to date is $7200. My suggestion if you have a commercial use goal is to invest in the server grade parts.
Any advise on potential local server for a small startup looking to support 50-100 concurrent users doing basic inference/embeddings with small-medium sized models - 13B for eg.
Would a single RTX 3090 suffice for this?
This is my guess, so don't hold me to it. I would start with figuring out exactly which model or models you want to run concurrently. You would want to set the timeout to those to be pretty long to avoid something like people coming back and all warming it up at the same time, so greater than 1 hour. I think you would be better off with 3 3060 12GBs if it would support the models that you intend to use. If you are looking for any flexibility, then start with a good base system and add 3090s as needed is the safest advice. If there is a big impact from undersizing, just go 3090s. Make sure to get a CPU that has a good fast single-thread speed. Adjust your batch size around as needed but the frequency of your users' interactions need to be observed in NVTOP of other more LLM specific specialized performance monitoring tools.
I setup motherboard and epyc cpu just like you.
May i ask if you can do it all over again, will you change any setup?
Im wanting to get a 7f72 but they are expensive and I would need a pair. If i was scratch building I would likely have used an air cooler for cpu also. Maybe the h12ssl-i would be the board id go with since the mz32-ar0 has gone up in price a good bit.
Nice video! What do you think of the Asrock Romed82t motherboard?
I'd go with h12ssl-i.
I didn't understand why you didn't mention any of the Radeon 7xxx cards, nor ROCm
You want CUDA for this.
It’s preferable . AFAIK, Olllama isn’t yet optimized to work with ROCm. Would’ve been interesting though like “how far do you get with AMD”. AMD is so much more affordable per GB. Especially when you look at used stuff. Maybe that’s something for a future video, @DigitalSpaceport ?
My comment vanished. Could you make a video on AMD GPU? Some people say they aren’t that bad for AI.
I see two comments here and do plan to test AMD and Intel soon.
@@DigitalSpaceport I have a bunch of A770 16tgb cards along with Asrock H510 BTC Pro+ motherboards sitting around. Was thinking of trying to make a 12 card cluster connect by 10gb network cards and 10900k for the cpu with the 3 linked to each other. Any problems you can think of that I am missing? 4 gpu per motherboard with 2 10 gb cards. The biggest problem I can think of would be the single 32gb ram stick that the cpu is using.
Hello there. Regarding RAM speed, were you partially offloading the models in GGUF format? I am currently loading the EXL2 model completely into VRAM.
No the model was fully loaded to vram this video tested multiple facets of cpu impact fairly decently. ua-cam.com/video/qfqHAAjdTzk/v-deo.html
This is relevant to my interests 🤔
Any tips/experience using NVLink with dual 3090s?
Its not needed unless you are training but i need to test on my a5000's that have nvlink to not just be a parrot on that. I did try it out but messed up something iirc and got frustrated. Will give it another shot soonish
@DigitalSpaceport cool thanks! I'm putting together my new 2x3090 desktop/workstation and I grabbed the bridge so I'll be trying it out soon as well
more than ddr4/ddr5 & MTs probably the interesting takeawy would be single vs dual vs quad channels vs 8 channels performances
...maybe even more cache speeds and quantity....
what are your thoughts?
Forvsurevyou want tobwatch this video! Its the most in depth test on cpu impacts around and ive got a pretty crazy 7995wx in it 8 channels filled. ua-cam.com/video/qfqHAAjdTzk/v-deo.html
@@DigitalSpaceport I missed that. tnx. whatching rn
same thoughts...faster cache and higher amounts would be my bet both on cpu and gpu.
If I'm not getting something wrong the fastest gpus running llm ( both older and newers models) seems to be those with higher cache, higher memory bandwidht and bigger Memory Bus sizes.
of course TFlops do count but to lower extent
What are your thoughts on getting 4 to 8 4060 ti with 16g vram?
64GB VRAM is a very solid amount that will run vision and nemotron easy at q4 and not a bad card at all for inference.
The most difficult decision is how much money to spend for a first buy. I’m kinda reluctant to get a 3090 config not knowing if I’ll be totally into local AI.
3060 12GB is a good starter then. If you want img video gen heavy, 24GB is desirable. Local AI is best left running 24/7 in a setup however to really get the benefits with integrations abound in so many homeserve apps now.
Maybe rent a vm with your target config for a little before you start building?
My RTX A6000s idle at 23W so yeah, always on is expensive depending on your GPU config. I have 3x in each system, 2 systems in my lab.
Mmmmmm 48GB vram each. So nice!!!
@@DigitalSpaceport Yes, they're nice. I was looking for a trio of A100s over a year ago and couldn't find them, so instead, I bought 6 A6000s because at least I could find them.
If you think about it... I avg 10-12w per 3090 24gb the 23w per a6000 48gb seems to scale. Maybe idle is tied to vram amount also?
@@DigitalSpaceport That could be, but usually power scales with # of modules, not size. But then again, maybe you're right, because I looked at an 8x A100-SXM rig a while back, it idled each GPU between 48-50W and had 80GB per GPU.
@canoozie my 3060 12GB idles 5-6 hum. Interesting. Also now im browsing ebay for A100s. SMX over pcie right? Im prolly not this crazy.
can you mix together amd and nvidia gpus on inference?
Great question. Will test when I get an amd card.
i want to suggest a slight lower tier, 2080 Tis that have been modified with 22G memory, running 2x system
what about cpu cache?
Doesnt seem to impact inference speed interestingly but would need engineering flamegraph to profile it really. Not a top factor for sure.