Looking at the price: 2XP2000 will cost you around $200 (+shipping) while a new RTX 3060 12GB will cost you $284 from Amazon (+shipping), so for around $84, why should someone buy the 2 P2000 cards? I'm pretty sure that RTX3060 will smack out the dual P2000
I think I mentioned that in the video but yes the 3060 12GB is an all around better the card vs 2 P2000. The script was out the door when I pivoted to testing other cards so it was likely muddled as a point. Always when I write a script before testing.... but the M2000 will be the stand-in for the current cheapest rig I could figure out. Its for sure worth it to go with the 350 rig and 3060 12GB if someone can.
@@DigitalSpaceport damn did you see intel's new Battlemage gpu? It drops in stores in a couple of weeks. The Arc B580 has 12GB of vram at $250! It improves efficiency on that front, using tricks like transforming the vector engines from two slices into a single structure, supporting native SIMD16 instructions, and beefing up the capabilities of the Xe core’s ray tracing and XMX AI instructions. I don't what the previous A770 16GB graphics card standing is but it may get a price drop soon as a result. It's already $259 at microcenter.
10gb and 12gb VRAM is not worth it. Get a $100 M40 24gb gpu if you want cheap AI. They're slow but work fine. VRAM is king imho. Lots of stuff is made for 24gb VRAM.
I have been playing with Ollama on an AMD Ryzen 5900hx with 32gb of DDR4-3200 RAM and I ran the same models (with my RAM already over 65% taken using other stuff) And got 8-9 tokens/s with minicpm-v:8b and have been happy with the 17-19 token/s I can get with llama3.2:3b
The K2200 is disappointing but I was surprised at the M2000. I meandered a bit in some explanations as this all popped up and evolved outside of my bullet points but yeah if I get curious I will track it down if I can. This one I feel like I chased performance pretty well, but I am still wanting to know the why behind the K2200 to M2200 differences. I need to learn more.
@@DigitalSpaceport The M2000 does has better FP32 performance and about 25% faster memory performance. There is also the CUDA Compute Level 5.0 versus 5.2 difference. I haven't seen anything explaining what instruction level differences there are between the two. It would be cool to really locate all causes for the performance difference though.
There is a tool for benching models if you have their shape Ive yet to go down but looks good for comparisons like this. It may be a real rabbithole but Im interested in the raw perf numbers. Maybe of interest. github.com/stas00/ml-engineering/tree/master/compute/accelerator/benchmarks
I tested the minicpm-v:8b on a gtx 1070 ~37 t/s, and on a rtx 3090 ~92 t/s. Using this prompt: "Help me study vocabulary: write a sentence for me to fill in the blank, and I'll try to pick the correct option." ~5,5gb vram. Default values. Tested with an image and prompt "explain the meme" using an image and got ~34t/s (gtx1070) and 97t/s (rtx 3090) the image was resized to 1344x1344
Thanks for the videos. I am looking to build a home AI server for ~$1000 or less. Would love to see a video on what you could build for around that price range.
@@DigitalSpaceport Looking forward to it. I am a Software developer by trade and have been working to learn more about the hardware side of things. Thanks again for the Videos. You have gained a subscriber.
Yes I have read they are doing better on the sw front, but still have some stability issues. I do plan to snag in some AMD cards for testing when I can, just dont have money to buy one of everything really. It will happen.
My 16GB 4060 TI clocks in around 31 tps on this model (single card used). I've seen these for around $400 USD, so price/performance ratio is on par, but overall system price is higher. And you get 16GB of VRAM, which is going to be the limiting factor with the cheaper cards even if the performance is OK for you.
@@DigitalSpaceport Just started playing with it - default settings I'm getting about 6 tps. I'll try and up the context, but for some reason I'm getting flaky malfunctions with multiple models lately when playing with the settings. I hope that settles down with some updates. Also my models never unload, which is minor-level annoying. (Yes, I think I have the flags set correctly...)
Hi! I really enjoyed your video. I'm trying to do some experimental work (research) with local AI models (I'm a teacher). What is your opinion about using Xeon processors (like the ones that are sold in AliExpress) plus a graphic card like the ones that you presented? Is the Xeon processor necessary or Can I choose any other processor? (like a Ryzen plus an nvidia card). Greetings from Mexico.
@@DigitalSpaceport Those that can (have another gpu for desktop) have modest benefit by setting the nvidia gpu in TCC mode instead of WDDM mode. So you get to use 95+ % of the vram for compute instead of 80+ % because of OS reserved memory. It can be the difference between 16k context and 32k or Q4 or Q5 quant.
@@DigitalSpaceport once that you set that gpu to TCC mode, it cant display image until you set it to WDDM again (reboot does reset to WDDM unless you make the change persistant)
stoked i found your channel! I'm considering using Exo to distribute a llm across my families fleet of gaming pcs, however not sure on the overall power draw. Thoughts?
i am having a good time with ryzen mini pc 5th and 6th gen are CHEAP you can add a m.2 to pci adapter for egpu and you can max max out the ram of the igpu in the bios
Have a ryzen 7 (5800H) apu with 64GB of RAM (48 dedicated to the GPU) and it works surprisingly well. Recently bought a HP Victus E16 motherboard (only) with the same APU plus a 3060 on the board (really it's 1/2 a 3060 - has 6GB of VRAM) that I have just gotten powered up and am hoping will be interesting - or at least cost effective for a £140 outlay (as i already have the RAM, SSD etc)
Sorry, a really basic question from me; puns unintended. What are you using to collect reliable stats on power consumption (watts). We have Threadrippers and we're considering a couple of 4090s, but one question relates to having good metrics on power usage at idle and peak usage. Then we can begin to track and compare power costs. What have you found that works? Thanks in advance. Sunny
In the videos im peeking at a killawatt. If your gathering metrics you can use nvidia tooling to drop that out to influix. I forget the name of it but its fairly searchable. That would be useful to check around guthub for.
Most people I believe if you're just trying to track GPU wattage use, you can create a script of job to track nvidia-smi, and set the power levels of your RTX 3090's down to an acceptable wattage with some performance loss, until you hit an efficient rate. Something like nvidia-smi -pm 1 and nvidia-smi -p 250 I set my RTX 3060's to 100w max for all 4 cards. It's a decrease from their usual spikes of around 145w during inference, with around 10% speed loss, but 45w spike inference savings, and they never got past 70C
No unfortunatly the 7050 doesnt. The wattages are nearly identical at idle however but the peak during use is higher on a 3060 but the work is done faster. Ive seen the 3060 in the 3620 peak at 130 watts but the 7050 only hit near 100watts.
I'm running Ollama UI on Proxmox with 1070 - its not bad. The 1070s are in low $100 USD rate. But you will probably do much better with 3060 12Gb/4060ti Super 16GB
Oh yeah I did test a 1070ti out in an older video which unfortunately had bad audio. A card a lot of ppl have sitting around also which can still perform really decently for a power pin capable setup. ua-cam.com/video/Aocrvfo5N_s/v-deo.htmlsi=YhmtIDi5C0JGyRL9&t=569
How hard is it to get Invoke AI to use dual GOU, could you use an RTX 4060 8GB and a RTX 3060 12GB to get 20GB of VRAM, or would it be better to use 2 4060's?
I feel like Kepler, especially after this video, is a bridge too far on the performance side at this point. Its the bottom of the supported list also for llama.cpp/ollama so I cant think it hangs in for a lot longer on the software support side.
P102-100 10GB mining cards I think you can still get sub~$45? 2 of these together can probably push IQ3 QwQ 32B with a decent amount of context in llama.cpp, and might be around $90~140 total GPUs. + basic.. Really anything since I believe they run pcie3.0@4 lanes each. They hit a decent inference level being pascal around GTX 1080 inference speeds.
If I only need a language model when I'm using my gaming/main PC, is there a point in having a dedicated LLM server? Is VRAM the end all be all? I have a 10GB 3080.
I think base idle has to be considered also, so intels cards are out on that alone. The 3xxx series and 4xxxx series all have great idles that scale with oddly the amount of vram its looking like in my analysis. I strongly recommend a 24GB card if a person can afford it as the experience is unmatched, and spexifically a 3090 unless you want image generation at max speeds. Inference is close to the same as the 4090. That said the 3060 12G is very fast and I recommend avoiding all 8GB cards unless you already have them. The 16GB 4060 is likely to be a strong contender as well.
You can to gain vram for model storage, but your performance is that of the slowest single card always. So if you mixed a K2200 and a P2000, the tps would be that of the K2200.
Hypervisor server - for running virtual machines. As opposed to a desktop machine that is a physical machine. Proxmox is good for sharing a machine among many tasks.
This video is pretty pointless because 8gb vram is nothing at all when it comes to running AI. Like, sure if you build your pc from outdated and nearly unusable parts then sure you can make it cheap. What I'd like to see is a video showing how to cheaply make a pc using 2x m10's or 2x m40 tesla gpus.
There are some uses for a 8GB Pascal GPU if you got one. smaller models that can still hit 20+ t/s for 7~8B models with small fine tunes, roleplay, visual support models, SDXL generators.
4 x Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz (1 Socket) RAM usage 71.93% (11.22 GiB of 15.60 GiB) DDR3 proxmox-ve: 8.3.0 (running kernel: 6.8.12-4-pve) NVIDIA geForce GTX 960 PCIe GEN 1@16x 4Gi write python code to access this LLM response_token/s:24.43 create the snake game to run in python response_token/s:21.38 This is way faster than P2000 with just one GTX 960 card
Looking at the price: 2XP2000 will cost you around $200 (+shipping) while a new RTX 3060 12GB will cost you $284 from Amazon (+shipping), so for around $84, why should someone buy the 2 P2000 cards? I'm pretty sure that RTX3060 will smack out the dual P2000
I guess bc it’s 4 GB more of VRAM so you should be able to use slightly larger models. That being said, I think I’d go with the 4060 as well.
I think I mentioned that in the video but yes the 3060 12GB is an all around better the card vs 2 P2000. The script was out the door when I pivoted to testing other cards so it was likely muddled as a point. Always when I write a script before testing.... but the M2000 will be the stand-in for the current cheapest rig I could figure out. Its for sure worth it to go with the 350 rig and 3060 12GB if someone can.
@@DigitalSpaceport damn did you see intel's new Battlemage gpu? It drops in stores in a couple of weeks. The Arc B580 has 12GB of vram at $250! It improves efficiency on that front, using tricks like transforming the vector engines from two slices into a single structure, supporting native SIMD16 instructions, and beefing up the capabilities of the Xe core’s ray tracing and XMX AI instructions. I don't what the previous A770 16GB graphics card standing is but it may get a price drop soon as a result. It's already $259 at microcenter.
Yes I wish it was a 16GB but I will prolly snag one to test. I hope they have fixed their idle wattage issues also, my a750 is a power eater!
@@DigitalSpaceport REALLY WANT to see you test 2x Arc B580 for 24GB of vram
now that intel battlemage is out i bet they will be more price competitive with dedicated AI cores
Just watched the GN breakdown and looks like an interesting option and a good price point
10gb and 12gb VRAM is not worth it. Get a $100 M40 24gb gpu if you want cheap AI. They're slow but work fine. VRAM is king imho. Lots of stuff is made for 24gb VRAM.
I have been playing with Ollama on an AMD Ryzen 5900hx with 32gb of DDR4-3200 RAM and I ran the same models (with my RAM already over 65% taken using other stuff)
And got 8-9 tokens/s with minicpm-v:8b and have been happy with the 17-19 token/s I can get with llama3.2:3b
I was excited that you went from k2200 to m2000 to p2000. If you would have stopped at k2200 I would have been really disappointed.
The K2200 is disappointing but I was surprised at the M2000. I meandered a bit in some explanations as this all popped up and evolved outside of my bullet points but yeah if I get curious I will track it down if I can. This one I feel like I chased performance pretty well, but I am still wanting to know the why behind the K2200 to M2200 differences. I need to learn more.
@@DigitalSpaceport The M2000 does has better FP32 performance and about 25% faster memory performance. There is also the CUDA Compute Level 5.0 versus 5.2 difference. I haven't seen anything explaining what instruction level differences there are between the two. It would be cool to really locate all causes for the performance difference though.
There is a tool for benching models if you have their shape Ive yet to go down but looks good for comparisons like this. It may be a real rabbithole but Im interested in the raw perf numbers. Maybe of interest. github.com/stas00/ml-engineering/tree/master/compute/accelerator/benchmarks
Great video! That was fun
It was a hard pivot mid video mentally for me to buy into, but rolling the dice worked. It came out decent. Thanks!
I tested the minicpm-v:8b on a gtx 1070 ~37 t/s, and on a rtx 3090 ~92 t/s. Using this prompt: "Help me study vocabulary: write a sentence for me to fill in the blank, and I'll try to pick the correct option." ~5,5gb vram. Default values. Tested with an image and prompt "explain the meme" using an image and got ~34t/s (gtx1070) and 97t/s (rtx 3090) the image was resized to 1344x1344
Thanks for the videos. I am looking to build a home AI server for ~$1000 or less. Would love to see a video on what you could build for around that price range.
Good news, i'm working in that video already and I think its a price that gets a very capable setup. Out in days.
@@DigitalSpaceport Looking forward to it. I am a Software developer by trade and have been working to learn more about the hardware side of things. Thanks again for the Videos. You have gained a subscriber.
I have two titan xp's languishing. They may have a new purpose now.
What about AMD gpus? Haven't they made progress for AI and cuda alternatives?
Yes I have read they are doing better on the sw front, but still have some stability issues. I do plan to snag in some AMD cards for testing when I can, just dont have money to buy one of everything really. It will happen.
20-80 watts? This means live 24/7 classification of persons on your Ring is not only technically feasible but also financially acceptable.
My 16GB 4060 TI clocks in around 31 tps on this model (single card used). I've seen these for around $400 USD, so price/performance ratio is on par, but overall system price is higher. And you get 16GB of VRAM, which is going to be the limiting factor with the cheaper cards even if the performance is OK for you.
Hey can you see if your 4060ti's can fit the new llama 3.3 in and at what context? It is a great model, excited for you to try it.
@@DigitalSpaceport Just started playing with it - default settings I'm getting about 6 tps. I'll try and up the context, but for some reason I'm getting flaky malfunctions with multiple models lately when playing with the settings. I hope that settles down with some updates. Also my models never unload, which is minor-level annoying. (Yes, I think I have the flags set correctly...)
this guy is built different
I would be interested to see how a tesla p4 or 2 does, especially as they are around $100 especially when compared to a 3060
There is the 3DFX card?
Hi! I really enjoyed your video. I'm trying to do some experimental work (research) with local AI models (I'm a teacher). What is your opinion about using Xeon processors (like the ones that are sold in AliExpress) plus a graphic card like the ones that you presented? Is the Xeon processor necessary or Can I choose any other processor? (like a Ryzen plus an nvidia card). Greetings from Mexico.
My dual P40 + A2000 use 550w at idle lol Keeps me warm
Its "Free" heat if a workloads running 😉
19:53 - so would you recommend 3060's over 1080Ti's, or what kind of price would make 11GB Pascals an interesting value?
stay away from Pascal, most models use FP16 and Pascal power is 90% in the 32 instead of the 16.
I do like the 3060s 12GB vram. That extra 1 GB really does matter. Id sell the 1080ti while you can and move on up.
@@DigitalSpaceport Those that can (have another gpu for desktop) have modest benefit by setting the nvidia gpu in TCC mode instead of WDDM mode. So you get to use 95+ % of the vram for compute instead of 80+ % because of OS reserved memory. It can be the difference between 16k context and 32k or Q4 or Q5 quant.
Hey now thats news to me 😀 Im looking into this asap, thx for sharing!
@@DigitalSpaceport once that you set that gpu to TCC mode, it cant display image until you set it to WDDM again (reboot does reset to WDDM unless you make the change persistant)
Great stuff. Tnx
stoked i found your channel! I'm considering using Exo to distribute a llm across my families fleet of gaming pcs, however not sure on the overall power draw. Thoughts?
i am having a good time with ryzen mini pc 5th and 6th gen are CHEAP you can add a m.2 to pci adapter for egpu and you can max max out the ram of the igpu in the bios
Does the m.2 to pcie adapter need an external power supply? I might buy one here for my unraid nas. It could use a proper cuda card.
Have a ryzen 7 (5800H) apu with 64GB of RAM (48 dedicated to the GPU) and it works surprisingly well.
Recently bought a HP Victus E16 motherboard (only) with the same APU plus a 3060 on the board (really it's 1/2 a 3060 - has 6GB of VRAM) that I have just gotten powered up and am hoping will be interesting - or at least cost effective for a £140 outlay (as i already have the RAM, SSD etc)
Could I use a 4x x4 bifurcated pcie slot adapter and squeeze 5 gpus in the pc?
Sorry, a really basic question from me; puns unintended. What are you using to collect reliable stats on power consumption (watts). We have Threadrippers and we're considering a couple of 4090s, but one question relates to having good metrics on power usage at idle and peak usage. Then we can begin to track and compare power costs. What have you found that works? Thanks in advance. Sunny
In the videos im peeking at a killawatt. If your gathering metrics you can use nvidia tooling to drop that out to influix. I forget the name of it but its fairly searchable. That would be useful to check around guthub for.
Most people I believe if you're just trying to track GPU wattage use, you can create a script of job to track nvidia-smi, and set the power levels of your RTX 3090's down to an acceptable wattage with some performance loss, until you hit an efficient rate. Something like nvidia-smi -pm 1 and nvidia-smi -p 250
I set my RTX 3060's to 100w max for all 4 cards. It's a decrease from their usual spikes of around 145w during inference, with around 10% speed loss, but 45w spike inference savings, and they never got past 70C
Does the Del 7050 have power connectors to support a 3060? Also, what would the difference be in power consumption? Just curious, thanks!
No unfortunatly the 7050 doesnt. The wattages are nearly identical at idle however but the peak during use is higher on a 3060 but the work is done faster. Ive seen the 3060 in the 3620 peak at 130 watts but the 7050 only hit near 100watts.
I'm running Ollama UI on Proxmox with 1070 - its not bad. The 1070s are in low $100 USD rate. But you will probably do much better with 3060 12Gb/4060ti Super 16GB
If anybody wondering - 1070 runs at 36 tokens per second. The Wattage pulled while idle = 36W (Intel 13500)
Oh yeah I did test a 1070ti out in an older video which unfortunately had bad audio. A card a lot of ppl have sitting around also which can still perform really decently for a power pin capable setup. ua-cam.com/video/Aocrvfo5N_s/v-deo.htmlsi=YhmtIDi5C0JGyRL9&t=569
How hard is it to get Invoke AI to use dual GOU, could you use an RTX 4060 8GB and a RTX 3060 12GB to get 20GB of VRAM, or would it be better to use 2 4060's?
It'd be cool to look at a K5200 8GB card. I'm seeing those used at like $70
I feel like Kepler, especially after this video, is a bridge too far on the performance side at this point. Its the bottom of the supported list also for llama.cpp/ollama so I cant think it hangs in for a lot longer on the software support side.
P102-100 10GB mining cards I think you can still get sub~$45? 2 of these together can probably push IQ3 QwQ 32B with a decent amount of context in llama.cpp, and might be around $90~140 total GPUs.
+ basic.. Really anything since I believe they run pcie3.0@4 lanes each. They hit a decent inference level being pascal around GTX 1080 inference speeds.
really a basic question - can I mix and match Intel CPU and Nvidia GPUs or AMD CPUs with new Intel GPU.
If I only need a language model when I'm using my gaming/main PC, is there a point in having a dedicated LLM server? Is VRAM the end all be all? I have a 10GB 3080.
What about AMD GPUs and APUs?
Can I make a video request.
Try the Tesla P4 Gpus
Okay I have one of those here. Gotta toss a fan on it but good call.
@DigitalSpaceport I 3d printed the fan housing for mine
I had some printed but failed to find a good and cheap fan option for them. Did you happen to get fans that are not coil whine prone?
@@DigitalSpaceport My fan is very loud and noisy but it does not bother me as it's in a room that is not occupied.
Love this video script
What cards are most efficient in terms of tokens per watt in your experience?
I think base idle has to be considered also, so intels cards are out on that alone. The 3xxx series and 4xxxx series all have great idles that scale with oddly the amount of vram its looking like in my analysis. I strongly recommend a 24GB card if a person can afford it as the experience is unmatched, and spexifically a 3090 unless you want image generation at max speeds. Inference is close to the same as the 4090. That said the 3060 12G is very fast and I recommend avoiding all 8GB cards unless you already have them. The 16GB 4060 is likely to be a strong contender as well.
IMO Apple Silicon Macs are the best for power efficiency. Not the best for capital cost or outright speed though.
Guys, for what i can usw the ai If i Run IT local, i dont See any uswcase ?
Can you mix and match different cards?
You can to gain vram for model storage, but your performance is that of the slowest single card always. So if you mixed a K2200 and a P2000, the tps would be that of the K2200.
@DigitalSpaceport thx. I have a 1080ti and two 1030, so it would be better to ignore the two small one and just use the 1080?
What's a proxmox server?
Hypervisor server - for running virtual machines. As opposed to a desktop machine that is a physical machine. Proxmox is good for sharing a machine among many tasks.
Could you share that cat meme? 😅
Breaking software or fiber drops?
@@DigitalSpaceport Breaking software please 😂
ua-cam.com/users/postUgkx_9IiU9QQk6J0EHlQ9yOmz4FO0da1Zv1-?si=eu8llgtJqqiIMyTg 🙀
@@DigitalSpaceport thanks for that. I really appreciate it ❤️
now try tpus on a 4x4 carrier card
Thank you for this. I've been looking for ideas for a viable $200 - $400 ultra budget rig to get my feet wet. This is right in that range. lol
Thanks again for the "Poof" of concept.....
im running mine with RTX 4060 8GB
I need to get one a 16GB one of those in the mix for testing!
This video is pretty pointless because 8gb vram is nothing at all when it comes to running AI. Like, sure if you build your pc from outdated and nearly unusable parts then sure you can make it cheap.
What I'd like to see is a video showing how to cheaply make a pc using 2x m10's or 2x m40 tesla gpus.
Small models are pretty good now, however P40s would be a safer longevity bet as they are cuda12.
There are some uses for a 8GB Pascal GPU if you got one. smaller models that can still hit 20+ t/s for 7~8B models with small fine tunes, roleplay, visual support models, SDXL generators.
So cheap and low power but fucking slow :( I have 4060 and it is fucking fast
4 x Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz (1 Socket)
RAM usage 71.93% (11.22 GiB of 15.60 GiB) DDR3
proxmox-ve: 8.3.0 (running kernel: 6.8.12-4-pve)
NVIDIA geForce GTX 960 PCIe GEN 1@16x 4Gi
write python code to access this LLM
response_token/s:24.43
create the snake game to run in python
response_token/s:21.38
This is way faster than P2000 with just one GTX 960 card