In this video, I failed to mention that all the models shown are quantized Q4 models, not full-size models. These Q4 models are smaller sized models. They are easier to load on computers with limited resources. That's why I used Q4 models-to show what most people can run on their computers. However, I should have mentioned that these are not full-size models. If you have enough hardware resources you can download larger Q8, and fp16 models from Ollama's website. Also, I didn't cover running local LLMs in RAM instead of VRAM in detail because this video focuses mainly on GPUs and VRAM. I might make another video explaining running them in RAM in more detail.
14b can fit fine on a 2080 Ti that's only got 11 GB of vram. 1.5B is a 2GB model - you don't need 8 gigs of ram for it. Your specs all seem way higher than actually needed.
You do not need VRAM to run those models. It's all about memory bandwidth. For VRAM usually is around 1000 GB/s, but you can get about 500GB/s with RAM on better motherboards supporting 8-channel memory or even 12, and 16. You can run 671B model on such machine with 5 T/s and it will be much cheaper than using GPUs.
Some systems have cpu limitations for ram. Only 64 gig max. Can u recommend workstation cpus with 8 channel. Even if i use old one with 96 gig ddr3 Or do i need to get newer gen workstation. I can put budget for test. Cpu intel xeon or intel or amd? Xeons are cheap for not having io and integrated unit but if not needed that we can get 8 core cpu just 20 dollar.
I run the 70B model on my 128Gb ram with 3rd gen Ryzen (R9 5950). Iget around 1token/s = which slow, but the model is very good at reasoning and providing detailed answers.
One day, when the 5090 comes back in stock, you'll be able to get 32GB without paying obscene amounts of money for 80GB of VRAM. At least ollama is pretty good at splitting a model across multiple GPU's--I ran DeepSeek-r1:70b on a pair of 3090's I rented and it was pretty fast.
ROG Ally Z1E white works really well on 14b as long i set the VRAM to auto and set 8 cores (via LMStudio) to allocate solely on RAM. So possible to go 20b on Ally X or any 32GB based Handheld PCs.
Nice to have the specs. I'm temped to try the 671b model on a server with 8 A6000 that I can rent for a few bucks an hour. This would be 384GB of VRAM which is almost enough to run efficiently with 4-bit quantization. I can run the DeepSeek-r1:14b at 11.93 tokens/s on an laptop with a Quadro P5000 video card, it's nice to know a 3060 is 2.2x as fast. The 32b model was running at 1.73 tokens/s, but this is largely a CPU measurement. I'm tempted to upgrade to an AMD or 3090 or 5060 Ti or something. I rented a server with 2x 3090's from vast AI when DeepSeek-R1 first came out and tried the 70b model. It ran quite well with ollama, utilizing both GPU's at 250-300watts. I didn't see a large difference in intelligence between the 70b and 32b model...though I wish there a deepseek coder model with the R1 styled thinking/fine-tuning.
a nice way to use DeepSeek R1 is Deep Infra who are offering 671B and 70B models for dirt cheap. The 70B distilled model actually works better for me and is 23cents/69cents per Mtoken input/output
7B Model runs great on my M1 Pro chip. Perhaps it's harnessing the ML cores on top of the CPU? The new 50 series cards boast a LOT of ML cores, so they should be able outperform the 3090s significantly?
1. will it run (and use ai accelerator of) amd hx370 ? 2. what software will it run on in this setup? 3. are distilled models "reasoning models"? 4. can u continue training of the distilled models?
2:45 - On MI25 at 220W I get: total duration: 40.378089366s load duration: 30.090414ms prompt eval count: 11 token(s) prompt eval duration: 69ms prompt eval rate: 159.42 tokens/s eval count: 765 token(s) eval duration: 40.277s eval rate: 18.99 tokens/s I wonder if it's core speeds that affect this, MI25 has HMB2 memory that seems doesn't play dominant major role in this case. Note: your input was slightly inconsistent, at least once you had "larger" instead of "largest"
I ran the 14b model on RTX 4050 Laptop GPU with 6GB VRAM, Ryzen 5 8645HS with 16GB single ram stick runninng at 5600MT/s. Getting an answer from it took 5-10 minutes. 😂😂😂
Mismo modelo ejecutado una gtx1660ti, una respuesta tarda aproximadamente 1.5 minutos. Veo esto: El modelo no utiliza casi nada de memoria RAM, apenas 0.5GB El modelo utiliza toda la vram del GPU (6GB) El modelo unicamente utiliza el 100% del GPU al inicio, luego el uso baja al 20%. El modelo usa todos los nucleos del procesador al 50%, excepto uno que se satura al 100% (i5-11400). Se supone que ese modelo deberia usar mas VRAM, pero parece ejecutarse bien en esta configuracion. Te recomiendo revisar los drivers de nvidia y ver si realmente esta usando la aceleracion por GPU, o si existe algun problema de sobre calentamiento.
The media hype around DeepSeek running locally is misplaced! Running an DeepSeek at 5 tokens per second is ridiculous but helps NVDA get more business!
I've been running all models up to the 32b model on a 12 year old machine just on CPU. AMD fx-8350 CPU (8 cores) and 24gb ddr3 ram. Getting 6-8 tokens per socond on the 1.5b, 2.5 tokens on the 7b, 1 token per second on the 14b and the 32b is very slow. However the answers on anything smaller than the 14b are poor quality so for this to be effective in a real world setting I will need a better machine.
you are getting better performance than I am 14-b is giving me eval rate: 25.57 tokens/s on a 4060ti 16gb. could it be related to me running a docker container with ollama through WSL? wondering how to speed up my setup
you are not talking about powering full size FP8 models by the way, you are talking about powering ollama 4bit quantized models, which are around half the size of the true models. For 671 b parameters, you would need around 700 Gb of RAM.
You are right. Thank you for pointing that out. I failed to mention that in the video so now I made a comment about it an pinned it to the top. Thanks!
Technically, but it would be rough. LLM performance is mostly limited by memory speed and in a multi-GPU setup you get more capacity, but speed will be limited by the speed of a single GPU. That is to say that two RTX 3090's will perform about the same as an RTX A6000, which is the same chip with switch as much VRAM. The RTX 3090's are still the cheaper option, but the power draw will be twice that of a single GPU. GPU's with smaller amounts of memory typically have slower memory, so three 8 GB 3060's will deliver much worse performance than a single 24 GB 3090. I wish we could go back to the days when board partners could release models with twice the memory of the OEM version of a GPU.
It would be interesting to test this. I ran the 70b model on a pair of 3090's and it was reasonably fast, both GPU's were taking 250-300w of power, but I don't know if this is better or worse than a single A6000. For the 671b model, it's using a mixture of experts system which should be much more efficient than a large model like the Llama 405b because the GPU's don't need to communicate as much. Presumably this is because DeepSeek was using H800 GPU's instead of H100's...the Chinese variants have less inter-GPU communication and less 64-bit floating point arithmetic, but they both have 80GB of VRAM and for FP4 & FP8 calculations they're both fine. I've used Mixtral a few months back, and it was faster than other models with the same number of parameters, but I'm not sure if this was caused by inter-GPU communication. I think the computer I rented had 4x 4090's when I tested Mixtral 8x22b.
@nathanbanks2354 I'm pretty sure MoE models are faster even when ran on one GPU. Because only a subset of parameters are active at any given time, the models will run like a smaller model despite needing more VRAM than an actually smaller model. As for the inter GPU connectivity, I don't think that's nearly as important for inferencing versus training. I saw a video a while back where someone distributed inferencing across multiple machines, including a custom build and a Mac and I don't recall it showing significant impact to the performance. As I understand it, and please correct me if I'm wrong, the high memory bandwidth required for LLM inferencing only applies within processing a layer of the model. So as long as you distribute whole layers to each available GPU the traffic between GPUs is quite minimal. Of course, distributing layers means that smaller gpus are even more wasteful. For example, let's say we have a 40 GB model made up of eight 5 GB layers. You would need eight 8 GB GPU's for good inferencing performance and likely a 9th GPU if you want decent context. That's a total of 64 to 72 GB of VRAM. Compare that to a 48 GB GPU, where you can load all layers onto one GPU and still have 8 GB leftover for context.
If I have a laptop with 64 GB RAM, what model of DeepSeekR1 can run on my machine. Also, is the 8 GB VRAM at 1:32 min purchased separately and connected to a laptop via USB cord?
This is wrong, the 671B parameter is a mixture of expert model. So VRAM is only needed for the active MoEs. The inactive MoEs can be offloaded into RAM. This means that usually you only need 4 to 8 active MoE. So for 8 active MoE and the 4bit quantized version, around 64Go of vRAM and 322Go of RAM
I haven't heard of anyone doing that yet, although there are some discussions and papers about methods but I am not sure tis is an actual thing being done right now tbh. It would be a huuuuuuge achievement if you could run Deepseek V3/R1 on one or two consumer GPUs at home
@@danielhenderson7050 and there are also people running it on EXO with stacked computers and high network bandwidth. With EXO you can run R1 691B with 4 high end different computers (each computer being an amd threadripper + 3090 RTX 24GO + 128 GO RAM) .
I'm not sure how much this would speed things up because loading/unloading the correct expert for any given question is pretty hard. It's designed to avoid GPU to GPU communication, where an MoE model will use only 2 of the 8 GPU's. However maybe if you ask the same type of questions over and over it could keep the most commonly used weights cached on the GPU and the rest in RAM...I'm not familiar enough with how the weights are divided. I remember Mixtral 8x22b would typically activate two "experts" for one answer.
14b runs fast with my 4070. Not very accurate compared to online DeepSeek. Disappointing so far. Maybe better prompts will help although I used same prompts online and got very good results.
DeepSeek-R1-Distill models are fine-tuned based on open-source models, using samples generated by DeepSeek-R1. They will never perform as goo as the online version
If a local LLM runs on a computer without dedicated VRAM, it relies entirely on the CPU for computations instead of a GPU. The model loads into system RAM, but since RAM is slower than VRAM and CPUs are less optimized for the parallel processing required by LLMs, performance is significantly reduced.
@@BlueSpork thanks for the speedy reply. Assuming that speed is not taken into account, is the precision/accuracy affected if i were to run larger models like 32b on below required hardware?
Im using ryzen 5600, 16G DDR4 3000, 3GB GTX1060 and its decent to run on 8b (at 10 tk/s). But went BSOD (Black) and rebooted due to vram use after 30 mins lol.
PyTorch supports ROCm, as does ollama. However older NVIDIA GPU's work better than older AMD GPU's. I've done AI stuff on an MI25 card. If it's not working for you, it could be a driver issue or an old card...I've never tried to run ollama in Windows or MacOS.
Yeah, it runs smoothly on my 3060 12GB too, but if I run something else that requires VRAM at the same time … something will slow down. This is why I recommended 16GB to leave some room for other programs
@@BlueSpork I don't plan to run anything else that requires VRAM while running DeepSeek. Thanks for reminder. Not going to buy a $1000+ GPU anytime soon. Maybe next year.
You can't run gpt 4o on your machine--it's a proprietary model. You can only use the API to connect to Microsoft Azure servers. ollama runs everything locally.
Yes, but the speed will not be great. Watch the example in the video where I ran both 7b and 8b models on a system withouth any dedicated VRAM on a computer with 16GB RAM.
@@BlueSpork I have an i5-13400 32gb ram and a gtx 1080 8gb gpu. I am running the 8b model and it runs quite well on the gpu but I want to run the 14b model. What kind of speeds can I expect if it will divide it between cpu and gpu? Like at least 4 5 tokens per seconds or even lower?
At this state with inflated RTX prices and supply vs demand issues a Mac might be cheaper while RTX is going to be better in performance especially with the 5090.
Apple charges tons for RAM, but not as much as NVIDIA. The new NVIDIA Project DIGITS will be most flexible. It's slower & more expensive than a 5090, but has 128GB of unified RAM. I'm thinking of grabbing a 3090 or AMD card with 24GB of VRAM once the stock for the 5090's are available. But I may get a 5070 Ti or just rent GPU's from Vast AI or RunPod whenever I need them.
@@nathanbanks2354 I just tried one model today, a 16B model with a size of 16GB and loaded it into my 3080 10GB with offloading to system ram option with LM studio. I'm definitely looking for an upgrade to a 5080 or 5090 and in the meantime my current setup plus runpod is a good solution.
Yeah, it could be why it's an MoE model since the H100 has better NVLink communication. But they both have 80GB of RAM and only 64-bit numbers are penalized, so the Chinese variants of the H100 are still super fast.
I ran it on a pair of 3090's and the 70b and 32b were both quite fast. The computer cost a couple bucks an hour, and since everyone else wants 4090's, I could get the preemptible computers without getting kicked off.
whats point of running smaller models locally, answers always will be low quality, lets say in coding you want best possible answer to your question, will small model even with medium model you didn't get best possible solution
I asked DeepSeek to estimate the TPS for DeepSeek R1 671B on a Dell R930 server with 1TB of RAM. It says 0.1 to 1 TPS! 😧🤯🫢 Are you surprised? So I asked the estimated TPS rates for the other models. The following is the answer it gave me. Model Size Estimated TPS (Dell PowerEdge R930) |------------------------------------------------------------|-------------------|---------------------------------------------------------------------------------| DeepSeek V3 Unknown Likely similar to 32B or 70B models (see below) DeepSeek-R1-Distill-Qwen-32B 32B ~2-5 tokens per second DeepSeek-R1-Distill-Qwen-14B 14B ~5-10 tokens per second DeepSeek-R1-Distill-Qwen-7B 7B ~10-20 tokens per second DeepSeek-R1-Distill-Qwen-1.5B 1.5B ~20-50 tokens per second DeepSeek-R1-Zero Unknown Likely similar to 7B or 14B models DeepSeek-R1-Distill-Llama-70B 70B ~1-2 tokens per second DeepSeek-R1-Distill-Llama-8B 8B ~10-20 tokens per second |------------------------------------------------------------|-------------------|---------------------------------------------------------------------------------| It then went on to recommend the NVIDIA A100 or H100 GPUs (graphics cards) saying GPUs are 10-15x faster than CPUs. When I said I could afford neither 😭, it then suggested the following GPUs with estimated TPS. GPU Model VRAM (GB) Estimated TPS (7B Model) Price Range |-------------------------------------|--------------------------|--------------------------------------------------|----------------------------| NVIDIA RTX 4090 24 ~300-600 1,600−2,000 (they are NVIDIA RTX 3090 24 ~200-400 1,000−1,500 cheaper NVIDIA A4000 16 ~100-200 800−1,200 if you NVIDIA RTX 3060 12 ~50-100 300−400 look NVIDIA RTX 2080 Ti 11 ~50-100 400−600 around) |-------------------------------------|--------------------------|--------------------------------------------------|----------------------------| I said I couldn't afford any of them, too, and will continue to use the web chat box 😂. It says: Using the web chat box is a fantastic choice because: 1. Cost-Effective: You don’t have to worry about hardware costs, electricity bills, or maintenance. 2. Convenient: It’s always available, and you don’t need to set up or manage anything. 3. Powerful: The models behind web chat boxes are often state-of-the-art and run on massive GPU clusters, so you get top-tier performance without any effort. Note: I tried 1.5B on Lenovo T430 i3 8GB RAM and it ran quite well, but took a bit to "think".
I don't actually trust an LLM to answer this type of question well, but if it's finding the answer in a web search, it could be accurate. I've heard Groq is running DeepSeek-r1:70b at ~250 tokens per second (not to be confused with Grok).
In this video, I failed to mention that all the models shown are quantized Q4 models, not full-size models. These Q4 models are smaller sized models. They are easier to load on computers with limited resources. That's why I used Q4 models-to show what most people can run on their computers. However, I should have mentioned that these are not full-size models. If you have enough hardware resources you can download larger Q8, and fp16 models from Ollama's website. Also, I didn't cover running local LLMs in RAM instead of VRAM in detail because this video focuses mainly on GPUs and VRAM. I might make another video explaining running them in RAM in more detail.
14b can fit fine on a 2080 Ti that's only got 11 GB of vram. 1.5B is a 2GB model - you don't need 8 gigs of ram for it.
Your specs all seem way higher than actually needed.
love how ppl went from having "no chance of owning intelligent robot" to "4 words per second is too slow"
It's not an intelligent robot.
@@MrViki60 knows more then you are can do more stuff then you... who are you then
more like 20 years but yeah
@ go to church lil bro and stop bothering me.
If you’re doing work with an AI for 4 words per second, you’re going to get fired soon. Just down load the app & run it on the cloud!
You do not need VRAM to run those models. It's all about memory bandwidth. For VRAM usually is around 1000 GB/s, but you can get about 500GB/s with RAM on better motherboards supporting 8-channel memory or even 12, and 16. You can run 671B model on such machine with 5 T/s and it will be much cheaper than using GPUs.
Some systems have cpu limitations for ram. Only 64 gig max.
Can u recommend workstation cpus with 8 channel. Even if i use old one with 96 gig ddr3
Or do i need to get newer gen workstation. I can put budget for test. Cpu intel xeon or intel or amd? Xeons are cheap for not having io and integrated unit but if not needed that we can get 8 core cpu just 20 dollar.
Or maybe need ddr5 for 5000 gt/s above along with 8 channel
Make a video about this and explain it to us please. And show us everything!
@@Mehtab20mehtab still has to have the model in memory, you can use SSDs so I have heard but wouldnt be very fast at all
Not sure but I think MacBook Pro M4 Max's will be enough for the larger ones. Not entry-level hardware though.
Concise, easy to understand. Thanks mate.
I run the 70B model on my 128Gb ram with 3rd gen Ryzen (R9 5950). Iget around 1token/s = which slow, but the model is very good at reasoning and providing detailed answers.
VRAM is the secret sauce, not DRAM.
@examplerkey Correct but I can never afford more than 16Gb VRAM
@@xhobv02So you’re not going to get a 5090 with 32G of GDDR7 for the $5k street price? Where’s your commitment to AI?😂😂😂
It's pretty fast on a pair of 3090's, but the 32GB model is pretty much just as smart anyway for most stuff. The 671b model is much smarter.
Nice! Short and crisp 👍
Best Nvidia could do was give us 5 WHOLE gigabytes in 8 years (GTX 1080Ti - RTX 5080). Blessed be thy leather jacket!
One day, when the 5090 comes back in stock, you'll be able to get 32GB without paying obscene amounts of money for 80GB of VRAM. At least ollama is pretty good at splitting a model across multiple GPU's--I ran DeepSeek-r1:70b on a pair of 3090's I rented and it was pretty fast.
yeah... about 500GB ram for the 671b Q4... but full model its 1.6+ TB
Thank you for pointing that out. I did not mention that these are Q4 models
@@BlueSpork ahh no wonder it didnt make sense to me.
ROG Ally Z1E white works really well on 14b as long i set the VRAM to auto and set 8 cores (via LMStudio) to allocate solely on RAM. So possible to go 20b on Ally X or any 32GB based Handheld PCs.
Nice to have the specs. I'm temped to try the 671b model on a server with 8 A6000 that I can rent for a few bucks an hour. This would be 384GB of VRAM which is almost enough to run efficiently with 4-bit quantization. I can run the DeepSeek-r1:14b at 11.93 tokens/s on an laptop with a Quadro P5000 video card, it's nice to know a 3060 is 2.2x as fast. The 32b model was running at 1.73 tokens/s, but this is largely a CPU measurement. I'm tempted to upgrade to an AMD or 3090 or 5060 Ti or something.
I rented a server with 2x 3090's from vast AI when DeepSeek-R1 first came out and tried the 70b model. It ran quite well with ollama, utilizing both GPU's at 250-300watts. I didn't see a large difference in intelligence between the 70b and 32b model...though I wish there a deepseek coder model with the R1 styled thinking/fine-tuning.
a nice way to use DeepSeek R1 is Deep Infra who are offering 671B and 70B models for dirt cheap. The 70B distilled model actually works better for me and is 23cents/69cents per Mtoken input/output
But the news media keeps harping on DeepSeek’s $0.14/Mtoken input! Don’t tell me the news media is embellishing.
7B Model runs great on my M1 Pro chip. Perhaps it's harnessing the ML cores on top of the CPU? The new 50 series cards boast a LOT of ML cores, so they should be able outperform the 3090s significantly?
1. will it run (and use ai accelerator of) amd hx370 ?
2. what software will it run on in this setup?
3. are distilled models "reasoning models"?
4. can u continue training of the distilled models?
The question with the AMD npu is very interesting
I ran the 70B on my Mac Studio with 192GB of RAM, and it provided answers very quickly.
2:45 - On MI25 at 220W I get:
total duration: 40.378089366s
load duration: 30.090414ms
prompt eval count: 11 token(s)
prompt eval duration: 69ms
prompt eval rate: 159.42 tokens/s
eval count: 765 token(s)
eval duration: 40.277s
eval rate: 18.99 tokens/s
I wonder if it's core speeds that affect this, MI25 has HMB2 memory that seems doesn't play dominant major role in this case.
Note: your input was slightly inconsistent, at least once you had "larger" instead of "largest"
I ran the 14b model on RTX 4050 Laptop GPU with 6GB VRAM, Ryzen 5 8645HS with 16GB single ram stick runninng at 5600MT/s. Getting an answer from it took 5-10 minutes. 😂😂😂
Mismo modelo ejecutado una gtx1660ti, una respuesta tarda aproximadamente 1.5 minutos. Veo esto:
El modelo no utiliza casi nada de memoria RAM, apenas 0.5GB
El modelo utiliza toda la vram del GPU (6GB)
El modelo unicamente utiliza el 100% del GPU al inicio, luego el uso baja al 20%.
El modelo usa todos los nucleos del procesador al 50%, excepto uno que se satura al 100% (i5-11400).
Se supone que ese modelo deberia usar mas VRAM, pero parece ejecutarse bien en esta configuracion. Te recomiendo revisar los drivers de nvidia y ver si realmente esta usando la aceleracion por GPU, o si existe algun problema de sobre calentamiento.
The media hype around DeepSeek running locally is misplaced! Running an DeepSeek at 5 tokens per second is ridiculous but helps NVDA get more business!
Thanks this is very helpful
Glad it helped!
deepseek-r1 14b runs smoothly in 16GB VRAM + 32GB 3200MT/s RAM(it fits in VRAM alone), but 32b is molasses slow, not worth it.
Could you also explore the quantization?
All of these models are Q4 (quantization Q4_K_M)
@@BlueSpork😂
I've been running all models up to the 32b model on a 12 year old machine just on CPU. AMD fx-8350 CPU (8 cores) and 24gb ddr3 ram. Getting 6-8 tokens per socond on the 1.5b, 2.5 tokens on the 7b, 1 token per second on the 14b and the 32b is very slow. However the answers on anything smaller than the 14b are poor quality so for this to be effective in a real world setting I will need a better machine.
You are the best and thanks you !
you are getting better performance than I am
14-b is giving me eval rate: 25.57 tokens/s on a 4060ti 16gb. could it be related to me running a docker container with ollama through WSL? wondering how to speed up my setup
why don't you run ollama directly through terminal to compare?
Running Ollama in a Docker container through WSL can introduce some performance overhead compared to running it natively on Windows
Thank you!
Can you benchmark how each model performs? There must be a sweet spot for performance compared to requirements.
you are not talking about powering full size FP8 models by the way, you are talking about powering ollama 4bit quantized models, which are around half the size of the true models.
For 671 b parameters, you would need around 700 Gb of RAM.
You are right. Thank you for pointing that out. I failed to mention that in the video so now I made a comment about it an pinned it to the top. Thanks!
Even fp8 isn't "full size". And during training, it's likely even more than fp16.
Can I run 10 GPUs with 8GB VRAM in parallel and run the 70b model?
Technically, but it would be rough. LLM performance is mostly limited by memory speed and in a multi-GPU setup you get more capacity, but speed will be limited by the speed of a single GPU.
That is to say that two RTX 3090's will perform about the same as an RTX A6000, which is the same chip with switch as much VRAM. The RTX 3090's are still the cheaper option, but the power draw will be twice that of a single GPU.
GPU's with smaller amounts of memory typically have slower memory, so three 8 GB 3060's will deliver much worse performance than a single 24 GB 3090.
I wish we could go back to the days when board partners could release models with twice the memory of the OEM version of a GPU.
It would be interesting to test this. I ran the 70b model on a pair of 3090's and it was reasonably fast, both GPU's were taking 250-300w of power, but I don't know if this is better or worse than a single A6000. For the 671b model, it's using a mixture of experts system which should be much more efficient than a large model like the Llama 405b because the GPU's don't need to communicate as much. Presumably this is because DeepSeek was using H800 GPU's instead of H100's...the Chinese variants have less inter-GPU communication and less 64-bit floating point arithmetic, but they both have 80GB of VRAM and for FP4 & FP8 calculations they're both fine. I've used Mixtral a few months back, and it was faster than other models with the same number of parameters, but I'm not sure if this was caused by inter-GPU communication. I think the computer I rented had 4x 4090's when I tested Mixtral 8x22b.
@nathanbanks2354 I'm pretty sure MoE models are faster even when ran on one GPU. Because only a subset of parameters are active at any given time, the models will run like a smaller model despite needing more VRAM than an actually smaller model.
As for the inter GPU connectivity, I don't think that's nearly as important for inferencing versus training. I saw a video a while back where someone distributed inferencing across multiple machines, including a custom build and a Mac and I don't recall it showing significant impact to the performance.
As I understand it, and please correct me if I'm wrong, the high memory bandwidth required for LLM inferencing only applies within processing a layer of the model. So as long as you distribute whole layers to each available GPU the traffic between GPUs is quite minimal.
Of course, distributing layers means that smaller gpus are even more wasteful.
For example, let's say we have a 40 GB model made up of eight 5 GB layers.
You would need eight 8 GB GPU's for good inferencing performance and likely a 9th GPU if you want decent context. That's a total of 64 to 72 GB of VRAM.
Compare that to a 48 GB GPU, where you can load all layers onto one GPU and still have 8 GB leftover for context.
What an excellent video!
Are Intel Arc B570 10gb and B580 12gb and good for this ?
If I have a laptop with 64 GB RAM, what model of DeepSeekR1 can run on my machine. Also, is the 8 GB VRAM at 1:32 min purchased separately and connected to a laptop via USB cord?
vram is in the graphics card it can be added only if you can add a graphics card
Clear and to the point
So… it would be possible to run R1 with 4x128 GB RAM -I wonder how slow…
Someone tried it and it was extremely slow.
Look up for Digital Spaceport for his test
For an idea of the speed at 8x80GB, you can see ua-cam.com/video/bOsvI3HYHgI/v-deo.html
These servers cost over $20/hour unless you're youtube famous.
This is wrong, the 671B parameter is a mixture of expert model. So VRAM is only needed for the active MoEs. The inactive MoEs can be offloaded into RAM. This means that usually you only need 4 to 8 active MoE.
So for 8 active MoE and the 4bit quantized version, around 64Go of vRAM and 322Go of RAM
I haven't heard of anyone doing that yet, although there are some discussions and papers about methods but I am not sure tis is an actual thing being done right now tbh. It would be a huuuuuuge achievement if you could run Deepseek V3/R1 on one or two consumer GPUs at home
@@danielhenderson7050 and there are also people running it on EXO with stacked computers and high network bandwidth. With EXO you can run R1 691B with 4 high end different computers (each computer being an amd threadripper + 3090 RTX 24GO + 128 GO RAM) .
I'm not sure how much this would speed things up because loading/unloading the correct expert for any given question is pretty hard. It's designed to avoid GPU to GPU communication, where an MoE model will use only 2 of the 8 GPU's. However maybe if you ask the same type of questions over and over it could keep the most commonly used weights cached on the GPU and the rest in RAM...I'm not familiar enough with how the weights are divided. I remember Mixtral 8x22b would typically activate two "experts" for one answer.
Can this run on AMD gpus?
14b runs fast with my 4070. Not very accurate compared to online DeepSeek. Disappointing so far. Maybe better prompts will help although I used same prompts online and got very good results.
DeepSeek-R1-Distill models are fine-tuned based on open-source models, using samples generated by DeepSeek-R1. They will never perform as goo as the online version
so the models must run 100% on gpu for faster results?
If a local LLM runs on a computer without dedicated VRAM, it relies entirely on the CPU for computations instead of a GPU. The model loads into system RAM, but since RAM is slower than VRAM and CPUs are less optimized for the parallel processing required by LLMs, performance is significantly reduced.
@@BlueSpork thanks for the speedy reply. Assuming that speed is not taken into account, is the precision/accuracy affected if i were to run larger models like 32b on below required hardware?
@@justannpc1866 If a model runs on hardware below its recommended specs, precision/accuracy remains the same
Im using ryzen 5600, 16G DDR4 3000, 3GB GTX1060 and its decent to run on 8b (at 10 tk/s). But went BSOD (Black) and rebooted due to vram use after 30 mins lol.
Can I ran it on Radeon? 6900xt for example
Yes
@@rajeebbhoumik4093 how for me its saying 100% cpu in processor
PyTorch supports ROCm, as does ollama. However older NVIDIA GPU's work better than older AMD GPU's. I've done AI stuff on an MI25 card. If it's not working for you, it could be a driver issue or an old card...I've never tried to run ollama in Windows or MacOS.
what about distilled models ?
They are all distilled models, except 671B
My 4070 runs 14b smoothly though?
14b runs great with my 4070. Not very accurate results though compared to online DS. Disappointing so far. I used the same prompts online and offline.
Yeah, it runs smoothly on my 3060 12GB too, but if I run something else that requires VRAM at the same time … something will slow down. This is why I recommended 16GB to leave some room for other programs
@@BlueSpork I don't plan to run anything else that requires VRAM while running DeepSeek. Thanks for reminder. Not going to buy a $1000+ GPU anytime soon. Maybe next year.
Great video!
Do you have an estimate of how many tokens per second a 24gb VRAM gpu will generate?
Thanks! Do you mean for the 32b model? I’m not sure. Maybe someone with a 24GB GPU will see your comment and answer
With 32b model, while running at RTX 4090 24GB, I get around 34 tokens per second
I get 7-10 tokens/s or so for the eval rate on a 4090
thanks , i don't try to run it locally . instead i would prefer online version
I understand, we all have our preferences
yeah, running this locally would cost a lot of computing powers, my pc's gonna cry with its 4 tokens/second 😂
14B can fit into 12GB of ram
you can easily run entire deepseek r1 model 128 gb ram + 48 gb vram btw. But only u will get 1.6 t/s. (you have to use unsloth 1.58 bit version)
Cool! I found going below 3-4bit quantization really starts to affect intelligence, but it's probably still smarter than the 32b & 70b version.
why is it so slow compared to llama 3.2 and gpt 4o ? those 2 are instant with the answer on my PC
You can't run gpt 4o on your machine--it's a proprietary model. You can only use the API to connect to Microsoft Azure servers. ollama runs everything locally.
Can a GTX 1060 6GB and a Ryzen 5 5600G processor, with two 8GB 3200MHz RAM modules, run 7B and 8B models?
Yes, but the speed will not be great. Watch the example in the video where I ran both 7b and 8b models on a system withouth any dedicated VRAM on a computer with 16GB RAM.
Canthe 14B model be runon a 8gb vram gpu? Will it divide the workload between cpu and gpu or shift entirely to cpu?
It will divide it between the GPU and CPU\RAM. Since it needs around 10GB to run
@@BlueSpork I have an i5-13400 32gb ram and a gtx 1080 8gb gpu. I am running the 8b model and it runs quite well on the gpu but I want to run the 14b model. What kind of speeds can I expect if it will divide it between cpu and gpu? Like at least 4 5 tokens per seconds or even lower?
me 2070 8GB 32GB RAM TEST
8b tokens 40~46
14b tokens 3.59~4.5 (full load GPU RAM)
Hope nvidia will not stop to make videocard with more vram, because maybe they want make more money...
what would be better/cheaper: RTX GPUs or apple mac.
At this state with inflated RTX prices and supply vs demand issues a Mac might be cheaper while RTX is going to be better in performance especially with the 5090.
Apple charges tons for RAM, but not as much as NVIDIA. The new NVIDIA Project DIGITS will be most flexible. It's slower & more expensive than a 5090, but has 128GB of unified RAM.
I'm thinking of grabbing a 3090 or AMD card with 24GB of VRAM once the stock for the 5090's are available. But I may get a 5070 Ti or just rent GPU's from Vast AI or RunPod whenever I need them.
@@nathanbanks2354 I just tried one model today, a 16B model with a size of 16GB and loaded it into my 3080 10GB with offloading to system ram option with LM studio. I'm definitely looking for an upgrade to a 5080 or 5090 and in the meantime my current setup plus runpod is a good solution.
These points dont seem accurate for me. For one, if running a 12B model, 4GB headroom is kinda low for a reasoning model. APUs are king.
Okay, let's try the 7B model out with a Nvidia GT 1030 (2GB VRAM) , and a Intel N100 with 32 GB RAM
Is this really your voice bro?
It is. Cloned. Why?
@BlueSpork I'd like to use your voice man
Me laptop user can only affort with CPU. At least with my ryzen 6600h, i could load 8b model (q4_k_m) 5 second, and 8 token/s. 5,8gb ram taken.
Such a good Video !
Thanks!
Nvidia H100.
Oh Nvidia H800 is fine.
😅
Yeah, it could be why it's an MoE model since the H100 has better NVLink communication. But they both have 80GB of RAM and only 64-bit numbers are penalized, so the Chinese variants of the H100 are still super fast.
NVIDIA is such a dog corp for outfitting all of their customers with Napoleonic amounts of RAM
thanks
5090 should be enough then 😂😂😂😂
I ran it on a pair of 3090's and the 70b and 32b were both quite fast. The computer cost a couple bucks an hour, and since everyone else wants 4090's, I could get the preemptible computers without getting kicked off.
whats point of running smaller models locally, answers always will be low quality, lets say in coding you want best possible answer to your question, will small model even with medium model you didn't get best possible solution
Maybe it’s good for people who are curious about how local models work but don’t have hardware good enough to run larger models
uncensored roleplaying
Your mileage may vary with smalller models. 14b does pretty well in programming related topics IMHO.
you can train them for specific problem, and they get results better than full model
I asked DeepSeek to estimate the TPS for DeepSeek R1 671B on a Dell R930 server with 1TB of RAM. It says 0.1 to 1 TPS! 😧🤯🫢 Are you surprised? So I asked the estimated TPS rates for the other models. The following is the answer it gave me.
Model Size Estimated TPS (Dell PowerEdge R930)
|------------------------------------------------------------|-------------------|---------------------------------------------------------------------------------|
DeepSeek V3 Unknown Likely similar to 32B or 70B models (see below)
DeepSeek-R1-Distill-Qwen-32B 32B ~2-5 tokens per second
DeepSeek-R1-Distill-Qwen-14B 14B ~5-10 tokens per second
DeepSeek-R1-Distill-Qwen-7B 7B ~10-20 tokens per second
DeepSeek-R1-Distill-Qwen-1.5B 1.5B ~20-50 tokens per second
DeepSeek-R1-Zero Unknown Likely similar to 7B or 14B models
DeepSeek-R1-Distill-Llama-70B 70B ~1-2 tokens per second
DeepSeek-R1-Distill-Llama-8B 8B ~10-20 tokens per second
|------------------------------------------------------------|-------------------|---------------------------------------------------------------------------------|
It then went on to recommend the NVIDIA A100 or H100 GPUs (graphics cards) saying GPUs are 10-15x faster than CPUs. When I said I could afford neither 😭, it then suggested the following GPUs with estimated TPS.
GPU Model VRAM (GB) Estimated TPS (7B Model) Price Range
|-------------------------------------|--------------------------|--------------------------------------------------|----------------------------|
NVIDIA RTX 4090 24 ~300-600 1,600−2,000 (they are
NVIDIA RTX 3090 24 ~200-400 1,000−1,500 cheaper
NVIDIA A4000 16 ~100-200 800−1,200 if you
NVIDIA RTX 3060 12 ~50-100 300−400 look
NVIDIA RTX 2080 Ti 11 ~50-100 400−600 around)
|-------------------------------------|--------------------------|--------------------------------------------------|----------------------------|
I said I couldn't afford any of them, too, and will continue to use the web chat box 😂. It says:
Using the web chat box is a fantastic choice because:
1. Cost-Effective: You don’t have to worry about hardware costs, electricity bills, or maintenance.
2. Convenient: It’s always available, and you don’t need to set up or manage anything.
3. Powerful: The models behind web chat boxes are often state-of-the-art and run on massive GPU clusters, so you get top-tier performance without any effort.
Note: I tried 1.5B on Lenovo T430 i3 8GB RAM and it ran quite well, but took a bit to "think".
I don't actually trust an LLM to answer this type of question well, but if it's finding the answer in a web search, it could be accurate. I've heard Groq is running DeepSeek-r1:70b at ~250 tokens per second (not to be confused with Grok).