DeepSeek R1 Hardware Requirements Explained

Поділитися
Вставка
  • Опубліковано 1 лют 2025

КОМЕНТАРІ • 138

  • @BlueSpork
    @BlueSpork  3 години тому +8

    In this video, I failed to mention that all the models shown are quantized Q4 models, not full-size models. These Q4 models are smaller sized models. They are easier to load on computers with limited resources. That's why I used Q4 models-to show what most people can run on their computers. However, I should have mentioned that these are not full-size models. If you have enough hardware resources you can download larger Q8, and fp16 models from Ollama's website. Also, I didn't cover running local LLMs in RAM instead of VRAM in detail because this video focuses mainly on GPUs and VRAM. I might make another video explaining running them in RAM in more detail.

    • @jeffcarey3045
      @jeffcarey3045 3 хвилини тому

      14b can fit fine on a 2080 Ti that's only got 11 GB of vram. 1.5B is a 2GB model - you don't need 8 gigs of ram for it.
      Your specs all seem way higher than actually needed.

  • @mrpicky1868
    @mrpicky1868 10 годин тому +63

    love how ppl went from having "no chance of owning intelligent robot" to "4 words per second is too slow"

    • @MrViki60
      @MrViki60 6 годин тому +7

      It's not an intelligent robot.

    • @mrpicky1868
      @mrpicky1868 6 годин тому +1

      @@MrViki60 knows more then you are can do more stuff then you... who are you then

    • @Yilmaz4
      @Yilmaz4 5 годин тому

      more like 20 years but yeah

    • @MrViki60
      @MrViki60 4 години тому

      @ go to church lil bro and stop bothering me.

    • @tringuyen7519
      @tringuyen7519 Годину тому

      If you’re doing work with an AI for 4 words per second, you’re going to get fired soon. Just down load the app & run it on the cloud!

  • @KrzysztofDerecki
    @KrzysztofDerecki 10 годин тому +39

    You do not need VRAM to run those models. It's all about memory bandwidth. For VRAM usually is around 1000 GB/s, but you can get about 500GB/s with RAM on better motherboards supporting 8-channel memory or even 12, and 16. You can run 671B model on such machine with 5 T/s and it will be much cheaper than using GPUs.

    • @Mehtab20mehtab
      @Mehtab20mehtab 5 годин тому

      Some systems have cpu limitations for ram. Only 64 gig max.
      Can u recommend workstation cpus with 8 channel. Even if i use old one with 96 gig ddr3
      Or do i need to get newer gen workstation. I can put budget for test. Cpu intel xeon or intel or amd? Xeons are cheap for not having io and integrated unit but if not needed that we can get 8 core cpu just 20 dollar.

    • @Mehtab20mehtab
      @Mehtab20mehtab 5 годин тому

      Or maybe need ddr5 for 5000 gt/s above along with 8 channel

    • @19Ronin95
      @19Ronin95 5 годин тому +1

      Make a video about this and explain it to us please. And show us everything!

    • @francistaylor1822
      @francistaylor1822 5 годин тому +1

      @@Mehtab20mehtab still has to have the model in memory, you can use SSDs so I have heard but wouldnt be very fast at all

    • @Kazekoge101
      @Kazekoge101 5 годин тому

      Not sure but I think MacBook Pro M4 Max's will be enough for the larger ones. Not entry-level hardware though.

  • @semeunacte
    @semeunacte 8 годин тому +6

    Concise, easy to understand. Thanks mate.

  • @xhobv02
    @xhobv02 5 годин тому +8

    I run the 70B model on my 128Gb ram with 3rd gen Ryzen (R9 5950). Iget around 1token/s = which slow, but the model is very good at reasoning and providing detailed answers.

    • @examplerkey
      @examplerkey 4 години тому +1

      VRAM is the secret sauce, not DRAM.

    • @xhobv02
      @xhobv02 4 години тому

      @examplerkey Correct but I can never afford more than 16Gb VRAM

    • @tringuyen7519
      @tringuyen7519 Годину тому

      ⁠@@xhobv02So you’re not going to get a 5090 with 32G of GDDR7 for the $5k street price? Where’s your commitment to AI?😂😂😂

    • @nathanbanks2354
      @nathanbanks2354 Годину тому

      It's pretty fast on a pair of 3090's, but the 32GB model is pretty much just as smart anyway for most stuff. The 671b model is much smarter.

  • @JustFacts81
    @JustFacts81 День тому +29

    Nice! Short and crisp 👍

  • @zdspider6778
    @zdspider6778 11 годин тому +17

    Best Nvidia could do was give us 5 WHOLE gigabytes in 8 years (GTX 1080Ti - RTX 5080). Blessed be thy leather jacket!

    • @nathanbanks2354
      @nathanbanks2354 Годину тому

      One day, when the 5090 comes back in stock, you'll be able to get 32GB without paying obscene amounts of money for 80GB of VRAM. At least ollama is pretty good at splitting a model across multiple GPU's--I ran DeepSeek-r1:70b on a pair of 3090's I rented and it was pretty fast.

  • @estapeluo
    @estapeluo 20 годин тому +12

    yeah... about 500GB ram for the 671b Q4... but full model its 1.6+ TB

    • @BlueSpork
      @BlueSpork  17 годин тому +10

      Thank you for pointing that out. I did not mention that these are Q4 models

    • @juliopeguero5665
      @juliopeguero5665 16 годин тому +1

      @@BlueSpork ahh no wonder it didnt make sense to me.

  • @akonako364
    @akonako364 5 годин тому +2

    ROG Ally Z1E white works really well on 14b as long i set the VRAM to auto and set 8 cores (via LMStudio) to allocate solely on RAM. So possible to go 20b on Ally X or any 32GB based Handheld PCs.

  • @nathanbanks2354
    @nathanbanks2354 Годину тому

    Nice to have the specs. I'm temped to try the 671b model on a server with 8 A6000 that I can rent for a few bucks an hour. This would be 384GB of VRAM which is almost enough to run efficiently with 4-bit quantization. I can run the DeepSeek-r1:14b at 11.93 tokens/s on an laptop with a Quadro P5000 video card, it's nice to know a 3060 is 2.2x as fast. The 32b model was running at 1.73 tokens/s, but this is largely a CPU measurement. I'm tempted to upgrade to an AMD or 3090 or 5060 Ti or something.
    I rented a server with 2x 3090's from vast AI when DeepSeek-R1 first came out and tried the 70b model. It ran quite well with ollama, utilizing both GPU's at 250-300watts. I didn't see a large difference in intelligence between the 70b and 32b model...though I wish there a deepseek coder model with the R1 styled thinking/fine-tuning.

  • @joefreeman3772
    @joefreeman3772 7 годин тому +3

    a nice way to use DeepSeek R1 is Deep Infra who are offering 671B and 70B models for dirt cheap. The 70B distilled model actually works better for me and is 23cents/69cents per Mtoken input/output

    • @tringuyen7519
      @tringuyen7519 Годину тому

      But the news media keeps harping on DeepSeek’s $0.14/Mtoken input! Don’t tell me the news media is embellishing.

  • @pathfinder.george
    @pathfinder.george 5 годин тому +2

    7B Model runs great on my M1 Pro chip. Perhaps it's harnessing the ML cores on top of the CPU? The new 50 series cards boast a LOT of ML cores, so they should be able outperform the 3090s significantly?

  • @mrpicky1868
    @mrpicky1868 10 годин тому +3

    1. will it run (and use ai accelerator of) amd hx370 ?
    2. what software will it run on in this setup?
    3. are distilled models "reasoning models"?
    4. can u continue training of the distilled models?

    • @matsekodo
      @matsekodo Годину тому

      The question with the AMD npu is very interesting

  • @Seyfettin.a
    @Seyfettin.a 13 годин тому +4

    I ran the 70B on my Mac Studio with 192GB of RAM, and it provided answers very quickly.

  • @Gastell0
    @Gastell0 7 годин тому +1

    2:45 - On MI25 at 220W I get:
    total duration: 40.378089366s
    load duration: 30.090414ms
    prompt eval count: 11 token(s)
    prompt eval duration: 69ms
    prompt eval rate: 159.42 tokens/s
    eval count: 765 token(s)
    eval duration: 40.277s
    eval rate: 18.99 tokens/s
    I wonder if it's core speeds that affect this, MI25 has HMB2 memory that seems doesn't play dominant major role in this case.
    Note: your input was slightly inconsistent, at least once you had "larger" instead of "largest"

  • @isharadhanushan2002
    @isharadhanushan2002 14 годин тому +12

    I ran the 14b model on RTX 4050 Laptop GPU with 6GB VRAM, Ryzen 5 8645HS with 16GB single ram stick runninng at 5600MT/s. Getting an answer from it took 5-10 minutes. 😂😂😂

    • @IA07C
      @IA07C 4 години тому

      Mismo modelo ejecutado una gtx1660ti, una respuesta tarda aproximadamente 1.5 minutos. Veo esto:
      El modelo no utiliza casi nada de memoria RAM, apenas 0.5GB
      El modelo utiliza toda la vram del GPU (6GB)
      El modelo unicamente utiliza el 100% del GPU al inicio, luego el uso baja al 20%.
      El modelo usa todos los nucleos del procesador al 50%, excepto uno que se satura al 100% (i5-11400).
      Se supone que ese modelo deberia usar mas VRAM, pero parece ejecutarse bien en esta configuracion. Te recomiendo revisar los drivers de nvidia y ver si realmente esta usando la aceleracion por GPU, o si existe algun problema de sobre calentamiento.

    • @tringuyen7519
      @tringuyen7519 Годину тому

      The media hype around DeepSeek running locally is misplaced! Running an DeepSeek at 5 tokens per second is ridiculous but helps NVDA get more business!

  • @Anup_x
    @Anup_x День тому +12

    Thanks this is very helpful

  • @ZeroUm_
    @ZeroUm_ 7 годин тому +2

    deepseek-r1 14b runs smoothly in 16GB VRAM + 32GB 3200MT/s RAM(it fits in VRAM alone), but 32b is molasses slow, not worth it.

  • @sylvainflipot8007
    @sylvainflipot8007 23 години тому +2

    Could you also explore the quantization?

    • @BlueSpork
      @BlueSpork  20 годин тому +1

      All of these models are Q4 (quantization Q4_K_M)

    • @derduebel
      @derduebel 13 годин тому

      ​@@BlueSpork😂

  • @ryanglass4004
    @ryanglass4004 11 годин тому +2

    I've been running all models up to the 32b model on a 12 year old machine just on CPU. AMD fx-8350 CPU (8 cores) and 24gb ddr3 ram. Getting 6-8 tokens per socond on the 1.5b, 2.5 tokens on the 7b, 1 token per second on the 14b and the 32b is very slow. However the answers on anything smaller than the 14b are poor quality so for this to be effective in a real world setting I will need a better machine.

  • @imrokwasiba9027
    @imrokwasiba9027 12 годин тому +1

    You are the best and thanks you !

  • @agush22
    @agush22 20 годин тому +3

    you are getting better performance than I am
    14-b is giving me eval rate: 25.57 tokens/s on a 4060ti 16gb. could it be related to me running a docker container with ollama through WSL? wondering how to speed up my setup

    • @jilherme
      @jilherme 19 годин тому +3

      why don't you run ollama directly through terminal to compare?

    • @BlueSpork
      @BlueSpork  17 годин тому +5

      Running Ollama in a Docker container through WSL can introduce some performance overhead compared to running it natively on Windows

  • @ArtemAleksashkin
    @ArtemAleksashkin 19 годин тому +3

    Thank you!

  • @anon1999-h5j
    @anon1999-h5j День тому +6

    Can you benchmark how each model performs? There must be a sweet spot for performance compared to requirements.

  • @thirien59
    @thirien59 3 години тому +2

    you are not talking about powering full size FP8 models by the way, you are talking about powering ollama 4bit quantized models, which are around half the size of the true models.
    For 671 b parameters, you would need around 700 Gb of RAM.

    • @BlueSpork
      @BlueSpork  3 години тому +1

      You are right. Thank you for pointing that out. I failed to mention that in the video so now I made a comment about it an pinned it to the top. Thanks!

    • @crackwitz
      @crackwitz 9 хвилин тому

      Even fp8 isn't "full size". And during training, it's likely even more than fp16.

  • @birdost4872
    @birdost4872 16 годин тому +2

    Can I run 10 GPUs with 8GB VRAM in parallel and run the 70b model?

    • @RubberDuckDebugger
      @RubberDuckDebugger 15 годин тому +3

      Technically, but it would be rough. LLM performance is mostly limited by memory speed and in a multi-GPU setup you get more capacity, but speed will be limited by the speed of a single GPU.
      That is to say that two RTX 3090's will perform about the same as an RTX A6000, which is the same chip with switch as much VRAM. The RTX 3090's are still the cheaper option, but the power draw will be twice that of a single GPU.
      GPU's with smaller amounts of memory typically have slower memory, so three 8 GB 3060's will deliver much worse performance than a single 24 GB 3090.
      I wish we could go back to the days when board partners could release models with twice the memory of the OEM version of a GPU.

    • @nathanbanks2354
      @nathanbanks2354 Годину тому

      It would be interesting to test this. I ran the 70b model on a pair of 3090's and it was reasonably fast, both GPU's were taking 250-300w of power, but I don't know if this is better or worse than a single A6000. For the 671b model, it's using a mixture of experts system which should be much more efficient than a large model like the Llama 405b because the GPU's don't need to communicate as much. Presumably this is because DeepSeek was using H800 GPU's instead of H100's...the Chinese variants have less inter-GPU communication and less 64-bit floating point arithmetic, but they both have 80GB of VRAM and for FP4 & FP8 calculations they're both fine. I've used Mixtral a few months back, and it was faster than other models with the same number of parameters, but I'm not sure if this was caused by inter-GPU communication. I think the computer I rented had 4x 4090's when I tested Mixtral 8x22b.

    • @RubberDuckDebugger
      @RubberDuckDebugger 31 хвилина тому

      @nathanbanks2354 I'm pretty sure MoE models are faster even when ran on one GPU. Because only a subset of parameters are active at any given time, the models will run like a smaller model despite needing more VRAM than an actually smaller model.
      As for the inter GPU connectivity, I don't think that's nearly as important for inferencing versus training. I saw a video a while back where someone distributed inferencing across multiple machines, including a custom build and a Mac and I don't recall it showing significant impact to the performance.
      As I understand it, and please correct me if I'm wrong, the high memory bandwidth required for LLM inferencing only applies within processing a layer of the model. So as long as you distribute whole layers to each available GPU the traffic between GPUs is quite minimal.
      Of course, distributing layers means that smaller gpus are even more wasteful.
      For example, let's say we have a 40 GB model made up of eight 5 GB layers.
      You would need eight 8 GB GPU's for good inferencing performance and likely a 9th GPU if you want decent context. That's a total of 64 to 72 GB of VRAM.
      Compare that to a 48 GB GPU, where you can load all layers onto one GPU and still have 8 GB leftover for context.

  • @stevensexton5801
    @stevensexton5801 6 годин тому

    What an excellent video!

  • @SEO-010
    @SEO-010 11 годин тому +1

    Are Intel Arc B570 10gb and B580 12gb and good for this ?

  • @ngana8755
    @ngana8755 5 годин тому +2

    If I have a laptop with 64 GB RAM, what model of DeepSeekR1 can run on my machine. Also, is the 8 GB VRAM at 1:32 min purchased separately and connected to a laptop via USB cord?

    • @ChessPuzzles2
      @ChessPuzzles2 5 годин тому +1

      vram is in the graphics card it can be added only if you can add a graphics card

  • @victoryeung2844
    @victoryeung2844 19 годин тому +7

    Clear and to the point

  • @zgolkar
    @zgolkar 13 годин тому +2

    So… it would be possible to run R1 with 4x128 GB RAM -I wonder how slow…

    • @jameswilliam7992
      @jameswilliam7992 12 годин тому

      Someone tried it and it was extremely slow.
      Look up for Digital Spaceport for his test

    • @nathanbanks2354
      @nathanbanks2354 Годину тому

      For an idea of the speed at 8x80GB, you can see ua-cam.com/video/bOsvI3HYHgI/v-deo.html
      These servers cost over $20/hour unless you're youtube famous.

  • @alfredomoreira6761
    @alfredomoreira6761 12 годин тому +3

    This is wrong, the 671B parameter is a mixture of expert model. So VRAM is only needed for the active MoEs. The inactive MoEs can be offloaded into RAM. This means that usually you only need 4 to 8 active MoE.
    So for 8 active MoE and the 4bit quantized version, around 64Go of vRAM and 322Go of RAM

    • @danielhenderson7050
      @danielhenderson7050 9 годин тому

      I haven't heard of anyone doing that yet, although there are some discussions and papers about methods but I am not sure tis is an actual thing being done right now tbh. It would be a huuuuuuge achievement if you could run Deepseek V3/R1 on one or two consumer GPUs at home

    • @alfredomoreira6761
      @alfredomoreira6761 8 годин тому

      @@danielhenderson7050 and there are also people running it on EXO with stacked computers and high network bandwidth. With EXO you can run R1 691B with 4 high end different computers (each computer being an amd threadripper + 3090 RTX 24GO + 128 GO RAM) .

    • @nathanbanks2354
      @nathanbanks2354 55 хвилин тому

      I'm not sure how much this would speed things up because loading/unloading the correct expert for any given question is pretty hard. It's designed to avoid GPU to GPU communication, where an MoE model will use only 2 of the 8 GPU's. However maybe if you ask the same type of questions over and over it could keep the most commonly used weights cached on the GPU and the rest in RAM...I'm not familiar enough with how the weights are divided. I remember Mixtral 8x22b would typically activate two "experts" for one answer.

  • @matiasbrunaga3269
    @matiasbrunaga3269 10 годин тому +1

    Can this run on AMD gpus?

  • @rangerkayla8824
    @rangerkayla8824 18 годин тому +2

    14b runs fast with my 4070. Not very accurate compared to online DeepSeek. Disappointing so far. Maybe better prompts will help although I used same prompts online and got very good results.

    • @BlueSpork
      @BlueSpork  17 годин тому +4

      DeepSeek-R1-Distill models are fine-tuned based on open-source models, using samples generated by DeepSeek-R1. They will never perform as goo as the online version

  • @justannpc1866
    @justannpc1866 21 годину тому +1

    so the models must run 100% on gpu for faster results?

    • @BlueSpork
      @BlueSpork  20 годин тому +4

      If a local LLM runs on a computer without dedicated VRAM, it relies entirely on the CPU for computations instead of a GPU. The model loads into system RAM, but since RAM is slower than VRAM and CPUs are less optimized for the parallel processing required by LLMs, performance is significantly reduced.

    • @justannpc1866
      @justannpc1866 19 годин тому +2

      @@BlueSpork thanks for the speedy reply. Assuming that speed is not taken into account, is the precision/accuracy affected if i were to run larger models like 32b on below required hardware?

    • @BlueSpork
      @BlueSpork  17 годин тому +2

      @@justannpc1866 If a model runs on hardware below its recommended specs, precision/accuracy remains the same

  • @chroma1004
    @chroma1004 14 годин тому +1

    Im using ryzen 5600, 16G DDR4 3000, 3GB GTX1060 and its decent to run on 8b (at 10 tk/s). But went BSOD (Black) and rebooted due to vram use after 30 mins lol.

  • @Nikita11035
    @Nikita11035 12 годин тому +1

    Can I ran it on Radeon? 6900xt for example

    • @rajeebbhoumik4093
      @rajeebbhoumik4093 10 годин тому +1

      Yes

    • @VShopov
      @VShopov 2 години тому

      @@rajeebbhoumik4093 how for me its saying 100% cpu in processor

    • @nathanbanks2354
      @nathanbanks2354 47 хвилин тому

      PyTorch supports ROCm, as does ollama. However older NVIDIA GPU's work better than older AMD GPU's. I've done AI stuff on an MI25 card. If it's not working for you, it could be a driver issue or an old card...I've never tried to run ollama in Windows or MacOS.

  • @Investchan
    @Investchan 22 години тому +1

    what about distilled models ?

    • @BlueSpork
      @BlueSpork  21 годину тому +7

      They are all distilled models, except 671B

  • @EienKurisukiCA
    @EienKurisukiCA 19 годин тому +1

    My 4070 runs 14b smoothly though?

    • @rangerkayla8824
      @rangerkayla8824 18 годин тому

      14b runs great with my 4070. Not very accurate results though compared to online DS. Disappointing so far. I used the same prompts online and offline.

    • @BlueSpork
      @BlueSpork  17 годин тому +2

      Yeah, it runs smoothly on my 3060 12GB too, but if I run something else that requires VRAM at the same time … something will slow down. This is why I recommended 16GB to leave some room for other programs

    • @rangerkayla8824
      @rangerkayla8824 17 годин тому +3

      @@BlueSpork I don't plan to run anything else that requires VRAM while running DeepSeek. Thanks for reminder. Not going to buy a $1000+ GPU anytime soon. Maybe next year.

  • @diogobarreto4407
    @diogobarreto4407 19 годин тому

    Great video!
    Do you have an estimate of how many tokens per second a 24gb VRAM gpu will generate?

    • @BlueSpork
      @BlueSpork  17 годин тому +2

      Thanks! Do you mean for the 32b model? I’m not sure. Maybe someone with a 24GB GPU will see your comment and answer

    • @SCPH79
      @SCPH79 14 годин тому +4

      With 32b model, while running at RTX 4090 24GB, I get around 34 tokens per second

    • @keltonkuntz5952
      @keltonkuntz5952 8 годин тому +1

      I get 7-10 tokens/s or so for the eval rate on a 4090

  • @saidjonrko
    @saidjonrko 17 годин тому +2

    thanks , i don't try to run it locally . instead i would prefer online version

    • @BlueSpork
      @BlueSpork  16 годин тому +2

      I understand, we all have our preferences

    • @rickyleonardi7605
      @rickyleonardi7605 14 годин тому

      yeah, running this locally would cost a lot of computing powers, my pc's gonna cry with its 4 tokens/second 😂

  • @dibu28
    @dibu28 10 годин тому +1

    14B can fit into 12GB of ram

  • @batuhanbayraktar337
    @batuhanbayraktar337 13 годин тому +1

    you can easily run entire deepseek r1 model 128 gb ram + 48 gb vram btw. But only u will get 1.6 t/s. (you have to use unsloth 1.58 bit version)

    • @nathanbanks2354
      @nathanbanks2354 Годину тому

      Cool! I found going below 3-4bit quantization really starts to affect intelligence, but it's probably still smarter than the 32b & 70b version.

  • @doctor_who1
    @doctor_who1 11 годин тому

    why is it so slow compared to llama 3.2 and gpt 4o ? those 2 are instant with the answer on my PC

    • @nathanbanks2354
      @nathanbanks2354 Годину тому

      You can't run gpt 4o on your machine--it's a proprietary model. You can only use the API to connect to Microsoft Azure servers. ollama runs everything locally.

  • @mralbinoman6231
    @mralbinoman6231 20 годин тому

    Can a GTX 1060 6GB and a Ryzen 5 5600G processor, with two 8GB 3200MHz RAM modules, run 7B and 8B models?

    • @BlueSpork
      @BlueSpork  20 годин тому +1

      Yes, but the speed will not be great. Watch the example in the video where I ran both 7b and 8b models on a system withouth any dedicated VRAM on a computer with 16GB RAM.

  • @AXLtheOG
    @AXLtheOG 16 годин тому

    Canthe 14B model be runon a 8gb vram gpu? Will it divide the workload between cpu and gpu or shift entirely to cpu?

    • @BlueSpork
      @BlueSpork  16 годин тому +2

      It will divide it between the GPU and CPU\RAM. Since it needs around 10GB to run

    • @AXLtheOG
      @AXLtheOG 16 годин тому

      @@BlueSpork I have an i5-13400 32gb ram and a gtx 1080 8gb gpu. I am running the 8b model and it runs quite well on the gpu but I want to run the 14b model. What kind of speeds can I expect if it will divide it between cpu and gpu? Like at least 4 5 tokens per seconds or even lower?

    • @f91hk67
      @f91hk67 11 годин тому +1

      me 2070 8GB 32GB RAM TEST
      8b tokens 40~46
      14b tokens 3.59~4.5 (full load GPU RAM)

  • @antoniocapraro89
    @antoniocapraro89 8 годин тому +1

    Hope nvidia will not stop to make videocard with more vram, because maybe they want make more money...

  • @Harsh-zn2od
    @Harsh-zn2od 8 годин тому

    what would be better/cheaper: RTX GPUs or apple mac.

    • @BoyanOrion
      @BoyanOrion 7 годин тому +1

      At this state with inflated RTX prices and supply vs demand issues a Mac might be cheaper while RTX is going to be better in performance especially with the 5090.

    • @nathanbanks2354
      @nathanbanks2354 49 хвилин тому

      Apple charges tons for RAM, but not as much as NVIDIA. The new NVIDIA Project DIGITS will be most flexible. It's slower & more expensive than a 5090, but has 128GB of unified RAM.
      I'm thinking of grabbing a 3090 or AMD card with 24GB of VRAM once the stock for the 5090's are available. But I may get a 5070 Ti or just rent GPU's from Vast AI or RunPod whenever I need them.

    • @BoyanOrion
      @BoyanOrion 40 хвилин тому

      @@nathanbanks2354 I just tried one model today, a 16B model with a size of 16GB and loaded it into my 3080 10GB with offloading to system ram option with LM studio. I'm definitely looking for an upgrade to a 5080 or 5090 and in the meantime my current setup plus runpod is a good solution.

  • @ickorling7328
    @ickorling7328 2 години тому

    These points dont seem accurate for me. For one, if running a 12B model, 4GB headroom is kinda low for a reasoning model. APUs are king.

  • @ArcardyArcardus
    @ArcardyArcardus 12 годин тому

    Okay, let's try the 7B model out with a Nvidia GT 1030 (2GB VRAM) , and a Intel N100 with 32 GB RAM

  • @Redzuan-rg2ev
    @Redzuan-rg2ev 18 годин тому +2

    Is this really your voice bro?

    • @BlueSpork
      @BlueSpork  17 годин тому +1

      It is. Cloned. Why?

    • @Redzuan-rg2ev
      @Redzuan-rg2ev 12 годин тому +2

      @BlueSpork I'd like to use your voice man

  • @kirshi8492
    @kirshi8492 8 годин тому

    Me laptop user can only affort with CPU. At least with my ryzen 6600h, i could load 8b model (q4_k_m) 5 second, and 8 token/s. 5,8gb ram taken.

  • @KAtergorie
    @KAtergorie 23 години тому +2

    Such a good Video !

  • @pravardhanus
    @pravardhanus 7 годин тому +1

    Nvidia H100.
    Oh Nvidia H800 is fine.
    😅

    • @nathanbanks2354
      @nathanbanks2354 45 хвилин тому

      Yeah, it could be why it's an MoE model since the H100 has better NVLink communication. But they both have 80GB of RAM and only 64-bit numbers are penalized, so the Chinese variants of the H100 are still super fast.

  • @timmeeyh6523
    @timmeeyh6523 7 годин тому

    NVIDIA is such a dog corp for outfitting all of their customers with Napoleonic amounts of RAM

  • @kirillriman3611
    @kirillriman3611 12 годин тому

    thanks

  • @infinityimpurity4032
    @infinityimpurity4032 5 годин тому +1

    5090 should be enough then 😂😂😂😂

    • @nathanbanks2354
      @nathanbanks2354 43 хвилини тому

      I ran it on a pair of 3090's and the 70b and 32b were both quite fast. The computer cost a couple bucks an hour, and since everyone else wants 4090's, I could get the preemptible computers without getting kicked off.

  • @Admin...
    @Admin... 19 годин тому

    whats point of running smaller models locally, answers always will be low quality, lets say in coding you want best possible answer to your question, will small model even with medium model you didn't get best possible solution

    • @BlueSpork
      @BlueSpork  17 годин тому +1

      Maybe it’s good for people who are curious about how local models work but don’t have hardware good enough to run larger models

    • @aldi_nh
      @aldi_nh 13 годин тому +2

      uncensored roleplaying

    • @ZeroUm_
      @ZeroUm_ 7 годин тому

      Your mileage may vary with smalller models. 14b does pretty well in programming related topics IMHO.

    • @KrzysztofDerecki
      @KrzysztofDerecki 7 годин тому

      you can train them for specific problem, and they get results better than full model

  • @examplerkey
    @examplerkey 4 години тому +1

    I asked DeepSeek to estimate the TPS for DeepSeek R1 671B on a Dell R930 server with 1TB of RAM. It says 0.1 to 1 TPS! 😧🤯🫢 Are you surprised? So I asked the estimated TPS rates for the other models. The following is the answer it gave me.
    Model Size Estimated TPS (Dell PowerEdge R930)
    |------------------------------------------------------------|-------------------|---------------------------------------------------------------------------------|
    DeepSeek V3 Unknown Likely similar to 32B or 70B models (see below)
    DeepSeek-R1-Distill-Qwen-32B 32B ~2-5 tokens per second
    DeepSeek-R1-Distill-Qwen-14B 14B ~5-10 tokens per second
    DeepSeek-R1-Distill-Qwen-7B 7B ~10-20 tokens per second
    DeepSeek-R1-Distill-Qwen-1.5B 1.5B ~20-50 tokens per second
    DeepSeek-R1-Zero Unknown Likely similar to 7B or 14B models
    DeepSeek-R1-Distill-Llama-70B 70B ~1-2 tokens per second
    DeepSeek-R1-Distill-Llama-8B 8B ~10-20 tokens per second
    |------------------------------------------------------------|-------------------|---------------------------------------------------------------------------------|
    It then went on to recommend the NVIDIA A100 or H100 GPUs (graphics cards) saying GPUs are 10-15x faster than CPUs. When I said I could afford neither 😭, it then suggested the following GPUs with estimated TPS.
    GPU Model VRAM (GB) Estimated TPS (7B Model) Price Range
    |-------------------------------------|--------------------------|--------------------------------------------------|----------------------------|
    NVIDIA RTX 4090 24 ~300-600 1,600−2,000 (they are
    NVIDIA RTX 3090 24 ~200-400 1,000−1,500 cheaper
    NVIDIA A4000 16 ~100-200 800−1,200 if you
    NVIDIA RTX 3060 12 ~50-100 300−400 look
    NVIDIA RTX 2080 Ti 11 ~50-100 400−600 around)
    |-------------------------------------|--------------------------|--------------------------------------------------|----------------------------|
    I said I couldn't afford any of them, too, and will continue to use the web chat box 😂. It says:
    Using the web chat box is a fantastic choice because:
    1. Cost-Effective: You don’t have to worry about hardware costs, electricity bills, or maintenance.
    2. Convenient: It’s always available, and you don’t need to set up or manage anything.
    3. Powerful: The models behind web chat boxes are often state-of-the-art and run on massive GPU clusters, so you get top-tier performance without any effort.
    Note: I tried 1.5B on Lenovo T430 i3 8GB RAM and it ran quite well, but took a bit to "think".

    • @nathanbanks2354
      @nathanbanks2354 41 хвилина тому

      I don't actually trust an LLM to answer this type of question well, but if it's finding the answer in a web search, it could be accurate. I've heard Groq is running DeepSeek-r1:70b at ~250 tokens per second (not to be confused with Grok).