Running FULL DeepSeek R1 671B Locally (Test and Install!)

Поділитися
Вставка

КОМЕНТАРІ • 216

  • @bittertruthnavin
    @bittertruthnavin 3 дні тому +174

    When I posted on reddit showcasing I was able to run this exact model on my 4090 system, they downvoted my post like crazy. Some even mocked me saying it's impossible to run it on the consumer hardware. The sanctimonious localllama moderators later removed my reddit post. I hope they see it and learn something.

    • @Aiworld2025
      @Aiworld2025 3 дні тому +5

      Yeah I have two 4090 but ollama crashed on the 671gb model I wish I could read your post. I have 128gb ram

    • @Aiworld2025
      @Aiworld2025 3 дні тому +1

      @@axlodl exactly it’s time for me to upgrade I know lol

    • @bittertruthnavin
      @bittertruthnavin 3 дні тому +4

      @@Aiworld2025 I was running the exact version that Mr. Bowen here is running. It's not exactly the 671b per say, as he metaphorically explained it's 1.58bit shrunk version of 671b.

    • @TPCDAZ
      @TPCDAZ 3 дні тому +35

      The mistake you made was assuming Reddit users had a brain cell between them

    • @j562gee0hdeewestsdegethemuLa
      @j562gee0hdeewestsdegethemuLa 3 дні тому

      Some people will comment anything just for attention even if they clearly have no knowledge rather than other local opinions

  • @dacherx
    @dacherx 3 дні тому +44

    Magic of opensource sharing, move everyone forward!

    • @Bijanbowen
      @Bijanbowen  3 дні тому +5

      I'm keeping it on a flash drive for the future LOL

    • @infinixplay5777
      @infinixplay5777 2 дні тому +1

      Haha thats the first thing popped up in my mind but A BOOTABLE​ USB 😂@@Bijanbowen

    • @freivonaußen
      @freivonaußen 2 дні тому

      The Code isnt opensource

  • @jvangeene
    @jvangeene Годину тому

    Really insightful, well done! So impressed, with the 80% shrunk model that s insane!

  • @nahlene1973
    @nahlene1973 3 дні тому +37

    For some reason, the slower token output actually made it feel even more like an actual person, and make you write the prompts a lot more mindfully as if you are sending email to a colleague that would literally take 30 minutes respond (and that speed is already a gold co-worker by human standard )😂

    • @jtjames79
      @jtjames79 3 дні тому +1

      Pay real close attention to the reasoning steps.
      Put the reasoning steps in your prompt.
      Rerun the prompt.
      Be surprised what it's twice as fast.
      Repeat until grocked.

    • @Bijanbowen
      @Bijanbowen  3 дні тому

      It really is interesting how a slower response time can make a conversation feel more natural.

  • @JeromeDemers
    @JeromeDemers 3 дні тому +10

    This is my third video I watch from you. There is something that I like about your videos. I also notice you look at the lens so it looks like you are having eye contact while talking which is great. Keep it up!

    • @Bijanbowen
      @Bijanbowen  3 дні тому

      I really appreciate the kind words, I am glad you have enjoyed the videos so far!

  • @Aiworld2025
    @Aiworld2025 3 дні тому +26

    You're the first to recommend unsloth, and it was interesting to see it work on 128gb of ram awesome :D

    • @ngana8755
      @ngana8755 3 дні тому +3

      What? It only runs on 128GB of RAM? I didn't know they had laptops for sale with that much RAM.

    • @TheMcSebi
      @TheMcSebi 3 дні тому +2

      Damn I only got 96gb ram

    • @Bijanbowen
      @Bijanbowen  3 дні тому +1

      I loved the way they presented the documentation especially with the examples of the different quantization outputs, etc

  • @jayhu6075
    @jayhu6075 3 дні тому +4

    First what a big news for all people in the world, because we can reduce the cost in the future to run different models.
    What a great explanation about quantization to reduce this R1 model. Hopefully a following tutor about the

    • @Bijanbowen
      @Bijanbowen  3 дні тому

      Thanks for the kind words. It is very exciting to see this progress in terms of less cost.

  • @wahoobear6588
    @wahoobear6588 3 дні тому +11

    I had ever thought that A.I. technology was be belong only Big Company or Rich Fund. this is revolution event in technology history !

    • @gappergob6169
      @gappergob6169 3 дні тому +2

      You're the rich guy for most people, if you can play around with this locally.

    • @wahoobear6588
      @wahoobear6588 3 дні тому +5

      @@gappergob6169 No, Now i can't run it but i mean now large A.I. model it can run on local computer (not supercomputer in data center) this thing i think it may be happen on "really" local computer at someday in future

  • @MrDvaz
    @MrDvaz 3 дні тому +1

    Great job as usual !!!! Keep them coming professor !!!!!

    • @Bijanbowen
      @Bijanbowen  20 годин тому

      Thanks for the kind words!

  • @danial_amini
    @danial_amini 3 дні тому +14

    that's crazy!!!! honestly 1-2 t/s is very very good for the 671b model, good job!!!!!

    • @Bijanbowen
      @Bijanbowen  3 дні тому +1

      I was also very pleased with the output speed, way better than I had thought I'd see - thanks!

    • @007Rincewind
      @007Rincewind 3 дні тому +1

      How does the quantized version of the 671b compare to 70b version.

  • @hicamajig
    @hicamajig 2 дні тому

    Excellent introduction to dynamic quants, thanks!

    • @Bijanbowen
      @Bijanbowen  19 годин тому +1

      Thanks very much! Fwiw some folks seemed to take issue with it so make sure to double check everything hahaa

  • @marcelo9170
    @marcelo9170 2 дні тому

    Thank you for putting this video together! Great job!

    • @Bijanbowen
      @Bijanbowen  19 годин тому +1

      Thanks for the kind words!

  • @bearinch
    @bearinch День тому

    Nice, TY for sharing 🙂! JFYI: CTRL+L = clear, CTRL+A = beginning of line in terminal; I've seen you type in so much I had to tell you 😀! Greetz from Slovakia

    • @Bijanbowen
      @Bijanbowen  19 годин тому +1

      I shall never give up my clear!!! hahah, thanks for the tip!

  • @FelipePeletti
    @FelipePeletti 3 дні тому +1

    Awesome video man! Thanks for sharing this alternative to run this LLMs on more average machines.

  • @nagasaikatakamsetty3969
    @nagasaikatakamsetty3969 3 дні тому +7

    Good to see unsloth running on your machine.

  • @ewenchan1239
    @ewenchan1239 3 дні тому +1

    Great video!!!
    I'm actually about to watch your vLLM video so that I can deploy the full DeekSeek R1:671b model, on my home server (four node, each node with 128 GB of RAM).
    To be fair though, this isn't TECHNICALLY the full 671b parameter model, because the 1.58-bit dynamic quantization, as you explained it, "reduced" some of that.
    Windows Task Manager was the phrase you were looking for.
    Great to see that this is able to run with your 3090 Tis, which means that I should be able to run it with my 3090s (non-Ti).
    It is also interesting to see that their blog post changed from when you recorded and uploaded/posted this video to UA-cam to now as their blog post now has removed any and all references to running their model on vLLM.

    • @Bijanbowen
      @Bijanbowen  20 годин тому +1

      YES! task manager was it, hahaha. Interesting about the blog changes, thanks for pointing that out!

    • @ewenchan1239
      @ewenchan1239 12 годин тому

      @@Bijanbowen
      No problem.
      I only noticed it because your vLLM video was on queue for me to watch, and I ended up watching this video before that one, so when I saw that in your video, I was going to pop over there to their blog to see how I would be able to deploy this via vLLM and that's when and how I found that they had removed references to it.
      Must have had a problem trying to get it to run with vLLM.

  • @comfyden
    @comfyden 3 дні тому +7

    130GB quantized version. 128 GB RAM + 2x RTX 3090 24GB VRAM
    You can check in 8:33 section

    • @sigma_z
      @sigma_z 3 дні тому

      Are the 2 RTX 3090 linked using NVLINK?

    • @Bijanbowen
      @Bijanbowen  3 дні тому +1

      They are not.

  • @saltygravy6928
    @saltygravy6928 3 дні тому

    Thanks for the video. Very nice demo. Thanks again.

    • @Bijanbowen
      @Bijanbowen  20 годин тому

      Thanks for the kind words!

  • @michealkinney6205
    @michealkinney6205 3 дні тому +3

    Great video, I was curious when I read the paper. Thanks!
    I was was going to run it on an older T5500 PC with two M40's with 24GB RAM each and 192GB of system memory. I see now that it would likely be painfully slow at around

    • @autohmae
      @autohmae 3 дні тому

      you can njust try it, Ollama makes it easy to just try things.

    • @Bijanbowen
      @Bijanbowen  3 дні тому +2

      Glad it was helpful. Your system can definitely run it, even while slow it is fun to play with from an educational perspective I suppose hahah, but yeah the dynamic quant is very interesting and I think will be more widely used in future models

    • @ewenchan1239
      @ewenchan1239 3 дні тому

      @@Bijanbowen
      I'm running the DeekSeek R1:70b distilled model at home already.

    • @ewenchan1239
      @ewenchan1239 3 дні тому +1

      @michealkinney6205
      You don't *need* to run the full model.
      You can always run the distilled models, which have lower computational resource requirements.
      I'm running the DeepSeek R1:70b distilled model at home already (and I've also been playing around with the 8b model as well, because that runs SIGNIFICANTLY faster).

    • @michealkinney6205
      @michealkinney6205 2 дні тому +1

      @@ewenchan1239 Yeah, I know.. it was just going to be for fun. Thanks though.
      But I don't understand why everyone is recommending "DeepSeek-R1-Distill-Llama-70B-GGUF" because on DeepSeek's own benchmark, the "DeepSeek-R1-Distill-Qwen-32B" model outperforms it. Also, on the HuggingFace Open LLM Benchmarks, the much smaller 14B model even does better (and much better than many larger models).
      So personally I will just run the 32B model locally. Also, if that model was dynamically quantized like the 671B R1 model, based on my number it would be < 10GB VRAM with little quality loss... so I might try to do just that.

  • @MyKaosLife
    @MyKaosLife 3 дні тому +4

    What I'd like to see is a comparison between the bigger models that can be run locally. This nerfed "full" model, the distilled llama 70B, and the destilled Qwen 30B. Comparing speed (t/s) and quality of results. If nothing else, so I don't have to do it myself 😄

  • @InsistoSonocazzimiei
    @InsistoSonocazzimiei 2 дні тому +4

    No clickbait,please! Full= FP8, no quantized versions! Reason? We have no Benchmarks how good the quantized versions are...only flappy bird and thats not enough for me😢

  • @romayojr
    @romayojr 3 дні тому +5

    that’s really cool to see. can you post the exact hardware used in the description? i know you mentioned some of it in the video but would be helpful if you posted it too

    • @hd3adpool
      @hd3adpool 22 години тому

      He used 128 gigs RAM and 2x3090ti if I'm not wrong

    • @Bijanbowen
      @Bijanbowen  20 годин тому

      Thanks very much. It is 2x3090ti, 128gb ddr4-3200, msi z690 mobo and an i7 12700 cpu.

  • @e8root
    @e8root 3 дні тому +8

    If you are comparing quantization to image you should not talk about reducing resolution (this would be more akin to reducing number of parameters!) but dropping number of colors. To show it you could take image with text and graphics on it and drop its number of colors to 3 and show that most of the text is still legible and graphics somewhat visible. You could also drop colors to e.g. 16 to show why quantizing to 4-bit has generally much less impact on model performance.

    • @fr3zer677
      @fr3zer677 3 дні тому

      I was thinking the same thing when he brought up that analogy. Perhaps instead he could have said "Dynamic quantization is like keeping more colors for the important parts, like the image in the center, while reducing the color to a minimum (only black and white) for other parts like the background and text."

    • @Bijanbowen
      @Bijanbowen  3 дні тому

      Thanks for the alternative example. Like I said in the video, I don't mean to be very scientific, just wanting to give a general acceptable explanation for how it was working.

  • @tristanvaillancourt5889
    @tristanvaillancourt5889 3 дні тому +1

    Nice work dude. Dynamic quants eh. Very interesting.

    • @Bijanbowen
      @Bijanbowen  3 дні тому +1

      Thanks very much. I agree, they give me excitement for the future of being able to run larger models on more constrained hardware.

  • @richardrombouts1883
    @richardrombouts1883 3 дні тому +2

    Saying you use the full model, while it has been trimmed down by 80%, is misleading.

  • @JohnBasically
    @JohnBasically 3 дні тому +9

    Clickbait title. He is running a shrunk down version if you watch the first minute. Not the full original. Some fancy qbit shrunk version that just claims to perform really well, you got baited like e

    • @TheDriveToSuccessToday
      @TheDriveToSuccessToday 3 дні тому +1

      Nonono your wrong

    • @BRIGS21
      @BRIGS21 3 дні тому

      you try running 671b on you computer then ,

    • @Bijanbowen
      @Bijanbowen  3 дні тому +2

      It's the full model, yes it's a quant, but so are 99% of the local AI tests you see running on UA-cam, regardless of which model.

    • @tringuyen7519
      @tringuyen7519 3 дні тому +3

      @@BijanbowenNo, you’re just running R1 1.5b. Stop being so delusional. R1 671b needs rack mounted servers with multiple A100 GPUs & terabytes of DRAM to run locally!

  • @technopremium91
    @technopremium91 3 дні тому +14

    Great video! But I feel like there’s a bit too much hype around this. Here’s the reality: if you take a large model that typically requires two high-end servers, each with 8x H100 GPUs, to run at full precision (FP16, as it should be), and then apply aggressive quantization techniques to shrink it, you end up with a much weaker model. Think of it like compressing a high-resolution image too much-you save space, but you lose a lot of detail. The lower the precision, the more accuracy the model sacrifices.
    At that point, the model might not even outperform GPT-4o, which is already blazing fast. In fact, you might be better off using something like LLaMA 70B quantized to 4-bit, which can run comfortably on a single 48GB VRAM GPU.
    Another major issue is speed. The tokens-per-second rate makes the model impractical because it takes too long to “think” before generating responses. And to even run it, you’d need at least three A6000 Ada GPUs-each costing around $7K-plus the rest of the hardware to build the system. That’s a massive investment for something that, in most cases, won’t provide much real-world benefit.
    The only scenario where this makes sense is if an organization needs a private LLM with RAG (Retrieval-Augmented Generation) to operate locally without internet access. In that case, the priority should be finding a model that balances performance, speed (high tokens per second), and low memory usage. That way, you can support multiple users simultaneously by increasing the batch size while keeping everything efficient.

    • @e8root
      @e8root 3 дні тому +4

      671B 1.58bit version should outperform 70-bit 4-bit because of how neuron networks store and process internally. That is why there is move toward lower and lower bitdepth for real time machine leaning tasks.
      The main issue here is that we take fp16 model and selectively drop precision at reduced precision to fix any issues quantization did introduce. Or did we? It would require stupendous processing power but would improve quality of such quantized model - not to where it is but compared to naive approach definitely.

    • @ah89971
      @ah89971 3 дні тому +1

      As he mentioned, it is not about quantization only.
      I think you need to think for future use cases like multiple responses in shorter time like agents. Api will cost too much for that

    • @Bijanbowen
      @Bijanbowen  3 дні тому

      You bring up some very valid points indeed.

  • @germancruzram
    @germancruzram 3 дні тому +2

    what should be the computing characteristics to run the large DeepSeek model?

    • @Bijanbowen
      @Bijanbowen  20 годин тому

      for the unsloth dynamic quant I showed, greater than 80gb combined system memory

  • @jameschern2013
    @jameschern2013 2 дні тому

    The latest FP4 precision training has been released, showing great potential.

    • @Bijanbowen
      @Bijanbowen  19 годин тому

      That is awesome, thanks for mentioning this!

  • @juanjesusligero391
    @juanjesusligero391 3 дні тому

    Bruh! This is insane! :D

    • @Bijanbowen
      @Bijanbowen  3 дні тому +1

      I share your thoughts haha

  • @l1k2j3h400
    @l1k2j3h400 3 дні тому +6

    Bro can you make a video on setting up your pc on running things up? Thanks

    • @Bijanbowen
      @Bijanbowen  3 дні тому

      I am very open to doing a video like this, I just don't know what specifically would be the best approach - meaning what would folks want to start from, and what would they want to end up being able to do by the end of the video?

    • @l1k2j3h400
      @l1k2j3h400 2 дні тому

      @@Bijanboweni think from building the pc to install the software that need to run all the stuffs. The goal is to build one rig from starch end to end and running same like you did

  • @AndrewNorman-x7h
    @AndrewNorman-x7h 2 дні тому +1

    Sorry if I missed this, but has anyone done a side-by-side performance comparison to the unquantisized model?

    • @Bijanbowen
      @Bijanbowen  19 годин тому

      Not that I am aware of, but it is a good idea

  • @OddlyTugs
    @OddlyTugs 5 годин тому

    I am impressed with the 1.5b version. Running it on your phone is cooler than running the big baddie on a 4090, maybe.

  • @marcomerola4271
    @marcomerola4271 3 дні тому +1

    I am bit surprised tough, as Lama 3.3B could make the game work well on first try, also with the correct pipe logic. Let's wait for new updates on the model :)
    Thanks for the great content

    • @Bijanbowen
      @Bijanbowen  3 дні тому +1

      Thanks very much! Yes, It is more about how it was able to do it at such a large size reduction from the original, and still output working code. I am excited for more efficient ways to run bigger models.

  • @TheZimberto
    @TheZimberto 2 дні тому

    Great content, Bijan! Can you please increase the treble in your audio?

    • @Bijanbowen
      @Bijanbowen  19 годин тому

      Thanks very much! I will see what options there are for final cut to help make the audio better.

  • @brahimwalid9926
    @brahimwalid9926 2 дні тому +1

    Thank you for the video. How do you think this model will run on the upcoming NVIDIA project DIGITS? with 128 GB unified memory

    • @Bijanbowen
      @Bijanbowen  19 годин тому

      Thanks very much! I am hoping to be able to try this as soon as it comes out. I believe it will be able to but can't say for sure as we will need to see how much of the 128gb is usable!

  • @atahansensatar9295
    @atahansensatar9295 3 дні тому

    Great vid !!

  • @TheGannoK
    @TheGannoK День тому

    now we just need an uncensored version

  • @amortalbeing
    @amortalbeing 3 дні тому +5

    @8:54 for a moment I thought with 64GB of Ram and an RTX3080 could be able to run the full deepseek R1 locally!🥺
    but 128GB of ram and 2 3090?😭

  • @golvellius6855
    @golvellius6855 3 дні тому +1

    What was the hardware specs you used for the test?

    • @Bijanbowen
      @Bijanbowen  20 годин тому +1

      It is 2x3090ti, 128gb ddr4-3200, msi z690 mobo and an i7 12700 cpu.

  • @osas1684
    @osas1684 3 дні тому

    Cool video

  • @RobFreeman83
    @RobFreeman83 2 дні тому

    Could this same principle be applied to allow more capability of AI models running on hardware like the Orin Nano S?

    • @Bijanbowen
      @Bijanbowen  19 годин тому

      Yes, it is likely we will see more models like this in the future!

  • @milieu427
    @milieu427 13 годин тому

    How much RAM does a computer need to run the full 671B comfortably?

    • @Bijanbowen
      @Bijanbowen  5 годин тому

      For this dynamic quant, they say over 80gb of total system ram. For the full no quant model, I am not 100% sure, but a very large amount lol.

  • @ericb7937
    @ericb7937 3 дні тому

    Could someone explain the relative performance of the various DeepSeek models vs 4o and 1o? Isn't DeepSeek r3 like 4o in relative capabilities?

  • @etheolee4430
    @etheolee4430 3 дні тому

    impressive

  • @mrlaszlo
    @mrlaszlo 2 дні тому

    Hi! Is there any correlation between the model size and the graphics card memory? What happens when the GPU has no more memory to work with? The CPU can handle that with swapping, but I don't know about the graphcard? BTW, I like your videos, I have subscribed.

    • @Bijanbowen
      @Bijanbowen  19 годин тому

      Thanks for the kind words. The model size does generally correspond to amount of gpu memory used, but there are many variable factors like quant, model architecture, etc. Depending on how things are configured, the extra can be handled through cpu/ram when the gpus tap out, but some things will just give an oom error instead.

  • @itubeutubewealltube1
    @itubeutubewealltube1 3 дні тому +1

    hey, curious...whats the average file size moving from the drive to ram and then back again? and how many files total are moved in any given amount of time while an operation is taking place? thx

    • @Bijanbowen
      @Bijanbowen  3 дні тому +1

      I don't actually have an answer for this right now, and I did not check while running. I will keep this query in mind for future videos as this is a great question you raise.

    • @itubeutubewealltube1
      @itubeutubewealltube1 3 дні тому +1

      @@Bijanbowen thanks...that would be super helpful, have you seen the new wesroth video? thirty dollar setup .. im going to watch it now

  • @ritooo6969
    @ritooo6969 3 дні тому

    W man, nice vid fr. Is there anyway to run the r1-zero version? Im just curious since theres no distillation for it what it actually outputs. Either way though hope you have a nice day. Youre a good man :)

    • @Bijanbowen
      @Bijanbowen  20 годин тому

      Thanks very much. I don't actually have an answer to this!

  • @ngana8755
    @ngana8755 3 дні тому +1

    0:01 min Is that a vintage (1980s) Apple MAC behind you?

    • @Bijanbowen
      @Bijanbowen  3 дні тому

      Yes! There are three of them there, a macintosh classic, powerbook 140 and a macintosh portable.

  • @hyeung1
    @hyeung1 3 дні тому

    Can you try asking this question on your setup:
    How many R's are in "strawberrry"?
    The 7b model gives a wrong answer.
    The 14b/32b models insist on correcting the spelling and always return 3 as the answer. Ugh....

    • @Bijanbowen
      @Bijanbowen  3 дні тому

      If I do a comparison test between them all I will do a strawberry test!

  • @AspenVonFluffer
    @AspenVonFluffer 3 дні тому +1

    and what gpu on that machine?

    • @jayco10125
      @jayco10125 3 дні тому +2

      He there was 2 3090ti in the performance monitoring

  • @anewworldishappening
    @anewworldishappening День тому

    could it do games like a gobling dot eater around a maze?

    • @Bijanbowen
      @Bijanbowen  19 годин тому

      I am not sure tbh, I want to say yes, but not certain!

  • @CDanieleLudus
    @CDanieleLudus 3 дні тому

    I wonder, I recently purchased an Alder Lake 12700K with 32GB of RAM. The motherboard has only two memory slots, but I could potentially install 2 DDR5 modules for a total of 96GB, I think. Then in theory, I need 500GB to install the AI model, I am not sure. The PC runs Windows 11 Pro, so I'd also need to create a dual boot and install Ubuntu on one of the partitions. Would it run though?

    • @alcoholrelated4529
      @alcoholrelated4529 3 дні тому +1

      instead of dual boot, you can use hyper-v and wsl

    • @Bijanbowen
      @Bijanbowen  3 дні тому +1

      I think you may be able to pull it off, and as the other person wrote wsl is an option as well for your system.

  • @Adam-fl9uc
    @Adam-fl9uc 3 дні тому

    a little bit regressive news. I mean yeah we can also make video about switching the computer on

  • @AspenVonFluffer
    @AspenVonFluffer 3 дні тому

    how much ram do you have on that machine?

  • @nahum8240
    @nahum8240 3 дні тому

    holy shit man, on local! this 4090 needs to be well cooled😅🤣

    • @Bijanbowen
      @Bijanbowen  3 дні тому

      I am never liquid cooling again hahaha

  • @brian95240
    @brian95240 3 дні тому

    10:45 is where he shows how to install it.

  • @avetissian
    @avetissian 3 дні тому +1

    Yo Bijan, I see you're out here playin' Flappy Bird with the big boys, huh? Running a 671 billion parameter AI on a rig that's more enthusiast than enterprise, that's some straight-up wizardry! 🧙‍♂️✨ But let's cut to the chase - you've squeezed this colossal AI into a 131 GB package like it's a pair of jeans two sizes too small. 👖 So here's my question: If you had to explain to your grandma, who still thinks a gigabyte is a new brand of chocolate, how you managed to get this AI beast to fit without losing its smarts, what would you say? And hey, keep it snappy, no one's got time for a lecture that'll put us to sleep faster than her turkey dinner! 😴🦃

    • @bittertruthnavin
      @bittertruthnavin 3 дні тому +1

      This obviously sounds like an AI generated comment! Be better next time.

    • @e8root
      @e8root 3 дні тому

      His explanation how network was resized is nonsense. He said its like shrinking image which is nonsense.
      Correct explanation and something you could show to anyone and they would get the idea: take book with images. Text is mostly black on white and you reduce number of colors from 65K to between 2 and 3 it doesn't change much about its readability. For images this looks bad and some images can become unrecognizable so you use higher precission like 16colors. Take any photo and make it 16-colors and especially with dithering (which is like running few retraining passes to realign weights to lower precission) and such processed photo will look very much like it was but with less nuance to colors.
      While at it you could show how quantizing everything to e.g. 6 bit does preserve model without being too selective by dropping number of colors to 64 - compared to 4-bit it will look like much less changed. Do 8-bit so 256 colors and you might be not able to notice any difference.
      And this is exactly what quantization is - it is the same process as dropping number of colors.
      This 1.58-bit version is selectively changing number of possible weights values between neurons where this change doesn't introduce too much change to how network operates while leaving parts of the network which require higher precision to as low of a precision as possible.
      You could explain dynamic quantization to your grandma just fine with these examples.
      And especially to someone who has any experience with image manipulation on computers.

    • @Bijanbowen
      @Bijanbowen  3 дні тому

      Uhhh lol

    • @Bijanbowen
      @Bijanbowen  3 дні тому

      @e8root This example may be more technically correct, but to a general audience, let alone a typical grandma you would have lost them after mentioning 65k colors,dithering, neurons, etc. It is obvious based on your chosen username and remarks you are talented in this area of understanding, however that can also lead to a disconnect in being able to explain something in a way that resonates with those who lack said understanding. Meaning, your intelligence in the area is very good, which makes it harder to relate said info to someone who may not have that same grasp.

  • @anatalelectronics4096
    @anatalelectronics4096 3 дні тому

    ok you got my attention, which is all you need ;)

  • @BeekuBird
    @BeekuBird 3 дні тому +1

    I was so disappointed after I thought you were reaching for puppets to explain quantization, and then you brought out a poster... so disappointed 🥺

    • @Bijanbowen
      @Bijanbowen  3 дні тому

      LOL I'll keep that in mind for next time..

  • @amihartz
    @amihartz 3 дні тому

    what if you run the unquantized version on a raspberry pi 1B using a 2TB SD card with half partitioned for swap space 😏

  • @gynthos6368
    @gynthos6368 3 дні тому

    I was expecting FP16 when reading the title :(

  • @mrpro7737
    @mrpro7737 3 дні тому

    Can they do 0.001 quantisation 😂 i have 12vram gpu

  • @PaulRichardson_Canada
    @PaulRichardson_Canada 3 дні тому

    Does ur head hurt ????

  • @wyphonema4024
    @wyphonema4024 3 дні тому +2

    在知识的海洋中,我们如同漂泊的船只,试图用理性的罗盘指引方向,却常常被感性的风暴吹离航道。人类的认知,仿佛是一座由无数镜子组成的迷宫,每一面镜子都反射出不同的真相,却又彼此矛盾,令人难以捉摸。我们追求真理,却往往发现真理本身是一个多面的棱镜,每一面都闪耀着不同的色彩,而我们所看到的,仅仅是其中微不足道的一角。科学、哲学、艺术,这些人类智慧的结晶,似乎都在试图从不同的角度逼近那个终极的答案,然而,越是接近,越是发现答案的边界在无限延伸,仿佛永远无法触及。我们试图用逻辑的链条串联起世界的碎片,却常常发现,逻辑的链条本身也是由碎片组成的。或许,真正的智慧并不在于找到那个终极的答案,而在于学会在不确定性中航行,在矛盾中寻找平衡,在碎片中拼凑出属于自己的那一幅不完整的图景。

  • @RobertLeeMonterroso
    @RobertLeeMonterroso 3 дні тому

    goddamn what machine do you have

    • @Bijanbowen
      @Bijanbowen  3 дні тому

      It is nothing special aside from the 2x 3090ti. It has a 12th gen i7 and 128gb of ddr4.

  • @andrecarasas6427
    @andrecarasas6427 3 дні тому

    Can I run this model in my 8gb mac mini?

  • @trader548
    @trader548 3 дні тому

    Ask the LLM how to build a custom LLM, then use actions to build it. Hey presto, LLMs are building more LLMs.

    • @Bijanbowen
      @Bijanbowen  20 годин тому

      lol this is probably happening in AI labs

  • @jeffwads
    @jeffwads 3 дні тому +1

    1.58 bit quant. So the village idiot version of the model. At least we can compare the output to the full 8 bit version online. Thanks.

    • @Bijanbowen
      @Bijanbowen  3 дні тому

      Yes, seemingly the lowest usable output version. I loved the comparison they gave of the different results depending on the quant and if it was dynamic or not.

  • @cyborgmetropolis7652
    @cyborgmetropolis7652 День тому

    Invest in a server farm and charge clients for API access to DeepSeek for 10 cents on the dollar for the cost of OpenAI.

    • @Bijanbowen
      @Bijanbowen  19 годин тому

      Solid business plan, I'm in hahaha

  • @PITERPENN
    @PITERPENN 3 дні тому

    shown token speed is absurd, accuracy with that Q is questionable at best

  • @RajAgrawal1
    @RajAgrawal1 3 дні тому

    Misleading. You started off the video hyping it up with an impression that an average gaming PC would do. But later in the video you reveal a beefy system is till needed.

    • @Bijanbowen
      @Bijanbowen  3 дні тому

      I said "enthusiast grade desktop pc" with air quotes at 11 seconds in - sorry if this sounded like I was saying an average pc could run it

    • @RajAgrawal1
      @RajAgrawal1 3 дні тому

      @@Bijanbowen Enthusiast grade desktop pc is super vague. A 3060ti with 8GB VRAM and 32GB system RAM is also enthusiast grade. What i'm pointing at is your video can be of much higher quality in value provision if you structure it better and add key information like system specs with each test.

  • @okolenmi7511
    @okolenmi7511 2 дні тому

    It's better to test full 80B instead...

  • @laurdys5421
    @laurdys5421 3 дні тому

    deepseek API error 503

    • @Bijanbowen
      @Bijanbowen  3 дні тому

      Haven't used the api, can't say unfortunately.

  • @inout3394
    @inout3394 3 дні тому

    Telling about Quantization better giving example about Compression of file like WinRar or 7Zip, be more understanding for Peoples.

    • @alcoholrelated4529
      @alcoholrelated4529 3 дні тому

      those are lossless though

    • @Bijanbowen
      @Bijanbowen  3 дні тому +1

      I feel like more people will understand pointing to different parts of a picture than bringing up compression.

  • @cristallo33
    @cristallo33 3 дні тому

    Interesting. Looking at the GPU utilization (nvidia-smi), one sees, that GPU memory is used, but almost none of their compute capabilities. I wonder how much the whole think benefitted from using the GPUs. Might not be much slower (if at all) when just going without the GPUs. As long as not more or less the whole model fits into the GPU memory, the GPUs cannot really unfold their performance. When more then one GPU is used (which is required for such large models), the GPUs should then support a fast memory-to-memory communication like NV-LINK to not get throttled by using the PCI bus.

    • @Bijanbowen
      @Bijanbowen  3 дні тому +1

      That is an excellent point, it would be interesting to try it sans any gpu involvement.

  • @4vead21
    @4vead21 3 дні тому

    Try to ask Deepseek about Xi then ask about another president of any country. Also ask about who owns WPS. Compare the answer to ChatGPT.

    • @Bijanbowen
      @Bijanbowen  20 годин тому

      Every big model has something like this.

  • @sebagsm2
    @sebagsm2 3 дні тому

    use tab for filenames :D

  • @electricsushi
    @electricsushi 3 дні тому

    4:37 LOL

  • @AspenVonFluffer
    @AspenVonFluffer 3 дні тому

    where is my lady?

    • @Bijanbowen
      @Bijanbowen  3 дні тому

      She doesn't like it up here lol

    • @AspenVonFluffer
      @AspenVonFluffer 3 дні тому

      @@Bijanbowen Tell her I said Hi! Tell her to say hi to her brother for me also ;)

  • @AspenVonFluffer
    @AspenVonFluffer 3 дні тому

    First!!!!!!!!!!!!!!!!!!!!!!!!!!!!

  • @adventureglobaloffical
    @adventureglobaloffical 3 дні тому

    no moat gg

  • @ralffig3297
    @ralffig3297 3 дні тому

    You need a RAM as big as this guy.

  • @adg2302
    @adg2302 3 дні тому

    Why are you sounding like *Trump* lmfao

    • @Bijanbowen
      @Bijanbowen  3 дні тому

      I have never heard that comparison haha

  • @palfers1
    @palfers1 3 дні тому +1

    I think "run" is a rather optimistic way to describe an output rate of one token per second.

    • @Bijanbowen
      @Bijanbowen  3 дні тому +1

      I would have to disagree, I think any output could be considered running.

    • @palfers1
      @palfers1 День тому +1

      @@Bijanbowen I would call it "walking" or "crawling" - not "running".

  • @marcol869
    @marcol869 3 дні тому

    Any chance you can do a video in running this on window ?

    • @Bijanbowen
      @Bijanbowen  3 дні тому

      Unfortunately not something I can do

  • @Jibs-HappyDesigns-990
    @Jibs-HappyDesigns-990 3 дні тому

    🍿🍿👍👍🦾thanks 4 sharing U'r expertise! congratulations in these many steps! 👍💪good luck!🍌
    yup!🍒🍌🍒🥳🌼yup! yea! Jetson content!! woohoo!

    • @Bijanbowen
      @Bijanbowen  3 дні тому

      Thanks, appreciate the support and hope you enjoy it!