Run Local LLMs on Hardware from $50 to $50,000 - We Test and Compare!

Поділитися
Вставка
  • Опубліковано 18 лис 2024

КОМЕНТАРІ • 633

  • @DataIsBeautifulOfficial
    @DataIsBeautifulOfficial Місяць тому +440

    I'm here for the moment when the Pi says: "I can't do that, Dave"

    • @richard_d_bird
      @richard_d_bird Місяць тому

      it has to wait for dave to forget his space helmet

    • @nathanielmoore87
      @nathanielmoore87 Місяць тому +11

      Open the pod bay doors!!

    • @NigelBassman
      @NigelBassman Місяць тому +10

      The irony being that the Pi could do that

    • @eugrus
      @eugrus Місяць тому +5

      1:17 on this part it would actually be I CAN DO THAT, Dave

    • @markusmcgee
      @markusmcgee Місяць тому +3

      😆😆😆🤣

  • @LilaHikes
    @LilaHikes Місяць тому +84

    Dave, I appreciate your mindfulness of how valuable our time is and editing this vid down to a reasonable time frame.

  • @wozaiwodejia
    @wozaiwodejia Місяць тому +82

    The main (and almost only) factor for speed is memory bandwidth. Every token is generated by pulling the entire model from RAM and doing a bit of math to it. An 8gb model on an 12gb RTX 3060 TI with 6 channels (of 2gb each) get 448 gb/s for about 50 tokens/s (accounting for some overhead). That's why GPUs are so fast. If you have 2 channels of 3200 DDR4 memory, you have 51.2 gb/s - so you'll get about 6 tokens/s or around 1 token/s on a ~48 gb llama 3 70b model with 4bit quantization. - DDR 5 helps a lot, so does having more than 2 channels. CPU doesn't really matter. (Unless you're limited to 2933 MHz by a shoddy memory controller in a Ryzen 2600 and upgrade to a 5600X and get a 22% boost by pushing your DDR4 to 3600 MHz.)

    • @wozaiwodejia
      @wozaiwodejia Місяць тому +3

      Ok, to be fair. If you running Llama on a old Thinkpad x260, you actually do get twice the performance by running the model on *both* cores. Having true AVX256 or better and more than two cores really helps with doing the math.

    • @andersjjensen
      @andersjjensen Місяць тому +4

      "A bit of math" is.... an interesting way of putting it. I'm aware that training is several orders of magnitude more compute intensive than inferencing, but weather I run in CPU or GPU mode both are taxed pretty heavily. Never to 100%, which does indeed confirm that memory bandwidth/latency is the bottleneck, but still, taxing an 8 core CPU to 45% on LP-DDR5 6400 is hardly "a bit of math".

    • @SquintyGears
      @SquintyGears Місяць тому

      ​@@andersjjensenit really isn't that much math. The only reason it even registers as 45% is because we're talking about models that use all the input tokens and the output tokens as active bi-lstm nodes.
      So it's more like it's constantly rechecking it's work.
      Just consider how fast the mac pro pumps the tokens out when any other benchmark doesn't make the GPU look all that impressive. Mac pro is more similar to an rtx 2060 with loads of fast ram strapped onto it.
      This is a case where the way usage data is monitored isn't representative of really how the hardware is taxed. usage monitoring is more an indicator of how full the wait queue is.
      Ah i just realized you specifically mentioned cpu for the 45% figure. But either way, my point is that you can't actually extrapolate down from that number what the ideal hardware configuration would be. Same amount & bandwidth of ram but half the raw compute is still much faster than it really takes. Even if the usage seems to say it's the spot.

    • @JonVB-t8l
      @JonVB-t8l Місяць тому +2

      Use a Vega 20 GPU (excluding radeon VII) and you can pool VRAM with RAM to run whatever models you want. You can even add swap space on NVMEs. I got LLAMA 405b running on a system with Vega 56 which supports HBCC (although it's worse) and I used 4 NVME drives raid 0 for swap. PCIE Gen 3 is part of the problem, but The system prioritized VRAM, then ram, then Swap, as I expected so about 192GB of real RAM was used and only 600GB of Swap.
      Vega 20 (MI60 for example) has PCIE 4.0, and Optane DIMMs or Optane U.2s would work better though.

    • @SquintyGears
      @SquintyGears Місяць тому

      @@JonVB-t8l you can basically always do this. It's not vega specific. The computers just works that way.
      What you're doing is changing how it's reported to the system so the basic flag checking that the software does before sending the model clears without complaining.
      But you could also just remove the flags or use wrappers that doesn't check.
      The reason they do try to prevent it is because you lose 90% of the speed when you do this. And it can be unstable on some systems.

  • @martyb3783
    @martyb3783 Місяць тому +29

    I found this video both informative and entertaining! I chuckled when you mentioned that it made you sad to see that big boss PC struggling. Great video as always Dave!

    • @20chocsaday
      @20chocsaday Місяць тому +4

      I smiled too, but got the impression that Dave cares for his viewers.
      He is quite precise when he talks which rather suits me.

    • @swanstudios2018
      @swanstudios2018 Місяць тому

      Definitely learned something there. 😀

  • @chrisdulledge6452
    @chrisdulledge6452 Місяць тому +10

    having failed to get the webserver running on your previous WSL demo, i removed everything in frustration. Great to see it works from the command line equally well under Windows. I now have AI on my laptop (8G RAM no GPU), something i never thought possible! Thanks for showing something for everyone.

  • @Ultimatebubs
    @Ultimatebubs Місяць тому +101

    Hey Dave, in your next LLM tutorial, can you give us a demo on how to connect external data sources to it? I'm struggling to wrap my brain around it.

    • @Fybre
      @Fybre Місяць тому

      Do you mean using your own reference documents? If so, take a look at AnythingLLM, it might meet your requirements

    • @justtiredthings
      @justtiredthings Місяць тому

      Check out N8N or Dify

    • @ИванИванов-б8у4и
      @ИванИванов-б8у4и Місяць тому

      LMstudio. Anything LLM or simular

  • @08nittany
    @08nittany Місяць тому +7

    As someone who gave you "heat" in the last video, thank you for the follow-up!

  • @XTC3D
    @XTC3D Місяць тому +3

    Thanks for updating and including budget friendly options.

  • @seanwright4976
    @seanwright4976 Місяць тому +21

    I rather liked your having demonstrated with WSL, as I was able to follow along on my Ubuntu server

  • @EhdrianEh
    @EhdrianEh Місяць тому +17

    I very much believe that local LLMs are an answer to privacy in the future. As long as a large group of open testers materialize, we can also try and remove bias as best we can.

  • @Madgod711
    @Madgod711 Місяць тому +3

    Superb content. Not many channels with this amount of quality in terms of delivery.

  • @LanningRon
    @LanningRon Місяць тому +15

    The Llama 3.2 1B and 3B models run surprisingly well using Ollama on my OrangePi 5+ 8-core RK3588 processor with 8G RAM. Both models generate tokens at speeds that match or exceed normal human speech. I believe additional cores make a big difference. I also want to test these models on the Radxa X4 8G, N100 processor.

  • @drelephanttube
    @drelephanttube Місяць тому +3

    Thanks Dave, I really appreciate the time you spend to make these videos for us. Really enjoy these geeky rabbitholes.

  • @speed0002
    @speed0002 Місяць тому +2

    Thanks Dave! Really appreciate your time, and energy on this topic. I was playing with the former video yesterday and thought, "man I hope he does a little more on this".... and BAM, you did. THANK YOU!

  • @matt_b...
    @matt_b... Місяць тому +87

    11:00 I believe you've been running the 8B model if you're pulling 3.1 latest. I could be wrong, but I believe latest defaults to 8B flavor.

    • @reverse_meta9264
      @reverse_meta9264 Місяць тому +19

      correct, llama3.1:latest =llama3.1:8B

    • @Steamrick
      @Steamrick Місяць тому +16

      With a 5GB download there's no amount of quantization that could possibly fit 70B parameters. It's 100% the 8B model and probably at Q4_0 quantization, which is pretty aggressive and kinda lossy.

    • @joostwestra
      @joostwestra Місяць тому +7

      Came here to say the same. The 70B might be a great fit for the faster machines.

    • @sharpenednoodles
      @sharpenednoodles Місяць тому +2

      I haven't played with llama yet, mostly mistral, so I was also surprised when the 70b param model was only 5gb 🥲

    • @reverse_meta9264
      @reverse_meta9264 Місяць тому +6

      @@sharpenednoodles 70b llama3.1 is more like 40gb 😅

  • @Billwzw
    @Billwzw Місяць тому +2

    I loved seeing how AI can bring super hardware to it's knees. It instantly demonstrates why the AI cutting edge is moving to Blackwell and Rubin. Many thanks for this demo.

  • @eugene3d875
    @eugene3d875 Місяць тому +1

    That windows method is even more straightforward than the wsl from the last video. Thanks for sharing!

  • @alastorclark3492
    @alastorclark3492 Місяць тому +2

    I'm so glad you're doing a hardware comparison. I watched your previous video and wanted this immediately.

    • @alastorclark3492
      @alastorclark3492 Місяць тому

      I'd prefer it directly on Linux, but ofc I'm sure I can figure that out myself I'm just here watch 😂

  • @DJCatmom
    @DJCatmom 7 днів тому

    Dave, thank you for running those tests for us. While I am currently working with GPT through web browser and looking forward to switching to API, it is becoming more and more clear that the frameworks involved might hit hard limitations sooner than later and running a local model will be my only option in the future. Seeing that it is feasible, even today is very reassuring!

  • @Steamrick
    @Steamrick Місяць тому +11

    Hey Dave - 11:00 With a sub 5GB download there's no amount of quantization that could possibly fit 70B parameters. It's 100% the 8B model and probably at Q4_0 quantization, which is pretty aggressive and kinda lossy. You were running pretty much the smallest version possible.

    • @JonVB-t8l
      @JonVB-t8l Місяць тому +4

      I'm running 405b on a 8 year old server with a Vega 56. Abusing the F outta HBCC to add ram and Swap into the pool of "VRAM". Yes, I have 600GB of the 810GB model running from swap spread across 4 NVME drives.

    • @Steamrick
      @Steamrick Місяць тому +2

      @@JonVB-t8l That's quite the setup. I'd be very curious how that performs.

    • @thecompanioncube4211
      @thecompanioncube4211 4 дні тому

      ​@@Steamrick I am pretty sure not well enough to be acceptable. Even with the NVME I think the read write speeds are like quarter-ish compared to a DDR4 RAM stick.

    • @thecompanioncube4211
      @thecompanioncube4211 4 дні тому

      Came here to say this. I think 70b is like 40GB model

    • @Steamrick
      @Steamrick 4 дні тому

      @@thecompanioncube4211 Oh, even the fastest NVMe SSD is far less performant than a quarter of DRAM. It's not just the speed, it's also the latency that's much worse.

  • @BigiyePhilipo
    @BigiyePhilipo 16 годин тому

    Thanks so much for this favorite opportunities. We really loving your online classes.

  • @ArndBrugman
    @ArndBrugman Місяць тому +1

    I am freaking amazed to run this locally on my laptop (13900HX plus 4070 mobile) and it is only 2gb and performs amazing. Thanks for sharing this Dave, great content piece! thx!

    • @ADB-zf5zr
      @ADB-zf5zr Місяць тому +2

      Good luck with the longevity of your laptop.!!! If you have any random problems, crashes, things just not working, make notes of what and when (time, date) and contact the laptop company and have them officially note this as a warranty issue (if you have a warranty), and otherwise make preparations for a replacement laptop. Good luck and best wishes.

    • @LittleBoobsLover
      @LittleBoobsLover Місяць тому

      and how do you use this 2gb (8B?) model in daily use?

  • @leoxiao2751
    @leoxiao2751 7 днів тому

    Thanks, Dave. You've given me a lot more confidence in my beat-up 2015 MacBook Pro. Off to Ollama now!

  • @theritchie2173
    @theritchie2173 Місяць тому +95

    Since some people (predictably) like to complain in your videos because you're not catering to their exact needs, here's my demand for a followup with you running it on your PDP-11.

    • @NeonfOxa
      @NeonfOxa Місяць тому +11

      Video to come out in 200 years

    • @20chocsaday
      @20chocsaday Місяць тому +1

      Do you want it done in real time?

    • @theritchie2173
      @theritchie2173 Місяць тому

      @@20chocsaday What's the max allowed length for a UA-cam video, 10 hours?

    • @robertthomas5906
      @robertthomas5906 Місяць тому +1

      Watch it turn out to be faster than the 50K Dell.
      I know, no chance of that. Yet a PDP-11 used to power a Xerox 9700 printer. It could read from network or tape, merge data with a form at 300 DPI, print at 2 pages a second duplex and do that hour after hour.

  • @justtiredthings
    @justtiredthings Місяць тому +1

    This testing is right up the alley of the sort of video that I've been looking for and I really appreciate it. Going through a wide range of machines is much more useful than just testing like a 20k machine. That being said, there's something I am super confused about. Before you start the Threadripper test, you said up till now we've been using the 70 billion parameter model. The download sizes were showing around 5GB and the 70 billion parameter model would be much larger than that on the order of over 10 times, even for a quantized version. And there's just absolutely no way a 70 billion parameter model would run on anything remotely close to as wimpy as a Raspberry Pi. I assume you misspoke, which does lead me into a request. I would actually really, really appreciate seeing this sort of range testing across a variety of machines, specifically for larger models around ~30 billion or ~70 billion parameters, because I assume that most of the early tests were for some quant of the 8 billion parameter model. Most of the results available online are for the 8 billion parameter models, which is really a shame because higher end consumer machines like a gaming PC or an M2 Ultra really should be able to handle larger models around 30-70 billion parameters.

  • @ryanlemere4212
    @ryanlemere4212 Місяць тому +1

    Pretty awesome the pi even ran. Super cool Dave thanks as always man!

  • @OceanusHelios
    @OceanusHelios Місяць тому

    I saw your previous video. It made me want to make my system dual boot. Your first video I followed and was able to execute the LLM you suggested within VirtualBox. It worked just fine and I was gratefu.
    And so I installed Linux Mint in a dual boot, and your FIRST video was inspiring enough for me to figure out how to get Ollama on Linux and then pick out any LLM I wanted and install it from there.
    I am grateful for this video, but to be fair, your first video shouldn't have garnered any hate. Because, if people are even your viewers they should be savvy enough to figure things out on their own, and use your videos as a guide. Otherwise, those viewers wouldn't be your subscribers if they were that afraid of their own computers.

  • @vulcan4d
    @vulcan4d Місяць тому +1

    I built a system with 4x P102-100's which total 40GB of GPU ram. Now I can use the 70b quantized models and it is awesome! Best bang for your $$$.

  • @justadirtblock681
    @justadirtblock681 25 днів тому

    Wonderful!! Actually very useful. I plan on upgrading my own PC to do AI stuff, and now I can see roughly how well it'll do it! Thank you so much!

  • @aquinamedia4508
    @aquinamedia4508 Місяць тому +7

    I've run Windows on my RPi4, tutorial videos are out there. Not to complicated.

  • @tedkrapf1302
    @tedkrapf1302 Місяць тому +1

    You needed to run Minesweeper on the $50k Dell to really push it ;) Another great video Dave, thanks.

  • @MandrakeDCR
    @MandrakeDCR Місяць тому

    This is amazing. I just installed it on my home PC. ZorinOS / Ryzen 5 3600 / AMD 5700XT / 16GB ... It runs great (running the 3.2:latest). I have been trying to learn how to make my first game in Unity and I've been struggling with some basic ideas on the interface to code a basic shader to apply to a material and get it into the scene. The format this thing uses is perfect! ChatGPT couldn't tell me in a way I understand, couldn't find a tutorial that was what I wanted... this thing spit it out in 3 questions. I can actually understand exactly what it means, not just some vague concept I'm going to have to stumble through! I don't understand how this is even possible with such a small data set, but I will take it. THANK YOU!!!!

  • @martinsykes1257
    @martinsykes1257 Місяць тому

    Nice content, i like that you seem completely agnostic between, mac, linux and windows and even the different hardware.

  • @peterxyz3541
    @peterxyz3541 Місяць тому

    I appreciate this vid of using “affordable” or affordable” hardware.
    I’m already on a Mac, I’m researching Ubuntu and windows as an option for some old vid cards

  • @Aleksei-p9g
    @Aleksei-p9g Місяць тому

    Turns out, 3.1 runs reasonably well on 4080. Thanks for the tip! Until this video I didn't know I could run an LLM on my PC.

  • @aperson7624
    @aperson7624 Місяць тому

    Thanks for making this video. I'm building a new PC and wanted to play with running local LLMs. To see just how fast a 4080 is...holy crap!

  • @TheGrizz485
    @TheGrizz485 Місяць тому +16

    The 7940hs CPU on your mini pc has a dedicated ai hardwares acceleration dubbed "Ryzen ai“. Hopefully the project enables and starts optimizing for it (in addition to the igpu) in the Future. Looks promising for cheap devices.

    • @artim96
      @artim96 Місяць тому +4

      Only at 10 TOPS according to their website. For comparison, the Copilot+-PCs need at least 40 TOPS. So questionable if it's accelerating anything.

    • @Zaf9670
      @Zaf9670 Місяць тому +2

      There are projects working on incorporating ROCm which I believe can leverage the TOPS AI processor. Similar to MLX based Apple Silicon models.

  • @LarryStrawson
    @LarryStrawson Місяць тому

    You are always entertaining Dave! and considering your niche topic this is true talent! Im not even that much of a nerd, or am I interested in programming or computer hardware but I really enjoy your channel. Keep up the great work!

  • @schedarr
    @schedarr 6 днів тому

    Llama 3.2 3B is clear winner for general chat tasks on local machines. I just love it! Thanks for testing the 405B - I was wondering how fast it will go and how much RAM it needs. Now I know it's not worth it. I'm looking forward for llama 3.2 7B which I think will be the sweet spot.

  • @dingolovethrob
    @dingolovethrob Місяць тому +1

    Yet another fab video Dave. (It's amazing how many people who have never produced anything in their lives feel compelled to criticize the heck out of other people work)...

  • @WolfsKonig
    @WolfsKonig Місяць тому

    Nice pivot and delivery, sir. Respect. I can't wait to follow along.

  • @Dattobayo
    @Dattobayo Місяць тому

    These vids are exactly what I need right now. Good to know that the pi can actually run it in some capacity.

    • @wtmayhew
      @wtmayhew Місяць тому

      Even a 8G RAM Pi 5B is still under 100 Dollars US, thus it would be a reasonable entry level platform. Beyond the learning experience of setting up AI and LLM, there might be utility in having a Pi as an offline server which could e-mail answers to questions which don’t need to be answered within a few seconds real time.

  • @lhargil
    @lhargil Місяць тому

    So kewl. Was just about to look for resources regarding this topic and this video got recommended. Amazing, thank you!

  • @tadmarshall2739
    @tadmarshall2739 Місяць тому

    Wow, educational, interesting and inspiring! Thanks for showing us what is possible, in detail. I'd not even heard of ollama!

  • @AnonYmous-yz9zq
    @AnonYmous-yz9zq Місяць тому

    This video should save me a lot of time when I get around to running an LLM, many thanks.

  • @OhRonaldo
    @OhRonaldo Місяць тому +15

    That was best of the internet right there. Thanks, Dave.
    Best I can do is like and say "thank you" since I've already subscribed. How about a heart? ❤

  • @StarOfDavidKush
    @StarOfDavidKush Місяць тому

    @Dave's Garage: Thanks for the video! That LLM on Raspberry Pi looks painful, ouch.
    I am testing some new beta releases of WIndows Server and other WIndows OS, and I got my rig over here running on Corsair Origin Neuron AMD 79503dfx and NVIDIA 4090 GPU. I was not impressed with the last LLM software I used, but I am going to check out your recommendations in the video. Thanks! I usually go to Chat GPT for my subscription plan, but there are many use cases where I prefer working offline. Thanks again for all the awesome videos!

  • @TomasRamoska
    @TomasRamoska Місяць тому

    Awesome video Dave. I was playing with Stable Diffusion. Will try to explore Llama in WSL

  • @peteradshead2383
    @peteradshead2383 Місяць тому +1

    I'm surprised how smart a off-line LLM is , I asked the question " I have Ryzen x670e motherboard with a Ryzen 9700x cpu which idles at 45w from the wall how much is from the chipset. " , and the answer was correct and relevant with pages of it.
    i tried words with multiple meanings , spelling mistakes etc and the answers was correct.
    Do lto drive need drivers , what is the difference between lto 5 and 6 , all the worlds knowledge in a few gigabytes.

  • @HaydonRyan
    @HaydonRyan Місяць тому

    Love it. Would also like to see a chart showing tokens per second on thr same model across the hardware. Good ollama benchmarks are hard to come by

  • @ricardoandresriquelmerios5995
    @ricardoandresriquelmerios5995 Місяць тому +10

    I used this on my machine , a i5 14500 with 16GB DRR5 with a nvidia gpu rtx 4060 running linux mint , and the speed is good enough for me

    • @ArthurFlimbimlinson-x1r
      @ArthurFlimbimlinson-x1r Місяць тому +2

      What LLM?

    • @firecat6666
      @firecat6666 Місяць тому +3

      @@ArthurFlimbimlinson-x1r Likely one with half a dozen to a dozen billion parameters. I get around 20-30 tokens/s on my RTX 3060 12 GB when using LLMs with those sizes. Intel i5-12400F, 32GB DDR4 and Windows 11 if you want the other details but I'm pretty sure the rest of your PC can be a potato as long as the entire model plus context window cache fits in the GPU.
      I can also load a 70 billion parameter model that's been cut down to a smaller size (quantized to 2-bits) but it uses all my RAM+VRAM and runs at a glorious 1 token/s.

    • @ricardoandresriquelmerios5995
      @ricardoandresriquelmerios5995 Місяць тому +1

      @@ArthurFlimbimlinson-x1r Dolphin

  • @Bp1033
    @Bp1033 Місяць тому +3

    The fact that you got llama-3.1:405B running at all at home is just impressive even if its mostly running on CPU.
    My Ryzen 7 is hardware capped at 128gb of system ram, I really should have waited for the AM5 socket.

    • @darksushi9000
      @darksushi9000 Місяць тому

      I have a 7950x with 32GB RAM and a 3090. No probs running 405B if I can wait for the result. Also have a 64 core Threadripper, 256GB RAM and a 3090. Both machines are level pegging. The more GPU VRAM you have, the bigger your model can be

    • @firecat6666
      @firecat6666 Місяць тому

      @@darksushi9000 Which quant of the 405B model are you using in your 32GB RAM machine? I can barely fit a 2-bit quant of the 70B model in 32GB RAM plus 12GB VRAM.

    • @darksushi9000
      @darksushi9000 Місяць тому

      @@firecat6666 I am running the Q4

    • @firecat6666
      @firecat6666 Місяць тому

      @@darksushi9000 Hmm, that doesn't fit in 32GB of RAM unless you have 10 RTX 3090. Didn't you mean to say you're running the 70b on your 32GB RAM machine and the 405b on your 256GB RAM machine?

    • @JonVB-t8l
      @JonVB-t8l Місяць тому

      I'm running full fat 405b on a 7 year old Xeon Gold seystem with 192GB of ram and a Vega 56 GPU.
      I mean I'm cheating because I'm using 4 NVME drives raid 0 as swap space and HBCC to pull it off, but hey... It works sorta.

  • @DeanHorak
    @DeanHorak Місяць тому

    Good info… answers many questions I had without me having to do the experiments myself, so thanks.

  • @doozowings4672
    @doozowings4672 Місяць тому

    I don’t know why anyone would give you heat , that video was OUTSTANDING !! I was up and running on my HP Gen 9 with an old Nvidia P2000 in no time at all ! The thing ran GREAT ! The replies were smooth and fast … The thing I don’t understand is the three variants or size options in 3.1 ? I want the most powerful model available. My GPU seems to be doing just fine and I have a ton of CPU and memory .

    • @firecat6666
      @firecat6666 Місяць тому

      Bigger models are (usually) smarter. But to run them fast enough, you need to fit the entire thing in VRAM or else your GPU has to pull data from the RAM, which is slow as fuck. Try loading a model that's bigger than your 5GB of VRAM and see how it goes for you, I bet you'll be disappointed.

  • @requiem9586
    @requiem9586 Місяць тому

    I think it's worth mentioning that the quality of a word is also important not just the speed of an idea. something well thought out has more value and I personally could see the value of your expensive machine as a host-body for the language model in the quality of the sentence that it came up with. Maybe it's nice to think of something for a bit, but I didn't see the word 'delightful' in the other examples. Thanks for making this video

  • @msromike123
    @msromike123 Місяць тому +7

    Ok, thanks Dave. Got it running. Any interest in setting it up to web scrape and analyze results based on a local query?

  • @randaldavis8976
    @randaldavis8976 Місяць тому

    nice episode. I have been playing with a local AI in Win 11(using LM studio) on a 7950x / RTX 3070 ti. I also have a RPi 4, Orange Pi 5+ and an old 4790k that I am loading Linux on. This video helps me decide what fast enough.

  • @txkflier
    @txkflier Місяць тому

    And..., the $50,000 Dell said, "I'm sorry, Dave. I can't do that". Excellent video. Much better than the previous one on LLM. I actually have it working now. Thanks!

  • @warezit
    @warezit Місяць тому +2

    🎯 Key points for quick navigation:
    00:00:00 *💡 Introduction & Overview*
    - Introduction to testing LLMs on different hardware setups, ranging from $50 to $50,000,
    - Motivation for addressing viewers' requests for more budget-friendly hardware and direct Windows installation.
    00:00:43 *🐢 Running on Raspberry Pi 4*
    - Attempt to run LLaMA on a Raspberry Pi 4 with 8 GB of RAM,
    - Installed on Raspbian, demonstrated extremely slow performance, impractical for real-time use.
    00:03:27 *🔄 Testing on Consumer Mini PC (Orion Herk)*
    - Upgraded to a $676 Mini PC with a Ryzen 9 7940HS and Radeon 780M iGPU,
    - Faster performance compared to Raspberry Pi, but model could not fit in GPU memory, relying on CPU instead.
    00:07:50 *🎮 Desktop Gaming PC with Nvidia 4080*
    - Running the LLM on a 3970X Threadripper with Nvidia 4080 using WSL 2,
    - GPU offloading enabled faster performance, similar to ChatGPT, demonstrating good use of available hardware.
    00:09:42 *🍎 Mac Pro M2 Ultra Testing*
    - Tested on Mac Pro with M2 Ultra and 128 GB unified memory,
    - Model ran efficiently with GPU usage around 50%, producing rapid responses, demonstrating M2 Ultra’s suitability for LLMs.
    00:10:51 *🚀 High-End 96-Core Threadripper & Nvidia 6000 Ada*
    - Attempt to run a 405-billion-parameter model on an overclocked Threadripper with Nvidia 6000 Ada,
    - Performance lagged significantly, highlighting that larger models can struggle even on high-end consumer hardware.
    00:13:12 *⚡ Efficient Model on High-End Hardware*
    - Switching to a smaller, more efficient LLaMA 3.2 model on the high-end setup,
    - Demonstrated much better performance, producing rapid answers in real-time, highlighting the importance of model size optimization.
    00:14:33 *📢 Conclusion & Call to Action*
    - Summary of testing LLMs on various hardware from low-end to high-end,
    - Encouraged viewers to subscribe and check out more content, highlighting the educational and entertainment aspects of the video.
    Made with HARPA AI

  • @Ranchhand323
    @Ranchhand323 Місяць тому +1

    Even though you brought the 50K machine to it's knees , and we're somewhat saddened ; I'm guessing there was a well hidden smirk as well ..😅

  • @dazecm
    @dazecm Місяць тому

    WSL2 Linux on Windows is a perfectly cromulent decision. That WSL2 tech is magical.

  • @kristenwaite5955
    @kristenwaite5955 Місяць тому

    I also came here for the dog playing the piano. You're the best, Dave!!

  • @macbaryum
    @macbaryum Місяць тому

    The salute gives me goosebumps. Makes me think I am a war hero that served in a war zone when I didn't.

  • @thomaspripley
    @thomaspripley Місяць тому

    Perfect! Just in time for me to install Ollama on my new Lenovo Yoga Slim 7x Copilot+ PC with the Snapdragon X Elite processor and NPU!

  • @zandanshah
    @zandanshah 22 дні тому

    Well-made, full of information for the public.

  • @LouwPretorius
    @LouwPretorius Місяць тому

    Thanks for listening to the comments. Great video!

  • @svenvandevelde1
    @svenvandevelde1 Місяць тому

    It was nice to see the canals of the city of Brugge in the background of the windows machine.

  • @agritech802
    @agritech802 Місяць тому

    Great video, thanks for sharing 👍

  • @thbadmin7751
    @thbadmin7751 Місяць тому

    Top notch work Dave!!! Thank you!

  • @kids123123123
    @kids123123123 Місяць тому +9

    win10 i7-13700k with no video card pegs at 100%, and llama3.2 generates about 80% as fast as normal reading speed.

    • @docrx1857
      @docrx1857 Місяць тому +2

      with a 10600k its at least 2-3x times faster than normal reading speed. But I am on linux

  • @dave_kimura
    @dave_kimura Місяць тому +5

    You should use the --verbose flag when running the examples as it will give the tokens/sec

  • @PaulGrayUK
    @PaulGrayUK Місяць тому +1

    Nice one Dave, bravo.

  • @malcolmgibson6288
    @malcolmgibson6288 Місяць тому

    My next-door neighbour has an autistic son aged 10. I am reading as much as I can find to understand the condition. Your book is my latest purchase. I'm not sure if it will help the lad as he has very complex needs, but the knowledge will be useful.

    • @DavesGarage
      @DavesGarage  Місяць тому

      There's a lot of overlap even between mild and severe cases, so hopefully the info is still useful!

    • @malcolmgibson6288
      @malcolmgibson6288 Місяць тому

      @DavesGarage Thanks, I'm sure it will help. I love your work on the channel. Keep it up.

  • @airjuri
    @airjuri Місяць тому +1

    Yeah, i installed ollama after your video. Had to comment some stuff out of install script because it didn't notice that in my Fedora machine cuda drivers were installed from RPMFusion. But yeah after install script went through it works crazy fast in my office machine i7-7800X/RTX4070Ti. And even in my old livingroom machine it works faster that i can read so it is enough ;) i5-4670k/Quadro
    P2000

  • @GlennHamblin
    @GlennHamblin Місяць тому

    Thanks for tickling my fancy with the "Do it Len" animations! 😂

  • @BlackFlux22
    @BlackFlux22 Місяць тому

    I love your channel! The OGs of Tech Samarai!

  • @orion10x10
    @orion10x10 26 днів тому

    You're the developer who created Task Manager! Awesome

  • @John-zz6fz
    @John-zz6fz Місяць тому

    Great episode! I loved this one.

  • @musicwombat74
    @musicwombat74 Місяць тому +1

    Correct me if I am wrong, but the reason the Herk box is using a CPU is because its GPU is an AMD. Pretty much every ML framework today expects to use CUDA library for GPU acceleration. CUDA is proprietary library developed by Nvidia. AMD has been fighting tooth and nail to gain wider adoption for their own alternatives, but they are simply not there yet.

    • @Deeptesh97
      @Deeptesh97 28 днів тому

      It does support some AMD dedicated video cards as you saw in the video. Not sure how effective it will be vs CUDA.

  • @seikojin
    @seikojin Місяць тому

    When you did the intro into the last video, I knew this would be a followup kind of video. It made no sense to just leave the demo out of youtube watcher reach :D

  • @JustinEmlay
    @JustinEmlay 26 днів тому

    There's a 3.2 11b that will be out soon. That's probably the sweet spot for most people. Especially for 12Gb and up GPUs. It also adds image support.

  • @PropagatorNET
    @PropagatorNET Місяць тому

    The real problem I find is context tends to eat lots of memory, beyond just loading the model itself. Sure, I can maybe load a 70b model with the memory I have, but I'm gonna hit the ceiling pretty fast with 128k context. I don't have the budget for 512gb of video memory, or a high end mac, so unless I load it into system memory, which is just insane, even with some smaller models I'm going to struggle once the context is full up. Of course, I can manually reduce the context length, but it's a shame because I'd like it to be able to handle large amounts of text or long discussions. Great video as always!

  • @PracticalPcGuide
    @PracticalPcGuide Місяць тому

    Tested the 70B Q4 (42gb) on a 5950x and 128gb ram with RAG and 40K context. was about 80GB ram usage and the inferencing was around 0.56/s. (usually gets 30-50 on GPU using 11B). Then tried the IQ1_S which was 15GB on the 4060TI 16GB +30K context and got the same speed. (obviously offloading to the ram).
    The good thing is that the 70B generates long and detailed answer unlike the 3.2 1-3B models which sometimes say that it did not find the query in the document attached. (2H 30K words YT interview)

  • @Greenie2450
    @Greenie2450 Місяць тому +4

    "Nothing but the 2nd best, for dave.... " Classic hahahaha

  • @foodflare9870
    @foodflare9870 Місяць тому

    I think the GUI of Jan makes the installation and user experience of models to try things more convenient. It also has the capability for you to put instructions for it per what it calls threads, which are basically what ChatGPT calls a new chat. It also has a nifty thing where you can tweak settings on the models and have different models per thread. For example, I have one model that's been trained a lot on code/documentation, that can be useful for searching when I remember the concept of some language feature I need, but don't remember the specific keywords in the language I'm doing it in, most relevant when I'm doing something in a language that I either haven't touched in a while or not often. Whereas I have a separate model that's been trained on a lot of fictional writing that I use to help proofread things that I wrote. Even if it doesn't give me the fix that I want, it at least demonstrates where certain errors are that need looking at.
    Another nice thing about Jan is that if you wanted to, you can hook it up to online services as well, if you wanted. You can keep all your LLM stuff in one place with it. I'm predominantly doing things on it locally only, but I know at least one person that does ChatGPT stuff through it

  • @bertblankenstein3738
    @bertblankenstein3738 Місяць тому

    I haven't read the story of Little Red Robin Hood yet. :) I'm glad you did this video on a variety of hardware that includes today's computer enthusiasts.

  • @vexy1987
    @vexy1987 Місяць тому

    You should be using llama3.2 on the PI, which is designed specifically for edge devices like SBCs or smartphones

  • @krfloll
    @krfloll Місяць тому

    Great content. As succint and complete as one could hope

  • @gigiosos1044
    @gigiosos1044 Місяць тому

    the final story about jeff bezos generated by llama 3.2 2B model was actually funny ngl

  • @ADB-zf5zr
    @ADB-zf5zr Місяць тому +1

    @DavesGarage @6:00 you are talking about the "fixed" RAM allocated to the GPU. The BIOS/UEFI "should" have an option to set the memory as "shared" or (similar meaning), where the amount of RAM is dynamically allocated between the CPU and the GPU. This is one of the reasons why people are interested in the upcoming "Strix Halo" that has a beefy GPU (and CPU), but also quad channel RAM and can be fitted with 256GB, which can be dynamically adjusted, and then eaten up by the GPU.! Please find this setting in the BIOS, change it to "dynamic" and post a video about your findings, many would be I am sure interested in such a thing. Thanks.

  • @iamthemoss
    @iamthemoss Місяць тому

    As always, great video Dave.

  • @Dan-hw9iu
    @Dan-hw9iu 16 днів тому

    Run quantized Llama 405B on a 192GB Mac Studio. The $6.5k Mac will run circles around that $50k beast.

  • @paulbrooks4395
    @paulbrooks4395 Місяць тому +1

    I just watched a video on the limits of LLM error rate as relates to parameters, performance, etc. basically the relationship is asymptotic. More is better but the relationship decreases logarithmically. I think most people won't understand how AI models are being designed for levels of complexity and ambiguity that are difficult to grasp.
    They do this by having a massive number of parameters and ability to discriminate finer and finer details. These are use cases for AI to interact with humans in a visual and audio world that is absurdly complex, all while hoping to have the ability to interact with millions or billions of humans.

  • @DensDigitalDen
    @DensDigitalDen Місяць тому

    I have an M2 Mac and run LLM's locally using Msty locally with very good results.

  • @blitzthis77
    @blitzthis77 Місяць тому

    I appreciate this video thanks. I don’t know how in the world wsl is considered shenanigans though.

  • @tarelethridge8937
    @tarelethridge8937 Місяць тому +8

    I seriously think your show is great. It's interesting and it's entertaining. I wasn't born in the age of the, computers you grew up using, but you explain it in a very good and interesting way. I wasn't born in the age of the, computers you grew up using, but you explain it in a very good and interesting way. I think there's UA-camrs, that could benefit, from as well as you do at presenting the material. You're not. Just staring at a screen and watching you do stuff.

  • @terryhdbailey
    @terryhdbailey 11 годин тому

    Dave... First off thanks for this and many other videos you have done. I am thinking my pushing the like button is going to wear out the button soon :). I am trying to wrap my brain around many things in this and have had the local running chat gpt that you showed us try to teach me about each of the parts. I am working on understanding each piece. The one question I might have is What is the difference in the 8 billion , 70 billion, and 405 billion parameters as far as reliable answers go? I understand they take more horsepower for the larger ones but not sure "exactly" what the benifit of more parameters are. maybe a future video explaining the intracicaes or more parameters or maybe one of the other co-patrons here would help out and try to clue me in. Either way thanks for now as I not only jealous of your infomation quality but also that you are retired and I am not. :)

  • @Joshua-s8o
    @Joshua-s8o Місяць тому

    Thank you for this, Dave!

  • @SK-bl1lp
    @SK-bl1lp Місяць тому +1

    Hey, as for RPI4 and RPI5 there are tons of models of 1B-3B size, which are pretty fast even on Raspberry PI

  • @DytliefMoller
    @DytliefMoller Місяць тому +12

    Think the next good video should be on how to trin it on your own data. Lets say a simple ms access local db?

    • @wtmayhew
      @wtmayhew Місяць тому +1

      I’ll second that. It would be interesting to what it takes to turn a database of help desk ticket problems and resolutions into an LLM which could try to answer technical questions.

    • @kiddailey
      @kiddailey Місяць тому +1

      Definitely! Or a collection of things, such as a bunch of emails or source code files.

    • @jaz093
      @jaz093 Місяць тому +1

      This