Local LLM Challenge | Speed vs Efficiency

Поділитися
Вставка
  • Опубліковано 21 жов 2024
  • I put three systems to the local LLM test.
    🛒 Gear Links 🛒
    💻🔄 K9 Mini with 32GB RAM: amzn.to/3ZiKjcp
    🛠️🚀 96GB RAM kit: amzn.to/3ZhQ4qR
    🍏💥 Mac Mini M2 Pro: amzn.to/4fbgmzY
    🎧⚡ RTX 4090: amzn.to/3YvvHpg
    🛠️🚀 Mini PC with Oculink: amzn.to/3UgLNAK
    📦🎮 My gear: www.amazon.com...
    🎥 Related Videos 🎥
    🤯 Cheap mini runs a 70B LLM - • Cheap mini runs a 70B ...
    🤖 It’s over…my new LLM Rig - • It’s over…my new LLM Rig
    🌗 RAM torture test on Mac - • TRUTH about RAM vs SSD...
    🛠️ FREE Local LLMs on Apple Silicon | FAST! - • FREE Local LLMs on App...
    🛠️ Set up Conda - • python environment set...
    🤖 INSANE Machine Learning on Neural Engine - • INSANE Machine Learnin...
    🛠️ Developer productivity Playlist - • Developer Productivity
    🔗 AI for Coding Playlist: 📚 - • AI
    - - - - - - - - -
    ❤️ SUBSCRIBE TO MY UA-cam CHANNEL 📺
    Click here to subscribe: / @azisk
    - - - - - - - - -
    Join this channel to get access to perks:
    / @azisk
    - - - - - - - - -
    📱 ALEX ON X: / digitalix
    #machinelearning #llm #softwaredevelopment

КОМЕНТАРІ • 179

  • @chinesesparrows
    @chinesesparrows 14 годин тому +56

    Man this question was literally on my mind. Not everyone can afford a H100

  • @stokeseta7107
    @stokeseta7107 13 годин тому +17

    Energy efficiency and heat are the reason I go for mac at the moment, upgraded a few years ago from a 1080 ti to a 3080 ti. Basically sold my pc a month later and bought a PS5 and MacBook.
    So I really appreciate the inclusion of efficiency in your testing.

    • @fallinginthed33p
      @fallinginthed33p 4 години тому +1

      Electricity prices are really expensive in some parts of the world. I'd be happy running a MacBook, Mac Mini or some small NUC for a private LLM.

  • @AmanPatelPlus
    @AmanPatelPlus 11 годин тому +11

    If you run ollama command with `--verbose` flag, it will give you the tokens/sec for each prompt. So you don't have to time each machine separately.

  • @TheHardcard
    @TheHardcard 14 годин тому +16

    You can make LLMs nearly deterministic if not entirely with the “Temperature” setting. I haven’t yet had time to experiment with this myself, but I’ve been following this with great interest.
    I’d be joyous to see a similar comparison woth the (hopefully just days away) M4 Pro and M4 Max alongside your other unique, insightful, and well designed testing.

    • @TobyDeshane
      @TobyDeshane 10 годин тому +7

      Not to mention setting the seed value. I wonder if there would be any per-machine difference among them with the same seed/temp, or if it would generate the same on both.
      But without them pinned to the same output, a 'speed test' that isn't about tokens/sec is pointless.

  • @showbizjosh40
    @showbizjosh40 5 годин тому +1

    The timing of this video could not have been more perfect. Literally working on figuring out how to get an LLM running locally because I want the freedom of choosing the LLM I want and for privacy reasons. I only have an RTX 4060 Ti w/ 16GB but it should be more than sufficient for my purposes.
    Love this style of format! It makes sense to consider electric cost when running these sorts of setups. Awesome quality as always!

    • @oppy124
      @oppy124 3 години тому

      LM studio should meet your needs if you're looking for a one click install system

  • @danielkemmet2594
    @danielkemmet2594 12 годин тому +6

    Yeah definitely please keep making this style of video!

  • @ferdinand.keller
    @ferdinand.keller 14 годин тому +5

    7:42 If you set the temperature of the model to zero, the output should be deterministic, and give the same results across all machines.

  • @robertotomas
    @robertotomas 6 годин тому +2

    If you wanna make your tests deterministic with ollama, you can use the /set command to set the parameters for top k of 0 and top p, temperature of 0, and set the seed parameter to a same value. Also, if instead of running a chat if you put the prompt in the run command and use -verbose you’ll get the tps

  • @RichWithTech
    @RichWithTech 14 годин тому +6

    I'm not sure how I feel about you reading all our minds but I'm glad you made this

  • @djayjp
    @djayjp 14 годин тому +26

    Would be interested to test iGPU vs (new) NPU perf.

    • @chinesesparrows
      @chinesesparrows 14 годин тому +5

      Yeah tops per watt

    • @andikunar7183
      @andikunar7183 12 годин тому +4

      Currently NPUs can't run the AI models of the tests well. llama.cpp (the inference code behind ollama) does not run on the NPUs (yet). It's all marketing only. Qualcomm/QNN, Intel, AMD, Apple all have different NPU architectures and frameworks, which make it very hard for llama.cpp to support them. Apple does not even support their own NPU (called ANE) with their own MLX-framework.
      llama.cpp does not even support the Qualcomm Snapdragon's GPU (Adreno), but ARM did some very clever CPU-tricks, so that the Napdragon's X Elite's CPU is approx. as fast as a M2 10-GPU for llama.cpp via their Q4_0_4_8 (re-)quantization. You can also use this speed-up with ollama (but need specially re-worked models).

    • @GetJesse
      @GetJesse 12 годин тому +1

      @@andikunar7183 good info, thanks

    • @flyingcheesecake3725
      @flyingcheesecake3725 4 години тому

      @@andikunar7183i heard apple mlx do use npu but we don't have control to manually target it. correct me if i am wrong

  • @aliuslavadius
    @aliuslavadius 14 годин тому +4

    Your videos are pure joy.

  • @Himhalpert8
    @Himhalpert8 10 годин тому +1

    I'll be totally honest... I don't have the slightest clue what's happening in this video but the little bit that I could understand seems really cool lol.

  • @deucebigs9860
    @deucebigs9860 12 годин тому +3

    I got the K9 and 96gb after your first video about it, and am happy with the output speed and power consumption for the price. I did get a silent fan for it though cause it is loud. I only really do coding work so I think the Mac would be great, but the RTX is just overkill. I'm very curious the next gen of Intel processors so please do that test. I don't care about the RTX honestly cause I needed something that good I can put that money towards claude or chat gippity online and not have to deal with the heat, power consumption, maintenance, and run a huge model. I personally don't see the benefits of running a 7B parameter model at lightning speed. I'd rather run a bigger model slower at home. Great video! So much to think about.

  • @comrade_rahul_1
    @comrade_rahul_1 10 годин тому +2

    This is what I want to watch about tech. Amazing man! ❤

    • @AZisk
      @AZisk  9 годин тому +1

      Glad you enjoyed it.

    • @comrade_rahul_1
      @comrade_rahul_1 2 години тому +1

      @@AZisk Great work sir. No one else does these things except you. 👌🏻

  • @ChrisGVE
    @ChrisGVE 14 годин тому +1

    Very cool content thanks Alex, I've always wanted to try one of these models locally but I couldn't find the time, at least I can follow your steps and do it a bit faster :) See you in your next video!

  • @davidtindell950
    @davidtindell950 14 годин тому +3

    thanks yet again !

  • @FuturePulse_nl
    @FuturePulse_nl 11 годин тому +1

    Look, someone in a dog costume passing by at 00:11

  • @technovangelist
    @technovangelist 7 годин тому

    This was fantastic, thanks. I have done a video or two about Ollama, but haven't been able to do something like this because I haven't bought any of the nvidia cards. It is crazy to think that we have finally hit a use case where the cheapest way to go is with a brand new Mac.

  • @ShinyTechThings
    @ShinyTechThings 9 годин тому +2

    Agreed

  • @seeibe
    @seeibe 14 годин тому

    Great test! This is exactly what I'm interested in, especially the idle power test is important, as you won't be running inference 24/7 usually. Amazing to see that "mini PC" with the eGPU takes more power in idle than my full blown desktop PC with a 4090 😅

  • @newman429
    @newman429 15 годин тому +11

    I am pretty sure you heard about asahi linux so will you make a video on it ?

    • @ALCE-h7b
      @ALCE-h7b 14 годин тому +2

      it seems like he already did 2 years ago.. Did something changed?

    • @newman429
      @newman429 13 годин тому +2

      ​@@ALCE-h7b
      No they are still doing the what they were doing before but I suppose they took a big leap now with the release of their 'Asahi game playing toolkit' which in short makes playing AAA games on Mac very probable
      Maybe if you wanna lean more, why not read their blog.

    • @AZisk
      @AZisk  13 годин тому +5

      Maybe I need to revisit it. Cheers

  • @cobratom666
    @cobratom666 13 годин тому +7

    I think that You wrongly calculate costs for RTX 4090. You assume that average watt usage for RTX is 300W, but you shoud use your electricity usage monitor. There is an option to show total usage. I try to explain. RTX consume for first 1-2 seconds (loading model to memory) less than 320W, let say 90W so total consumption will be 2x90W + 10x320 = 3380J or even less - not 3840J . For M2 PRO it doesn't matter that much, beacuse 2x15W + 45*48W = 2130 J instead of 2256J. So result is not accurate at least 15% for RTX

    • @laszlo6501
      @laszlo6501 11 годин тому

      The 2 seconds would probably be there only for the first time the model is loaded.

    • @GraveUypo
      @GraveUypo 10 годин тому +2

      @@laszlo6501 yup, the 2 second delay only happens the first time you load the model and he put such emphasis on it.

  • @haha666413
    @haha666413 12 годин тому

    love the comparison. can't wait for a next, maybe try the 4090 in a pc so it can actually stretch its legs and copy the data much quicker to vram

  • @EladBarness
    @EladBarness 7 годин тому

    Great video! Thanks
    In the verdict you forgot to mention
    that the only way to run bigger LLMs is with Mac’s if you have enough unified memory.
    yes it’s gonna be slow but still possible and faster than cpu
    You mentioned it in the beginning though :)

  • @DevsonButani
    @DevsonButani 12 годин тому +1

    I'd love to see the same comparison with the Mac Studio with the Ultra chip since it has similar cost to a high spec PC with a RTX4090. This would be quite useful to know for workstation situations

  • @Solstice42
    @Solstice42 12 годин тому

    Great assessment- so glad you're including power usage. Power and CO2 cost of AI is critical for people to consider going forward. (for the Earth and our grandchildren)

  • @mshark2205
    @mshark2205 12 годин тому

    This is top quality video I was looking for. Surprised that Apple silicon runs LLMs quite well…

  • @siddu494
    @siddu494 13 годин тому +1

    Great comparison Alex just shows if a model is tuned to run on a specific hardware it will outperform in terms of effficiency. However I saw an article today that shows Microsoft open-sourced bitnet.cpp a blazing-fast 1-bit LLM inference that runs directly on CPUs. It says that now you can run 100B parameter models on local devices.
    Will be waiting for your video on who this changes everything.

    • @andikunar7183
      @andikunar7183 12 годин тому

      LLM inference is largely determined by RAM bandwidth. The newest SoCs (Qualcomm/AMD/Intel/Apple) almost all have 100-130 GB/s. While the M2/M3 Max has 400GB/s, the Ultra 800 GB/s, the 4090 has >1TB/s. And all the new CPUs have very fast matrix-instructions, rivalling the GPUs for performance. 1.5-Bit quantization might be some future thing (but not in any useable model), but currently 4-Bit is the sweet-spot. Snapdragon X Elite CPU Q4_0_4_8 quantized inference is already similar in speed to M2 10-GPU Q4_0 inference with the same accuracy.

  • @samsquamsh78
    @samsquamsh78 10 годин тому

    thanks for a great and interesting video! I have a Linux machine with a nvidia card and a macbook pro m3, and personally I really care about the noise and heat from the linux/nvidia machine - after a while it's bothering. the mac makes zero noise, and no heat and is always super responsive. in my opinion, it's incredible what Apple has built. thanks again for your fun and cool technical videos!

  • @donjaime_ett
    @donjaime_ett 13 годин тому

    For an AI server, once the model is loaded into vram, you probably want it to remain there for repeated inferences. So it depends on how you set things up.
    Also if you want apples to apples determinism, reduce the temperature at inference time.

  • @dave_kimura
    @dave_kimura 13 годин тому +2

    I'm surprised the the Oculink doesn't create more of a performance hit than it did. My 4090 plugged directly into the Mobo gets about 145 tokens/sec on llama3.1:8b verses the ~130 tokens/sec that you got. Kind of make sense since the model is first loaded into memory on the GPU.

    • @andikunar7183
      @andikunar7183 12 годин тому

      Token-generation during inference is largely memory-bandwidth bound (OK, with some minimal impact for quantization-calculations) - Intel 1TB/s. And the LLM runs entirely on the 4090. The 4090 really shines during (batched) prompt-processing, blowing the Intel/Apple machine away - probably >20x faster than the M2 Pro, and way, way more so than the Intel CPU/GPU.

  • @massimodileo7169
    @massimodileo7169 13 годин тому +1

    -verbose
    please use this option, it’s easier than Schwarzenegger 2.0

    • @AZisk
      @AZisk  13 годин тому +2

      but way less fun

  • @Johnassu
    @Johnassu 10 годин тому

    Great test!

  • @gaiustacitus4242
    @gaiustacitus4242 12 годин тому

    Let's be honest, once the 24Gb RAM on the RTX 4090 is exhausted the performance is dismal for LLMs that get pushed out to the 128Gb RAM on your motherboard. That's why I'm looking forward to the new Mac Studio M4 Ultra with 256Gb (or greater) integrated memory, 24+ CPU cores, 60+ GPU cores, and 32+ NPU cores.
    Many of the LLMs are developed on Mac hardware because it is presently the best option.

  • @wnicora
    @wnicora 11 годин тому

    Really nice video tx
    What about performing the test on the Qualcomm dev kit machine? Interesting to see how the snapdragon performs

  • @modoulaminceesay9211
    @modoulaminceesay9211 14 годин тому +1

    Thanks

  • @vinz3301
    @vinz3301 11 годин тому

    can we talk about this colorful keyboard on the right ? gave me goose bumps !

  • @mrmerm
    @mrmerm 12 годин тому +1

    Would be great to see AMD in the benchmarks both with CPU and external GPU.

  • @arkangel7330
    @arkangel7330 14 годин тому +4

    I'm curious how the Mac Studio with M2 Ultra would compare with the RT 4090

    • @AZisk
      @AZisk  14 годин тому +5

      Depending on the task, I think it would do pretty well. It won't be as fast on smaller models, but it will destroy the RTX on larger models.

    • @brulsmurf
      @brulsmurf 13 годин тому +1

      @@AZisk It can run pretty large models, but the low speed makes it close to unusable for real time interaction.

    • @mk500
      @mk500 9 годин тому

      @@brulsmurfI use 123B models on my M1 ultra 128GB. It can be as slow as 7 tokens per second, but I find that still usable for interactive chat. I’m more into quality than speed.

  • @timsubscriptions3806
    @timsubscriptions3806 13 годин тому +2

    I’m with you! Apple wins by a mile. I have compared my 64G Mac Studio M2 Ultra to my Windows WS that has dual Nvidia A4500 using NVlink (20G for each card) and at half the price the Mac Studio easily competes with the Nvidia cards. I can’t wait to get an M4 Ultra Mac Studio with 192G Ram -maybe more RAM 😂. I use AI for LOCAL RAG and research and local code assistant for Vs code

    • @timsubscriptions3806
      @timsubscriptions3806 13 годин тому

      BTW, the Mac Studio was half the price of the Windows workstation

  • @Techonsapevole
    @Techonsapevole 11 годин тому +1

    cool, but where is AMD Ryzen 7 ?

  • @nasko235679
    @nasko235679 13 годин тому

    I was actually interested in building a dedicated LLM server, but after a lot of looking around for language models, I realized most open source "coding" focused LLMs are either extremely outdated or planely not good. LLama is working with data from 2021, DeepSeek Coder 2022 etc. Unfortunately the best models for coding purposes are still Claude and ChatGPT and those are closed.

  • @kevinwestmor
    @kevinwestmor 10 годин тому

    Alex, thank you, and please comment on the NUC shared video memory management (in general - because the smaller test probably fit in either/both), especially when switching back-and-forth between a CPU test and a GPU test -would this be windows managed, or would you be changing parameters? Thank you again.

  • @mrfaifai
    @mrfaifai 3 години тому

    Thank you for making this. I think Mac Studio could be comparable to 4090? Haven’t said that, there is 192GB option for Mac Studio, and not even mentioned the potential new M4 Mac Studio.

  • @atom6_
    @atom6_ 13 годин тому +2

    a Mac with MLX backend can go even faster.

  • @pe6649
    @pe6649 4 години тому

    @Alex for me it is even more interesting how much longer the Mac Mini will take on a finetuning session than a 4090 - only inference is a bit boring

  • @ALCE-h7b
    @ALCE-h7b 14 годин тому +1

    Have you thought about running such tests on the new AMD 8700 on tiny PCs? Heard that they have pretty good iGPU

  • @sarmadmohsin8444
    @sarmadmohsin8444 14 годин тому

    Man things look good for future

  • @delphiguy23
    @delphiguy23 8 годин тому

    I wonder how it performs for M1 Mac Mini, will have to give a try. Rather than selling it I might repurpose it for this use case. Also for that RTX4090, you mentioned that there's a warmup where the model is copied to the vram -- does this happen for each request or it stays in vram for awhile?

  • @Vili69420
    @Vili69420 9 годин тому +1

    the rtx 4090 is probably heavily gimped by that X4 PCIE when it comes to copying the stuff from RAM to VRAM also im not sure if that the LLM take advantage of NVIDIA GPUDirect Storage, to skip the RAM step and to load the stuff directly to VRAM

  • @tovisalvador5373
    @tovisalvador5373 13 годин тому +1

    0:11 cute doggie

    • @AZisk
      @AZisk  13 годин тому

      Dog photobomb spotted!

    • @raguel259
      @raguel259 13 годин тому

      The highlight off this video

    • @AZisk
      @AZisk  13 годин тому

      @@raguel259I’ll tell him

  • @boltez6507
    @boltez6507 14 годин тому +2

    Man waiting for strix halo.

  • @Coolgamer322
    @Coolgamer322 14 годин тому

    If you want the experiment to be deterministic you can actually do so by setting a seed.

  • @camerascanfly
    @camerascanfly 4 години тому

    0.53 EUR/kWh in Germany? Where did you get these numbers from? About 0.28 EUR/kWh would be an average price. In fact prices dropped considerably during the last year or so.

  • @chrisdavis6264
    @chrisdavis6264 11 годин тому

    I’d like to see which cooling system would work best per price/kWh level- how much it’d cost to keep your machine cool and operative

  • @yudtpb
    @yudtpb 14 годин тому +1

    Would like to see how lunar lake performs

  • @skyak4493
    @skyak4493 11 годин тому +1

    I would love to see this on a new M4 pro Mac mini.

    • @flexairz
      @flexairz 11 годин тому

      if that one exists

  • @seeibe
    @seeibe 14 годин тому

    I'm actually surprised the performance/W for the M2 isn't better. So basically if you want to get the most performance for the least money, the 4090 still wins. Now we need to see an M2 ultra for comparison 😂

  • @toadlguy
    @toadlguy 13 годин тому +1

    Didn’t you add a lot of memory to the Intel Box? Was that in your cost of ownership?

    • @AZisk
      @AZisk  13 годин тому +2

      I did, but in this case, the model I was showing wasn't using all that memory and could easily run on the 32GB that the machine came with.

    • @toadlguy
      @toadlguy 12 годин тому +1

      @@AZisk But if you had that box, you probably wouldn't be able to resist adding 64GBs of memory 🤣

  • @eldino
    @eldino 12 годин тому

    Could be possible to have also the prices of the devices you test in the next videos?
    In order to understand what's the cheapest option to run a 7B LLM locally at decent speed?

    • @AZisk
      @AZisk  11 годин тому

      why not in this video?

  • @Ruby_Witch
    @Ruby_Witch 11 годин тому

    How about on the Snapdragon CPUs? Do they have Ollama running natively on those yet? I'm guessing not, but it would be interesting to see how they match up against the Mac hardware.

  • @neotokyovid
    @neotokyovid 14 годин тому

    I'd like to know the logic behind "being able to run models locally has become more important"

  • @smurththepocket2839
    @smurththepocket2839 11 годин тому

    Did you consider also the Nvidia dev kits ? Like the Nvidia Jetson AGX orin 64gb ?

  • @abhiranjan0001
    @abhiranjan0001 13 годин тому

    Nice Keyboard collection 👀

  • @PaulPetterson
    @PaulPetterson 13 годин тому

    at 15:10 you show a document that says both that M2 16GB costs $0.11 per day and $.03 per day, and yearly costs are $40 and $12 per day. I looked closer because 3 cents per day seems suspisouly low.

  • @xsiviso4835
    @xsiviso4835 8 годин тому

    0,53$ per kWh is really high for Germany. At the moment it is at 0,30$ per kWh.

  • @gregorydcollins512
    @gregorydcollins512 13 годин тому

    My electricity cost is 10.5 cents per kilowatt hour on off peak rate in Sacramento, California, because we get an EV discount.

  • @DaveEtchells
    @DaveEtchells 5 годин тому

    I bought a used M1 Pro 64GB MBP to tide me over from my old Intel MBP, until whatever the max-config M4 will end up being. Will there be a version with 256 GB of integrated RAM with fast “neuro” processing units? That could be a game-changer for running decently large models locally.
    (Although the trend seems to be towards more limited-focus smaller models in agentic frameworks for best efficiency. Still though, 256GB of integrated memory with a fast NPU would be awesome for local models 😁)

  • @ppasieka
    @ppasieka 14 годин тому

    I guess you can set a seed when running these tests with Ollama. That would give more deterministic results

  • @mathiashove02
    @mathiashove02 9 годин тому

    From Denmark.. The electricity bill is crazy here :)

  • @Z-add
    @Z-add 14 годин тому +2

    You should have used an energy meter and not just multiply time with average power.

  • @mk500
    @mk500 9 годин тому

    I ended up with a used studio M1 Ultra with 128GB RAM. I can run any model up to 96GB RAM, and often do. It could be faster, but the important thing is I have few limitations for large-ish models and context. What really competes with this?

  • @meh2285
    @meh2285 12 годин тому

    You can set the seed so they all output the same response.

  • @prezlamen
    @prezlamen 8 годин тому

    Make it with full blown PC and try on Linux.
    Maybe is also good idea to try with different distributions, use comparisons from phoronix to see what is best, I remember that clear linux by intel have good performance.

  • @Mohammed_school_s
    @Mohammed_school_s 3 години тому

    We must see m4 soon.

  • @donsfromnz6515
    @donsfromnz6515 Годину тому

    I would love to see the RTX A2000 and A4000 SFF Ada cards @ 70W over oculink

  • @claytonramstedt3191
    @claytonramstedt3191 5 годин тому

    Last I checked on my Linux desktop, Ollama will keep a model loaded into memory for like 5 mins after its first use before unloading it. Did you account for that behavior?

  • @Ukuraina-cs6su
    @Ukuraina-cs6su 14 годин тому

    I am even more confused. Is there some tutorial to understand what LLM configuration exists and what all the words mean to have a general idea of what can I need?
    Because running on 16 gb ram surprised me, I have seen open-source models that require 100+ gigs of ram or even terabytes... 😮
    I believe there should be a reason for them to require so much memory if you can do the same with 16 gigs 😮

  • @riccardomarchesini836
    @riccardomarchesini836 10 годин тому +1

    great video

    • @AZisk
      @AZisk  9 годин тому

      Thanks!

  • @luisrortega
    @luisrortega 13 годин тому

    I wonder if the benchmark uses the latest MLX apple library, when I switch to it (on LM Studio), it was an incredible difference. I can't wait until Ollama add it!

    • @andikunar7183
      @andikunar7183 12 годин тому

      Ollama uses the llama.cpp code-base for inference. And llama.cpp doesn't use MLX.
      I admit, that I did not benchmark MLX via llama.cpp recently, I have to look into it. MLX in my understanding can use GGUF-files, but only limited quantization variants. Q4_0 seems a good compromise, but llama.cpp can do better.

  • @pythonlibrarian224
    @pythonlibrarian224 9 годин тому

    Mac silicon is still winning for LLM text generation.

  • @navroopsingh8902
    @navroopsingh8902 12 годин тому

    6:39 I think the RAM to VRAM copy will be much faster if the GPU is connected to the motherboard using PCIe instead of this workaround

    • @andikunar7183
      @andikunar7183 12 годин тому

      The RAM-VRAM copying is just for model-startup. It does not impact LLM run-performance if the LLM fits into VRAM. ollama has a configurable keep-alive time for the models. So if you do repeated LLM-interactions, you don't have to copy.

  • @kaptnwelpe5322
    @kaptnwelpe5322 14 годин тому

    Would be interesting how those RTX Small Form Factor cards would perform. Max. 75W as PCIe bus only powered... Also: if a model was trained on NVidia, does running it on other hardware give different results?

  • @laszlo6501
    @laszlo6501 11 годин тому

    Does the RTX 4090 is not keep the model in the memory between short requests? Time to first token 2 sec would only impact the first request.

  • @myte1why
    @myte1why 4 години тому

    well I have some questions about test method because you do not close and open llm everytime you need it mostly you load it once and it will run for long time, you generate multiple queries at that time so loading time will be non effective point on daily use. second most of the time you will put llm on a mini server (you can use minipc or any pc for that matter I do not have mac so cant talk about them) so you just reach it over lan so heat can be little to no problem.
    but for last point how big model it can work on or did you need biggest model? like did you need storyteller for coding or coder for story telling.
    and lastly who need what and for how much? as long as you have inernet connection even cheapest solution here is around 2+ years of subscription plan.
    and I use local llms so I can say its good way when you do not have internet connection but its not for everybody.

  • @pucavaz
    @pucavaz 14 годин тому

    I love your video, but for me doesn't make so much sense to investing on Local LLM rn, I really love it, but in real life pay was you go is better for 99% of the cases.

  • @laherikeval2524
    @laherikeval2524 Годину тому

    thats why i bought m3 pro macbook pro works great and very power efficient i use llama3:1 7b in code generation and speed is great.

  • @CaimAstraea
    @CaimAstraea 4 години тому

    It's not feassible imo yet .. we would need specialized hardware, general purpose gpu's that are used for gaming won't cut it and apple will fleece you for price / memory. My thinking is in the future we will see local inference specialized consumer hardware at the 2000 - 4000$ price range .. maybe ~

  • @chany0033
    @chany0033 13 годин тому

    Hi Alex, can make a video about running llm code editors like zed editor or vscode with some extension in local setup? im planning buy a gpu but i dont find any video demostrating the local generate code example.

  • @seeibe
    @seeibe 14 годин тому

    12:00 Are you sure? Normally ollama will keep the model in memory for a few minutes. So if you do something like coding, the model should already be in VRAM for most queries.

    • @toadlguy
      @toadlguy 13 годин тому +1

      The default is just 5 seconds, but if you increase that it might affect your cost substantially. Trade off.

  • @adeelsiddiqui4131
    @adeelsiddiqui4131 11 годин тому

    Can you do a same test for stable diffusion or flux models?

  • @Gome.o
    @Gome.o Годину тому

    12:37 Shouldn't quality of responses also matter? I imagine the quality that the 4090 spits out is better than the mac in terms of accuracy and quality?

  • @KaranSinghSikoria
    @KaranSinghSikoria 13 годин тому +1

    freaked out at 0:12 then realized its doggo buddy. whats his name?

  • @12wsaqw
    @12wsaqw 13 годин тому

    i like your video but you seem to be skipping over a (IMHO) significant use case: Image generation with ComfyUI :^). Is it just me or is my very expensive Mac Studio M2 Max w/ 64GB just plain SLOWWWWWWW when running ComfyUI?

  • @tomxu4254
    @tomxu4254 14 годин тому

    wonder what is the point for these benchmarks. I would like to learn from you practical LLM at home which can do efficient recommendations with highest parameters for Mac studio M1 Max 10 core CPU , 24 Core GPU and 64 GB Ram

  • @woolfel
    @woolfel 13 годин тому

    if RTX didn't get more than double, I would be concerned. It's crazy how much 4090 costs in 2024. Nvidia is toooo greedy now.

  • @ChristophBackhaus
    @ChristophBackhaus 13 годин тому

    I hope Apple goes ALL IN on AI in the next generation.

  • @ertemeren
    @ertemeren 14 годин тому

    Alex if its possible use another transaction sound please instead of glass-like one

  • @thiagotk987
    @thiagotk987 7 годин тому

    Im curious with performance of amd strix halo when it launch

  • @blackhorseteck8381
    @blackhorseteck8381 14 годин тому

    Did you hook up the 4090 power supply to the killawatt meter? Because those numbers don't make sense.

    • @brulsmurf
      @brulsmurf 13 годин тому

      These kind of models hit the GPU different than gaming workloads.

    • @AZisk
      @AZisk  13 годин тому

      The meter was set up to measure the "system". So for the 4090 setup it was the Minisforum Mini PC (which is the exact same internal specs as the GMKTek mini pc), plus the 4090/Power supply/dock contraption.

    • @blackhorseteck8381
      @blackhorseteck8381 13 годин тому

      @@brulsmurf I am an LLM engineer running local servers and monitor closely the power usage. I know what am talking about.

    • @brulsmurf
      @brulsmurf 11 годин тому

      @@blackhorseteck8381 ok, if you say so. The numbers in the video are in the range of what I get on my 4090 running inference.