LLAMA 3.1 70b GPU Requirements (FP32, FP16, INT8 and INT4)

Поділитися
Вставка
  • Опубліковано 15 вер 2024
  • This Tool allows you to choose an LLM and see which GPUs could run it... : aifusion.compa...
    Welcome to this deep dive into the world of Llama 3.1, the latest and most advanced large language model from Meta. If you've been amazed by Llama 3, you're going to love what Llama 3.1 70B brings to the table. With 70 billion parameters, this model has set new benchmarks in performance, outshining its predecessor and raising the bar for large language models.
    In this video, we'll break down the GPU requirements needed to run Llama 3.1 70B efficiently, focusing on different quantization methods such as FP32, FP16, INT8, and INT4. Each method offers a unique balance between performance and memory usage, and we’ll guide you through which GPUs are best suited for each scenario-whether you’re running inference, full Adam training, or low-rank fine-tuning.
    To make your life easier, I’ve developed a free tool that allows you to select any large language model and instantly see which GPUs can run it at different quantization levels. You’ll find the link to this tool in the description below.
    If you’re serious about optimizing your AI workloads and want to stay ahead of the curve, make sure to watch until the end. Don’t forget to like, subscribe, and hit the notification bell to stay updated with all things AI!
    Disclaimer: Some of the links in this video/description are affiliate links, which means if you click on one of the product links, I may receive a small commission at no additional cost to you. This helps support the channel and allows me to continue making content like this. Thank you for your support!
    Tags:
    #Llama3
    #MetaAI
    #GPUrequirements
    #QuantizationMethods
    #AIModels
    #LargeLanguageModels
    #FP32
    #FP16
    #INT8
    #INT4
    #AITraining
    #AIInference
    #AITools
    #llmgpu
    #llm
    #gpu
    #AIOptimization
    #ArtificialIntelligence

КОМЕНТАРІ • 27

  • @einstien2409
    @einstien2409 25 днів тому +30

    Lets all take a moment to appreciate how much nvidia knee caps their GPUs with low Vram to screw the customers into buying more 50K usd GPUs. For 25K the H100 should not come for less than 256GB vram. At all.

    • @alyia9618
      @alyia9618 День тому +4

      shame to the competitors that cannot/don't want to offer better solutions! and there are many startups out there with innovative solutions ( no GPU but real "neural" processors )...these startups need money, but the likes of AMD, Intel, etc... instead continue with their bollocks and get out "CPUs with NPUs" that are clearly not enough to run "real" LLMs...and this is because they are playing at the same game as Nvidia, trying to squeeze as much money as possible from the gullible...sooner or later we will have machines with Graphcore IPUs or Groq LPUs, but not before the usual culprits will get rich squeezing everyone

  • @lrrr
    @lrrr 10 днів тому +7

    Thanks man , I was trying to find video like this on for a long time You save my day!

  • @serikazero128
    @serikazero128 4 дні тому +6

    I think your video is pretty solid and also, its missing something.
    I can currently run llama 3.1 with 0 Video RAM. Yes, you heard that right, 0 GB of VRAM.
    How is this possible, well with low quaternization types, similar to int4 and int8; In my case more exactly: llama3.1:70b-instruct-q3_K_L
    I can run with with around 50-64 gb of RAM. And run it on my CPU.
    Its takes however roughly 2 minutes to answer: hey, My name is Jack, what's yours?
    What's the deal?
    AI needs RAM, not really specifically VRAM. VRAM is much faster of course, but I'm using a laptop CPU (weaker than a desktop one) and one that is from 3-4 years ago.
    After I load the model my RAM usage jumps to around 48gb, while normally without using the model it sits at around 10gb.
    My point is: you don't need insane resources to run AI as long as speed ain't the issue you can even run it on the CPU. It just is going to take longer. The GPU isn't the one that makes AI go, the GPU only makes the AI go much faster.
    I have no doubt that with a 40gb VRAM, my llama 3.1 would answer in 20-30 seconds instead of 2 minutes or 2.5 minutes.
    However, you can still run it on a outdated LAPTOP CPU. As long as you have enough memory. And that's the key thing here, Memory.
    And it doesn't have to be VRAM!!

    • @AIFusion-official
      @AIFusion-official  3 дні тому +1

      Thank you for sharing your experience! You're absolutely right that running LLaMA 3.1 70B on a CPU with low quantization like Q3_K_L is possible with enough RAM, but it comes with trade-offs. While CPUs can handle the load, they tend to overheat more than GPUs when running large language models, which can slow down the generation even further due to throttling. So, while it's feasible, the long response times (e.g., 2 minutes for a simple query) and the potential for overheating make it impractical for real-life usage. For faster and more stable performance, GPUs with sufficient VRAM are much better suited for these tasks. Thanks again for bringing up this important discussion!

    • @alyia9618
      @alyia9618 День тому +1

      yeah ok, but the loss of precision from using 3 bit quantization is colossal!!! there is a reason why fp16 ( or bf16 ) is the sweet spot for quantization, with int8 as a "good enough" stopgap....

    • @serikazero128
      @serikazero128 День тому +1

      @@alyia9618 I could run fp16, if I add more RAM, that's my point.
      And if a laptop processor can do this, A LAPTOP CPU, You can run even fp16 on a CPU with a computer with 256 RAM. And getting 256 RAM is a looooot cheaper than getting 256 VRAM

    • @alyia9618
      @alyia9618 День тому +1

      @@serikazero128 yes you can run fp16 no problem, especially with avx512 equipped cpus! the problem is that by going up on the number of parameters, memory bandwidth becomes a huge bottleneck...this is the real problem, because the cpus can cope with the load, especially the latest ones with integrated npus and it is a no brainer if we run the computation on the igpus too! feeding all those computational units is the problem, because 2 memory channels and a theoretical max of 100GB/s for the bandwidth isn't enough...the solution the likes of Nvidia and AMD have found for now is to add hbm memory to their chips...and it is an empirically verified solution too, because we have Apple M3 chips going strong exactly because they have high bandwidth memory on the socs

  • @nithinbhandari3075
    @nithinbhandari3075 5 днів тому +2

    Nice video.
    Thanks for the info.
    We are sooo gpu poor.

  • @maxxflyer
    @maxxflyer 7 годин тому +1

    great tool

  • @gazzalifahim
    @gazzalifahim 18 годин тому +2

    Man, this is the tool I was wishing for the last 3 months! Thanks Thanks Thanks!
    Just got a question. I was planning to buy a RTX 4060Ti for my new build to run some Thesis work. My work is mostly on the Open Source Small LLMs like Llama 3.1 8B, Phi-3-Medium 128K etc. Will I be able run those with almost a great inference speed?

    • @AIFusion-official
      @AIFusion-official  17 годин тому

      Thank you, I’m glad it’s useful! As for the RTX 4060 Ti, if you’re looking at the 8GB version, I’d actually recommend considering the RTX 3060 with 12GB instead. It’s usually cheaper and gives you more room to run models at higher quantization levels. For example, with LLAMA 3.1 8B, the RTX 3060 can run it in INT8, whereas the 4060 Ti with 8GB would only handle INT4. Just to give you some perspective, I personally use an RTX 4060 with 8GB of VRAM, and I can run LLaMA 3.1 8B in INT4 with around 41 tokens per second at the start of a conversation. So while the 4060 Ti will work, the 3060 might give you more flexibility for your thesis work with LLMs

  • @io9021
    @io9021 5 днів тому +2

    When running Llama3.1 70b with ollama, by default it selects a version using 40GB memory. That's 70b-instruct-q4_0 (c0df3564cfe8). So that has to be int4. I guess in this case all parameters (key / value / query and feedforward weights) are int4?
    Then there are intermediate sizes where probably different parameters are quantized differently?
    70b-instruct-q8_0 (5dd991fa92a4) needs 75GB, presumably that's all int8?

    • @AIFusion-official
      @AIFusion-official  5 днів тому +1

      Thank you for your insightful comment! Yes, when running LLaMA 3.1 70B with Ollama, the 70b-instruct-q4_0 version likely uses INT4 quantization, which would apply to all parameters, including key, value, query, and feedforward weights. As for intermediate quantization levels, you're correct different parameters may be quantized to varying degrees, depending on the model version. The 70b-instruct-q8_0, needing 75GB, would indeed suggest that it’s fully quantized to INT8. Each quantization level strikes a balance between memory usage and model performance.

  • @px43
    @px43 10 днів тому +2

    This app you made is awesome, but I've also heard people got 405b running on a MacBook, which your app says should only be possible with $100k of GPUs, even at the lowest quantization. I'd love to use your site to be my go-to for ML builds but it seems to be overestimating the requirements.
    Maybe there should be a field for speed benchmarks, and you could give people tokens per second when using various swap and external ram options?

    • @AIFusion-official
      @AIFusion-official  9 днів тому +2

      Thank you for your feedback! Running a 405 billion parameter model on a MacBook is highly unrealistic due to hardware constraints, even with extreme quantization, which can severely degrade performance. In practice, very low quantization levels like Q2 would significantly reduce precision, making the model's output much poorer compared to a smaller model running at full or half precision. Additionally, tokens per second can vary based on the length of the input and output, as well as the context window size, so providing a fixed benchmark isn't feasible. We’re considering ways to better address performance metrics and appreciate your suggestions to help improve the app!

  • @Felix-st2ue
    @Felix-st2ue 2 дні тому +1

    How does the 70b q4 version vompare to lets say the 8b version at fp32? Basically whats more inportant, the number of parameters or the quantization?

    • @AIFusion-official
      @AIFusion-official  2 дні тому +3

      Thank you for your question! The 70B model at Q4 has many more parameters, allowing it to capture more complex patterns, but the lower precision from quantization can reduce its accuracy. On the other hand, the 8B model at FP32 has fewer parameters but higher precision, making it more accurate in certain tasks. Essentially, it’s a trade-off: the 70B Q4 model is better for tasks requiring more knowledge, while the 8B FP32 model may perform better in tasks needing precision.

    • @alyia9618
      @alyia9618 День тому +1

      if you must do "serious" things always prefer a bigger number of parameters ( with 33b and 70b being the sweet spots ), but try to not go under int8 if you want your LLM to not spit out "bullshit"...loss of precision can drive accuracy down very fast and make the network hallucinate a lot, loses cognitive power ( a big problem if you are reasoning on math problems, logic problems, etc... ), becomes incapable of understanding and producing nuanced text, spells disaster for non latin languages ( yes the effects are magnified for non latin scripts ), dequantization ( during inference you must go back to fp and back again to the desider quant level ) increases the overhead

  • @dipereira0123
    @dipereira0123 27 днів тому +1

    Nice =D

  • @mohamadbazmara601
    @mohamadbazmara601 18 днів тому +1

    Great, what if we want to run it not just for one since request. What if we have 1000 request per second?

    • @AIFusion-official
      @AIFusion-official  18 днів тому +1

      Handling 1,000 requests per second is a massive task that would require much more than just a few GPUs. You'd be looking at a full-scale data center with racks of GPUs working together, along with the necessary infrastructure for cooling, power, and security. It’s a significant investment, and you’d need to carefully optimize the setup to ensure everything runs smoothly at that scale. In most cases, relying on cloud services or specialized AI infrastructure providers might be more practical for such heavy workloads.

  • @sinayagubi8805
    @sinayagubi8805 8 днів тому +2

    wow. can you add tokens per second on that tool?

    • @AIFusion-official
      @AIFusion-official  8 днів тому

      Thank you for your comment! Regarding the tokens per second metric, it’s tricky because the speed varies greatly based on the input length, the number of tokens in the context window, and how far along you are in a conversation (since more tokens slow things down). Giving a fixed tokens-per-second value would be unrealistic, as it depends on these factors. I’ll consider ways to offer more detailed performance metrics in the future to make the tool even more helpful. Your feedback is greatly appreciated!

  • @Xavileiro
    @Xavileiro 15 годин тому

    And lets be honest. Llama 8b sucks really bad.

    • @AIFusion-official
      @AIFusion-official  14 годин тому

      @Xavileiro i respect your opinion, but i dont agree. Maybe you've been using a heavily quantized version. Some quantization levels reduce the model accuracy and the quality of the output significantly. You should try the fp16 version. It is really good for a lot of use cases.