LLAMA 3.1 70b GPU Requirements (FP32, FP16, INT8 and INT4)
Вставка
- Опубліковано 15 вер 2024
- This Tool allows you to choose an LLM and see which GPUs could run it... : aifusion.compa...
Welcome to this deep dive into the world of Llama 3.1, the latest and most advanced large language model from Meta. If you've been amazed by Llama 3, you're going to love what Llama 3.1 70B brings to the table. With 70 billion parameters, this model has set new benchmarks in performance, outshining its predecessor and raising the bar for large language models.
In this video, we'll break down the GPU requirements needed to run Llama 3.1 70B efficiently, focusing on different quantization methods such as FP32, FP16, INT8, and INT4. Each method offers a unique balance between performance and memory usage, and we’ll guide you through which GPUs are best suited for each scenario-whether you’re running inference, full Adam training, or low-rank fine-tuning.
To make your life easier, I’ve developed a free tool that allows you to select any large language model and instantly see which GPUs can run it at different quantization levels. You’ll find the link to this tool in the description below.
If you’re serious about optimizing your AI workloads and want to stay ahead of the curve, make sure to watch until the end. Don’t forget to like, subscribe, and hit the notification bell to stay updated with all things AI!
Disclaimer: Some of the links in this video/description are affiliate links, which means if you click on one of the product links, I may receive a small commission at no additional cost to you. This helps support the channel and allows me to continue making content like this. Thank you for your support!
Tags:
#Llama3
#MetaAI
#GPUrequirements
#QuantizationMethods
#AIModels
#LargeLanguageModels
#FP32
#FP16
#INT8
#INT4
#AITraining
#AIInference
#AITools
#llmgpu
#llm
#gpu
#AIOptimization
#ArtificialIntelligence
Lets all take a moment to appreciate how much nvidia knee caps their GPUs with low Vram to screw the customers into buying more 50K usd GPUs. For 25K the H100 should not come for less than 256GB vram. At all.
shame to the competitors that cannot/don't want to offer better solutions! and there are many startups out there with innovative solutions ( no GPU but real "neural" processors )...these startups need money, but the likes of AMD, Intel, etc... instead continue with their bollocks and get out "CPUs with NPUs" that are clearly not enough to run "real" LLMs...and this is because they are playing at the same game as Nvidia, trying to squeeze as much money as possible from the gullible...sooner or later we will have machines with Graphcore IPUs or Groq LPUs, but not before the usual culprits will get rich squeezing everyone
Thanks man , I was trying to find video like this on for a long time You save my day!
Glad I could help
I think your video is pretty solid and also, its missing something.
I can currently run llama 3.1 with 0 Video RAM. Yes, you heard that right, 0 GB of VRAM.
How is this possible, well with low quaternization types, similar to int4 and int8; In my case more exactly: llama3.1:70b-instruct-q3_K_L
I can run with with around 50-64 gb of RAM. And run it on my CPU.
Its takes however roughly 2 minutes to answer: hey, My name is Jack, what's yours?
What's the deal?
AI needs RAM, not really specifically VRAM. VRAM is much faster of course, but I'm using a laptop CPU (weaker than a desktop one) and one that is from 3-4 years ago.
After I load the model my RAM usage jumps to around 48gb, while normally without using the model it sits at around 10gb.
My point is: you don't need insane resources to run AI as long as speed ain't the issue you can even run it on the CPU. It just is going to take longer. The GPU isn't the one that makes AI go, the GPU only makes the AI go much faster.
I have no doubt that with a 40gb VRAM, my llama 3.1 would answer in 20-30 seconds instead of 2 minutes or 2.5 minutes.
However, you can still run it on a outdated LAPTOP CPU. As long as you have enough memory. And that's the key thing here, Memory.
And it doesn't have to be VRAM!!
Thank you for sharing your experience! You're absolutely right that running LLaMA 3.1 70B on a CPU with low quantization like Q3_K_L is possible with enough RAM, but it comes with trade-offs. While CPUs can handle the load, they tend to overheat more than GPUs when running large language models, which can slow down the generation even further due to throttling. So, while it's feasible, the long response times (e.g., 2 minutes for a simple query) and the potential for overheating make it impractical for real-life usage. For faster and more stable performance, GPUs with sufficient VRAM are much better suited for these tasks. Thanks again for bringing up this important discussion!
yeah ok, but the loss of precision from using 3 bit quantization is colossal!!! there is a reason why fp16 ( or bf16 ) is the sweet spot for quantization, with int8 as a "good enough" stopgap....
@@alyia9618 I could run fp16, if I add more RAM, that's my point.
And if a laptop processor can do this, A LAPTOP CPU, You can run even fp16 on a CPU with a computer with 256 RAM. And getting 256 RAM is a looooot cheaper than getting 256 VRAM
@@serikazero128 yes you can run fp16 no problem, especially with avx512 equipped cpus! the problem is that by going up on the number of parameters, memory bandwidth becomes a huge bottleneck...this is the real problem, because the cpus can cope with the load, especially the latest ones with integrated npus and it is a no brainer if we run the computation on the igpus too! feeding all those computational units is the problem, because 2 memory channels and a theoretical max of 100GB/s for the bandwidth isn't enough...the solution the likes of Nvidia and AMD have found for now is to add hbm memory to their chips...and it is an empirically verified solution too, because we have Apple M3 chips going strong exactly because they have high bandwidth memory on the socs
Nice video.
Thanks for the info.
We are sooo gpu poor.
great tool
Man, this is the tool I was wishing for the last 3 months! Thanks Thanks Thanks!
Just got a question. I was planning to buy a RTX 4060Ti for my new build to run some Thesis work. My work is mostly on the Open Source Small LLMs like Llama 3.1 8B, Phi-3-Medium 128K etc. Will I be able run those with almost a great inference speed?
Thank you, I’m glad it’s useful! As for the RTX 4060 Ti, if you’re looking at the 8GB version, I’d actually recommend considering the RTX 3060 with 12GB instead. It’s usually cheaper and gives you more room to run models at higher quantization levels. For example, with LLAMA 3.1 8B, the RTX 3060 can run it in INT8, whereas the 4060 Ti with 8GB would only handle INT4. Just to give you some perspective, I personally use an RTX 4060 with 8GB of VRAM, and I can run LLaMA 3.1 8B in INT4 with around 41 tokens per second at the start of a conversation. So while the 4060 Ti will work, the 3060 might give you more flexibility for your thesis work with LLMs
When running Llama3.1 70b with ollama, by default it selects a version using 40GB memory. That's 70b-instruct-q4_0 (c0df3564cfe8). So that has to be int4. I guess in this case all parameters (key / value / query and feedforward weights) are int4?
Then there are intermediate sizes where probably different parameters are quantized differently?
70b-instruct-q8_0 (5dd991fa92a4) needs 75GB, presumably that's all int8?
Thank you for your insightful comment! Yes, when running LLaMA 3.1 70B with Ollama, the 70b-instruct-q4_0 version likely uses INT4 quantization, which would apply to all parameters, including key, value, query, and feedforward weights. As for intermediate quantization levels, you're correct different parameters may be quantized to varying degrees, depending on the model version. The 70b-instruct-q8_0, needing 75GB, would indeed suggest that it’s fully quantized to INT8. Each quantization level strikes a balance between memory usage and model performance.
This app you made is awesome, but I've also heard people got 405b running on a MacBook, which your app says should only be possible with $100k of GPUs, even at the lowest quantization. I'd love to use your site to be my go-to for ML builds but it seems to be overestimating the requirements.
Maybe there should be a field for speed benchmarks, and you could give people tokens per second when using various swap and external ram options?
Thank you for your feedback! Running a 405 billion parameter model on a MacBook is highly unrealistic due to hardware constraints, even with extreme quantization, which can severely degrade performance. In practice, very low quantization levels like Q2 would significantly reduce precision, making the model's output much poorer compared to a smaller model running at full or half precision. Additionally, tokens per second can vary based on the length of the input and output, as well as the context window size, so providing a fixed benchmark isn't feasible. We’re considering ways to better address performance metrics and appreciate your suggestions to help improve the app!
How does the 70b q4 version vompare to lets say the 8b version at fp32? Basically whats more inportant, the number of parameters or the quantization?
Thank you for your question! The 70B model at Q4 has many more parameters, allowing it to capture more complex patterns, but the lower precision from quantization can reduce its accuracy. On the other hand, the 8B model at FP32 has fewer parameters but higher precision, making it more accurate in certain tasks. Essentially, it’s a trade-off: the 70B Q4 model is better for tasks requiring more knowledge, while the 8B FP32 model may perform better in tasks needing precision.
if you must do "serious" things always prefer a bigger number of parameters ( with 33b and 70b being the sweet spots ), but try to not go under int8 if you want your LLM to not spit out "bullshit"...loss of precision can drive accuracy down very fast and make the network hallucinate a lot, loses cognitive power ( a big problem if you are reasoning on math problems, logic problems, etc... ), becomes incapable of understanding and producing nuanced text, spells disaster for non latin languages ( yes the effects are magnified for non latin scripts ), dequantization ( during inference you must go back to fp and back again to the desider quant level ) increases the overhead
Nice =D
Great, what if we want to run it not just for one since request. What if we have 1000 request per second?
Handling 1,000 requests per second is a massive task that would require much more than just a few GPUs. You'd be looking at a full-scale data center with racks of GPUs working together, along with the necessary infrastructure for cooling, power, and security. It’s a significant investment, and you’d need to carefully optimize the setup to ensure everything runs smoothly at that scale. In most cases, relying on cloud services or specialized AI infrastructure providers might be more practical for such heavy workloads.
wow. can you add tokens per second on that tool?
Thank you for your comment! Regarding the tokens per second metric, it’s tricky because the speed varies greatly based on the input length, the number of tokens in the context window, and how far along you are in a conversation (since more tokens slow things down). Giving a fixed tokens-per-second value would be unrealistic, as it depends on these factors. I’ll consider ways to offer more detailed performance metrics in the future to make the tool even more helpful. Your feedback is greatly appreciated!
And lets be honest. Llama 8b sucks really bad.
@Xavileiro i respect your opinion, but i dont agree. Maybe you've been using a heavily quantized version. Some quantization levels reduce the model accuracy and the quality of the output significantly. You should try the fp16 version. It is really good for a lot of use cases.