Run 70Bn Llama 3 Inference on a Single 4GB GPU
Вставка
- Опубліковано 15 вер 2024
- Code : github.com/roh...
🐦 Connect with me in Twitter : / rohanpaul_ai
Airllm Github - github.com/lyo...
Checkout the MASSIVELY UPGRADED 2nd Edition of my Book (with 1300+ pages of Dense Python Knowledge) 🐍🔥
Covering 350+ Python 🐍 Core concepts ( 1300+ pages ) 🚀
🟠 Book Link - rohanpaul.gumr...
-----------------
Hi, I am a Machine Learning Engineer | Kaggle Master. Connect with me on 🐦 TWITTER: / rohanpaul_ai - for daily in-depth coverage of Large Language Model bits
----------------
You can find me here:
**********************************************
🐦 TWITTER: / rohanpaul_ai
👨🏻💼 LINKEDIN: / rohan-paul-ai
👨🔧 Kaggle: www.kaggle.com...
👨💻 GITHUB: github.com/roh...
🧑🦰 Facebook : / rohan.paul.562
📸 Instagram: / rohan_paul_2020
**********************************************
Other Playlist you might like 👇
🟠 MachineLearning & DeepLearning Concepts & interview Question Playlist - bit.ly/380eYDj
🟠 ComputerVision / DeepLearning Algorithms Implementation Playlist - bit.ly/36jEvpI
🟠 DataScience | MachineLearning Projects Implementation Playlist - bit.ly/39MEigt
🟠 Natural Language Processing Playlist : bit.ly/3P6r2CL
----------------------
#LLM #Largelanguagemodels #Llama2 #LLMfinetuning #opensource #NLP #ArtificialIntelligence #datascience #textprocessing #deeplearning #deeplearningai #100daysofmlcode #neuralnetworks #datascience #generativeai #generativemodels #OpenAI #GPT #GPT3 #GPT4 #chatgpt #genai
Good writeup - covered when it's applicable, and pros and cons. I would recommend using it on a machine w a lot of RAM, setting up a RAM disk, and using that for your cache - that would knock the latency down somewhat.
Is it possible configure that library for use of RAM instead if SSD? It would be useful if you have a computer with many RAM (p.e 64GB of RAM) because all layers would be able in memory in 4 bit quantization.
I was thinking the same about offloading to RAM, as it has become so much cheap. However on my quick search could not find that option yet with that lib. Will need to investigate more. If you find please let me know as well.
... isnt this question ironic? doesnt LLM naturually load into RAM / VRAM, and the whole reason of this project is to switch it to an Actual Storage Drive so You can use the 70B in the Drive instead of having issues with over loading VRAM / RAM
@@i6od Indeed, this project brings a completely new way to deal with LLMs beyond RAM/VRAM.
Really simple to do just create a ram Drive a simple application you can download and then put your model into the Rand Drive and load it from there and you're all set
I would then just try to use a RAM-disk..
Using this method then, is it possible to run say a 350b model on a gpu with 20/24gb vram?
Say running Grok-1 which is 314b could run on a 3090/4090 using this method?
I know it would be slow af, but it could work right?
theoretically possible . The layered inference approach will just do the sequential loading and unloading of model layers. Ofcourse, the latency will accumulate and result in super super slow inference.
I was thinking about this for a while. Im glad someone did it. O think if done properly you can have similar performance to all weights in vram
Indeed.
There are many comments about loading layers from RAM instead of SSD. Basically, it doesn't make sense. You will have a better performance doing all the computations on CPU. Why? Very simple, when you run LLM on CPU the main bottleneck is not a CPU speed, but the bandwidth of RAM and that is why it is much faster to run LLM on GPU because it has much higher bandwidth. With this lib you will have to copy each time a layer form RAM to VRAM and then compute output of a layer on GPU. That doesn't make sense, since your CPU makes computations faster than it gets data from RAM. So no magic here, if you want to run very big model and it fits to the RAM then just run it on CPU.
hi, if are four gpu of 3060 12GB run easy the 405b?
Why ssd? Why not RAM?
If i have enough RAM to save the entire LLM? Can the layer be read from RAM? (RAM to VRAM)
google "RAM disk" still a thing in Win 11...
Yes offloading to RAM will always be better given how cheap it is.
But this library wanted a new way to deal with LLM bypassing RAM/VRAM as much as possible.
damn, imagine the same optmization for an 8B model, the speeds would rival Groq
Yes, but actual speed may not improve much as you still have to do Disk IO. So you will always be bottlenecked by your SSD read speed.
Exactly. The smaller the model the higher the proportional IO overhead compared to compute… at 8B paging memory like this in and out makes it far slower than it is right now. That is because the compute time needed per additional parameter in an XB model grows exponentially. So large models are so slow in compute, that IO overheads like these can become neglectable. But there are interesting developments like vLLMs that use something like virtual memory management to pack a very large model still in small GPU memory. Skipping the need for IO speed (since there is no IO to disk), since everything is still in memory on the graphics card.
@@dinoscheidt very well explained. Thanks.
To my understanding, Groq uses a similar trick as their LPU has only 250MB(yes MB) memory
@@dinoscheidt can't wait for vLLM Llama 3 400B. For a few years I've been hoping for something like that, then really top level AI can be run locally by anyone with a reasonable computer and ok graphics card... Will be amazing for productivity, gaming, etc.
The question is if we have more VRAM like 16 or 24GB that can be used and mitigate the SSD bottleneck more? Maybe that way can read not only one layer but multiple and that way can be even faster
Yes, I think its possible i.e. you can managing the num of layers to allocate to GPU reducing the frequency of SSD reads.
Here's the long ans.
- In the current implementation of their code (check the github repo), the `AirLLMBaseModel` class in `airllm_base.py` loads and processes one layer at a time during the forward pass. However, you can modify the `forward` method to load and cache a certain number of layers based on the available GPU memory.
For example, you can introduce a configuration parameter to specify the number of layers to cache in GPU memory. Then, in the `forward` method, you can load and store the layers in a cache until the specified number of layers is reached. When processing the next layer, you can check if it is already in the cache before loading it from SSD.
Here's a simplified example of how you could modify the `forward` method to cache multiple layers:
```python
def forward(self, ...):
...
cached_layers = []
max_cached_layers = 4 # Specify the maximum number of layers to cache
for i, (layer_name, layer) in enumerate(zip(self.layer_names, self.layers)):
if layer_name in cached_layers:
# Layer is already cached, use it directly
layer = self.cached_layers[layer_name]
else:
# Load the layer from SSD and add it to the cache
state_dict = self.load_layer_to_cpu(layer_name)
self.move_layer_to_device(state_dict)
cached_layers.append(layer_name)
self.cached_layers[layer_name] = layer
# Remove the oldest cached layer if the cache size exceeds the maximum
if len(cached_layers) > max_cached_layers:
oldest_layer = cached_layers.pop(0)
del self.cached_layers[oldest_layer]
# Process the layer
...
```
In this example, the `max_cached_layers` variable determines the maximum number of layers to cache in GPU memory. The `cached_layers` list keeps track of the currently cached layers. When processing a layer, it first checks if it is already cached. If not, it loads the layer from SSD, adds it to the cache, and removes the oldest cached layer if the cache size exceeds the maximum.
- By caching multiple layers in GPU memory, you can reduce the number of SSD reads required during inference.
Additionally, you may need to handle the case where a single layer itself exceeds the available GPU memory. In such scenarios, you might need to explore other techniques like tensor parallelism or model sharding to distribute the layer across multiple GPUs or devices.
is there a paper of apple in 2023 doing this the difference is that Apple targets efficiency in reading chunks specifically on its own hardware. UA-cam censor my previous comment where i paste the link and tittle of the paper :/
Yes I think you are talking about "LLM in a Flash: Efficient Large Language Model Inference with Limited Memory"
twitter.com/rohanpaul_ai/status/1737425137451073573
have you tried it on a Ramdisk? If so, could you make another video comparing perfomance?
No haven't tried on that yet, but will try.
I'm completely new to this. Is this something I can use with oobagooba?
Seconding this question, as I'm interested in the answer too.
So my only question is can this be integrated with ollama?
Dont think ollama supports this.
How much disk space would this all need?
You just need to be able to accomodate the entire model into your SSD.
Ask can it run on igpu?
If it can do compute, it will work. iGPU are GPU too. All it needs is to have the compute functions (not all GPUs have them)
How many tokens per second would you get though
Depends on SSD read speed. It may vary but in Mac hardware was getting 1 tok/2sec.
@@RohanPaul-AI is this an regular SSD or an NVME?
@@Gatrehs its NVME
It's possible offline?
Yes, you can use the locally downloaded model's local path like below
model = AutoModel.from_pretrained("/home/ubuntu/.cache/huggingface/hub/models--garage-bAInd--Platypus2-70B-instruct/snapshots/b585e74bcaae02e52665d9ac6d23f4d0dbc81a0f")
@@RohanPaul-AIThank you! ❤