Run Any 70B LLM Locally on Single 4GB GPU - AirLLM

Fahd Mirza

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 10 січ 2025

КОМЕНТАРІ • 49

@testales 8 місяців тому ⁺¹³
What's the point of this? Is AirLLM meant to be used if you also don't have enough normal RAMfor a 70b model and inference speed is not important? I'm wondering because about every LLM software nowadays can run models in RAM and most support offloading layers, so the GPU VRAM can be used as much as possible though the performances rapidly declines the more layers are offloaded (stored in normal RAM).
@fahdmirza 8 місяців тому
Thanks.
@Viewable11 7 місяців тому ⁺¹
"Offloading" a LLM layer means moving it from RAM to VRAM.
From what I see in this video, AirLLM uses the PagedAttention technique and model compression. PagedAttention is a technique for training a LLM so that during inference, any of its layers can be paged in and out of VRAM. Only models trained using PagedAttention can have each of their layers paged separately during inference.
@xspydazx 8 місяців тому ⁺⁶
Nice Find!::
Right now it seems that running each layer like this is a trade off , but for large models , the trade off is worth it:
With a fully structured prompt : in fact you would only need a single execution of the model to produce a full output: as you would also notice . this would not be useful for random chat with the model !:
Hence practicing the prompt on a smaller model before sending it to the super large model :
Some Prompts can be highly complex : ie: to build an app is not a simple "Build a game called snake?"
there is a whole brief required : the initial prompt to build the game would be highly specific : the workforce required to design and build the application professionally , using a scrum/waterfall system : with test and development and testing :
as well as full documentation and streamlining : : the model may require to use these agents checking the job , searching for information etc : hence the one shot prompt :
The output produced will not only be the built and tested app: and documentation and installers etc :
So having the system fully connected to your chains, agents , rag: open interpreter etc :(i would suggest some smaller models for these tasks and allow this large model to utilize them through API: as deploying agents on API surfaces and allowing for them to communicate on a project produces high work: so a Small Home Network With agents deployed on different stations:: IE: the full AI Surface could be created using VM-Ware/Multi Docker instances::: And a multi machine network setup : then use your main AI to Ping The FAT CONTROLLER! - (often the Super large language models are just various configurations of LLMS ie : MOE/MOA - (not real B!) This technique of layer by layer is pretty fast but the problem was that you can see the verboses!! hence mentally slowing down the response !
(obviously it does not keep the object in memory!! so it would have to reload each time !)
@fahdmirza 8 місяців тому ⁺¹
Yes
@ramikanimperador2286 8 місяців тому ⁺²
*Ótimo vídeo! As IAs avançam tão rápido dia a dia que é impossível testar todas as novidades! kkkkkkkkkk Esse projeto é fantástico, pois estamos falando de modelos gigantes, se diminuirmos a escala, um modelo de 13B pode rodar quase perfeito com essa técnica aí, em maquinas aonde não conseguem rodar. Eu que desejo ver as IAs rodarem melhor em apenas CPU sem GPU, consigo enxergar todo o poder desse projeto!*
@jacobogerardogonzalezleon2161 8 місяців тому ⁺¹
Bro, the TL; DR??
@fahdmirza 8 місяців тому ⁺¹
Obrigado pelos insights
@timothywcrane 7 місяців тому
1050ti here. Makes things interesting To say the least. Seeing some of the comments, I will have to compare to LMSTudio layer offloading with and w/o PA to see if there is a difference using my "entrance" card.
@fahdmirza 7 місяців тому
sure thanks
@TailorJohnson-l5y 8 місяців тому
Awesome find! Thank you
@fahdmirza 8 місяців тому
sure, thanks
@QorQar 4 місяці тому
Can the same idea be run with integrated VGA?
@coffees4closersonly 8 місяців тому
Thank you for this!
@fahdmirza 8 місяців тому
Glad it was helpful!
@sheikhakbar2067 8 місяців тому
Fahd could you make a tutorial about fine-tuning Command-R since it's the only viable LLM for non-European languages? I couldn't find UA-cam tutorials about fine-tuning Command-R from Cohere!
@fahdmirza 8 місяців тому
Would have to check.
@Laniakea5758 8 місяців тому
so do you think a gpu with 8gb vram like an rtx 4060 is enough if we want to start trying to run AI locally?
@fahdmirza 8 місяців тому ⁺¹
Would have to check.
@timothywcrane 7 місяців тому ⁺¹
Plenty. It just depends on what you want to run. I use a 1050ti, the first GPU with CUDA support AFAIK. It absolutely has its limits, but the answer to your question is yes. If you need more vRAM you can always look into remote GPU usage, but I am inferring that you are not at that project point.
@autoboto 8 місяців тому
I have the lama3 70b running from the i9 cpu 32 threads using ollama. But this is interesting alternative
@autoboto 8 місяців тому ⁺¹
I do have 128GB ram and mobile rtx 4090 16GB vram and thinking if copy the lama3-70b files into ram drive copy i/o to the vram might be faster than the ssd file i/o so might get better performance with the airLLM than the cpu 32 threads.
@fahdmirza 8 місяців тому
Sure thanks.
@JC.72 8 місяців тому
roughly how many tokens pers second can you get with that setup? is it around 0.5-1 tokens per second?
@azhyhama9649 Місяць тому
@@autoboto Did you manage to make it work by offloading to RAM?
@erenyaegar-tx2dd 2 місяці тому
ValueError: `rope_scaling` must be a dictionary with two fields, `type` and `factor`, got {'factor': 8.0, 'high_freq_factor': 4.0, 'low_freq_factor': 1.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}
@QorQar 3 місяці тому
How much airllm RAM and hard disk space do you needfor 70b?
@fahdmirza 2 місяці тому ⁺¹
There are similar videos on the channel, please search on the channel, thanks.
@flimactu9896 8 місяців тому
enable batch layer like 20 layer per layer un memories give better result
@fahdmirza 8 місяців тому
cheers
@samo-zaposao8757 2 місяці тому
Great but it dont have GUI and cannot be trained locally
@fahdmirza 2 місяці тому
ok thanks for feedback
@mohammdmodan5038 5 місяців тому ⁺¹
Old era 😂
@akierum 2 місяці тому
Tun this air llm in ram entirely, should be very fast
@fahdmirza 2 місяці тому
sure
@QorQar 7 місяців тому
I have 4000 of vram How do I run a model on them?
@fahdmirza 7 місяців тому ⁺¹
I just did a video yesterday
@QorQar 7 місяців тому
@@fahdmirza The Vega has one gigabyte and I took 3 RAM and it did not work
@ROKKor-hs8tg 7 місяців тому
Without running this method you can run llm 13b with vram 8 giga If that's not enough, you can run on it cpu and veam together
@fahdmirza 7 місяців тому ⁺¹
ok
@ROKKor-hs8tg 7 місяців тому
@@fahdmirza In your opinion, it is possible to run a 13B language model on cpu with gpu because I want to buy a Vega 8GB. Thank you
@pensiveintrovert4318 8 місяців тому ⁺²
Too slow to be useful. What is the use case?
@ramikanimperador2286 8 місяців тому ⁺³
A principio fazer um modelo gigante funcionar mesmo sendo impossível anteriormente com tão pouca GPU, e como sabemos o poder disso, esperamos que pessoas capazes de alavancar o projeto possam pelo menos acelerar o tempo de carregamento das camadas, pois isso nos dará uma resposta bem mais rápida...
@fahdmirza 8 місяців тому
Would have to check.
@3dmixer552 3 місяці тому
Nothing. Chatgpt is free and fast
@roberth8737 8 місяців тому ⁺¹
So, in short - don’t - just use groq
@fahdmirza 8 місяців тому
cheers
@allanallan6258 5 місяців тому
no really faster then cpu and ram,
@fahdmirza 5 місяців тому
thanks for feedback.

Наступне

Автоматичне відтворення