What's the point of this? Is AirLLM meant to be used if you also don't have enough normal RAMfor a 70b model and inference speed is not important? I'm wondering because about every LLM software nowadays can run models in RAM and most support offloading layers, so the GPU VRAM can be used as much as possible though the performances rapidly declines the more layers are offloaded (stored in normal RAM).
"Offloading" a LLM layer means moving it from RAM to VRAM. From what I see in this video, AirLLM uses the PagedAttention technique and model compression. PagedAttention is a technique for training a LLM so that during inference, any of its layers can be paged in and out of VRAM. Only models trained using PagedAttention can have each of their layers paged separately during inference.
Nice Find!:: Right now it seems that running each layer like this is a trade off , but for large models , the trade off is worth it: With a fully structured prompt : in fact you would only need a single execution of the model to produce a full output: as you would also notice . this would not be useful for random chat with the model !: Hence practicing the prompt on a smaller model before sending it to the super large model : Some Prompts can be highly complex : ie: to build an app is not a simple "Build a game called snake?" there is a whole brief required : the initial prompt to build the game would be highly specific : the workforce required to design and build the application professionally , using a scrum/waterfall system : with test and development and testing : as well as full documentation and streamlining : : the model may require to use these agents checking the job , searching for information etc : hence the one shot prompt : The output produced will not only be the built and tested app: and documentation and installers etc : So having the system fully connected to your chains, agents , rag: open interpreter etc :(i would suggest some smaller models for these tasks and allow this large model to utilize them through API: as deploying agents on API surfaces and allowing for them to communicate on a project produces high work: so a Small Home Network With agents deployed on different stations:: IE: the full AI Surface could be created using VM-Ware/Multi Docker instances::: And a multi machine network setup : then use your main AI to Ping The FAT CONTROLLER! - (often the Super large language models are just various configurations of LLMS ie : MOE/MOA - (not real B!) This technique of layer by layer is pretty fast but the problem was that you can see the verboses!! hence mentally slowing down the response ! (obviously it does not keep the object in memory!! so it would have to reload each time !)
*Ótimo vídeo! As IAs avançam tão rápido dia a dia que é impossível testar todas as novidades! kkkkkkkkkk Esse projeto é fantástico, pois estamos falando de modelos gigantes, se diminuirmos a escala, um modelo de 13B pode rodar quase perfeito com essa técnica aí, em maquinas aonde não conseguem rodar. Eu que desejo ver as IAs rodarem melhor em apenas CPU sem GPU, consigo enxergar todo o poder desse projeto!*
1050ti here. Makes things interesting To say the least. Seeing some of the comments, I will have to compare to LMSTudio layer offloading with and w/o PA to see if there is a difference using my "entrance" card.
Fahd could you make a tutorial about fine-tuning Command-R since it's the only viable LLM for non-European languages? I couldn't find UA-cam tutorials about fine-tuning Command-R from Cohere!
Plenty. It just depends on what you want to run. I use a 1050ti, the first GPU with CUDA support AFAIK. It absolutely has its limits, but the answer to your question is yes. If you need more vRAM you can always look into remote GPU usage, but I am inferring that you are not at that project point.
I do have 128GB ram and mobile rtx 4090 16GB vram and thinking if copy the lama3-70b files into ram drive copy i/o to the vram might be faster than the ssd file i/o so might get better performance with the airLLM than the cpu 32 threads.
ValueError: `rope_scaling` must be a dictionary with two fields, `type` and `factor`, got {'factor': 8.0, 'high_freq_factor': 4.0, 'low_freq_factor': 1.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}
A principio fazer um modelo gigante funcionar mesmo sendo impossível anteriormente com tão pouca GPU, e como sabemos o poder disso, esperamos que pessoas capazes de alavancar o projeto possam pelo menos acelerar o tempo de carregamento das camadas, pois isso nos dará uma resposta bem mais rápida...
What's the point of this? Is AirLLM meant to be used if you also don't have enough normal RAMfor a 70b model and inference speed is not important? I'm wondering because about every LLM software nowadays can run models in RAM and most support offloading layers, so the GPU VRAM can be used as much as possible though the performances rapidly declines the more layers are offloaded (stored in normal RAM).
Thanks.
"Offloading" a LLM layer means moving it from RAM to VRAM.
From what I see in this video, AirLLM uses the PagedAttention technique and model compression. PagedAttention is a technique for training a LLM so that during inference, any of its layers can be paged in and out of VRAM. Only models trained using PagedAttention can have each of their layers paged separately during inference.
Nice Find!::
Right now it seems that running each layer like this is a trade off , but for large models , the trade off is worth it:
With a fully structured prompt : in fact you would only need a single execution of the model to produce a full output: as you would also notice . this would not be useful for random chat with the model !:
Hence practicing the prompt on a smaller model before sending it to the super large model :
Some Prompts can be highly complex : ie: to build an app is not a simple "Build a game called snake?"
there is a whole brief required : the initial prompt to build the game would be highly specific : the workforce required to design and build the application professionally , using a scrum/waterfall system : with test and development and testing :
as well as full documentation and streamlining : : the model may require to use these agents checking the job , searching for information etc : hence the one shot prompt :
The output produced will not only be the built and tested app: and documentation and installers etc :
So having the system fully connected to your chains, agents , rag: open interpreter etc :(i would suggest some smaller models for these tasks and allow this large model to utilize them through API: as deploying agents on API surfaces and allowing for them to communicate on a project produces high work: so a Small Home Network With agents deployed on different stations:: IE: the full AI Surface could be created using VM-Ware/Multi Docker instances::: And a multi machine network setup : then use your main AI to Ping The FAT CONTROLLER! - (often the Super large language models are just various configurations of LLMS ie : MOE/MOA - (not real B!) This technique of layer by layer is pretty fast but the problem was that you can see the verboses!! hence mentally slowing down the response !
(obviously it does not keep the object in memory!! so it would have to reload each time !)
Yes
*Ótimo vídeo! As IAs avançam tão rápido dia a dia que é impossível testar todas as novidades! kkkkkkkkkk Esse projeto é fantástico, pois estamos falando de modelos gigantes, se diminuirmos a escala, um modelo de 13B pode rodar quase perfeito com essa técnica aí, em maquinas aonde não conseguem rodar. Eu que desejo ver as IAs rodarem melhor em apenas CPU sem GPU, consigo enxergar todo o poder desse projeto!*
Bro, the TL; DR??
Obrigado pelos insights
1050ti here. Makes things interesting To say the least. Seeing some of the comments, I will have to compare to LMSTudio layer offloading with and w/o PA to see if there is a difference using my "entrance" card.
sure thanks
Awesome find! Thank you
sure, thanks
Can the same idea be run with integrated VGA?
Thank you for this!
Glad it was helpful!
Fahd could you make a tutorial about fine-tuning Command-R since it's the only viable LLM for non-European languages? I couldn't find UA-cam tutorials about fine-tuning Command-R from Cohere!
Would have to check.
so do you think a gpu with 8gb vram like an rtx 4060 is enough if we want to start trying to run AI locally?
Would have to check.
Plenty. It just depends on what you want to run. I use a 1050ti, the first GPU with CUDA support AFAIK. It absolutely has its limits, but the answer to your question is yes. If you need more vRAM you can always look into remote GPU usage, but I am inferring that you are not at that project point.
I have the lama3 70b running from the i9 cpu 32 threads using ollama. But this is interesting alternative
I do have 128GB ram and mobile rtx 4090 16GB vram and thinking if copy the lama3-70b files into ram drive copy i/o to the vram might be faster than the ssd file i/o so might get better performance with the airLLM than the cpu 32 threads.
Sure thanks.
roughly how many tokens pers second can you get with that setup? is it around 0.5-1 tokens per second?
@@autoboto Did you manage to make it work by offloading to RAM?
ValueError: `rope_scaling` must be a dictionary with two fields, `type` and `factor`, got {'factor': 8.0, 'high_freq_factor': 4.0, 'low_freq_factor': 1.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}
How much airllm RAM and hard disk space do you needfor 70b?
There are similar videos on the channel, please search on the channel, thanks.
enable batch layer like 20 layer per layer un memories give better result
cheers
Great but it dont have GUI and cannot be trained locally
ok thanks for feedback
Old era 😂
Tun this air llm in ram entirely, should be very fast
sure
I have 4000 of vram How do I run a model on them?
I just did a video yesterday
@@fahdmirza The Vega has one gigabyte and I took 3 RAM and it did not work
Without running this method you can run llm 13b with vram 8 giga If that's not enough, you can run on it cpu and veam together
ok
@@fahdmirza In your opinion, it is possible to run a 13B language model on cpu with gpu because I want to buy a Vega 8GB. Thank you
Too slow to be useful. What is the use case?
A principio fazer um modelo gigante funcionar mesmo sendo impossível anteriormente com tão pouca GPU, e como sabemos o poder disso, esperamos que pessoas capazes de alavancar o projeto possam pelo menos acelerar o tempo de carregamento das camadas, pois isso nos dará uma resposta bem mais rápida...
Would have to check.
Nothing. Chatgpt is free and fast
So, in short - don’t - just use groq
cheers
no really faster then cpu and ram,
thanks for feedback.