There's a 7B model, which takes up only 4GB memory. But I was not sure if 7B can work at that time because there was a breaking change in this project. So the authors not only run them using C++ but also make smaller models.
Because otherwise you will have to use a GPU which uses different RAM (VRAM) compared to system RAM. You can also get more RAM for less money than multiple GPUs. Most consumer GPUs have very little VRAM, on average 4-8GB, which isn't enough usually. Although GPU is much, much faster than CPU inference as you get the parallel compute with higher floating point precision for next token predictions.
Everyone is saying that it is because in this way you can load the model in regular ram, but if I'm not mistaken pytorch already has this feature and so you don't need to reimplement everything in cpp if you only care about where to load the model, instead i think the difference here is that you need to reimplement stuff if you want to use custom protocols or formats (like in this case with the ggml format) and control how they are managed at low level to have more efficiency so i guess that's the main reason
is there a particular reason why they transferred the model to c++ (newbie question) other than to make the model smaller
C++ allows the entire model to be loaded into regular RAM. This is helpful for those of us without beefy GPUs.
There's a 7B model, which takes up only 4GB memory. But I was not sure if 7B can work at that time because there was a breaking change in this project. So the authors not only run them using C++ but also make smaller models.
Because otherwise you will have to use a GPU which uses different RAM (VRAM) compared to system RAM. You can also get more RAM for less money than multiple GPUs. Most consumer GPUs have very little VRAM, on average 4-8GB, which isn't enough usually. Although GPU is much, much faster than CPU inference as you get the parallel compute with higher floating point precision for next token predictions.
Everyone is saying that it is because in this way you can load the model in regular ram, but if I'm not mistaken pytorch already has this feature and so you don't need to reimplement everything in cpp if you only care about where to load the model, instead i think the difference here is that you need to reimplement stuff if you want to use custom protocols or formats (like in this case with the ggml format) and control how they are managed at low level to have more efficiency so i guess that's the main reason
How quit from chat, is ask ai and he say Ctrl+t but not work, finally I close the window o prompt, but I think can be quit somehow?
Just press Ctrl+C for 2 or 3 times (in case the prompt didn't catch it), which is the termination signal in Linux.
Did you slow down this video ?
Nope. It's the original speed.
how did you transform the model ? (.tmp ?) I get a too old, regenerate your model files or convert.. error when trying to use it...
I followed the comment github.com/ggerganov/llama.cpp/issues/382#issuecomment-1479091459 to transform the model.
But I notice there are some newer alpaca lora projects with more user-friendly setup like github.com/nomic-ai/gpt4all. Maybe you can try it.
Your computer be slow.