Hi Matt, this was a very useful explanation. It may have just scratched the surface of parameter tuning, but you clarified the difference between the maximum, default, and explicitly set context window sizes (via the MODEL file), which is great and answers some of the questions I had asked in other video comments. However, I would still propose a dedicated session explaining in detail how to balance context window size and other parameters, considering RAM usage. It's important to be clear that context window length impacts the amount of RAM required (likely in gigabytes), and since RAM is a shared resource dependent on model parameter size, a specific session in an Ollama course explaining all these trade-offs would be valuable. Another related suggestion is to dedicate a session in the Ollama course to models designed for conversations (chat). The conversation capabilities of each model depend on specific fine-tuning for conversational tasks (as opposed to one-shot question-answer tasks), and conversations involve context window RAM.
Shoutout to Matt for another fantastic video! 👍 I've been exploring Ollama recently, pushing it with larger documents and was blown away by what my updated setup can handle. Using Mistral-Nemo models 128k context on a dual 16GB RTX 4060Ti, here's what I found: - Single GPU: 7168 context size - gave a reliable 20 tokens/second, and then with a 32k context, it plummeted to just 5 tokens/second. - Dual GPUs: With the same 32k context, I'm now cruising at around 19 tokens per second and there's still head room to increase it further!
@technovangelist it is refreshing to hear someone who actually has knowledge of the subject talking about it without having to resort to hype and hyperbole
Thank you for the great explanation about the parameters. Hope you can share some practical examples from your experience. Best for coding, chat, research and so on.
I use python lib with 128000 num_ctx, and I get 1012 tokens back, with cut context. I have 192GB ram on macOS. Could there be a bug? I use options=dict(num_ctx=128000)
this is an outstandingly useful video, really well made. thank you Matt! I have not been able to find a way to set these parameters in the Ollama Python module. did I miss it, or is that simply not possible?
Great video, Matt and thanks for doing this! Question for you. Setting the temperature close to 0 or 0 altogether is a way to get more 'deterministic' behaviour regarding token generation by changing the probability distribution of the next token. You also mentioned that you can set the seed for the random number generator making the LLM essentially deterministic as the token generation will be the same given the start token (if I understand correctly). How does temperature and seed relate to each other?
Seed is the important value for getting more deterministic but then your better option is just pulling from a db. It will always get the same value and be many times faster
I run llama3.1 via ollama. My GPU VRAM is 48GB but ollama only using 7562MB (i checked it from nvidia-smi command on ubuntu) How to fully use GPU memory and increase llama output speed.
Thank you. For anyone that dosnt know how to do this on windows. 1. Open your IDE or even notepad 2. Type FROM llama3.1 PARAMETER num_ctx 131072 3. Save as modelfile without an extension into the folder where you ollama files are (I suggest searching ollama on your windows to be sure) 4. Go to the folder when you saved the file (assuming you do this on File explorer), type CMD into the address bar of that window to enter command prompt in the right location. 5. Type ollama create mybiggerllama3.1 -f ./Modelfile this creates the new model file 6. ollama run mybiggerllama3.1 this opens the newly created file Optional Type ollama show mybiggerllama3.1 Under parameters you should see num_ctx 131072 IF everything was successful!
It’s such a shame llama.cpp doesn’t support paged attention(aka paged kv cache) as used in vllm. There was a feature request but one of the devs was like nah I’m good. It would allow huge context lengths without slow down, by using CPU memory. also ps, your videos are such high quality.
When you discussed how to create a new version of a model with a larger context, I was asking myself: why would you do that, if you can set the parameter for context size if you use the API and also in the Gui? And as always: Thanks for your great videos.
At last, someone posting an awesome Ollama /llama.cpp parameters video. Question more for the ollama serve parameter setting: 1: In Windows: How do I select which gpu UUID I want ollama to use? I saw that the parameter or ENV variable is for example CUDA_VISIBLE_DEVICES. Thanks
i wish there was a tool that would let us put our system specs in and tell us what paramaters and quants can run on that system for the model we choose
temperature is a parameter you can set to adjust the randomness of generated text. perplexity is an evaluation metric that measures how close the generated text is to the correct text or meaning of the text. One you can set, the other measures the output.
Hi! For the first time I’ve found the video too complex….maybe after the first three or four parameters you could create a dedicated video explaining the concepts in more details and showing a few examples?. I found it very hard to understand otherwise.
separate video for the most used one maybe... or all of them in separate video each that would be a nice digital investment, one day someone might find such as seed context in ML (the SEO for seed might be confusing, need SEO GPT to generate it)
I used to work at a company called open text. My manager was matt Adney whose manager was matt brine whose manager was matt something else and the top guy was matt as well.
I need to say I’ve learned more from your videos than any other channel. thank you! Fantastic job. 🎉
Hi Matt, this was a very useful explanation. It may have just scratched the surface of parameter tuning, but you clarified the difference between the maximum, default, and explicitly set context window sizes (via the MODEL file), which is great and answers some of the questions I had asked in other video comments. However, I would still propose a dedicated session explaining in detail how to balance context window size and other parameters, considering RAM usage. It's important to be clear that context window length impacts the amount of RAM required (likely in gigabytes), and since RAM is a shared resource dependent on model parameter size, a specific session in an Ollama course explaining all these trade-offs would be valuable.
Another related suggestion is to dedicate a session in the Ollama course to models designed for conversations (chat). The conversation capabilities of each model depend on specific fine-tuning for conversational tasks (as opposed to one-shot question-answer tasks), and conversations involve context window RAM.
the best overview on Ollama parameters
Matt, could you share your thoughts on the concept of 'context caching' as employed by Gemini and Anthropic? How might it impact Ollama?
Shoutout to Matt for another fantastic video! 👍 I've been exploring Ollama recently, pushing it with larger documents and was blown away by what my updated setup can handle. Using Mistral-Nemo models 128k context on a dual 16GB RTX 4060Ti, here's what I found:
- Single GPU: 7168 context size - gave a reliable 20 tokens/second, and then with a 32k context, it plummeted to just 5 tokens/second.
- Dual GPUs: With the same 32k context, I'm now cruising at around 19 tokens per second and there's still head room to increase it further!
If the model needs more memory then the two gpus helps. But if it fits in one then the two gpus will slow it down
Thanks to the professional and interesting way you tell your stories, you have more subscribers. I wish you an early 1,000,000 !
hands down the best video on the subject matter
This is excellent, very informative and useful - qualities that output from other channels frequently lacks.
(I feel inspired to RTFM!)
Great to hear! Thanks
@technovangelist it is refreshing to hear someone who actually has knowledge of the subject talking about it without having to resort to hype and hyperbole
A case of "garbage in, garbage out", or is the model simply trained better? :)
Thank you for the great explanation about the parameters. Hope you can share some practical examples from your experience. Best for coding, chat, research and so on.
Wow. I didn't even know some of that stuff existed. Thanks Matt!
Great video, Matt!
This was very helpful, sir. More like this, please. Thank you.
Very good explanation and very good video presentation. I congratulate you
Thank you for making this video. It was educational and I subbed.
Best video on the topic ever, on point and understandable. Thanks!
Excellent explanation! Thank you.
Thanks very much ❤🌹
Great tutorial 🙏🙏🙏
Another great video! Thank you.
THANK YOU MY MAN, I needed this.
This is gold. Thank you for explaining this topic.
Great video as usual
Glad you enjoyed it
I use python lib with 128000 num_ctx, and I get 1012 tokens back, with cut context. I have 192GB ram on macOS. Could there be a bug? I use options=dict(num_ctx=128000)
Perhaps a better question for either of the discords. Links in the description
@@technovangelist, I’m not on discord. I’ll ask this at the python ollama github site.
Pretty easy to join. But what’s the code you are using. Just for that call , not the whole program
Great work explaining all this thanks!❤🎉
this is an outstandingly useful video, really well made. thank you Matt!
I have not been able to find a way to set these parameters in the Ollama Python module. did I miss it, or is that simply not possible?
Absolutely it is possible. I’ll come up with an example.
Lots of info done quick, clear with examples! very cool cheers. 🦙
Great video, Matt and thanks for doing this! Question for you. Setting the temperature close to 0 or 0 altogether is a way to get more 'deterministic' behaviour regarding token generation by changing the probability distribution of the next token. You also mentioned that you can set the seed for the random number generator making the LLM essentially deterministic as the token generation will be the same given the start token (if I understand correctly). How does temperature and seed relate to each other?
Seed is the important value for getting more deterministic but then your better option is just pulling from a db. It will always get the same value and be many times faster
I run llama3.1 via ollama. My GPU VRAM is 48GB but ollama only using 7562MB (i checked it from nvidia-smi command on ubuntu)
How to fully use GPU memory and increase llama output speed.
Hi, thanks for the video. Can we perform a Monte Carlo Dropout on model by using ollama?
I have absolutely NO idea
Could you explain where I type in the ''FROM llama3.1
PARAMETER num_ctx 131072"?
You look to be in some sort of terminal.
Thanks
in the modelfile to create a model
Thank you.
For anyone that dosnt know how to do this on windows.
1. Open your IDE or even notepad
2. Type
FROM llama3.1
PARAMETER num_ctx 131072
3. Save as modelfile without an extension into the folder where you ollama files are (I suggest searching ollama on your windows to be sure)
4. Go to the folder when you saved the file (assuming you do this on File explorer), type CMD into the address bar of that window to enter command prompt in the right location.
5. Type ollama create mybiggerllama3.1 -f ./Modelfile
this creates the new model file
6. ollama run mybiggerllama3.1
this opens the newly created file
Optional
Type ollama show mybiggerllama3.1
Under parameters you should see num_ctx 131072
IF everything was successful!
It’s such a shame llama.cpp doesn’t support paged attention(aka paged kv cache) as used in vllm. There was a feature request but one of the devs was like nah I’m good. It would allow huge context lengths without slow down, by using CPU memory.
also ps, your videos are such high quality.
How do I run Olama on igpu intel?
When you discussed how to create a new version of a model with a larger context, I was asking myself: why would you do that, if you can set the parameter for context size if you use the API and also in the Gui? And as always: Thanks for your great videos.
Sure. But there is no gui that is part of ollama or even an official gui. And not all guis show it.
At last, someone posting an awesome Ollama /llama.cpp parameters video.
Question more for the ollama serve parameter setting:
1: In Windows: How do I select which gpu UUID I want ollama to use? I saw that the parameter or ENV variable is for example CUDA_VISIBLE_DEVICES.
Thanks
I don’t think you can. Some work had been done recently on multiple gpus but still not really done yet
What is the max context length supported by an 8 GB card?
I’m not sure. I think with a 8b model at q4 you might get 4 or 8k tokens. But that depends on what else the machine is doing.
Thanks for the reply. This is a really helpful channel!
I found it to be a fire hose of info 😮 I loved it❤ need to go back and go over the CTX part for better use of llama3.1:70b on a 64GB ram Mac Studio
oops! 128k was way too much. swap used went over 40gb. lol
great video Matt. less rambling here
👍👍
i wish there was a tool that would let us put our system specs in and tell us what paramaters and quants can run on that system for the model we choose
What is the difference between perplexity vs temperature?
temperature is a parameter you can set to adjust the randomness of generated text. perplexity is an evaluation metric that measures how close the generated text is to the correct text or meaning of the text. One you can set, the other measures the output.
@@technovangelist thanks a lot !
@@technovangelist how can the LLM measure the value of the output (perplexity) if it doesnt know the truth?
The truth isn’t relevant. What is most likely the correct is key.
Wouldn't you say that temperature, windows size and top K are the only ones that makes the difference?
Most of the time there is little reason to use any of them but they all have value.
Hi! For the first time I’ve found the video too complex….maybe after the first three or four parameters you could create a dedicated video explaining the concepts in more details and showing a few examples?. I found it very hard to understand otherwise.
Pause and slowing down playback are useful.
Yayyyyyyyyyy
its ram it needs not nesserally gpu ram but ram in gen!
Too fast, Matt. But great content, if you can take time to explain each with an example would be better.
separate video for the most used one maybe...
or all of them in separate video each that would be a nice digital investment, one day someone might find such as seed context in ML (the SEO for seed might be confusing, need SEO GPT to generate it)
Why are there so many Matts becoming ai video creators?? 😂
I used to work at a company called open text. My manager was matt Adney whose manager was matt brine whose manager was matt something else and the top guy was matt as well.
And when I was at Microsoft all the matt w’s in Redmond had an email group because we would all get each others packages and faxes.
@@technovangelist
ua-cam.com/video/ZymUMAu_fB0/v-deo.htmlsi=EJh5jbfsrWw6PFDs
thank you for the expination, i really only knew about temperature