Hey, Chris! That was a great video. So easy to understand and I setup everything and followed along. You are very easy to understand and do a great job of explaining the concepts you are discussing. I just subscribed and I hope you continue with this. I agree, this is all about what's coming and not necessarily "is this the end all, be all LLM for coding." If you take the time and follow along with the evolving AI it will be much easier to adapt to the next thing coming. Thanks for keeping me informed.
@@andrepaes3908 on the top right hand corner click on the controls icon (looks like sliders) then go down to "Context Length" and for starters try 16384 if your device supports it as 32768 might be really slow, goodluck!
@@andrepaes3908 in Open Webui click on the controls icon top right hand corner then set context length to 16383 to start with and go up or depending on your system resources! good luck
@@andrepaes3908 you can change it easily in Openweb UI configuration or just create a new Modelfile and with Ollama create create a new version of the model with longer context.
Great video, amazing content! I see you used the 4bit quantized version to run all tests. Since you've got 128gb RAM, could you run the same tests for the 32b model with 8bit and FP16 quants to check if it improves responses? If so please make another video to share the great news!
Yeah, I never use 4-bit quantization anymore because it often gives very poor output results. Q8 is okay and almost as perfect as FP16. Also, Q5K_M should be the minimum since it still gives very good results. In fact, I don’t notice any quality loss with Q5K_M models. I’ve tested it on the Gemma 2 27B model and the Llama 3.1 8B and 70B models. However, if you have extra RAM, I highly recommend always using Q8 for the best performance.
All PCs will be able to run it. The question is, how big is your RAM? The bigger your RAM, the larger the model you can run. Usually, an 8B model requires 8GB of RAM, a 27B model requires 32GB of RAM, and so on (this is if you are only using the Q4 quantization). The speed of your CPU doesn’t matter; it only affects the speed of generation. You can still run it, though it will take longer if you have a slow CPU.
Good demo. You just went way, way too fast past the install/setup/config. I'm still trying to work my way past all the errors and figure out why I'm not getting any models showing up in my WebUI.
Apologies I did that because I have a video where I l walk through ollama (much slower and in detail), this is the link ua-cam.com/video/uAxvr-DrILY/v-deo.html hope it helps
Claude sucks big time with VUE, Vuetify, Bootsrap, bootstrap-vue, Laravel etc. Qwen is absolutely amazing! It makes so good VUE components, it knows Vite, it knows Laravel, it does not confuse VUE2 to VUE3 and differentiates versions. GPT-4-turbo does not understand different versions and just produces garbage.
You should be connected to Wi-Fi for the models to appear. It’s strange since it doesn’t use the internet but requires Wi-Fi. The truth is, you can access Open WebUI on any device as long as it is connected to the same network as the server.
@@LoveMapV4 Well I figured it out. You just gotta take your time getting the configuration right. I installed ollama as a local service. But I had to install open-webui using Docker because the Python PIP install didn't work. PIP didn't work because you need exactly version 3.11. It has to be exactly that version. 3.10 won't work. 3.12 won't work. It must be 3.11. Well, I had 3.10. I didn't spend enough time figuring out how to get exactly 3.11 installed. If I just blindly upgraded, that gave me 3.12, not 3.11. Arghhh!!! So I gave up on that path because I was impatient and just used Docker. But then I had to do something to configure open-webui in Docker to talk to ollama running locally, i.e., not in Docker. I followed the instructions on the web site and just took my time and finally got it to work. The installation and the documentation could both be better, but what the heck, that's what we get paid the big bucks for, right? I watched a few more of Chris's videos and they are really good. It's a good resource for doing this kind of work. Thank you.
Thank you so much for your video!
Really informative.
Didnt expect it to work that good.
What's the memory usage while using 32B
Great video. I loved the way you covered all these technically challenging areas for me so quickly and so comprehensively! Best wishes!
thank you
most important info missing!, what is the memory usage when 32B Qwen 2.5 running. Please provide the info.
Thx for the conclusion on top at the end
I use the 14b model and continue on my 4090 and it is fast and works great!
@@renerens what quantization are you using? And context length?
Which model/llm are you using?
@@andrepaes3908 I used the default model not a quantized one, 32768 context length.
@@asifudayan qwen2.5-coder:14b
Hi, is your computer a mac ? If yes: which one ?
mac m3 max with128GB of unified memory
Hey, Chris! That was a great video. So easy to understand and I setup everything and followed along. You are very easy to understand and do a great job of explaining the concepts you are discussing. I just subscribed and I hope you continue with this. I agree, this is all about what's coming and not necessarily "is this the end all, be all LLM for coding." If you take the time and follow along with the evolving AI it will be much easier to adapt to the next thing coming. Thanks for keeping me informed.
Glad it’s useful, honestly I don’t think there is a more exciting time to be a developer
For speed Cerebras A.I. is nuts. over 2000 tokens per second using meta lama 7 b and over 1800 using 70 b
Great video, with Open Webui need to up the num_ctx as it defaults to 2048, perhaps 32768 might help with full response.
@@TheCopernicus1 how do you increase context of a model length within openwebui?
@@andrepaes3908 on the top right hand corner click on the controls icon (looks like sliders) then go down to "Context Length" and for starters try 16384 if your device supports it as 32768 might be really slow, goodluck!
@@andrepaes3908 in Open Webui click on the controls icon top right hand corner then set context length to 16383 to start with and go up or depending on your system resources! good luck
@@andrepaes3908 you can change it easily in Openweb UI configuration or just create a new Modelfile and with Ollama create create a new version of the model with longer context.
What a great video, thanks man
My pleasure!
Great video, amazing content! I see you used the 4bit quantized version to run all tests. Since you've got 128gb RAM, could you run the same tests for the 32b model with 8bit and FP16 quants to check if it improves responses? If so please make another video to share the great news!
that’s a really good shout, I’ll do that
Yeah, I never use 4-bit quantization anymore because it often gives very poor output results. Q8 is okay and almost as perfect as FP16. Also, Q5K_M should be the minimum since it still gives very good results. In fact, I don’t notice any quality loss with Q5K_M models. I’ve tested it on the Gemma 2 27B model and the Llama 3.1 8B and 70B models. However, if you have extra RAM, I highly recommend always using Q8 for the best performance.
why use that if you could use ottodev?
What is the minimum spec that is required to run this ?
All PCs will be able to run it. The question is, how big is your RAM? The bigger your RAM, the larger the model you can run. Usually, an 8B model requires 8GB of RAM, a 27B model requires 32GB of RAM, and so on (this is if you are only using the Q4 quantization). The speed of your CPU doesn’t matter; it only affects the speed of generation. You can still run it, though it will take longer if you have a slow CPU.
32b-instruct-q4_K_M - 23gb vram
ChatGPTs donkey is like the spherical cow meme in phyics is a cow. :)
hahaha nice
Good demo. You just went way, way too fast past the install/setup/config. I'm still trying to work my way past all the errors and figure out why I'm not getting any models showing up in my WebUI.
Apologies I did that because I have a video where I l walk through ollama (much slower and in detail), this is the link
ua-cam.com/video/uAxvr-DrILY/v-deo.html hope it helps
@@chrishayuk Okay thanks. I'll try that.
Claude sucks big time with VUE, Vuetify, Bootsrap, bootstrap-vue, Laravel etc. Qwen is absolutely amazing! It makes so good VUE components, it knows Vite, it knows Laravel, it does not confuse VUE2 to VUE3 and differentiates versions. GPT-4-turbo does not understand different versions and just produces garbage.
I love you my guy
No models. Why?
This should help Getting Started with OLLAMA - the docker of ai!!!
ua-cam.com/video/uAxvr-DrILY/v-deo.html
You should be connected to Wi-Fi for the models to appear. It’s strange since it doesn’t use the internet but requires Wi-Fi. The truth is, you can access Open WebUI on any device as long as it is connected to the same network as the server.
@@LoveMapV4 Well I figured it out. You just gotta take your time getting the configuration right. I installed ollama as a local service. But I had to install open-webui using Docker because the Python PIP install didn't work. PIP didn't work because you need exactly version 3.11. It has to be exactly that version. 3.10 won't work. 3.12 won't work. It must be 3.11. Well, I had 3.10. I didn't spend enough time figuring out how to get exactly 3.11 installed. If I just blindly upgraded, that gave me 3.12, not 3.11. Arghhh!!! So I gave up on that path because I was impatient and just used Docker. But then I had to do something to configure open-webui in Docker to talk to ollama running locally, i.e., not in Docker. I followed the instructions on the web site and just took my time and finally got it to work. The installation and the documentation could both be better, but what the heck, that's what we get paid the big bucks for, right?
I watched a few more of Chris's videos and they are really good. It's a good resource for doing this kind of work. Thank you.