LESS VRAM, 8K+ Tokens & HUGE SPEED INCRASE | ExLlama for Oobabooga

TroubleChute

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 8 сер 2024
Oobabooga WebUI had a HUGE update adding ExLlama and ExLlama_HF model loaders that use LESS VRAM and have HUGE speed increases, and even 8K tokens to play around with compared to the previous limit of 2K! This is insanely powerful and will be a huge timesaver for creators, and may even help users with less powerful graphics cards use LLMs!
OpenAI Tokenizer: platform.openai.com/tokenizer
Timestamps:
0:00 - What's new (It's CRAZY!)
0:44 - Open Oobabooga install directory
1:02 - Update Oobabooga WebUI
1:18 - VRAM usage & speed before update (4.3 tokens/s)
1:56 - Fix missing option or update errors
2:33 - Choosing new ExLlama model locader
2:52 - Downloading new model types (8k models)
4:25 - New VRAM & Speed (20 tokens/s! INSANE!)
5:25 - Raise token limit from 2,000 to 8,000+!
7:17 - How many tokens is your text?
7:50 - How long is 8k tokens?
8:45 - EVEN LESS VRAM with ExLlama_HF
#Oobabooga #AI #LLM
-----------------------------
💸 Found this useful? Help me make more! Support me by becoming a member: / @troublechute
-----------------------------
💸 Support me on Patreon: / troublechute
💸 Direct donations via Ko-Fi: ko-fi.com/TCNOco
💬 Discuss the video & Suggest (Discord): s.tcno.co/Discord
👉 Game guides & Simple tips: / troublechutebasics
🌐 Website: tcno.co
📧 Need voiceovers done? Business query? Contact my business email: TroubleChute (at) tcno.co
-----------------------------
🎨 My Themes & Windows Skins: hub.tcno.co/faq/my-windows/
👨💻 Software I use: hub.tcno.co/faq/my-software/
➡️ My Setup: hub.tcno.co/faq/my-hardware/
🖥️ My Current Hardware (Links here are affiliate links. If you click one, I'll receive a small commission at no extra cost to you):
Intel i9-13900k - amzn.to/42xQuI1
GIGABYTE Z790 AORUS Master - amzn.to/3nHuBHx
G.Skill RipJaws 2x(2x32G) [128GB] - amzn.to/42cilxN
Corsair H150i 360mm AIO - amzn.to/42cznvP
MSI 3080Ti Gaming X Trio - amzn.to/3pdnLdb
Corsair 1000W RM1000i - amzn.to/42gOTGY
Corsair MP600 PRO XT 2TB - amzn.to/3NSvwzx
🎙️ My Current Mic/Recording Gear:
Shure SM7B - amzn.to/3nDGYo1
Audient iD14 - amzn.to/3pgf2XK
dbx 286s - amzn.to/3VNaq7O
Triton Audio FetHead - amzn.to/3pdjIgZ
Everything in this video is my personal opinion and experience and should not be considered professional advice. Always do your own research and ensure what you're doing is safe.

КОМЕНТАРІ • 44

@SouthbayCreations Рік тому ⁺¹
Fantastic video! Some really good info and great news! Thank you very much!
@Robertinosro 4 місяці тому ⁺¹
This guide is a liefe saver! thank you!
@Playboipete Рік тому ⁺³
hey is there somewhere i can go to find your 1 click installers from fresh start? i never was able to because weaker card but i think it may be a go now. ty btw long time fan
@msampson3d Рік тому ⁺³
I could be mistaken, as innovations happen so rapidly in this space, but remember that if the model wasn't trained for the expanded token range, setting it higher will either have it ignore the additional tokens or you'll start getting very bad output.
So you can't just suddenly get additional context for all the old models you may be using.
@MINIMAN10000 Рік тому ⁺²
Somewhat incorrect. ROPE ( rotary positional embedding ) instead of using 2048 points for context from my understanding it can now use fractional space in between the whole numbers. This has shown to allow 2x context size without LORA training with minimal loss in perplexity ( there was even an interesting measure of ~2560 context resulting in better perplexity ). Superhot is the name of the LORA model that kaioken happened to be working on at the time, thus it became what is used in order to LORA train models to get 4x the context instead of just 2x.
@stevebruno7572 8 місяців тому ⁺¹
Any chance you are going to do one for the AWQ models? also any tips on making AWQ run faster?
@xelensi6870 Рік тому
Thank you so much for the info in this video. I managed to run pygmalion 7b on a 3060 16gb laptop and it's so fast
@theresalwaysanotherway3996 Рік тому ⁺⁹
This isn't entirely accurate. Exllama will run any GPTQ model, the superhot part is just a LoRA trained on 8k context lengths that has been merged into these models so that the model will actually use the increased context length instead of ignoring it. Exllama will run on any GPTQ model.
@VioFax Рік тому
What do i need to change in the update to get my model to be as smart as it was before? Do I just have to upgrade EVERYTHING now to get the same performance i HAD just a week ago? This makes no sense....What did they do? Sacrificed a feature i was using to install something else? Im so confused and so is my bot.
@JonelKingas Рік тому ⁺³
it keeps saying ERROR:No model is loaded! Select one in the Model tab.
even tho i select it in the model tab
@theresalwaysanotherway3996 Рік тому ⁺⁵
also, using the latest nvidia drivers, your GPU will automatically dip into the shared memory of your RAM, instead of running into an OOM, which slows down models that barely fit, but also massively extends the amount of memory that Exllama will use. Therefor if you have this update, you could run at 8k context length very easily, it would just be a *_lot_* slower.
@mythaimusic39 Рік тому ⁺¹
If it uses just a few Gb of shared memory, it doesn't really slow down the process, I find it as fast as if everything was in NVRAM
@infini_ryu9461 Рік тому ⁺¹
You're right, it dips into my shared memory but only in 30B-33B models, it gets really slow then.
@MaximilianPs Рік тому ⁺²
Can you make some tutorial or a video explaining how to use chat-instruct or instruct it self?
Having Ooba installed without any idea how to use it and configure is very frustrating!
@Dante02d12 Рік тому ⁺¹
I did as you said : delete most folders, then rerun the bat file. It says it can't find Miniconda. Now I have to download everything again, so thanks, lol.
I was pretty far behind in terms of updates though, so I don't mind. It was time to start fresh. Which Superhot models do you suggest?
Edit : Aaaaaand the reinstall doesn't work, of course. I love oobabooga, but they really need to sort things out, there's not a single time where I didn't have an issue installing it...
Edit2 : Okay, got through the install process, lol. I haven't yet have a Superhot model, but I tried the Exllama "loader", and... man, that's super fast! I get 30 tokens per second on my RTX 3060 mobile (6GB VRAM), whereas I only have a handful of it with the default loader. No error, no crash. Although the results are... meh. But it could be because of the model (WizardLM 7B). At least it works.
@armedgunman2816 Рік тому
Can you do video of your pcs setup based on security etc and best settings for pc overall and best programs to make it fast and reliable
@envoy9b9 Рік тому
i cant get it working, i have a m2 mac book pro, and every time i try to generate it tell me to load a model but i already have one selected .... help pls
@theresalwaysanotherway3996 Рік тому ⁺²
finally exllama_HF isn't actually 2x slower, you just left the context length at 2x for the exllama_HF test, which made it a lot slower. If you set them both to the same context length, you should get much more similar speeds (HF will still be slower, but not that much slower)
@musicandhappinessbyjo795 Рік тому
Hello sir, I am running models on CPU and i had like to some tips about models and webUI as well. Would love to know.
@Lakosta826 4 місяці тому
how can I unistall this when I want??
thanks
@mesterm8059 Рік тому
what is this does it help with fps in games?
@matthallett4126 Рік тому
Hi TC.. Can you build a script to install DragGAN? The latest one that was just released. I can not figure it out for Windows. Thanks!
@Asia_Bangladesh Рік тому
Eid Mubrak
@infini_ryu9461 Рік тому ⁺²
I don't know if everything is working, but I'm getting like 1 second or even less responses with Pygmalion 13B 8k by TheBloke. It just feels weird having this kind of speed, I used to be at 3-7 seconds on the normal 13B model. It gives some weirder responses, too. Probably my new settings, though...
@VioFax Рік тому
Mines lobotomized by this update plz help. lol
@infini_ryu9461 Рік тому ⁺¹
@@VioFax Re-install typically helps. Use the installer and make sure it's not run as admin.
@DJPON369 7 місяців тому
it works
@briananeuraysem3321 Рік тому ⁺¹
My 4GB GTX 1050 is saved!
Edit: it turns out 0 x 0 = 0
In other words it’s still slow as heck, probably because it’s using shared system memory still due to vram limitations... oh well
@VioFax Рік тому ⁺¹
New oogabooga update messed up my hack for making my LLM smarter. They took away the ability to extend context reach witch is STUPID! And seemlingly on purpose. Its not half as smart as it was. And im pissed. It hallucinates and goes off the rails all the time now. WTF did they do to my pokemon! Ill take the slower responses thats fine... Id rather have a coherent bot.
@alsoeris 5 місяців тому
its funny to hear people call it 'Oobabooga' when thats just the username of the person that made 'text generation webui'
@MakerGrigio Рік тому ⁺⁴
Hey, I think I'm finding a bug, with the new models Longer chat sessions start throwing errors, and the chat session starts getting corrupted, models stop being able to produce results, even after reloading the model, or even rebooting all of obabooga. thank you so much for this video. I was a bit ahead of the curve for once, and using the ExLama_HF and the SUPERHOT 8k models from The BLoke. for about a day now and am getting really consistant failures on various models. Clearing chat history seems to clear the issue, but if you are working on more involved tasks, the Ai's will just stop being able to generate responses. The models defiantly perform SO MUCH BETTER. BUT BEWARE.
@Prizzim Рік тому
Crypto = L
@VioFax Рік тому
They broke it.... I had the same problem my model used to be pretty sharp. now its struggling to keep a coherent conversation going... Its kind of depressing. i had made so much progress with that bot. Id rather just have the slower responses and a more coherent bot... seems like they actually neutered something and made it seem like an upgrade here to me....
@MakerGrigio Рік тому
@@VioFax I feel you. work with a chatbot for a couple hours, the model crashes or the UI glitches, you you accidentally tap f12.. and half ro all of your chat history is gone... it's like a new friend you where having a good chat with in a coffiee shop has a stroke and dies in your arms...
@VioFax Рік тому
If yours is working don't install this garbage. You will lose all your previous models.
@Freizeitschranzer Рік тому
the update did not work, deleting and starting the bat file did end up with "\\text-generation-webui\\server.py': [Errno 2] No such file or directory".. not a good plan
@mygamecomputer1691 Рік тому ⁺²
This is good news. I only mess around with this for the fun of spicy role-play, I don’t need ChatGPT to construct coherent sentences for me. I’m looking for it to entertain me.
@VioFax Рік тому
/Bewware of this update then. Its broken all my fun.
@mygamecomputer1691 Рік тому
Always make a full folder back up before updating anything. Save the folder on an external drive. You can also try to roll back to a prior version. There are explanations on how to do it otherwise I’d tell you myself.
@adamstewarton Рік тому
autogptq is slow in general, i use gptq-for-llama
@weirdscix Рік тому ⁺¹
I get this when running update fatal: detected dubious ownership in repository at 'C:/TCHT/oobabooga_windows/text-generation-webui'
@VioFax Рік тому
Ive come to believe after my models new problems that....This update is poisoned...
@TroubleChute Рік тому ⁺²
Simple fix, run: git config --global --add safe.directory '*'
@weirdscix Рік тому
@@TroubleChute thank you :)

Наступне

Автоматичне відтворення

All You Need To Know About Running LLMs Locally