vLLM - Turbo Charge your LLM Inference

Sam Witteveen

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 4 жов 2024

КОМЕНТАРІ • 58

@rajivmehtapy Рік тому ⁺¹⁵
As always, you are one of the few people who hit this topic on UA-cam.
@sp4yke Рік тому ⁺⁵
Thanks Sam for this video. It would be interesting to dedicate a video comparing OpenAI emulation such as LocalAI, Oobabooga, and vLLM
@g-program-it Рік тому ⁺⁴
Finally AI models that don't take a year to give a response.
Cheers for sharing this Sam.
@clray123 Рік тому ⁺¹
Uhh... you already get instant response from GGML/llama.cpp (apart from the model weights loading time, but this is not anything that PagedAttention improves on).
The deal with PagedAttention is that it prevents the KV cache from wasting memory by not overallocating the entire context length at once, but rather doing it in chunks as the sequence keeps growing (and possibly sharing chunks among different inference beams or users).
This allows the same model to serve more users (throughput) - of course, provided that they generate sequences shorter than the context length. It should not affect the response time for any individual user (if anything, it makes it worse because of the overhead of mapping virtual to physical memory blocks).
So if it improves HF in that respect, it just demonstrates that either HF's implementation of KV cache sucks or Sam is comparing non-KV-cached generation with KV-cached one.
@MultiSunix Рік тому ⁺³
Talked to its core developer, they don't have plan to support quantized model yet, hence you really need powerfull GPU(s) to run it.
@wilfredomartel7781 Рік тому ⁺¹
Finally we can achieve fast responses.
@mayorc Рік тому ⁺³
A very good test for this, you could make a video for would be to use the OpenAI-Compatible Server functionality with a good performing optimized local model with good coding training, and testing this with new great tools like GPT-Engineer or AIDER, to see how it performs compared to GPT4 in real case scenarios of writing applications.
@henkhbit5748 Рік тому
Hmm, did not know that Red Bull and Verstappen are in the race for turboscharging Llm's😉 thanks for demonstrating vllm in combination with an open source model👍
@TailorJohnson-l5y Рік тому
Sam I love you videos but this one takes the cake. Thank you!!!
@Rems766 Рік тому
Thanks mate, I am going to try to add that to langchain so it can integrate seamlessly to my product
@guanjwcn Рік тому ⁺¹
This is very interesting. Thanks for sharing this. It would be nicer, I guess, if langchain can do the same.
@jasonwong8934 Рік тому ⁺¹
I’m surprised the bottleneck was due to memory inefficiency in the attention mechanism and not volume of matrix multiplications
@mayorc Рік тому
This looks very useful.
@NickAubert Рік тому ⁺¹
It looks like vLLM itself is CUDA, but I wonder if these techniques could apply to CPU based models like llama.cpp? Presumably, any improvements wouldn't be as dramatic if the bottleneck is processing cycles rather than memory.
@harlycorner Рік тому
Thanks for this video. Although I should mention that at least on my RTX 3090 TI the GPTQ 13B models with exllama loader are absolutely flying. Faster than ChatGPT-3.5 turbo.
But I'll definitely take a look
@MariuszWoloszyn Рік тому ⁺¹
vLLM is great but lacks support for some models (and some are still buggy like mpt-30b with streaming but mpt was added like 2 day ago so expect that to be fixed soon). For example there's less chance it will support Falcon-40b soon. In that case use huggingface/text-generation-inference which can load falcon 40b in 8-bit flawlessly!
@samwitteveenai Рік тому ⁺⁷
Yes none of these are flawless. I might make about video about hosting with HF Text-gen-inference as well.
@rakeshramesh9248 22 дні тому
My question is it does increase throughput by freeing up the memory to hold in more batches? But how does it achieve the speed up in latenc?
@clray123 Рік тому ⁺¹
It should be noted that for whatever reason it does not work with CUDA 12.x (yet).
@samwitteveenai Рік тому ⁺¹
My guess is just because their setup is not using that yet and it will come. I actually just checked my Colab and that seems to be running in Cuda 12.0 but maybe that is not optimal.
@frazuppi4897 Рік тому ⁺¹
not sure since they compared with hf transformer and hf doesn't use flash attention to my knowledge so they are quite slow by default
@samwitteveenai Рік тому ⁺²
They compared to TGI also which does have Flash-Attention huggingface.co/text-generation-inference and it is still quite a bit faster
@akiempaul1117 11 місяців тому
Great Great Great
@MeanGeneHacks Рік тому ⁺¹
Thank you for this nugget. Very useful information for speeding up inference. Does it support bitsandbytes library for loading in 8 or 4-bit?
Edit: Noticed no falcon support..
@samwitteveenai Рік тому ⁺¹
AFAIK I don't think they are supporting Bitsandbytes etc.which doesn't surprise me as for what they are mainly using it for is comparing models which is not ideal a low resolution quantization.
@shishirsinha6344 Рік тому
Where is the model comparison made in terms of execution time wrt HuggingFace?
@Gerald-xg3rq 6 місяців тому
can you run this on aws sagemaker too? does it also work with llama 2 model with 7 and 13 billion parameters?
@TheNaive 10 місяців тому
could you show how to add any hugging face model to vllm? Also above colab aint working.
@asmac001nolastname6 Рік тому ⁺¹
Can this package be used with quantized 4-bit models? I don't see any support for them in the docs..
@samwitteveenai Рік тому ⁺¹
no I don't think it will work with that.
@РыгорБородулин-ц1е Рік тому
Now I wonder if this is possible to launch on CPU
Some models will work tolerable.
@io9021 Рік тому ⁺¹
I'm wondering how vLLM compares against conversion to onnx (e.g. with optimum) in terms of speed and ease of use. I'm struggling a bit with onnx 😅
@s0ckpupp3t Рік тому
does ONNX have a streaming ability? I can't see any mention of websocket or http/2
@io9021 Рік тому ⁺¹
@@s0ckpupp3t Not that I know. I converted bloom-560 to ONNX and got similar latency as with vLLM. I guess with ONNX one could optimise it a bit further, but I'm impressed by vLLM because it's much easier to use.
@MrRadziu86 Рік тому
@Sam Witteveen do you know by any chance how it compares to latest other technics of speeding up model. I don't remember exactly, but sometimes it is just a settings, a parameter nobody didn't use, until somebody share it, as well other technics. AS well, if you would know, which are better suitable for falcon, llama, etc.?
@samwitteveenai Рік тому
for many of the options I have looked at this compares well for the models that it works with etc.
@chenqu773 Рік тому ⁺¹
I am wondering if it works with huggingface 8bit and 4bit quantization
@samwitteveenai Рік тому ⁺²
If you are talking with bitsandbytes I don't hink it does just yet.
@navneetkrc Рік тому ⁺¹
So can I use this with models downloaded from huggingface directly??
Context: In my office setup I can only use models weight downloaded separately.
@samwitteveenai Рік тому ⁺²
Yes totally the colab I show was downloading a model from HuggingFace. Not all of the LLMs are compatible, but most the popular ones are.
@navneetkrc Рік тому ⁺²
@@samwitteveenai In my office setup, these models cannot be downloaded (blocked), so I download them separately and use their weights using huggingface pipelines as LLM for Langchain and other use cases.
Will try a similar approach for vLLM hoping that this approach works
@samwitteveenai Рік тому ⁺²
@@navneetkrc Yes totally, will just need to load locally etc.
@navneetkrc Рік тому ⁺¹
@@samwitteveenai thanks a lot for the quick replies. You are the best 🤗
@andrewdang3401 Рік тому
Is this possible with langchain and a gui
@ColinKealty Рік тому ⁺¹
Is this usable as a model in langchain for tool use?
@samwitteveenai Рік тому ⁺²
You can use it as an LLM in Langchain. Whether it will work with tools will depend on which model you serve etc.
@ColinKealty Рік тому ⁺¹
@@samwitteveenai I assume it doesn't support quants? Don't see any mention
@keemixvico975 Рік тому
it don't work.. daim it. I don't want to use Docker to make this work, so I'm stuck
@samwitteveenai Рік тому ⁺¹
what model you trying to get to work? It also doesn't support quantized models if you are trying for that.
@saraili3971 Рік тому
@@samwitteveenai Hi Sam, thanks for the sharing(life-saver for newbies). Wonder your recommendation for quantized models ?
@napent Рік тому ⁺¹
What about data privacy?
@samwitteveenai Рік тому ⁺²
You are running it on a machine you control. What are the privacy issues ?
@napent Рік тому ⁺¹
@@samwitteveenai i though that it's cloud based 🎩
@stabilitylabs Рік тому ⁺¹
can use with GGML model?
@samwitteveenai Рік тому
no so far these are for full resolution models only
@sherryhp10 Рік тому
still very slow
@eyemazed 11 місяців тому
It doesnt work on windows folks, trash
@eljefea2802 11 місяців тому
they have a docker image. That's what im using right now

Наступне

Автоматичне відтворення

Meet Claude 2 : Anthropic's NEXT GEN Supercharged Model