Interesting sharings. Curious about the underlying implementation for KV blocks sharing part you have a copy-on-write mechanism, but how does it avoid dirty-read condition, where both request reads that ref count is 2 and both request copies the block simultaneously.
7:25 How is it possible to compute attentions separately block by block? Softmax (attention weight) is calculated based on all of the previous tokens and then those softmax scores are multiplied with all of the previous tokens' value vectors to calculate the attention score for the new token. So it should use all of the previous tokens on other blocks twice. What am I missing here?
I read the paper. Turns out the illustration is not 100% accurate (probably for the sake of making it intuitive). It indeed uses every previous block (in case sliding windows is not used) while computing the attention for the next layer.
It seems like there would be a performance increase for beam search as well? (That is, in addition to the memory savings it gets.) Would be great to see some benchmarks for that!
If sequences of different sizes can be processed in parallel (say request 1 is generating 11th token and request 2 is generating 3rd token), how come those two operations (Query vector of request 1 - say dimension 1x50 - dot product with previous tokens' key vectors matrix 11x50) and (1x50 dot product 3x50) can be batched together?
you run the server on the host that has the GPU installed, the server can be accessible over an API remotely using openai's client. follow me for more AI vids
Beautiful adaptation of a fundamental idea of paging, reference counting and copy -on-write.👌
Full circle dynamic memory management and garbage collection. Great talk!
Such an elegant idea and amazingly clear explanation!
very helpful and distinguish presentation!
Awesome 👍
Great talk and amazing work guys!
It was great. Thank you!
This is so great!
great work
Interesting sharings. Curious about the underlying implementation for KV blocks sharing part you have a copy-on-write mechanism, but how does it avoid dirty-read condition, where both request reads that ref count is 2 and both request copies the block simultaneously.
I think the last question was asked on impact on latency
7:25 How is it possible to compute attentions separately block by block? Softmax (attention weight) is calculated based on all of the previous tokens and then those softmax scores are multiplied with all of the previous tokens' value vectors to calculate the attention score for the new token. So it should use all of the previous tokens on other blocks twice. What am I missing here?
I read the paper. Turns out the illustration is not 100% accurate (probably for the sake of making it intuitive). It indeed uses every previous block (in case sliding windows is not used) while computing the attention for the next layer.
It seems like there would be a performance increase for beam search as well? (That is, in addition to the memory savings it gets.) Would be great to see some benchmarks for that!
How do we calculate memory used by kv cache in paged attention.Example for input 500 and output 1000
If sequences of different sizes can be processed in parallel (say request 1 is generating 11th token and request 2 is generating 3rd token), how come those two operations (Query vector of request 1 - say dimension 1x50 - dot product with previous tokens' key vectors matrix 11x50) and (1x50 dot product 3x50) can be batched together?
brilliant guys
sick
Rosamond Plain
Farrell Stravenue
Kiehn Crescent
Stroman Walks
Thurman Terrace
Lebsack Light
Tyrell Mountain
Fae Harbors
Cloyd View
Pink View
470 White Branch
Breitenberg Pines
Durgan Mews
Ettie Road
Dejah Corners
Terrance Villages
Francis Track
Steuber Lakes
Lenora Isle
Is vLLM engine running on the host?
you run the server on the host that has the GPU installed, the server can be accessible over an API remotely using openai's client.
follow me for more AI vids
Wunsch Vista
Declan Mews
Magnus Ridges
Jennyfer Cliff
Quitzon Walk
Kunze Junctions