Horace He joined us to walk us through what can one do with native PyTorch when it comes to accelerating inference! Also, if you need some GPUs check out Hyperstack: console.hyperstack.cloud/?Influencers&Aleksa+Gordi%C4%87 who are sponsoring this video! :)
Wow, this presentation was excellent. Straight to the point. No over-complicating, no over-simplifying, no trying to sound smart by obscuring simple things. Thank you Horace!
Super interesting talk!! Do u guys have any idea how the compilation-generated decoding kernel compares against custom kernels like Flash-Decoding or Flash-Decoding++?
One thing that was not super clear to me. Are we loading the next weight matrix (assuming there is enough SRAM), as the previous matmul+activation is being computed?
Within each matmul the loading of data from main memory into registers occurs at the same time as the values being computed. So the answer to your question is "no, but it also wouldn't help because the previous matmul/activation is already saturating the bandwidth"
I didn't understand one of the points made. In a couple of occasions Horace mentions that we are loading all the weights (into the registers I assume) with every token - that's also what the diagram shows at ua-cam.com/video/18YupYsH5vY/v-deo.html . Is that what's happening? Can the registers load all the model weights at once? If that were the case why do you need to load them every time instead of leaving them untouched. I hope that's a not too stupid of a question.
This is a good question! The big problem is that GPUs do not have enough registers (i.e. SRAM) to load all the model weights at once. A GPU has on the order of megabytes of registers/SRAM, while the weights require 10s of gigabytes to store. Q: But what if we used hundreds of chips to have enough SRAM to store the entire model? Would generation be much faster then? A: Yes, and that's what we have with Groq :)
None of the results are using speculative decoding except the results we specifically mentioned were using speculative decoding. I.e: we hit ~200 tok/s with int4 without spec-dec, and 225 or so with spec-dec.
Horace He joined us to walk us through what can one do with native PyTorch when it comes to accelerating inference! Also, if you need some GPUs check out Hyperstack: console.hyperstack.cloud/?Influencers&Aleksa+Gordi%C4%87 who are sponsoring this video! :)
Wow, this presentation was excellent. Straight to the point. No over-complicating, no over-simplifying, no trying to sound smart by obscuring simple things. Thank you Horace!
It was so informative
Wow! It was very educational and practical!
I liked the graphics in the presentation!
Great job by both of you!
Thanks!
i love this guy so much its unreal
Super interesting talk!! Do u guys have any idea how the compilation-generated decoding kernel compares against custom kernels like Flash-Decoding or Flash-Decoding++?
One thing that was not super clear to me. Are we loading the next weight matrix (assuming there is enough SRAM), as the previous matmul+activation is being computed?
Within each matmul the loading of data from main memory into registers occurs at the same time as the values being computed.
So the answer to your question is "no, but it also wouldn't help because the previous matmul/activation is already saturating the bandwidth"
@@Chhillee Thank you, makes sense.
I didn't understand one of the points made. In a couple of occasions Horace mentions that we are loading all the weights (into the registers I assume) with every token - that's also what the diagram shows at ua-cam.com/video/18YupYsH5vY/v-deo.html . Is that what's happening? Can the registers load all the model weights at once? If that were the case why do you need to load them every time instead of leaving them untouched. I hope that's a not too stupid of a question.
This is a good question! The big problem is that GPUs do not have enough registers (i.e. SRAM) to load all the model weights at once. A GPU has on the order of megabytes of registers/SRAM, while the weights require 10s of gigabytes to store.
Q: But what if we used hundreds of chips to have enough SRAM to store the entire model? Would generation be much faster then?
A: Yes, and that's what we have with Groq :)
@@Chhillee Thanks!! I appreciate the answer. I assume the diagram has been simplified for clarity then
Your questions about why fast-gpt is faster than the cuda version: kernel fusion, merging kernels into one is faster than multiple hand written ones
How does PPL look at int4 quants? Also, given GPTQ, how high is the tps with gpt-fast?
Is there any discord for this channel community ?
Yes sir! Pls see vid description
awesome talks, can Triton target TPUs?
speculative decoding is major thing, right? If so, not very fair comparison...
None of the results are using speculative decoding except the results we specifically mentioned were using speculative decoding. I.e: we hit ~200 tok/s with int4 without spec-dec, and 225 or so with spec-dec.
But ctranslate2 as i understand still faster?
wondering why larger batch sizes in general don't work well with torch compile? ua-cam.com/video/18YupYsH5vY/v-deo.html