GPT-Fast - blazingly fast inference with PyTorch (w/ Horace He)

Поділитися
Вставка
  • Опубліковано 29 січ 2025

КОМЕНТАРІ • 21

  • @TheAIEpiphany
    @TheAIEpiphany  10 місяців тому +1

    Horace He joined us to walk us through what can one do with native PyTorch when it comes to accelerating inference! Also, if you need some GPUs check out Hyperstack: console.hyperstack.cloud/?Influencers&Aleksa+Gordi%C4%87 who are sponsoring this video! :)

  • @xl0xl0xl0
    @xl0xl0xl0 10 місяців тому +5

    Wow, this presentation was excellent. Straight to the point. No over-complicating, no over-simplifying, no trying to sound smart by obscuring simple things. Thank you Horace!

  • @kaushilkundalia2197
    @kaushilkundalia2197 3 місяці тому +1

    It was so informative

  • @orrimoch5226
    @orrimoch5226 10 місяців тому +1

    Wow! It was very educational and practical!
    I liked the graphics in the presentation!
    Great job by both of you!
    Thanks!

  • @Cropinky
    @Cropinky 8 місяців тому +1

    i love this guy so much its unreal

  • @nikossoulounias7036
    @nikossoulounias7036 10 місяців тому

    Super interesting talk!! Do u guys have any idea how the compilation-generated decoding kernel compares against custom kernels like Flash-Decoding or Flash-Decoding++?

  • @xl0xl0xl0
    @xl0xl0xl0 10 місяців тому

    One thing that was not super clear to me. Are we loading the next weight matrix (assuming there is enough SRAM), as the previous matmul+activation is being computed?

    • @Chhillee
      @Chhillee 10 місяців тому

      Within each matmul the loading of data from main memory into registers occurs at the same time as the values being computed.
      So the answer to your question is "no, but it also wouldn't help because the previous matmul/activation is already saturating the bandwidth"

    • @xl0xl0xl0
      @xl0xl0xl0 10 місяців тому

      @@Chhillee Thank you, makes sense.

  • @XartakoNP
    @XartakoNP 10 місяців тому

    I didn't understand one of the points made. In a couple of occasions Horace mentions that we are loading all the weights (into the registers I assume) with every token - that's also what the diagram shows at ua-cam.com/video/18YupYsH5vY/v-deo.html . Is that what's happening? Can the registers load all the model weights at once? If that were the case why do you need to load them every time instead of leaving them untouched. I hope that's a not too stupid of a question.

    • @Chhillee
      @Chhillee 10 місяців тому

      This is a good question! The big problem is that GPUs do not have enough registers (i.e. SRAM) to load all the model weights at once. A GPU has on the order of megabytes of registers/SRAM, while the weights require 10s of gigabytes to store.
      Q: But what if we used hundreds of chips to have enough SRAM to store the entire model? Would generation be much faster then?
      A: Yes, and that's what we have with Groq :)

    • @XartakoNP
      @XartakoNP 10 місяців тому

      @@Chhillee Thanks!! I appreciate the answer. I assume the diagram has been simplified for clarity then

  • @xmorse
    @xmorse 10 місяців тому +1

    Your questions about why fast-gpt is faster than the cuda version: kernel fusion, merging kernels into one is faster than multiple hand written ones

  • @SinanAkkoyun
    @SinanAkkoyun 10 місяців тому

    How does PPL look at int4 quants? Also, given GPTQ, how high is the tps with gpt-fast?

  • @mufgideon
    @mufgideon 10 місяців тому +1

    Is there any discord for this channel community ?

    • @TheAIEpiphany
      @TheAIEpiphany  10 місяців тому +2

      Yes sir! Pls see vid description

  • @tljstewart
    @tljstewart 10 місяців тому

    awesome talks, can Triton target TPUs?

  • @kimchi_taco
    @kimchi_taco 10 місяців тому

    speculative decoding is major thing, right? If so, not very fair comparison...

    • @Chhillee
      @Chhillee 10 місяців тому

      None of the results are using speculative decoding except the results we specifically mentioned were using speculative decoding. I.e: we hit ~200 tok/s with int4 without spec-dec, and 225 or so with spec-dec.

  • @kyryloyemets7022
    @kyryloyemets7022 10 місяців тому

    But ctranslate2 as i understand still faster?

  • @let-me-handle
    @let-me-handle Місяць тому

    wondering why larger batch sizes in general don't work well with torch compile? ua-cam.com/video/18YupYsH5vY/v-deo.html