LLM inference optimization: Architecture, KV cache and Flash attention

Поділитися
Вставка
  • Опубліковано 15 лис 2024

КОМЕНТАРІ • 5

  • @cliffordino
    @cliffordino Місяць тому +1

    Nicely done and very helpful! Thank you!! FYI, the stress is on the first syllable of "INference", not the second ("inFERence").

    • @yanaitalk
      @yanaitalk  Місяць тому

      Copy that! Thank you😊

  • @johndong4754
    @johndong4754 Місяць тому

    Ive been learning about LLMs over the past few months, but i havent gone into too much depth. Your videos seem very detailed and technical. Which one(s) would you recommend starting off with?

    • @yanaitalk
      @yanaitalk  Місяць тому

      There are excellent courses from DeepLearning.ai on Coursera. To go even deeper, I recommend to directly read the technical papers which gives you more depth of understanding.

    • @HeywardLiu
      @HeywardLiu Місяць тому

      1. Roofline model
      2. Transformer arch. > bottleneck of attention > flash attention
      3. LLM Inference can be divided into: prefilling-stage (compute-bound) and decoding-stage (memory-bound)
      4. LLM serving: paged attention, radix attention
      If you want to optimize the inference performance, this review paper is awesome: LLM Inference Unveiled: Survey and Roofline Model Insights