How FlashAttention Accelerates Generative AI Revolution

Поділитися
Вставка

КОМЕНТАРІ • 38

  • @raayandhar6195
    @raayandhar6195 3 місяці тому +6

    Great video as always

  • @JanHolly-m5s
    @JanHolly-m5s 2 місяці тому +3

    Thank you for this video! Amazing explanation!

    • @jbhuang0604
      @jbhuang0604  2 місяці тому

      You’re welcome! Happy that you liked it.

  • @aritraroy3220
    @aritraroy3220 Місяць тому +1

    Wow! What a beautiful explanation!!!!

  • @JanHolly-m5s
    @JanHolly-m5s 2 місяці тому +1

    Thank you for this video! Clearly explained!

  • @LarryLaiTW
    @LarryLaiTW 3 місяці тому +1

    Clearly explained! Highly recommended.

  • @davidlearnforus
    @davidlearnforus Місяць тому +1

    It's a very good explanation and video in general as well!

    • @jbhuang0604
      @jbhuang0604  Місяць тому +2

      Glad it was helpful!

    • @VladimerKhasia
      @VladimerKhasia 27 днів тому

      @@jbhuang0604 Love all of the content on this channel! Thank you so much for doing this. I am spreading info about this great channel everywhere :))

  • @柳沢-u3g
    @柳沢-u3g 3 місяці тому +1

    Great explanation and animation!

  • @jiaruixu4873
    @jiaruixu4873 Місяць тому +1

    Really amazing video! May I ask what tools you use to create this video?

    • @jbhuang0604
      @jbhuang0604  Місяць тому +1

      Thanks! The animation comes from PowerPoints. I edit the video with Adobe premiere pro.

  • @sourabhverma9034
    @sourabhverma9034 3 місяці тому +1

    Awesome!

  • @Тима-щ2ю
    @Тима-щ2ю 2 місяці тому +1

    Combining this video with Umar Jamil implementation is useful

  • @present-bk2dh
    @present-bk2dh 3 місяці тому

    Thanks for making this! The notation is a bit confusion @8:36 if S = Q.K^T, and S={x_1, x_2....x_n}, then x_1, x_2.... should be column vectors, but here:
    m_0 = -inf
    ...
    m_i = max(m_i, x_i)
    they are handled as values, perhaps there's a missing outer loop? q_j where j goes through 1..N (if square matrices).
    but then S would be S={x_{1,1}, x_{1,2},...x_{1,N}
    x_{2,1}, x_{2,2},...x_{2,N}
    ....
    x_{N,1}, x_{N,2},...x_{N,N}}
    essentially I'm confused about whether O_N is a vector or a value.
    Thanks again for this content, I really enjoyed it!

    • @jbhuang0604
      @jbhuang0604  3 місяці тому +1

      Thanks for the question. Yes, in general S is a N x N matrix, where N is the number of tokens.
      When explaining the online softmax, we only look at the attention coming from one query vector and all key vectors. So the S = q * K^\T. Here the query vector is of size 1 x d_k and the key matrix is of the side d_k x N. Therefore, the "matrix" S is just a 1 x N vector.
      The O_N is a vector of size 1 by d, where d (it's a weighted average of value vectors, where the weights are from the attention).
      We only need to look at one query vector to understand the key idea of online softmax and FlashAttention. We can process multiple query vectors and key/value vectors at the same time in parallel (depending on the size of the on-chip SRAM).

    • @present-bk2dh
      @present-bk2dh 3 місяці тому +1

      ​@@jbhuang0604O_N is of shape 1 by d_v, thank you so much for this answer and making this video! You really made it click!

    • @jbhuang0604
      @jbhuang0604  3 місяці тому +1

      Thanks a lot!

  • @t.w.7065
    @t.w.7065 2 місяці тому +2

    @11:18 not HMB but HBM

    • @jbhuang0604
      @jbhuang0604  2 місяці тому +3

      Good catch! Clearly I was just trying to make sure if people are paying attention. :-p

  • @zhi-shengchen7950
    @zhi-shengchen7950 3 місяці тому +1

    Very very clear explanation! Thank you Professor, learned a lot! ps: May I ask what software you use to make the animation?

    • @jbhuang0604
      @jbhuang0604  3 місяці тому +1

      Thank you! It’s mostly morph transition from MS PowerPoint.

  • @GapLoser42
    @GapLoser42 2 місяці тому +1

    This video is so cool!!! May I ask how do you make this fantastic slides? Do you use Google docs, Beamer or sth else?

    • @jbhuang0604
      @jbhuang0604  2 місяці тому +1

      Thanks! I used PowerPoint. The animation comes from the morph transition.

    • @GapLoser42
      @GapLoser42 2 місяці тому +1

      @@jbhuang0604 Thx a lot!

  • @Leo-cc9pi
    @Leo-cc9pi 3 місяці тому +1

    Thank you for your video. I have a simple question.
    The paper explained outer and inner loops, so is the order right at around 10:40?

    • @jbhuang0604
      @jbhuang0604  3 місяці тому +2

      Yes, good catch! In FlashAttention-1, the KV is in the inner loop and the Q is in the outer loop (to prevent repeatly writing the partial output to the HBM). In FlashAttention-2, they change it back to using KV as the outer loop and Q as the inner loop due to parallelization (as the partial outputs are always in the on-chip SRAM).
      I intentionally use the loop order from FlashAttention-2 to better illustrate the accumulation of the partial results to obtain the full output. I think it’s easier to understand for the core concept.

  • @duzx4541
    @duzx4541 Місяць тому +1

    Uncle Roger???
    haha sorry, good explanation, cheers