Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Поділитися
Вставка
  • Опубліковано 28 вер 2024

КОМЕНТАРІ • 13

  • @husienvora9954
    @husienvora9954 5 місяців тому +1

    great vid gabriel 👍

  • @TheVirgile27
    @TheVirgile27 5 місяців тому

    One thing I don't understand well. After training, how we manage the final output ? For a large input, do we "force it" to speak directly i.e. an output for each input OR we first insert all the input and then look at the output at a certain point. Because basically, one could wait to complete the reading part and then force an answer. Maybe i am not that clear (not my language) but it seems there could be several ways to retrieve an output from this type of transformer. Please be kind, thanks for the video :)

  • @M-ed5ct
    @M-ed5ct 5 місяців тому

    Thanks for the video!
    Just one question. The H state dimension is fixed, but it accumulates additional information proceeding with the token sequence. After the first segment, H1 just "summarize" one segment, but after segment N, Hn summarize current segment + Hn-1 that is the summary of all the past context. Do you think would make sense to increase H dimension proceeding with context, i.e. dimension of Hn grows with n? The idea is that we keep information per bit constant in H, so that we can really grow to unlimited context without state becoming a bottleneck.

    • @gabrielmongaras
      @gabrielmongaras  5 місяців тому

      I think it makes sense to increase the hidden state, though doing so would result in a memory dependence on the sequence length during inference, which is currently a big problem. One can think of a softmax attention transformer as having an infinite hidden state (the keys/values are just stacked), on the other hand an RNN has a constant size hidden state. Perhaps something in the middle would perform better than an RNN, but not require as much memory as a Transformer?

    • @M-ed5ct
      @M-ed5ct 5 місяців тому

      @@gabrielmongaras Yeah, the trick is to find a state update function xn+1 = S(xn, segment_n) so that dim(xn+1) > dim(xn), i.e. projecting vector xn into a bigger space, while preserving the semantic and eventually imbuing segment_n's new data. Indeed because state dimension in the paper is tailored for a quite long context, with a growing state you can even start from a _smaller_ state x1 and then grow it with the number of segments....so maybe for a not too large context you can even have a memory reduction!
      But I don't see memory usage as a problem, you can always clamp it to a maximum if really needed, a kind of max_memory parameter...it can't be worse than the original.

  • @ericl227
    @ericl227 5 місяців тому +2

    17:50, shouldn't H_i be a summation of k_j and v_j over j, instead of i, where j goes from 1 to i.

    • @gabrielmongaras
      @gabrielmongaras  5 місяців тому +1

      Yep. Nice catch! Put a note in the video about the reindexing.

  • @danieldeychakiwsky1928
    @danieldeychakiwsky1928 5 місяців тому

    Great vids! At around 3 mins when you get into the attention matrices, I think the dimensions aren’t right because if Q is d by s and K is d by s and we take QK-transpose then d by s matmul s by d gives a d by d matrix but the video shows that matrix to be s by s.

    • @gabrielmongaras
      @gabrielmongaras  4 місяці тому

      Thanks! In that part, I transposed the diagram as I thought it looked a little better that way. Sometimes the diagrams I draw are transposed, but I try to label the dimensions to avoid ambiguity. So the s by s matrix is the resulting matrix, not a d by d one. A s by s matrix results in sequence relations while a d by d matrix results in dimension relations across the entire sequence.

  • @YEETSWORLDWIDE
    @YEETSWORLDWIDE 5 місяців тому +2

    so basically what you're tellling me is the world is going to end

  • @Eniac2045
    @Eniac2045 5 місяців тому +1

    Thanks, another great vid!

  • @EobardUchihaThawne
    @EobardUchihaThawne 4 місяці тому

    i have to get used to those scienetific notations, i struggle to provide code from these articles myself