LongNet: Scaling Transformers to 1,000,000,000 Tokens Explained

Поділитися
Вставка
  • Опубліковано 14 лис 2024

КОМЕНТАРІ • 20

  • @TTTrouble
    @TTTrouble Рік тому +1

    Thank you for this, currently the best explained and most detailed review of the paper that came out days ago. Really appreciate your time and how quickly you put this out!

  • @berkk1993
    @berkk1993 Рік тому

    You are the best explainer!!! Period.

  • @VperVendetta1992
    @VperVendetta1992 Рік тому +3

    Amazing explanation, thanks! Honestly the 1 billion token claim sounds more like a flex than anything else. They didn't actually build a model with 1 billion token context size, they just showed that it's mathematically possible with their algorithm.
    Let's see how it really performs against other models with standard quadratic attention and same context size.
    To me it seems like the skipping of tokens to attent to is arbitrary, and important semantic meanings can be lost. I wonder if someone will come up with other strategies to choose the best tokens that other tokens should attent to.

    • @gabrielmongaras
      @gabrielmongaras  Рік тому +1

      Yeah it definitely seems like a flex saying they could get to 1B tokens. That was even in the title and they didn't even train a model on 1B tokens.
      I do wonder how it performs against full attention. Most algorithms that have sub-quadratic attention tend to perform a lot worse as the amount of information they can process per layer is much smaller.
      I think whoever finds a way to make dynamic best token searching
      that is also efficient would create a kind of revolutionary algorithm, though all the way I can think of require the full attention to be computed itself.

    • @zandrrlife
      @zandrrlife Рік тому

      Idk bro. Think about how humans process data. We only care and use important bits of context from sequences. We need more empirical results honestly. Great content. Not a lot of REAL engineers on UA-cam. Stay blessed.

    • @hi-gf5yl
      @hi-gf5yl Рік тому

      @@zandrrlifewe likely learn what to ignore and what to pay attention to. Most sub quadratic attention is precoded. An exception is routing transformers.
      There’s also research into analyzing eye movements while reading and comparing it to attention in a gpt.

    • @zandrrlife
      @zandrrlife Рік тому

      @@hi-gf5yl that's super interesting. Can you please pass a brother a link to the paper..please 🙏?

    • @hi-gf5yl
      @hi-gf5yl Рік тому

      @@zandrrlife eye gaze and self attention:
      aclanthology.org/2022.cmcl-1.9.pdf

  • @Calphool222
    @Calphool222 Рік тому

    Without a doubt, naive dense attention is the bottleneck everybody is turning their brains toward.
    When we think about how humans process text while reading, it doesn't *seem* very similar to LongNet's approach. Instead, brain chemicals are involved I suspect. Some words "stick out" to us and mean more to us because of our experiences. We throw away noise ruthlessly as we're processing text, and we don't do so in a systematic way (like LongNet does). We build our own version of what the text says, which is a personally compressed version of the actual text. You see this most obviously when you're in a book club or poetry group and you realize that another person got something *entirely* differently from the same text because what was important to them to remember about what they were reading is very different than how you assessed the text.

  • @TTTrouble
    @TTTrouble Рік тому +1

    I'm trying to understand how this doesn't just delete/ignore information in the dilated layers. I'm sure I'm missing something, but the method just randomly ignores certain words attention calculation, it doesn't average anything out, so....where does the information for all the attention not calculated in the more sparse layers come from? In your explanation of the hops, I don't understand how the fact that token 3 attends to 5 provides any implicit meaning for the attention token 2 should get from token 5 by looking at 3. If this were the case and information was not lost, why not go to the natural extreme and have a local window size of 2 instead of 4, because it can just keep hopping along in the next layer to get information from token "x".
    No worries if this question is too hard to explain and you don't have time to respond. I don't have a strong background in this field and may be missing some fundamentals that is overlooking something obvious or it may just be an answer I have to discover on my own to truly understand.

    • @gabrielmongaras
      @gabrielmongaras  Рік тому +2

      Good question! So, with full attention, we know that each token is basically a weighted average of all other tokens and the attention matrix is this weighted average.
      Let's say we have three tokens: x = A, B, C
      Then we know the output of an attention block is:
      x2 = x@A
      where A is our attention matrix and if we expand this:
      A2 = a1*A + a2*B + a3*C
      B2 = b1*A + b2*B + b3*C
      C2 = c1*A + c2*B + c3*C
      (where an is the nth attention weight for the A embedding, bn is the nth attention weight for the B embedding ...)
      So our attention matrix looks like so:
      AA AB AC
      BA BB BC
      CA CB CC
      Now, what if I sparsify things:
      AA AB AC
      BA BB ---
      CA --- CC
      So token C cannot get information from token B and token B cannot get information from token C? Well no, in this case, since A is a linear combination of B and C, then information can flow through A to get to B and C in the second attention layer:
      A2 = a1*A + a2*B + a3*C
      B2 = b1*A + b2*B + 0
      C2 = c1*A + 0 + c3*C
      A3 = a1*A2 + a2*B2 + a3*C2
      B3 = b1*A2 + b2*B2 + 0
      C3 = c1*A2 + 0 + c3*C2
      If we just look at B3:
      B3 = b1*A2 + b2*B2 + 0
      = b1*(a1*A + a2*B + a3*C) + b2*(b1*A + b2*B)
      showing that information from C does flow to B
      Of course, due to the softmax and other layers, the information flows differently, but this is the basic idea if we remove everything but the attention calculation (and the softmax). Though, we don't know exactly what the algorithm is doing with these scores because of the nature of black-box algorithms.
      All that said, if you do attention to certain tokens in a smart way, then you can have all other tokens route information to all other tokens. If you do it even smarter, then tokens can be routed to each other quicker than other methods. In their method, they construct the matrix so information can be routed to all other tokens in logN layers, but now you essentially have a linear attention matrix for each layer. Even though all tokens cannot attend to all other tokens in one layer, they find that the logN relationship suffices, meaning that all tokens didn't need to immediately attend to all other tokens in the first place.
      The authors could've done a random attention matrix scheme or a different one, but they choose this scheme specifically so the attention matrix scales linearly with the sequence length and so the routing of information happens in logN layers.
      Hope this helps!

    • @TTTrouble
      @TTTrouble Рік тому +1

      @@gabrielmongaras ahh yes the matrix expansion was really helpful and makes it a little more clear to me what’s going on.
      It’s hard to wrap my head around the fact that the model just figures out how to pass along the requisite information during training through what seems like a much more narrow information channel compared to the standard method but if I’m thinking about it right, that would support the thought that current models have a lot of redundant information and why LORA’s work. Super cool, thanks so much for explaining!

    • @zandrrlife
      @zandrrlife Рік тому

      ​@@TTTroubleoutlier features are meaningless 😂. This is true. Interesting stuff.

  • @Skinishh
    @Skinishh Рік тому

    What about positional embeddings? How do they deal with them?

    • @gabrielmongaras
      @gabrielmongaras  Рік тому

      Since they treat this like a normal transformer and claim that dilated attention can substitute normal attention, I would imagine positional embeddings would be the same for a normal transformer. So, it depends on the task, but likely relative positional embeddings for sequential data as that seems to be the current standard right now. You would just add them normally like usual, which depends on the positional encodings type.

  • @김화겸-y6e
    @김화겸-y6e Рік тому

    Can't wait the implemented code

    • @gabrielmongaras
      @gabrielmongaras  Рік тому

      They open-sourced their implementation here: github.com/kyegomez/LongNet
      Looks like they made it really easy to add to your models.

  • @zhengzhou2076
    @zhengzhou2076 Рік тому

    Kind of hand created conv kernals. It is just me?

    • @gabrielmongaras
      @gabrielmongaras  Рік тому +1

      Kind of. Since convolutions deal with local information with static weights and attention deals with global information with dynamic weights, I think it's better calling it a hand-crafted attention kernel. Though I guess naming doesn't really matter as it's just some form of information routing in the end.