RoFormer: Enhanced Transformer with Rotary Position Embedding Explained

Поділитися
Вставка
  • Опубліковано 4 лют 2025

КОМЕНТАРІ • 27

  • @berkk1993
    @berkk1993 Рік тому

    You are awesome, I only understand when I watch you!

  • @sagumekishin5748
    @sagumekishin5748 Рік тому +4

    9:54 The positional encoding should propagate to higher layers easily because of the skip connections

  • @dziubek3run
    @dziubek3run Рік тому +1

    From what I understand the Sinusoidal Positional Encoding which you called “default” one is not only absolute encoding but it’s also an relative encoding. It does both.

    • @gabrielmongaras
      @gabrielmongaras  Рік тому +1

      I think sinusoidal positional encodings are referred to as "absolute" while rotary encodings are referred to as "relative" is due to the method of construction. While rotary encodings are created relative to each token, sinusoidal encodings are created relative to the start of the sequence. However, as you said, the model probably learns some sort of relative encodings between the tokens when using the sinusoidal encodings.

  • @abc123634
    @abc123634 Рік тому

    Nice one. Subscribed. Please keep the great work!

  • @lukasslatter3651
    @lukasslatter3651 Рік тому

    pretty good! it's difficult for me to understand RoPE, but it's clear after watching your view

  • @TTTrouble
    @TTTrouble Рік тому +3

    I’m assuming you were hinting at the kaio Ken superhot thing when you said you were reading this to better understand some developments in improving context length. I can’t say I fully grasp the rotational encoding part but I can’t get over how simple and seemingly cool that discovery was.(especially with the mutual accreditation from the two groups that independently discovered it). I think trying to learn what I needed to understand(rotational emb.) how that method works is how I first stumbled on your channel.
    Still working on fully grasping positional embedding in all their different flavors, but my eventual take away essentially boiled down to…how do you count to 20 without going past 10? Easy, 1, 1.5, 2, 2.5….
    Followed by….WTF THAT WORKS!?

    • @gabrielmongaras
      @gabrielmongaras  Рік тому +1

      Kind of. Was actually hinting at the recent paper Meta released: "extending context window of large language models via position interpolation" since they directly worked with these RoPE relative positional embeddings.
      Haven't heard of the Kaio Ken Superhot repo, but the research looks pretty interesting. The weird part is their repo also looks to use interpolation to extend context length, but I think it was released before the Meta paper. Kind of cool to see the same idea come up again. So I guess in a way, I was indirectly hinting at their repo?
      The idea of positional encodings isn't too complicated overall. The math notation just makes it hard to look at sometimes. Basically just put some type of series, like you said, in the model whether that's indirectly (like this paper) or directly (like ALiBi or absolute encodings). Sometimes the series is w.r.t. the first token and sometimes with w.r.t. the token being attended to. Either way, just a way for the model to know it's a sequence, not an unordered set.

    • @Sebastian-jf5cp
      @Sebastian-jf5cp Рік тому +2

      @@gabrielmongaras The reddit post of Kaioken was cited in the paper of Meta. So...

    • @gabrielmongaras
      @gabrielmongaras  Рік тому

      @@Sebastian-jf5cp ah ok. That makes more sense now. Didn't realize that 😅

  • @jonathandoucette3158
    @jonathandoucette3158 Рік тому

    Great explanation! Subscribed 😊

  • @gunnerstone120
    @gunnerstone120 4 місяці тому

    Am I correct in thinking that the rotational embedding goes from 0 to 360 degrees? In that case, wont the first word of the sequence be very close to the last word in the sequence? Did they account for this?

    • @gunnerstone120
      @gunnerstone120 4 місяці тому

      I've done some more digging on this for those interested. So yes the theta values do indeed loop back around. However, this is why they have multiple values of theta in equation 15. Up to d/2 unique values of theta. Theta_i is defined as 10,000 ^ -2(i-1)/d. So this set of angles vary logarithmically across the dimensions of the embedding vector. Because the exponential term scales with d and 10,000 is a large base, it would have to be an extremely long sequence of values before things start to 'loop back around' as a whole. What exactly that sequence length is? Not sure.

  • @Skinishh
    @Skinishh Рік тому +1

    Great explanation, thank you! In practice, how many times can an embedding be rotated before doing a 360º rotation? I guess that should be the maximum sequence length that the model can deal with before increasing the loss?

    • @gabrielmongaras
      @gabrielmongaras  Рік тому

      In this case, it depends on theta. The way they formalize theta is so it's in terms of the embedding dimension and so that the max value is never greater than a theta of 360º. So, there will never be a 360º rotation according to their theta parameterization.

    • @Skinishh
      @Skinishh Рік тому

      But how many rotations are there for a specific embedding?

    • @gabrielmongaras
      @gabrielmongaras  Рік тому +1

      @@Skinishh In case of a d dimensional embedding, the rotation size, theta = 10000^(-2(i-1)/d) where i is in [0, d]. So, the smallest largest rotation size is 10000^(-2(i-1)/d) will tend to 10000^-2 which is 1/sqrt(10000) = 1e-8. The largest rotation size is about 1 (when i = 0). So, the largest rotation will only ever be 1 while the smallest rotation will be 1e-8, this is kind of the step size. This means for a specific embedding, it rotates anywhere in between 1e-8 and 1 radians. It never does a full rotation.

    • @Skinishh
      @Skinishh Рік тому +1

      I am trying to understand how many positions can be encoded with this kind of rotary embeddings 🤔

    • @gabrielmongaras
      @gabrielmongaras  Рік тому

      @@Skinishh Since a rotation is continuous, then theoretically infinite.

  • @junhanouyang6593
    @junhanouyang6593 Рік тому

    It seem like what you say around 15 minutes mark about the r = m-n is just how ALiBi works right? It just ALiBi will add another m variable

  • @rajanghimire4022
    @rajanghimire4022 Рік тому

    Amazing man... Thanks a lot.

  • @thomasjohnson4842
    @thomasjohnson4842 Рік тому

    What's the software you're using to view & mark up the pdf, and to sketch on the right hand side of the video?

    • @gabrielmongaras
      @gabrielmongaras  Рік тому +1

      I'm using the default Samsung notes app and split my screen into two to get one page with the PDF and the other page with the paper. Nothing too special as this app has got everything I need in it and I haven't found a better free alternative.

  • @Skinishh
    @Skinishh Рік тому

    How does this compare with relative position embeddings?

    • @gabrielmongaras
      @gabrielmongaras  Рік тому +1

      These are relative positional embeddings as they are relative to the token in focus.

    • @Skinishh
      @Skinishh Рік тому +1

      Got it. Are they the SOTA relative positional encoding?

    • @gabrielmongaras
      @gabrielmongaras  Рік тому +1

      @@Skinishh They're one of the most widely used positional encodings, but they run into the exact same pitfalls as absolute positional encodings. That is, the extrapolation issue. Alibi or a different type of positional encoding scheme would probably be better to use.