From what I understand the Sinusoidal Positional Encoding which you called “default” one is not only absolute encoding but it’s also an relative encoding. It does both.
I think sinusoidal positional encodings are referred to as "absolute" while rotary encodings are referred to as "relative" is due to the method of construction. While rotary encodings are created relative to each token, sinusoidal encodings are created relative to the start of the sequence. However, as you said, the model probably learns some sort of relative encodings between the tokens when using the sinusoidal encodings.
I’m assuming you were hinting at the kaio Ken superhot thing when you said you were reading this to better understand some developments in improving context length. I can’t say I fully grasp the rotational encoding part but I can’t get over how simple and seemingly cool that discovery was.(especially with the mutual accreditation from the two groups that independently discovered it). I think trying to learn what I needed to understand(rotational emb.) how that method works is how I first stumbled on your channel. Still working on fully grasping positional embedding in all their different flavors, but my eventual take away essentially boiled down to…how do you count to 20 without going past 10? Easy, 1, 1.5, 2, 2.5…. Followed by….WTF THAT WORKS!?
Kind of. Was actually hinting at the recent paper Meta released: "extending context window of large language models via position interpolation" since they directly worked with these RoPE relative positional embeddings. Haven't heard of the Kaio Ken Superhot repo, but the research looks pretty interesting. The weird part is their repo also looks to use interpolation to extend context length, but I think it was released before the Meta paper. Kind of cool to see the same idea come up again. So I guess in a way, I was indirectly hinting at their repo? The idea of positional encodings isn't too complicated overall. The math notation just makes it hard to look at sometimes. Basically just put some type of series, like you said, in the model whether that's indirectly (like this paper) or directly (like ALiBi or absolute encodings). Sometimes the series is w.r.t. the first token and sometimes with w.r.t. the token being attended to. Either way, just a way for the model to know it's a sequence, not an unordered set.
Am I correct in thinking that the rotational embedding goes from 0 to 360 degrees? In that case, wont the first word of the sequence be very close to the last word in the sequence? Did they account for this?
I've done some more digging on this for those interested. So yes the theta values do indeed loop back around. However, this is why they have multiple values of theta in equation 15. Up to d/2 unique values of theta. Theta_i is defined as 10,000 ^ -2(i-1)/d. So this set of angles vary logarithmically across the dimensions of the embedding vector. Because the exponential term scales with d and 10,000 is a large base, it would have to be an extremely long sequence of values before things start to 'loop back around' as a whole. What exactly that sequence length is? Not sure.
Great explanation, thank you! In practice, how many times can an embedding be rotated before doing a 360º rotation? I guess that should be the maximum sequence length that the model can deal with before increasing the loss?
In this case, it depends on theta. The way they formalize theta is so it's in terms of the embedding dimension and so that the max value is never greater than a theta of 360º. So, there will never be a 360º rotation according to their theta parameterization.
@@Skinishh In case of a d dimensional embedding, the rotation size, theta = 10000^(-2(i-1)/d) where i is in [0, d]. So, the smallest largest rotation size is 10000^(-2(i-1)/d) will tend to 10000^-2 which is 1/sqrt(10000) = 1e-8. The largest rotation size is about 1 (when i = 0). So, the largest rotation will only ever be 1 while the smallest rotation will be 1e-8, this is kind of the step size. This means for a specific embedding, it rotates anywhere in between 1e-8 and 1 radians. It never does a full rotation.
I'm using the default Samsung notes app and split my screen into two to get one page with the PDF and the other page with the paper. Nothing too special as this app has got everything I need in it and I haven't found a better free alternative.
@@Skinishh They're one of the most widely used positional encodings, but they run into the exact same pitfalls as absolute positional encodings. That is, the extrapolation issue. Alibi or a different type of positional encoding scheme would probably be better to use.
You are awesome, I only understand when I watch you!
9:54 The positional encoding should propagate to higher layers easily because of the skip connections
From what I understand the Sinusoidal Positional Encoding which you called “default” one is not only absolute encoding but it’s also an relative encoding. It does both.
I think sinusoidal positional encodings are referred to as "absolute" while rotary encodings are referred to as "relative" is due to the method of construction. While rotary encodings are created relative to each token, sinusoidal encodings are created relative to the start of the sequence. However, as you said, the model probably learns some sort of relative encodings between the tokens when using the sinusoidal encodings.
Nice one. Subscribed. Please keep the great work!
pretty good! it's difficult for me to understand RoPE, but it's clear after watching your view
I’m assuming you were hinting at the kaio Ken superhot thing when you said you were reading this to better understand some developments in improving context length. I can’t say I fully grasp the rotational encoding part but I can’t get over how simple and seemingly cool that discovery was.(especially with the mutual accreditation from the two groups that independently discovered it). I think trying to learn what I needed to understand(rotational emb.) how that method works is how I first stumbled on your channel.
Still working on fully grasping positional embedding in all their different flavors, but my eventual take away essentially boiled down to…how do you count to 20 without going past 10? Easy, 1, 1.5, 2, 2.5….
Followed by….WTF THAT WORKS!?
Kind of. Was actually hinting at the recent paper Meta released: "extending context window of large language models via position interpolation" since they directly worked with these RoPE relative positional embeddings.
Haven't heard of the Kaio Ken Superhot repo, but the research looks pretty interesting. The weird part is their repo also looks to use interpolation to extend context length, but I think it was released before the Meta paper. Kind of cool to see the same idea come up again. So I guess in a way, I was indirectly hinting at their repo?
The idea of positional encodings isn't too complicated overall. The math notation just makes it hard to look at sometimes. Basically just put some type of series, like you said, in the model whether that's indirectly (like this paper) or directly (like ALiBi or absolute encodings). Sometimes the series is w.r.t. the first token and sometimes with w.r.t. the token being attended to. Either way, just a way for the model to know it's a sequence, not an unordered set.
@@gabrielmongaras The reddit post of Kaioken was cited in the paper of Meta. So...
@@Sebastian-jf5cp ah ok. That makes more sense now. Didn't realize that 😅
Great explanation! Subscribed 😊
Am I correct in thinking that the rotational embedding goes from 0 to 360 degrees? In that case, wont the first word of the sequence be very close to the last word in the sequence? Did they account for this?
I've done some more digging on this for those interested. So yes the theta values do indeed loop back around. However, this is why they have multiple values of theta in equation 15. Up to d/2 unique values of theta. Theta_i is defined as 10,000 ^ -2(i-1)/d. So this set of angles vary logarithmically across the dimensions of the embedding vector. Because the exponential term scales with d and 10,000 is a large base, it would have to be an extremely long sequence of values before things start to 'loop back around' as a whole. What exactly that sequence length is? Not sure.
Great explanation, thank you! In practice, how many times can an embedding be rotated before doing a 360º rotation? I guess that should be the maximum sequence length that the model can deal with before increasing the loss?
In this case, it depends on theta. The way they formalize theta is so it's in terms of the embedding dimension and so that the max value is never greater than a theta of 360º. So, there will never be a 360º rotation according to their theta parameterization.
But how many rotations are there for a specific embedding?
@@Skinishh In case of a d dimensional embedding, the rotation size, theta = 10000^(-2(i-1)/d) where i is in [0, d]. So, the smallest largest rotation size is 10000^(-2(i-1)/d) will tend to 10000^-2 which is 1/sqrt(10000) = 1e-8. The largest rotation size is about 1 (when i = 0). So, the largest rotation will only ever be 1 while the smallest rotation will be 1e-8, this is kind of the step size. This means for a specific embedding, it rotates anywhere in between 1e-8 and 1 radians. It never does a full rotation.
I am trying to understand how many positions can be encoded with this kind of rotary embeddings 🤔
@@Skinishh Since a rotation is continuous, then theoretically infinite.
It seem like what you say around 15 minutes mark about the r = m-n is just how ALiBi works right? It just ALiBi will add another m variable
Amazing man... Thanks a lot.
What's the software you're using to view & mark up the pdf, and to sketch on the right hand side of the video?
I'm using the default Samsung notes app and split my screen into two to get one page with the PDF and the other page with the paper. Nothing too special as this app has got everything I need in it and I haven't found a better free alternative.
How does this compare with relative position embeddings?
These are relative positional embeddings as they are relative to the token in focus.
Got it. Are they the SOTA relative positional encoding?
@@Skinishh They're one of the most widely used positional encodings, but they run into the exact same pitfalls as absolute positional encodings. That is, the extrapolation issue. Alibi or a different type of positional encoding scheme would probably be better to use.