My understanding is that this new attention, computes a subset of the attention pairs systematically, to be able to scale the context window. In contrast, you lost precision. This is relatively new, but it a great content to add the channel LongRoPE a newer method that does not modify the attention mechanism if not the positional encoding.
Great content Umar! It would be great if you provide us a video on how to implement LongNet from scratch. or how to upgrade the trasnformer we built in other video.
Hi Umair. Very good video! I love how you visualised the algorithm, great job! Can you make a video about implementation and the distributed training algorithm? It sounds very easy to do in theory but implementing it is giving me challenges. Would love to have some help, thank you!
Thank you for the video, educational and easy to understand! Does it have sense to pick 'most important' tokens from smaller matrices(with no skip) and use them for compute larger matrices (with skip)? and for multi-head use different "importance" for different heads? I guess it will be more expensive to compute because it introduces sort operation.
@@umarjamilai by "important" I meant weights. For example compute 4x4(no skip) find highest weight and remember position (i1 j1) find second highest value excluding positions (i1 j1) and remember (i2 j2). Compute second 4x4 repeat picking process (i3 j3) and (i4 j4). Larger matrix will be (i1 j3) (i1 j4) (i2 j3) (i2 j4). For another head pick lowest or closest to median. Idea is to somehow chop weights into quantified ranges for different heads.
For sure compared to a full vanilla attention, the dilated attention is less "precise" on very distant tokens, but you also need to consider that the vanilla transformer will never be able to watch 1 billion tokens with the current hardware and a reasonable cost. The dilated attention is a good compromise between full attention and reasonable cost.
@@umarjamilaiinterestingly enough. I believe even that bottleneck. I see a lot of synergy between this attention method and landmark tokens. Might help maintain a higher degree of "precision". High-level thoughts?
Full code available as always: github.com/hkproj/python-longnet
Umar..your lectures are really very useful and very clear..thank you
Very clear explanation!
My understanding is that this new attention, computes a subset of the attention pairs systematically, to be able to scale the context window. In contrast, you lost precision. This is relatively new, but it a great content to add the channel LongRoPE a newer method that does not modify the attention mechanism if not the positional encoding.
Keep up the good work! 👍 Grazie!
Great content Umar! It would be great if you provide us a video on how to implement LongNet from scratch. or how to upgrade the trasnformer we built in other video.
Excellent content!!
Hi Umair. Very good video! I love how you visualised the algorithm, great job!
Can you make a video about implementation and the distributed training algorithm? It sounds very easy to do in theory but implementing it is giving me challenges. Would love to have some help, thank you!
Big fan of you 👍👍🤞🤞🤞🤞
I am new to the field of nlp, can you listdown in chronological order your videos?
Thank you for the video, educational and easy to understand!
Does it have sense to pick 'most important' tokens from smaller matrices(with no skip) and use them for compute larger matrices (with skip)? and for multi-head use different "importance" for different heads? I guess it will be more expensive to compute because it introduces sort operation.
The hard part is to understand what are the "most important" tokens :-)
@@umarjamilai by "important" I meant weights. For example compute 4x4(no skip) find highest weight and remember position (i1 j1) find second highest value excluding positions (i1 j1) and remember (i2 j2). Compute second 4x4 repeat picking process (i3 j3) and (i4 j4). Larger matrix will be (i1 j3) (i1 j4) (i2 j3) (i2 j4). For another head pick lowest or closest to median. Idea is to somehow chop weights into quantified ranges for different heads.
25:24 bookmark
Bro can you please make a video on Deformable Detr for object detection like you made for transforms. It will really help me a lot.
How much does the quality of the resulting model suffer if any?
For sure compared to a full vanilla attention, the dilated attention is less "precise" on very distant tokens, but you also need to consider that the vanilla transformer will never be able to watch 1 billion tokens with the current hardware and a reasonable cost. The dilated attention is a good compromise between full attention and reasonable cost.
@@umarjamilaiinterestingly enough. I believe even that bottleneck. I see a lot of synergy between this attention method and landmark tokens. Might help maintain a higher degree of "precision". High-level thoughts?
image you run a convolutional network on the lower triangle with different window sizes, should get the same result