LongNet: Scaling Transformers to 1,000,000,000 tokens: Python Code + Explanation

Поділитися
Вставка
  • Опубліковано 26 вер 2024

КОМЕНТАРІ • 19

  • @umarjamilai
    @umarjamilai  Рік тому +2

    Full code available as always: github.com/hkproj/python-longnet

  • @davidlevinthal7085
    @davidlevinthal7085 5 місяців тому +1

    Umar..your lectures are really very useful and very clear..thank you

  • @chenhuiyu1997
    @chenhuiyu1997 Рік тому +2

    Very clear explanation!

  • @benji6296
    @benji6296 3 місяці тому

    My understanding is that this new attention, computes a subset of the attention pairs systematically, to be able to scale the context window. In contrast, you lost precision. This is relatively new, but it a great content to add the channel LongRoPE a newer method that does not modify the attention mechanism if not the positional encoding.

  • @channel8048
    @channel8048 Рік тому +1

    Keep up the good work! 👍 Grazie!

  • @user-xl3lp
    @user-xl3lp Рік тому

    Great content Umar! It would be great if you provide us a video on how to implement LongNet from scratch. or how to upgrade the trasnformer we built in other video.

  • @zaidnadeem4918
    @zaidnadeem4918 6 місяців тому +1

    Excellent content!!

  • @Akbarable
    @Akbarable Рік тому +2

    Hi Umair. Very good video! I love how you visualised the algorithm, great job!
    Can you make a video about implementation and the distributed training algorithm? It sounds very easy to do in theory but implementing it is giving me challenges. Would love to have some help, thank you!

  • @ummehabiba7249
    @ummehabiba7249 Рік тому +1

    Big fan of you 👍👍🤞🤞🤞🤞

  • @softwaredeveloper-c5u
    @softwaredeveloper-c5u 9 місяців тому

    I am new to the field of nlp, can you listdown in chronological order your videos?

  • @RomanLi-y9c
    @RomanLi-y9c Рік тому

    Thank you for the video, educational and easy to understand!
    Does it have sense to pick 'most important' tokens from smaller matrices(with no skip) and use them for compute larger matrices (with skip)? and for multi-head use different "importance" for different heads? I guess it will be more expensive to compute because it introduces sort operation.

    • @umarjamilai
      @umarjamilai  Рік тому

      The hard part is to understand what are the "most important" tokens :-)

    • @RomanLi-y9c
      @RomanLi-y9c Рік тому

      ​@@umarjamilai by "important" I meant weights. For example compute 4x4(no skip) find highest weight and remember position (i1 j1) find second highest value excluding positions (i1 j1) and remember (i2 j2). Compute second 4x4 repeat picking process (i3 j3) and (i4 j4). Larger matrix will be (i1 j3) (i1 j4) (i2 j3) (i2 j4). For another head pick lowest or closest to median. Idea is to somehow chop weights into quantified ranges for different heads.

  • @dawidmalan8727
    @dawidmalan8727 6 місяців тому

    25:24 bookmark

  • @abhinavgogadey9859
    @abhinavgogadey9859 Рік тому

    Bro can you please make a video on Deformable Detr for object detection like you made for transforms. It will really help me a lot.

  • @tantzer6113
    @tantzer6113 Рік тому

    How much does the quality of the resulting model suffer if any?

    • @umarjamilai
      @umarjamilai  Рік тому

      For sure compared to a full vanilla attention, the dilated attention is less "precise" on very distant tokens, but you also need to consider that the vanilla transformer will never be able to watch 1 billion tokens with the current hardware and a reasonable cost. The dilated attention is a good compromise between full attention and reasonable cost.

    • @zandrrlife
      @zandrrlife Рік тому

      ​@@umarjamilaiinterestingly enough. I believe even that bottleneck. I see a lot of synergy between this attention method and landmark tokens. Might help maintain a higher degree of "precision". High-level thoughts?

  • @xinyaoyin2238
    @xinyaoyin2238 3 місяці тому

    image you run a convolutional network on the lower triangle with different window sizes, should get the same result