LoRA: Low-Rank Adaptation of Large Language Models - Explained visually + PyTorch code from scratch

Attention is all you need (Transformer) - Model explanation (including math), Inference and Training

LongNet: Scaling Transformers to 1,000,000,000 Tokens Explained

«За той бій я бахнув 8 БМП, один МТ-ЛБ. Це було, як у кіно»: танкіст 30 ОМБр

Usyk and Conor McGregor met on AJ vs Dubois fight

When you see it… 👀

LongNet: Scaling Transformers to 1,000,000,000 tokens: Python Code + Explanation

Umar Jamil

Переглядів 4 534

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 26 вер 2024

КОМЕНТАРІ • 19

@umarjamilai Рік тому ⁺²
Full code available as always: github.com/hkproj/python-longnet
@davidlevinthal7085 5 місяців тому ⁺¹
Umar..your lectures are really very useful and very clear..thank you
@chenhuiyu1997 Рік тому ⁺²
Very clear explanation!
@benji6296 3 місяці тому
My understanding is that this new attention, computes a subset of the attention pairs systematically, to be able to scale the context window. In contrast, you lost precision. This is relatively new, but it a great content to add the channel LongRoPE a newer method that does not modify the attention mechanism if not the positional encoding.
@channel8048 Рік тому ⁺¹
Keep up the good work! 👍 Grazie!
@user-xl3lp Рік тому
Great content Umar! It would be great if you provide us a video on how to implement LongNet from scratch. or how to upgrade the trasnformer we built in other video.
@zaidnadeem4918 6 місяців тому ⁺¹
Excellent content!!
@Akbarable Рік тому ⁺²
Hi Umair. Very good video! I love how you visualised the algorithm, great job!
Can you make a video about implementation and the distributed training algorithm? It sounds very easy to do in theory but implementing it is giving me challenges. Would love to have some help, thank you!
@ummehabiba7249 Рік тому ⁺¹
Big fan of you 👍👍🤞🤞🤞🤞
@softwaredeveloper-c5u 9 місяців тому
I am new to the field of nlp, can you listdown in chronological order your videos?
@RomanLi-y9c Рік тому
Thank you for the video, educational and easy to understand!
Does it have sense to pick 'most important' tokens from smaller matrices(with no skip) and use them for compute larger matrices (with skip)? and for multi-head use different "importance" for different heads? I guess it will be more expensive to compute because it introduces sort operation.
@umarjamilai Рік тому
The hard part is to understand what are the "most important" tokens :-)
@RomanLi-y9c Рік тому
@@umarjamilai by "important" I meant weights. For example compute 4x4(no skip) find highest weight and remember position (i1 j1) find second highest value excluding positions (i1 j1) and remember (i2 j2). Compute second 4x4 repeat picking process (i3 j3) and (i4 j4). Larger matrix will be (i1 j3) (i1 j4) (i2 j3) (i2 j4). For another head pick lowest or closest to median. Idea is to somehow chop weights into quantified ranges for different heads.
@dawidmalan8727 6 місяців тому
25:24 bookmark
@abhinavgogadey9859 Рік тому
Bro can you please make a video on Deformable Detr for object detection like you made for transforms. It will really help me a lot.
@tantzer6113 Рік тому
How much does the quality of the resulting model suffer if any?
@umarjamilai Рік тому
For sure compared to a full vanilla attention, the dilated attention is less "precise" on very distant tokens, but you also need to consider that the vanilla transformer will never be able to watch 1 billion tokens with the current hardware and a reasonable cost. The dilated attention is a good compromise between full attention and reasonable cost.
@zandrrlife Рік тому
@@umarjamilaiinterestingly enough. I believe even that bottleneck. I see a lot of synergy between this attention method and landmark tokens. Might help maintain a higher degree of "precision". High-level thoughts?
@xinyaoyin2238 3 місяці тому
image you run a convolutional network on the lower triangle with different window sizes, should get the same result

Наступне

Автоматичне відтворення

LoRA: Low-Rank Adaptation of Large Language Models - Explained visually + PyTorch code from scratch

LoRA: Low-Rank Adaptation of Large Language Models - Explained visually + PyTorch code from scratch

Attention is all you need (Transformer) - Model explanation (including math), Inference and Training

Attention is all you need (Transformer) - Model explanation (including math), Inference and Training

LongNet: Scaling Transformers to 1,000,000,000 Tokens Explained

LongNet: Scaling Transformers to 1,000,000,000 Tokens Explained

«За той бій я бахнув 8 БМП, один МТ-ЛБ. Це було, як у кіно»: танкіст 30 ОМБр

«За той бій я бахнув 8 БМП, один МТ-ЛБ. Це було, як у кіно»: танкіст 30 ОМБр

Usyk and Conor McGregor met on AJ vs Dubois fight

Usyk and Conor McGregor met on AJ vs Dubois fight

When you see it… 👀

When you see it… 👀

"Завжди був патріотом! Я ніколи би не залишив країну через булінг!", Волошин⁠⁠ | @Raminaeshakzai

"Завжди був патріотом! Я ніколи би не залишив країну через булінг!", Волошин⁠⁠ | @Raminaeshakzai

Attention in transformers, visually explained | Chapter 6, Deep Learning

Attention in transformers, visually explained | Chapter 6, Deep Learning

State Space Models (S4, S5, S6/Mamba) Explained

State Space Models (S4, S5, S6/Mamba) Explained

CLIP - Paper explanation (training and inference)

CLIP - Paper explanation (training and inference)

The Most Important Algorithm in Machine Learning

The Most Important Algorithm in Machine Learning

Variational Autoencoder - Model, ELBO, loss function and maths explained easily!

Variational Autoencoder - Model, ELBO, loss function and maths explained easily!

🤔 Ok, but what IS ControlNet?

🤔 Ok, but what IS ControlNet?

BERT explained: Training, Inference, BERT vs GPT/LLamA, Fine tuning, [CLS] token

BERT explained: Training, Inference, BERT vs GPT/LLamA, Fine tuning, [CLS] token

LongNet from Microsoft - 1B Tokens Transformer with Dilated Attention

LongNet from Microsoft - 1B Tokens Transformer with Dilated Attention

Coding LLaMA 2 from scratch in PyTorch - KV Cache, Grouped Query Attention, Rotary PE, RMSNorm

Coding LLaMA 2 from scratch in PyTorch - KV Cache, Grouped Query Attention, Rotary PE, RMSNorm

Bike Vs Tricycle Fast Challenge

Bike Vs Tricycle Fast Challenge

Factory assembly line, water transfer graffiti #hydrographic #craftshorts #printing #diy #shorts

Factory assembly line, water transfer graffiti #hydrographic #craftshorts #printing #diy #shorts

Сікорський звернувся до Небензі | Що радянські солдати робили у Польщі?

Сікорський звернувся до Небензі | Що радянські солдати робили у Польщі?

Папич - миллионы на стримах, донаты от Меллстроя и альтушки

Папич — миллионы на стримах, донаты от Меллстроя и альтушки

😳Что делать, если вас Похоронили заживо ? #shorts

😳Что делать, если вас Похоронили заживо ? #shorts

ВЕТЕРАНИ КОСМІЧНИХ ВІЙСЬК В КЛУБІ ДИЛЕТАНТІВ #41

ВЕТЕРАНИ КОСМІЧНИХ ВІЙСЬК В КЛУБІ ДИЛЕТАНТІВ #41

Слова ДЖОШУА після поразки НОКАУТОМ. Пресконференція після бою Джошуа - Дюбуа (український переклад)

Слова ДЖОШУА після поразки НОКАУТОМ. Пресконференція після бою Джошуа - Дюбуа (український переклад)

Загадочная череда смертей участников группы Ласковый май | Документальный фильм

Загадочная череда смертей участников группы Ласковый май | Документальный фильм