Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Microservices are Technical Debt

ТОЛЬКО ЧТО! КАДЫРОВ огласил ВОЙНУ РФ? Боевики ДОН-ДОНА с ОРУЖИЕМ зашли в ОФИСЫ Wildberries

ПОЛ ЭТО ЛАВА В РЕАЛЬНОЙ ЖИЗНИ!**Янчик, Сабина, Амина, Джарахов, MONA, Прокофьев**

«Я три доби просиділа під тими завалами. Але дива не сталося»

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Gabriel Mongaras

Переглядів 3 570

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 28 вер 2024

КОМЕНТАРІ • 13

@husienvora9954 5 місяців тому ⁺¹
great vid gabriel 👍
@TheVirgile27 5 місяців тому
One thing I don't understand well. After training, how we manage the final output ? For a large input, do we "force it" to speak directly i.e. an output for each input OR we first insert all the input and then look at the output at a certain point. Because basically, one could wait to complete the reading part and then force an answer. Maybe i am not that clear (not my language) but it seems there could be several ways to retrieve an output from this type of transformer. Please be kind, thanks for the video :)
@M-ed5ct 5 місяців тому
Thanks for the video!
Just one question. The H state dimension is fixed, but it accumulates additional information proceeding with the token sequence. After the first segment, H1 just "summarize" one segment, but after segment N, Hn summarize current segment + Hn-1 that is the summary of all the past context. Do you think would make sense to increase H dimension proceeding with context, i.e. dimension of Hn grows with n? The idea is that we keep information per bit constant in H, so that we can really grow to unlimited context without state becoming a bottleneck.
@gabrielmongaras 5 місяців тому
I think it makes sense to increase the hidden state, though doing so would result in a memory dependence on the sequence length during inference, which is currently a big problem. One can think of a softmax attention transformer as having an infinite hidden state (the keys/values are just stacked), on the other hand an RNN has a constant size hidden state. Perhaps something in the middle would perform better than an RNN, but not require as much memory as a Transformer?
@M-ed5ct 5 місяців тому
@@gabrielmongaras Yeah, the trick is to find a state update function xn+1 = S(xn, segment_n) so that dim(xn+1) > dim(xn), i.e. projecting vector xn into a bigger space, while preserving the semantic and eventually imbuing segment_n's new data. Indeed because state dimension in the paper is tailored for a quite long context, with a growing state you can even start from a _smaller_ state x1 and then grow it with the number of segments....so maybe for a not too large context you can even have a memory reduction!
But I don't see memory usage as a problem, you can always clamp it to a maximum if really needed, a kind of max_memory parameter...it can't be worse than the original.
@ericl227 5 місяців тому ⁺²
17:50, shouldn't H_i be a summation of k_j and v_j over j, instead of i, where j goes from 1 to i.
@gabrielmongaras 5 місяців тому ⁺¹
Yep. Nice catch! Put a note in the video about the reindexing.
@danieldeychakiwsky1928 5 місяців тому
Great vids! At around 3 mins when you get into the attention matrices, I think the dimensions aren’t right because if Q is d by s and K is d by s and we take QK-transpose then d by s matmul s by d gives a d by d matrix but the video shows that matrix to be s by s.
@gabrielmongaras 4 місяці тому
Thanks! In that part, I transposed the diagram as I thought it looked a little better that way. Sometimes the diagrams I draw are transposed, but I try to label the dimensions to avoid ambiguity. So the s by s matrix is the resulting matrix, not a d by d one. A s by s matrix results in sequence relations while a d by d matrix results in dimension relations across the entire sequence.
@YEETSWORLDWIDE 5 місяців тому ⁺²
so basically what you're tellling me is the world is going to end
@gabrielmongaras 5 місяців тому ⁺²
Absolutely :)
@Eniac2045 5 місяців тому ⁺¹
Thanks, another great vid!
@EobardUchihaThawne 4 місяці тому
i have to get used to those scienetific notations, i struggle to provide code from these articles myself

Наступне

Автоматичне відтворення

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Microservices are Technical Debt

Microservices are Technical Debt

ТОЛЬКО ЧТО! КАДЫРОВ огласил ВОЙНУ РФ? Боевики ДОН-ДОНА с ОРУЖИЕМ зашли в ОФИСЫ Wildberries

ТОЛЬКО ЧТО! КАДЫРОВ огласил ВОЙНУ РФ? Боевики ДОН-ДОНА с ОРУЖИЕМ зашли в ОФИСЫ Wildberries

ПОЛ ЭТО ЛАВА В РЕАЛЬНОЙ ЖИЗНИ!**Янчик, Сабина, Амина, Джарахов, MONA, Прокофьев**

ПОЛ ЭТО ЛАВА В РЕАЛЬНОЙ ЖИЗНИ!**Янчик, Сабина, Амина, Джарахов, MONA, Прокофьев**

«Я три доби просиділа під тими завалами. Але дива не сталося»

«Я три доби просиділа під тими завалами. Але дива не сталося»

Factory assembly line, water transfer graffiti #hydrographic #craftshorts #printing #diy #shorts

Factory assembly line, water transfer graffiti #hydrographic #craftshorts #printing #diy #shorts

Visual AutoRegressive Modeling:Scalable Image Generation via Next-Scale Prediction

Visual AutoRegressive Modeling:Scalable Image Generation via Next-Scale Prediction

Distinguished Speaker Series @SCIoI: Sepp Hochreiter - xLSTM: New Architectures for LLMs

Distinguished Speaker Series @SCIoI: Sepp Hochreiter - xLSTM: New Architectures for LLMs

KAN: Kolmogorov-Arnold Networks

KAN: Kolmogorov-Arnold Networks

When Jim Carrey Goes Totally Off Script!

When Jim Carrey Goes Totally Off Script!

RING Attention explained: 1 Mio Context Length

RING Attention explained: 1 Mio Context Length

Why Agent Frameworks Will Fail (and what to use instead)

Why Agent Frameworks Will Fail (and what to use instead)

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits and BitNet

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits and BitNet

Attention in transformers, visually explained | Chapter 6, Deep Learning

Attention in transformers, visually explained | Chapter 6, Deep Learning

WE GOT ACCESS TO GPT-3! [Epic Special Edition]

WE GOT ACCESS TO GPT-3! [Epic Special Edition]

Bike Vs Tricycle Fast Challenge

Bike Vs Tricycle Fast Challenge

What really determines the price of a token? Hamster Girl explains ⚡️ Hamster Academy

What really determines the price of a token? Hamster Girl explains ⚡️ Hamster Academy

Офицер, я всё объясню

Офицер, я всё объясню

Продажный бой? Боксёр испугался? Нет! Всё гораздо сложней... #shorts

Продажный бой? Боксёр испугался? Нет! Всё гораздо сложней... #shorts

Новые технологии в МФЦ 😅 #ComedyClub #КамедиКлаб #АнтонИванов #АлексейСмирнов #Смирняга #тнт4 #тнт

Новые технологии в МФЦ 😅 #ComedyClub #КамедиКлаб #АнтонИванов #АлексейСмирнов #Смирняга #тнт4 #тнт

INCREDIBLE KO | Riyadh Season Card: Wembley Edition - Anthony Joshua vs. Daniel Dubois Highlights

INCREDIBLE KO | Riyadh Season Card: Wembley Edition - Anthony Joshua vs. Daniel Dubois Highlights

Что в джунглях лучше не тpогать?

Что в джунглях лучше не тpогать?

Factory assembly line, water transfer graffiti #hydrographic #craftshorts #printing #diy #shorts

Factory assembly line, water transfer graffiti #hydrographic #craftshorts #printing #diy #shorts