Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits and BitNet

xLSTM: Extended Long Short-Term Memory

The IMPOSSIBLE Puzzle..

"Він залишив свій слід в Україні та світі": у Вінниці попрощалися з В'ячеславом Узелковим

За кого болели?😂

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Gabriel Mongaras

Переглядів 2 115

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 14 лис 2024

КОМЕНТАРІ • 20

@raraismature 7 місяців тому ⁺²
awesome content, seriously
@toatoa10 7 місяців тому ⁺¹
great video! this is much easier to understand than just reading the paper. what app are you using for annotating the paper and making notes?
@gabrielmongaras 6 місяців тому ⁺¹
Thanks! Glad you found my video helpful! I'm using the default Samsung Notes app to make all the annotations and notes.
@gauravsrivastava9428 7 місяців тому ⁺¹
Thanks for the video tutorial, it is really helpful! At time 24:08, when you mention about softmax, do you mean a softmax is done to compute the routing scalars? If yes, then as per my understanding they don't compute routing scalars using softmax. The scalars are computed just by doing an inner product of token with routing weights vector.
@gabrielmongaras 7 місяців тому
Oh yes, I see what you're talking about. On page 6, right above equation (1), they mention the rth weight is computed as the inner product between the weight and the vector, which is different from normal MoE. I suppose this fixes the gradient problem I was talking about. Thanks for the clarification!
@ml-ok3xq 7 місяців тому
i thought people theorise that transformers still use the 'slack' tokens for other purposes, so the compute is not wasted, i guess this shows that maybe those theories needed to be rigorously tested. although actually since they only sandwich the layers maybe it is fully used. this method effectively gives some tokens up to double the mixing time
@ckpioo 7 місяців тому
awsome, btw maybe try using excalibur
@DiogoNeves 7 місяців тому
Im not sure I understand, even though the sigmoids are independent, why would it allow for causal sampling if it was trained to mimic a distribution that isn’t causal? It carries information from the future albeit indirectly no?
For example, if we were training on a distribution of a biased lottery, we would still be predicting the future from just some of the tokens?
@DiogoNeves 7 місяців тому
Ah, I think you mention exactly that afterwards 😅 thanks
@DiogoNeves 7 місяців тому
One more question, can these be added to existing models and trained separately? From the description sounds like it’s possible
@gabrielmongaras 7 місяців тому ⁺¹
I don't think they talked about doing that in the paper. My intuition says it may be hard and probably wouldn't work as well as we might hope. The activations for attention are whatever it needs to do the attention mechanism. However, in this paper, the activations are also used for ranking. My first thought is that these two activation distributions are quite different, making the model start from a poor state. I wonder if Google did something like this, but found it didn't work that well and decided not to add it in the paper? Would be totally worth trying if you have the compute though! Maybe you could start off with initializing routing to all tokens and slowly decrease this during fine-tuning.
@Stan-san 7 місяців тому ⁺²
Why use lot words when few words do trick?
@gabrielmongaras 7 місяців тому
Yeah, definitely a problem I have 😅
Been trying to get better at it, and realized I could've explained the extra loss part in much fewer words after uploading. In general, sometimes it's hard to know if the explanation given is satisfying or not when trying to balance conciseness and length.
@rykim4626 3 місяці тому
⁠@@gabrielmongarasthey might be referring to mixture of depths using only few of the words. Personally, I thought your explanations were great
@theatheistpaladin 7 місяців тому
What field of math do you need to understand this?
@gabrielmongaras 7 місяців тому
Just an understanding of machine learning models at a high level and how transformers work. The experts themselves are just a linear layer or feed forward layer in MoE and the single expert in this paper is a transformer layer.
@MrNathanShow 7 місяців тому ⁺²
I'd add that a basic understanding of statistics can help with some introductory degree of calculus. But for the most part there is more trial and error for these discoveries than you might not believe. The understanding comes after sometimes ;)
@jaredtweed7826 7 місяців тому ⁺¹
@@gabrielmongaras what do you think helped you best understand neural networks? I have a shallow understanding of how transformers work. I know how the encoder works, but I don't really understand the decoder fully. I also know pytorch only well enough to build simple convolutional neural networks. I also have a really strong understanding of calculus and linear algebra.
@tgugdevil 7 місяців тому ⁺¹
Calculus and Linear Algebra.
@jaredtweed7826 7 місяців тому
@@tgugdevil thank you, sorry, I forgot to mention that I already have a strong understanding of those concepts

Наступне

Автоматичне відтворення

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits and BitNet

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits and BitNet

xLSTM: Extended Long Short-Term Memory

xLSTM: Extended Long Short-Term Memory

The IMPOSSIBLE Puzzle..

The IMPOSSIBLE Puzzle..

"Він залишив свій слід в Україні та світі": у Вінниці попрощалися з В'ячеславом Узелковим

"Він залишив свій слід в Україні та світі": у Вінниці попрощалися з В'ячеславом Узелковим

За кого болели?😂

За кого болели?😂

Прозвища народов #сша #россия #украина

Прозвища народов #сша #россия #украина

Visual AutoRegressive Modeling:Scalable Image Generation via Next-Scale Prediction

Visual AutoRegressive Modeling:Scalable Image Generation via Next-Scale Prediction

Scalable Diffusion Models with Transformers （UCBerkeley & NYU 2023）

Scalable Diffusion Models with Transformers （UCBerkeley & NYU 2023）

The Strange Physics Principle That Shapes Reality

The Strange Physics Principle That Shapes Reality

Mixture-of-Depths

Mixture-of-Depths

The Quest To Make Unbreakable Glass

The Quest To Make Unbreakable Glass

Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer

Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer

Stable Diffusion 3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Stable Diffusion 3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

CoPE - Contextual Position Encoding: Learning to Count What's Important

CoPE - Contextual Position Encoding: Learning to Count What's Important

How large language models work, a visual intro to transformers | Chapter 5, Deep Learning

How large language models work, a visual intro to transformers | Chapter 5, Deep Learning

«В касці все, що залишилося від людини»: пошуковці знайшли останки загиблого українського бійця

«В касці все, що залишилося від людини»: пошуковці знайшли останки загиблого українського бійця

Бомжи достали 6 тачек из мусора. Находка года!

Бомжи достали 6 тачек из мусора. Находка года!

В ДЕТСТВЕ С РОДИТЕЛЯМИ КОНОПАТИШЬ ОКНА

В ДЕТСТВЕ С РОДИТЕЛЯМИ КОНОПАТИШЬ ОКНА

ГРИГОРІЙ ОМЕЛЬЧЕНКО: я звертаюсь до Президента Зеленського...

ГРИГОРІЙ ОМЕЛЬЧЕНКО: я звертаюсь до Президента Зеленського...

Когда муж не доверяет жене @Oscar_elteacher

Когда муж не доверяет жене @Oscar_elteacher

Incredibox Sprunki vs Inside Out 2 - Which team will win? #shorts #animation

Incredibox Sprunki vs Inside Out 2 - Which team will win? #shorts #animation

Увеличили моцареллу для @Lorenzo.bagnati

Увеличили моцареллу для @Lorenzo.bagnati

How To Choose Mac N Cheese Date Night.. 🧀

How To Choose Mac N Cheese Date Night.. 🧀