Mamba - a replacement for Transformers?

Samuel Albanie

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 26 вер 2024

КОМЕНТАРІ • 167

@qwerasdliop2810 9 місяців тому ⁺²²⁰
Insane, I loved the way you went through multiple important prior papers before talking about mamba!
@looksintolasers 8 місяців тому ⁺⁴
Depth-first search of the depenency tree of papers :)
@shiholololo1053 9 місяців тому ⁺³¹⁸
Standford Labs are thriving right now. To think all this work is made OPEN-SOURCE at a period of hostile and fierce competition among the big tech companies.
@nikoladjordjevic4477 9 місяців тому ⁺²⁵
Original transformers were Open Source by Google
Also, GPT and GPT2 were open source
This is no surprise to those in the community
@8191-m8t 9 місяців тому ⁺²
2 Timothy 3:16
New World Translation of the Holy Scriptures (Study Edition)
16 All Scripture is inspired of God+ and beneficial for teaching,+ for reproving, for setting things straight,+ for disciplining in righteousness,+
@patrickangel4880 9 місяців тому ⁺⁵
Like a knife, a weapon available to everyone is not a weapon anymore it's just a mere tool... #hail_to_the_open_source_and_public_research
@peterbennett2301 9 місяців тому ⁺¹
Is not Mathematics the language of God?
@dezh6345 8 місяців тому
@@nikoladjordjevic4477 Those companies all turned closed source once money got involved.
@rabbit-hole-research 9 місяців тому ⁺²⁴
Thank you for such a good survey of the prior work! Your effort is noted and appreciated!
@SamuelAlbanie1 9 місяців тому ⁺²
Much appreciated!
@Fritz0id 9 місяців тому ⁺²⁰
Thanks for this, I feel caught up again! I've seen several papers popping up with alternatives to the transformer architecture, but I lacked a framework to grok them. The way you put this paper in a broader context, both in terms of the new benchmark for long range arenas and the emphasis on "no free lunch" w/re to LTI vs SSM was really helpful.
@triplanetary 9 місяців тому
Can you send some links of those papers that list the alternatives transformers architecture.
@BradNeuberg 9 місяців тому ⁺¹⁵
Always appreciate your excellent video explanations of cutting edge papers, thanks!
@SamuelAlbanie1 9 місяців тому
Thanks!
@Rojfos 9 місяців тому ⁺¹⁰
Thats a really high quality content. I also really like the way you highlight the text when you read over it, this makes it easier to follow along!
@SamuelAlbanie1 9 місяців тому ⁺²
Thanks!
@MeanGeneHacks 9 місяців тому ⁺¹³⁹
Hope the open source community builds on this
@dinoscheidt 9 місяців тому ⁺⁴⁰
Well get on it. The open source community is also 🫵
@ItsRyanStudios 9 місяців тому ⁺²²
WE are the open source community ☺️
@rrestoring_faith 9 місяців тому ⁺¹⁸
The authors already keep their code open source so the work is replicable. It's common practice in ML research.
@borregoayudando1481 9 місяців тому ⁺⁴
All you need is Mambas?
@rjarpa 9 місяців тому
exepto for gtp 3 and 4 XD @@rrestoring_faith
@adamshaw46 9 місяців тому ⁺¹³⁴
I really really like the build up of ideas through papers, it's a great way to introduce the idea while giving references that we can look up and trace ourselves and coming onto the scene with no context of the last few years of research it provides a neat overview
@mkamp 8 місяців тому ⁺¹
Absolutely fantastic. Personally, I would be happy to watch a much longer video: same structure, just slower and broken down a bit more.
This is not a complaint. The video is awesome as it is. Just feedback.
@SethuIyer95 9 місяців тому ⁺⁴⁴
The crux of the performance of this network lies in the fact that they are using coefficients of legendre polynomial as a basis which allowed the information to be highly compressed with minimal information loss, thinking about sequence memory, moving away from iterative or recursive processing to a more holistic, algebraic form of memory management.
@xyh6552 9 місяців тому ⁺¹²
In line with your viewpoint, this job is actually similar to using FFT to process n-bit multiplication
@christophkogler6220 9 місяців тому ⁺²
@@xyh6552 I think it basically is a high dimensional FFT that's tracking location in the models similarly high dimensional memory/association space. Should provide near-perfect representation, recall, and higher efficiency for recurrent networks.
@derghiarrinde 9 місяців тому ⁺¹
U lost me at "legendre"
@SethuIyer95 9 місяців тому
@@xyh6552 Yep, FFT is on fourier basis, this is using legendre basis.
@xyh6552 9 місяців тому
@christophkogler6220 Similar to your viewpoint, from the perspective of solving the Kakeya conjecture in finite fields, I believe the main idea is to utilize the rigidity of polynomials to achieve efficient compression. I speculate that the effect of utilizing the relationship between polynomials and roots in polynomial splitting fields is essentially replacing one "n" in the complexity with "logn"
@alileevil 9 місяців тому ⁺⁴
Honestly how do you make sense of these papers? I've listened to the whole video and still haven't got a clue what it is about. Quite a lot of brilliant people out there do to work like this.
@drayg0n806 9 місяців тому ⁺¹⁶
I noticed that @havenhq had tuned a chat version of the pretrained Mamba-2.8B on huggingface. I played it on colab and it feels like a decent chatbot already. I'm very excited about the future of this architecture
@ArnavMondal14 9 місяців тому
You have any code for it?
@johnny02199 9 місяців тому ⁺⁴
Thanks for the video, would love to have a more detailed explaination based on the related works before!
@Dart_ilder 7 місяців тому
I liked this video so much that I reached for the like button 3 times while watching it.
Awesome context on S4. This is extremely helpful for getting the context and stripping the hype to get to the meaning.
That's definitely a sub and I am off to watch all the other videos
@kobilica999 9 місяців тому ⁺⁹
Man, those papers include hardcore numerical linear algebra :D
@HaganeNoGijutsushi 6 місяців тому
S4 seems to go the hardest with its convolutional trick, but then everyone else goes "fuck this complicated shit, it's too constraining, let's just parallelize more!" and honestly if I had been the one coming up with that clever math I'd feel so cheated 😂.
@Ben_D. 9 місяців тому ⁺⁵⁸
I need an ‘explain it like I’m five’ version of this. 😄
But I hope it means something strong is coming down the pipe.
@christophkogler6220 9 місяців тому ⁺⁸⁷
Actual ELI5: Many current AI models rely on 'MLP (Multi-Layer Perceptron)' and 'Transformer' blocks in their design. The "problematic" (but also usually the 'smart') one is the 'Transformer' block. These need more and more resources to process the context as the context size increases, making scaling up VERY difficult - for a 8x larger context you need about 64x the resources. This is because Transformers compare every part of the context to every other part of the context, every time.
The Mamba architecture excludes both the MLP and Transformer blocks for the new 'Mamba' block. It needs the same amount of resources for an increase in context size no matter how large the context already is. For an 8x larger context, you would only need about 8x the resources. That means that - compared to a Transformer based model - you could give it way more input at once and get way more output at once, with the same memory resources.
If the method works at larger scales, Mamba could be another significant step forward for AI capabilities.
Most current public-facing LLM models, like ChatGPT, use Transformers in their architecture. Transformers include 'self-attention', which basically weighs the importance of every thing against everything else, all at once. This means they process any input in approximately O(N^2) time and memory (where N is the input length). As input / context length increases, their demands scale incredibly high. Anybody with a decent GPU technically CAN run a local LLM, its just small, slow, and dumb. To run anything decent, you end up needing tens (maybe even hundreds) of gigabytes of extremely fast memory, which means workstation GPU's that cost thousands or even entire GPU clusters.
The Mamba architecture is basically an entirely different TYPE of AI, more similar to a Recurrent Neural Network, and is both faster and more memory efficient. It processes and considers information sequentially, instead of all at once, but can ALSO ignore unimportant information. The architecture would be able to process an input in approximately O(n+L) time and memory, where n is essentially some constant and L is input length. If it continues to work so efficiently at increasingly large scales, it means literally orders of magnitude faster output and lessened memory requirements for a given context window, which can allow model context size to be increased massively while still using less computational resources than the previous methods.
This part is basically educated guesswork, as this level of linear algebra / ML theory is a fair bit over my head: I think Legendre memory cells basically create a high dimensional Fast Fourier Transform from the Legendre polynomials (1:25), which fits neatly into my mental model of AI. In a certain mathematical sense, everything an AI knows can be represented as an incredibly complex interplay of gradients/associations between multidimensional vectors. A multidimensional FFT thus allows you to track your 'location' (the context) within this interplay as the model works, efficiently and with little (to near-zero) accuracy loss. They also allow you to accurately recompose the input from memory, thus allowing the model to efficiently recall or ignore only parts of the input that matter. The 'importance' of any specific part of the input to the output is probably based on some calculation of distance from the FFT to the 'current context'. If it isn't a 'near' association, it probably doesn't really matter to what you're doing, and so can be ignored.
And here's a summary of the results of paper from ChatGPT, after I had a little Q&A with it:
Summarize the benefits the Mamba architecture has over the Transformers architecture when at similar scales.
The Mamba architecture offers several significant benefits over traditional Transformer architectures, particularly when both are at similar scales. Here's a summary of these advantages:
1) Higher Throughput and Efficiency: Mamba achieves a much higher throughput in both inference and training compared to Transformers. Specifically, it has been noted to have 5× higher throughput during inference and up to 40× faster efficiency in training operations. This increased efficiency is especially beneficial when dealing with large-scale models and data.
2) Linear Scaling with Sequence Length: Unlike Transformers, which have quadratic scaling with sequence length, Mamba scales linearly. This is a substantial advantage for processing long sequences, as it ensures more predictable and manageable growth in computational requirements and memory usage as sequence length increases.
3) Improved Generation Throughput: In tasks like language modeling, Mamba not only outperforms Transformers of the same size but also matches or even exceeds the performance of Transformers that are twice its size. This indicates higher efficiency and effectiveness of Mamba in generating outputs.
4) Effective Handling of Longer Sequences: Mamba is particularly adept at handling long sequences, outperforming Transformer models in tasks involving extended contexts. Its design allows it to focus on the most relevant parts of a sequence, enhancing its ability to generalize to much longer sequences than it was trained on.
5) Simplified Architecture: By omitting attention and MLP blocks, Mamba’s architecture is more streamlined than that of traditional Transformers. This simplification contributes to its efficiency, especially in dealing with long sequences.
6) Hardware Optimization: Mamba’s hardware-aware algorithm makes it more compatible with modern GPU architectures, leading to better performance on current hardware platforms. This optimization is crucial for achieving faster processing speeds and more efficient utilization of computational resources.
In summary, Mamba offers significant improvements over Transformers in terms of efficiency, scalability, and effectiveness, particularly at similar scales. Its innovations in architecture and design enable it to handle longer sequences more efficiently, making it a strong candidate for various applications in fields requiring efficient sequence modeling.
@nartrab1 9 місяців тому ⁺¹
Thank you! This was excellent.
@alexander191297 9 місяців тому ⁺²
I think this answer is wonderful… and can tell it’s ChatGPT generated 😅
@kevinaud6461 9 місяців тому ⁺⁸
@@christophkogler6220I think this was more of an "explain like I have a bachelor's in CS," but that's exactly what I needed 🙂 Thanks for writing it out
@christophkogler6220 9 місяців тому ⁺²
@@alexander191297 Only the part after I mention ChatGPT :)
@fiery_transition 9 місяців тому
As a person new to the field, I greatly appreciate the way you presented things here!
@SamuelAlbanie1 9 місяців тому ⁺¹
Thanks!
@Kobe29261 9 місяців тому ⁺²
This does it for my 'aspiration video' of the week.
@SamuelAlbanie1 9 місяців тому
Great.
@TobiasWeg 9 місяців тому ⁺³
Very interesting and well explained. Thanks a lot.
@freedom_aint_free 9 місяців тому ⁺⁴
Amazing work ! Keep 'em coming !
@SamuelAlbanie1 9 місяців тому
Thanks, will try!
@synapsomorphy 9 місяців тому ⁺⁶
Very encouraging that they included the situation in which S6 did poorly! If there are no other catches this looks incredible!
@광광이-i9t 8 місяців тому
Thanks for your work !! It is really helpful to look through the related works 😮😮
@michaelparis6039 9 місяців тому
I'm only at 7:13, right after 'spicy'. Subscribed. Great format and amazing delivery!
@SamuelAlbanie1 9 місяців тому
Thanks!
@JazevoAudiosurf 9 місяців тому ⁺⁴
Tri Dao is one hell of a contributor
@XAheli 9 місяців тому ⁺¹
Keep these coming! Great video.
@Jeremy-e7u5y 9 місяців тому
Thank you for bringing this to our eyes and it has been really insightfull
@JerryFederspiel 9 місяців тому ⁺³
Just as complex numbers work well for SSMs in audio, I can't help but wonder whether split-complex numbers would help SSM performance in language tasks (considering the hyperbolic flavor of split-complex numbers and the benefits of hyperbolic embeddings when encoding hierarchical data).
@SamuelAlbanie1 9 місяців тому ⁺³
It certainly seems plausible. In my experience, while hyperbolic embeddings make strong intuitive sense for hierarchical data, I've never seen them yield significant gains (the kinds of works I am are familiar are of this flavour: arxiv.org/abs/2304.09172). If your experience has been different, I'd be curious to hear.
@JorgetePanete 9 місяців тому ⁺³
Remember, the RWKV mentioned is the one from its paper, the RWKV v4, there isn't yet a paper for v5 and v6, but v6 is similar to Mamba
Edit: it was updated today
@JorgetePanete 9 місяців тому
How similar? well, I don't know, check it at the repo
@BlayneOliver 9 місяців тому ⁺³
Would this help a regression based transformer which data is based on the stock market’s price action?
Or is it more for multi-media?
@KingPowa00 9 місяців тому ⁺³
What source do you suggest to understand the algebra and math behidn these works? I really struggled to understand most of the concepts, though I have a fairly good basis of the math behind transformers.
@raul36 9 місяців тому ⁺²
First of all, I recommend you guys 3Blue1Brown's algebra videos. Then, if you already have a solid knowledge, I would recommend "Linear Algebra Done Right" book
@MustafaAkben 9 місяців тому
Great review! Looking forward to playing with it soon :)
@SamuelAlbanie1 9 місяців тому
Thanks!
@Robert_McGarry_Poems 9 місяців тому ⁺⁵
This is my first time watching your channel.
Impressive walkthrough.
When I first heard of Q* my imagination started to build a very similar architecture... I don't follow too much of the technical, but I saw how the sandwiched gates, shown in the video, could be used almost in an analogue fashion. This is brilliant!
Watching this made me grin like crazy...
This might not be zero memory, but dang if it isn't a huge step in that direction. Using local memory is genius. And that token interpretation length, yes...
So... physically, I guess, in my mind the next step is to localize the memory to the operation even more, but it looks like in that architecture it's as local as it's going to get...
What about something like... "Sample-and-hold," from actual analogue circuits? That might be something to think about.
@vga7714 9 місяців тому
great summary and even better presenting voice.
@SamuelAlbanie1 9 місяців тому
Thanks!
@Shnugs 9 місяців тому ⁺⁷
When you stand back and squint your eyes at these papers they almost have a turbo encabulator quality to them.
@colejohnson2230 9 місяців тому
Lol, yeah. I noticed that most fields tend towards that as you get towards the bleeding edge. Sometimes I have to stop what I'm working on and just appreciate how it looks like nonsense to an outside viewer
@NoNTr1v1aL 9 місяців тому ⁺²
Amazing video! Subscribed.
@sup5356 9 місяців тому
beautifully developed narrative
@h3techsme 9 місяців тому ⁺²
This also begs the question of how the hardware-aware process fares when the memory between system and GPU are fully shared...
@EigenA 8 місяців тому
Great video, thanks for sharing!
@matusstiller4219 9 місяців тому ⁺⁹
This video reminds me of the fact that I do not understand mathematics🙃
@iamr0b0tx 9 місяців тому ⁺²
Thanks
@SamuelAlbanie1 9 місяців тому
Thanks!
@dfparker2002 9 місяців тому
How is Mamba similar or different to multi-expert models?
What is the minimum card spec (memory, cuda, tensors, what ever) to run this model?
@luizpereira7165 6 місяців тому
Can you use Mamba arquitecture in conjunction with Bitnet b1.58?
@6lack5ushi 9 місяців тому
is this not a somewhat proof or then addition to Lee Cronin's Assembly theory is you can rebuild input u from the components of m?
@grimsk 9 місяців тому
점점 물리학과 유사해지는 느낌 feels like it's becoming more and more similar to physics.... 🙂
@baab4229 9 місяців тому ⁺²
Idk man I kinda like the shapeshifting sapient robots fighting over their home planet cybertrone, why would you wanna replace them
@TheApgreyd 9 місяців тому
Thx UA-cam for recommendations
@TheGreatestJuJu 9 місяців тому
This makes so much sense. So obvious..
@Verrisin 9 місяців тому
turning image into a flattened sequence ... I wonder if they are using space filling curves, or just line by line ? ... I wonder which "regularity" would be more useful? Or something else even?
- To be fair, having no implicit notion of "relative position of 2 pixels" (which I believe brains have) seems really expensive, if it then has to fully recover that structure from just a sequence of tokens ...
@SamuelAlbanie1 9 місяців тому ⁺¹
Yes - this is a good point. I think the reason flattening is performed without retaining 2d structure is precisely because it makes for a particularly challenging modelling task.
@honeymak 9 місяців тому
is it conversational? can it talk to itself or several instances?
@circulartext 9 місяців тому
super cool work
@aron2922 9 місяців тому ⁺²
I think about 8 people followed what you were saying but I appreciate the effort
@SamuelAlbanie1 9 місяців тому
Thanks!
@patrickangel4880 9 місяців тому ⁺¹
Like a knife, a weapon available to everyone is not a weapon anymore it's just a mere tool... #hail_to_the_open_source_and_public_research
@qwertyuiop-ux6jk 9 місяців тому
thanks for the video
@shyama5612 9 місяців тому ⁺¹
is gemini based on this? the logo spiral seems to look like the Legendre polynomial graph,
@s11-informationatyourservi44 9 місяців тому
can’t wait for a model named kobe to come out
@watcher8582 9 місяців тому
cool presentation
@SamuelAlbanie1 9 місяців тому
Thanks!
@MemesnShet 9 місяців тому ⁺¹
Since the big companies are creating their LLMs on transformers with all those resources and time I doubt they'd change unless the results were dramatically better so Mamba while impressive doesn't seem to be it
@SamuelAlbanie1 9 місяців тому
Thanks!
@apidas 7 місяців тому
god, these kids really find the cure for cancer
@ReflectionOcean 8 місяців тому
- Understand Mamba's significance by exploring its efficient state space model design and selective state mechanism (00:04).
- Review the scale issues with Transformers and the emergence of efficient alternatives like Mamba for long sequence modeling (00:31).
- Examine the Hippo Recurrent Memory and its application in sequence modeling for improved performance (01:29).
- Recognize the role of kernel Fusion, parallel scan, and recomputation techniques in Mamba's efficient memory usage (09:55).
- Consider the empirical results showcasing Mamba's high performance on various tasks, including long sequence modeling and DNA classification (13:02).
- Analyze the trade-offs in model design, noting how selection mechanisms can impact performance on different data modalities (15:27).
- Investigate the limitations of current empirical evaluations and the need to test Mamba on larger model sizes (15:43).
- Dive into the released GitHub code to experiment with the Mamba model firsthand (15:59).
@RudyMartinInvest 9 місяців тому
Thanks!
@SamuelAlbanie1 9 місяців тому
Thanks!
@JohnViguerie 6 місяців тому
in the real world LeCunn and Hinton's ideas haven't yet been optimized and deployed to scale in commerce... 😂 But it's fun to try and keep up
@peteroliver7975 8 місяців тому
I want to see this applied to reasoning tokens
@porting400 9 місяців тому
Great video
@zlatanmessi2095 8 місяців тому
Added to my plays list on AI
@KeepingUp_withAI 2 місяці тому
Here after mistral release of their code mamba model 😄
@Sam-ri3hr 9 місяців тому
Good video Sam
@DamaKubu 9 місяців тому
If you are interrested in doing mechanistic interpretability on mamba model, hit me a dm.
Am thinking of writing something like Neel Nanda's transformer lens for mamba or some lower hanging fruit as a start.
@qwertasd7 9 місяців тому
any llm using it?
@Adovid 9 місяців тому ⁺¹
Transformers don't scale on long sequence operations because generative AI neural networks work better spreading attention over the parameters. We shall see if Mamba can do what it claims after a large model is doing inference.
@Kram1032 9 місяців тому
finally apparently near-infinite contexts!
@Sai_r2d2 8 місяців тому
Lesssgo kobe ✨️
@ekstrajohn 9 місяців тому
If transformers scale pretty well, I can't think of a reason why Mamba wouldn't scale. At least off the top of my head. Let's see what happens!
@luismeron4506 9 місяців тому
Kobe and Gigi 🏀8️⃣💛💜2️⃣4️⃣🖤
@stan-15 9 місяців тому ⁺¹
Cool beans
@Oler-yx7xj 9 місяців тому
I'm so tired that I read this title literally and it took me some time to understand why it is probably not a video about using snakes in place of ChatGPT.
@garethjax 9 місяців тому
that's enough math for a lifetime. Amazing.
@reinerwilhelms-tricarico344 8 місяців тому
Interesting. But as usual it suffers from acronym overload.
@osbernperson 9 місяців тому
Aha yes, this are the OK! 👍 becas I is smart here to, and No can be maybi. Good! Do it Now!
@dhrumil5977 9 місяців тому
Whattttt 😵‍💫😵‍💫😵‍💫
@rkbiri5470 9 місяців тому
Need an ELI5 section 😅😂
@iTXS 9 місяців тому
The machines now can get epilepsy lol
@flambr 9 місяців тому
in the uk, mamba is the nickname for a hard drug
@bootblacking 7 місяців тому
Why would a snake replace Transformers, it can't even turn into a truck
@derghiarrinde 9 місяців тому
Maybe you could better explain some sentences instead of just highlighting them and reading them aloud. I get you want a lower length video but sometime you could speak to us like we're 10 years old. Would help with understanding. In the worst case, generate special cases using a GPT (explain this passage to me as if I was 15) and just read that. Thanks.
@SamuelAlbanie1 9 місяців тому ⁺¹
Thanks for the feedback!
@supperenet9090 9 місяців тому
No, it's an replacement for conda.
@jasonandrewismail2029 9 місяців тому ⁺³
superficial and misleading
@Cineenvenordquist 9 місяців тому ⁺⁴
Remix it with your fixed leads. 🙏🏼
@xyh6552 9 місяців тому ⁺⁴⁶
The technique of solving long-term memory problems using polynomial projection is somewhat similar to using FFT for multiplication. Essentially, both methods use highly efficient information representations with almost orthogonal channel capacity to represent the original information
@krox477 8 місяців тому ⁺¹
I don't understand any thing
@johnsherfey3675 8 місяців тому
Yeah, but only big math heads will actually ever fully understand it.
@pierrekinbrand 8 місяців тому
Ironically many of the ML concepts in the video went over my head but this Fourier analogy was more approachable for me.
@astridwilde 9 місяців тому
great video
@SamuelAlbanie1 9 місяців тому
Thanks!
@couldntfindafreename 9 місяців тому ⁺¹⁴
It is 100% sure that someone is already training a 7B+ Mamba model out there, most likely even bigger.
@circulartext 9 місяців тому ⁺¹
true
@imded4014 9 місяців тому
I can't be the only who clicked on the video expecting the other transformers ...
@belzebubukas 8 місяців тому
what
@memenga260 9 місяців тому
I remember reading a paper on this in 2021 why isn't it adopted earlier page link in the reply
@memenga260 9 місяців тому
drive.google.com/file/d/1-67LHZbCoDmzLWYp_4ZUXNzavcbGNMGa/view?usp=drivesdk
@SamuelAlbanie1 9 місяців тому
Good find. I guess mamba is a popular name...
@Verrisin 9 місяців тому ⁺²
"infinite" context length is effectively the main thing we needed. This is very exciting.

Наступне

Автоматичне відтворення

Gemini 1.5 Pro has a massive context window