Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Paper Explained)

Поділитися
Вставка
  • Опубліковано 16 тра 2024
  • #mamba #s4 #ssm
    OUTLINE:
    0:00 - Introduction
    0:45 - Transformers vs RNNs vs S4
    6:10 - What are state space models?
    12:30 - Selective State Space Models
    17:55 - The Mamba architecture
    22:20 - The SSM layer and forward propagation
    31:15 - Utilizing GPU memory hierarchy
    34:05 - Efficient computation via prefix sums / parallel scans
    36:01 - Experimental results and comments
    38:00 - A brief look at the code
    Paper: arxiv.org/abs/2312.00752
    Abstract:
    Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5× higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.
    Authors: Albert Gu, Tri Dao
    Links:
    Homepage: ykilcher.com
    Merch: ykilcher.com/merch
    UA-cam: / yannickilcher
    Twitter: / ykilcher
    Discord: ykilcher.com/discord
    LinkedIn: / ykilcher
    If you want to support me, the best thing to do is to share out the content :)
    If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
    SubscribeStar: www.subscribestar.com/yannick...
    Patreon: / yannickilcher
    Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
    Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
    Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
    Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
  • Наука та технологія

КОМЕНТАРІ • 125

  • @YannicKilcher
    @YannicKilcher  4 місяці тому +18

    OUTLINE:
    0:00 - Introduction
    0:45 - Transformers vs RNNs vs S4
    6:10 - What are state space models?
    12:30 - Selective State Space Models
    17:55 - The Mamba architecture
    22:20 - The SSM layer and forward propagation
    31:15 - Utilizing GPU memory hierarchy
    34:05 - Efficient computation via prefix sums / parallel scans
    36:01 - Experimental results and comments
    38:00 - A brief look at the code

  • @HoriaCristescu
    @HoriaCristescu 4 місяці тому +41

    I have to say - your return to making more frequent videos is making me very happy. I used to see your video before reading the papers.

    • @ousooo
      @ousooo 3 місяці тому

      good to see you here, Horia. :)

  • @SebastianRaschka
    @SebastianRaschka 4 місяці тому +11

    This is a great video with a great rundown of Mamba. Was traveling when the Mamba paper came out and coincidentally stumbled upon this video today. This was a big time-saver to catch me up on the gist of itl. I'll make sure to watch more of your videos in the future. Big thumbs up!

  • @barni_7762
    @barni_7762 4 місяці тому +8

    I'd love another video diving deeper!

  • @OperationDarkside
    @OperationDarkside 4 місяці тому +2

    These kinds of videos are great on an early christmas morning. You know you are still not really awake. You won't get it anyway. But it kickstarts your brain into work mode.

  • @Summersault666
    @Summersault666 4 місяці тому +10

    Finally, i knew i could count on you!

  • @vikaspoddar9456
    @vikaspoddar9456 4 місяці тому +8

    I was waiting for your
    review

  • @rault.7108
    @rault.7108 4 місяці тому

    Thanks for talking about this model! ❤

  • @kaikapioka9711
    @kaikapioka9711 4 місяці тому +1

    Impressive, thx as always 🎉 happy holidays

  • @stephaneduhamel7706
    @stephaneduhamel7706 4 місяці тому +36

    11:15 They did experiments up to 3 billion parameters iirc. There is a mamba 3B model available on huggingface at least

    • @hunterkudo9832
      @hunterkudo9832 4 місяці тому +1

      How does it compare to a 3B transformer model?

    • @greengoblin9567
      @greengoblin9567 4 місяці тому +9

      @@hunterkudo9832It’s better

    • @TheRyulord
      @TheRyulord 4 місяці тому +20

      @@hunterkudo9832 It's roughly on par with a 7B param transformer

    • @bmoharryman5809
      @bmoharryman5809 4 місяці тому +6

      This is hot.

    • @erickmarin6147
      @erickmarin6147 4 місяці тому

      (___0__)__(____0__)
      \_________________/

  • @yannickpezeu3419
    @yannickpezeu3419 4 місяці тому +1

    brilliant explanation !
    Thanks !

  • @pladselsker8340
    @pladselsker8340 4 місяці тому +3

    Thank you for the paper review, it always helps!! Happy holydays to everyone 🍾

  • @patrickl5290
    @patrickl5290 4 місяці тому

    Was waiting for this, thanks!

  • @jonatan01i
    @jonatan01i 4 місяці тому +15

    Could see this as a "memory" architecture for an actual transformer, remembering distinctive contexts for a long time, but use transformers for the much more complicated and sophisticated logical reasonings where directed focus and attention is much needed.

    • @albertmashy8590
      @albertmashy8590 4 місяці тому +1

      Huge for assistants, and agents with longer term memory, as well as AI companions

    • @dibbidydoo4318
      @dibbidydoo4318 4 місяці тому

      transformers have problems of their own for composition and logical reasoning right?

  • @theskydebreuil
    @theskydebreuil 4 місяці тому

    Nice! Was waiting for this 😁

  • @PicaPauDiablo1
    @PicaPauDiablo1 4 місяці тому +2

    This is so great.

  • @amirronen1913
    @amirronen1913 4 місяці тому

    excellent talk explaining a non trivial paper. Thanks!

  • @josephle5686
    @josephle5686 4 місяці тому +12

    I work with state space models as a a control/optimization engineer on a daily basis. But that diagram of the state space model has got to be the most confusing thing I’ve seen in my life lol

    • @brunomaruszczak4016
      @brunomaruszczak4016 4 місяці тому +1

      Agreed, both State Space models and Kalman filter have been so throughly described in every control theory handbook and I have never seen something like this diagram 😅

    • @ivanzhelyazkov8099
      @ivanzhelyazkov8099 4 місяці тому +4

      can you guys recommend a handbook with a clearer representation?

    • @brunomaruszczak4016
      @brunomaruszczak4016 2 місяці тому

      @@ivanzhelyazkov8099 Well I think you could find some good introduction to state space systems in Signals and Systems: Fundamentals by Gang Li and others. Chapter 7 is on state space models but I would also recommend reading up on general properties of LTI systems. There could be better books though, I’ve read it quite some time ago

    • @brunomaruszczak4016
      @brunomaruszczak4016 2 місяці тому

      @@ivanzhelyazkov8099 Signals and Systems: Fundamentals by Gang Li and others. Chapter 7 deals with SS models, but recommend looking at chapter 1 and LTI systems as well

  • @vzxvzvcxasd7109
    @vzxvzvcxasd7109 4 місяці тому +91

    Papers! TBH, I don’t even watch any other vids in this channel.

    • @sagetmaster4
      @sagetmaster4 4 місяці тому +7

      There are other videos?

    • @Hexanitrobenzene
      @Hexanitrobenzene 4 місяці тому +3

      @@sagetmaster4
      Yannic had pretty good ML news videos but then he got busy with OpenAssistant and probably other things...
      Now I usually watch "AI Explained" for ML news.

    • @EdFormer
      @EdFormer 3 місяці тому

      ​@@Hexanitrobenzenethat OpenAI fanboy hype merchant, who probably can't define a perceptron, is no replacement for Yannic's ML News. Better than nothing, but while Kilcher's coverage gave me the feeling of being at an academic event for those who actually study and engineer ML techniques to learn about new ideas they might want to apply and to help them stay up to date, AI Explained seems to just want to blow the minds of tech enthusiasts who are only interested in how close to AGI we are, and always leaves me with a strong smell of bull regarding the significance of whatever he's been making out is a huge leap forward. How much the general culture of the AI community has gone downhill since ChatGPT was released and caught the attention of psuedo intellectuals is so depressing.

    • @Hexanitrobenzene
      @Hexanitrobenzene 3 місяці тому

      @@EdFormer
      I don't think such a strong criticism is warranted. You described him like he was ColdFusion or something like that :)
      Of course, if you want a raw engineering take, he is not a replacement, but considering all the countless clickbaity AI related videos on my side bar, he seems the most serious of those whose audience is semi-technical. He always reads the papers thoroughly, catches some less well known trends, does his own investigations. He even did a thorough analysis of MMLU dataset bugs. Not a NeurIPS level researcher, but still a serious guy.

    • @EdFormer
      @EdFormer 3 місяці тому

      @@Hexanitrobenzene absolutely fair response. I guess I have just channelled my dismay regarding the decline of the once great culture of our community at him specifically, since his channel is the only example of these new AI channels (which I think are illustrative of that decline) that I am prepared to engage with. I still watch his videos, so must find some value there. I just really miss being able to rely on regular, diverse, and legitimately technical/academic content from the likes of Yannic and Letitia that really added another dimension to my never ending literature review. Even Edan Meyer, an undergrad at the time, provided the perspective that I think is sorely missed. I feel these channels have struggled to keep up with the cookie cutter garbage that is now filling my recommendations, and again, I am probably focusing my blame for that on AI Explained. One valid criticism I have of AI Explained though is the misleading perspective I think he provides that is AI = LLMs. I'm probably very biased as someone with a background in computer vision, but there's so much more going on than just LLMs. I find it mind blowing, given how things were previously, that I cannot find a good deep dive video on the Segment Anything Model (SAM) or DINOv2.

  • @bibhabasumohapatra
    @bibhabasumohapatra 4 місяці тому +2

    More than the merits and demerits of transformers. The best part is it's inter modalities between text-audio-voice clip

  • @kimchi_taco
    @kimchi_taco 4 місяці тому +18

    I think this is very similar to "Retentive Network", which Yannic covered few months ago. State transition model recalls me linear Kalman filter. Anyway, I cannot believe single vector memory can carry all necessary information for every token, which fit for all.

    • @andreaterlizzi
      @andreaterlizzi 4 місяці тому +3

      Well, they actually mention this in the paper, and yeah Kalman filters are a type of state-space model

    • @axe863
      @axe863 23 дні тому

      ​@andreaterlizzi lol if that's true can it really handle nonGaussian processes

  • @heyman620
    @heyman620 4 місяці тому +1

    You are just amazing.

  • @berkk1993
    @berkk1993 4 місяці тому +1

    tHANKS GREAT

  • @jsalsman
    @jsalsman 4 місяці тому +10

    Very happy transformers aren't the only game in town.

  • @albertmashy8590
    @albertmashy8590 4 місяці тому +18

    This is gonna be crazy if you think about it. It's like you could initialize an "Assistant" or agent with a huge prompt, but rather than including that information every time, you "save" that state space to save on compute for generating the next tokens because they don't need to be re-loaded every time. This also means that agents could also all have their own different personalities and behaviors without significant fine tuning requirements

  • @srh80
    @srh80 4 місяці тому +2

    Please do more papers, like before!

  • @apilaiteloc6520
    @apilaiteloc6520 5 днів тому

    lovely

  • @leisha1519
    @leisha1519 4 місяці тому +1

    For A^n, will so many multiplies of A leads to an explosion of parameters?

  • @mohdil123
    @mohdil123 4 місяці тому

    Super nice

  • @corgirun7892
    @corgirun7892 4 місяці тому

    good video, Yannic has a strong ability to dissect and extract key points

  • @vladimirtchuiev2218
    @vladimirtchuiev2218 4 місяці тому +7

    This looks a lot like state-space control theory representation. What they are presenting is basically a learnable dynamic system with a linear state transition matrix A and an input dependent input matrix B, that makes it non-linear, same as for the observation matrix C. Look as a massive upgrade over transformers for stuff like music generation, maybe even ViT-based models. What isn't clear to me is how do they learn the A matrix, it seems that the farther the context is, the more severe the vanishing gradient problem and the nearest elements in the sequence is by far the most significant.

    • @patrickl5290
      @patrickl5290 4 місяці тому

      Something about the “selective” piece maybe? My thought would be that the forgetting rules differ based on a lot of factors, so there are many opportunities for the model to be induced to not forget potentially relevant details

    • @user-kd3lr6ny4t
      @user-kd3lr6ny4t 4 місяці тому

      A matrix is also input-dependent, but yea it's hard to believe that it can "remember" things that are 10^6 upstream (basically the K term in eq. 3a and 3b.)

    • @axelmarora6743
      @axelmarora6743 Місяць тому

      @@user-kd3lr6ny4t He says A is still a parameter at 30:13 and it shows on the paper as well. I think he made a mistake when he said A was input dependent because it seems not to be.

  • @ithaca2076
    @ithaca2076 4 місяці тому +3

    this is my christmas present

  • @jabowery
    @jabowery 4 місяці тому

    What happens if you limit the training corpus to enwik9 and then just hammer away?

  • @004307ec
    @004307ec 4 місяці тому +1

    😊this is the second video I watched for this paper. I got more understanding of this paper. The paper is kind of too hard for me😂

  • @jonsimonatwork
    @jonsimonatwork 4 місяці тому +3

    I think something is off about your explanation of the A_t prefix products around 35min. The dimensions given in Algorithm 2 imply that A remains constant across timesteps, since it has no L component.

    • @NLogSpace
      @NLogSpace 2 місяці тому +2

      I wanted to ask the same thing. I think it is a mistake in the video. Also before, he mentioned that A is constant and only B and C depend on the input.

  • @TheEbbemonster
    @TheEbbemonster 4 місяці тому +9

    Interesting that A^n actually works for long sequences. I would have expected a severe degradation of performance as sequences get longer...

    • @ItsRyanStudios
      @ItsRyanStudios 4 місяці тому +2

      my same thought; I need to implement this in code to understand how/ why this works

    • @Thien--Nguyen
      @Thien--Nguyen 4 місяці тому +1

      I think there are some parameterizations constraints like the eigenvalues of A's cannot be positive for the A^n to be stable. The h3 and hippo papers talk more about those conditions (there are also diagonal and other constraints on A to make things more efficient iirc).

    • @user-kd3lr6ny4t
      @user-kd3lr6ny4t 4 місяці тому

      second this. If the eigenvalue is negative, then it will vanish even more quickly (consider the context length of 10^6).

  • @simonl1938
    @simonl1938 4 місяці тому +3

    SRAM has very low latency, but the total bandwith is less than the gpu-cpu can do (especially with DMA). You could use something like Buffer Object Streaming (this is the name that opengl has for it) to reduce memory usage massively. This would also allow for much larger batches as they are computed concurrently with the same weights. Does anyone know if this is already being done? I can't find anything on the topic.

    • @afbf6522
      @afbf6522 4 місяці тому

      Do it. Can you talk about this in private? I'm very interested in the topic of optimizing operations right now

  • @user-rb3hb2ry9h
    @user-rb3hb2ry9h 22 дні тому

    Цікава тема та дослідження, які пояснюються в цій статті. Рекомендую переглянути!

  • @varunsaagars
    @varunsaagars 4 місяці тому +19

    🎯 Key Takeaways for quick navigation:
    00:00 📜 *This video discusses the Mamba paper, which introduces a linear-time sequence modeling approach with Selective State Spaces.*
    01:19 🔄 *Transformers have the advantage of dynamic and selective attention but suffer from quadratic computation and memory requirements, while RNNs have limited memory and scalability.*
    03:05 🔄 *Backpropagation through time in RNNs can be memory-intensive and lead to gradient problems.*
    05:51 🔄 *State space models, like S4, offer an alternative to RNNs and Transformers with linear computation but lack data-dependent transitions.*
    09:34 🔄 *Mamba introduces Selective State Spaces, relaxing the input-independence constraint while retaining linear computation, making it a hybrid between SSM and LSTM.*
    11:39 🚀 *Mamba is competitive with Transformers, especially on long sequences and dense data like language and genomics.*
    12:22 📚 *The paper addresses computational efficiency and context-based reasoning, improving on previous SSM models.*
    16:12 🤖 *Mamba's architecture combines selective state spaces with other components, providing linear scaling in sequence length.*
    18:40 🚀 *Mamba offers fast training and inference, scaling linearly in sequence length during training and achieving performance on sequences up to one million in length.*
    26:33 🧮 *The video explains a sequence modeling technique involving the computation of Y3, which depends on various time steps and matrices.*
    28:28 🖥️ *The technique allows for the precomputation of certain parameters, making it possible to compute the output of a new sequence instantly.*
    29:38 🧩 *Mamba focuses on making parameters input-dependent, achieving efficiency through GPU memory optimization.*
    32:07 🚀 *The paper reduces data movement and utilizes fast memory (SRAM) for matrix multiplications, resulting in significant speed improvements.*
    34:21 🧬 *Mamba performs zoom operations differently by using a prefix sum technique to accommodate input-dependent elements.*
    36:18 📈 *Mamba shows promising scaling performance for large-scale sequence models, outperforming other attention-free models.*
    37:23 💻 *Inference throughput on an A100 GPU is good and improves as batch size increases when compared to Transformers.*
    37:36 🧪 *The paper discusses the intricacies of the efficient implementation, providing insights into memory transfers and cost reductions.*
    39:50 📊 *The code implementation includes various components such as input projection, 1D convolution, discretization, recurrence, and gating pathways for efficient sequence modeling.*

  • @sirati9770
    @sirati9770 4 місяці тому

    regarding not having a facecam: for people still learning english having a visual reference can drastically improve listening comprehension

  • @erickmarin6147
    @erickmarin6147 4 місяці тому

    Would love to see some classification of machine learning itself using AI, there seems to be a lot of stuff called different that is functionally very similar

  • @ngayminh8463
    @ngayminh8463 2 місяці тому

    have not seen mamba any implementations on pytorch , also will it replace Transformers ?

  • @aleksszukovskis2074
    @aleksszukovskis2074 4 місяці тому +1

    i dont understand what got me to even watch this at 2:00 in the morning

  • @user-mr7dd5ye8e
    @user-mr7dd5ye8e 20 днів тому

    Wowww

  • @charmy1138
    @charmy1138 4 місяці тому +2

    What are the differences between RWKV, RetNet, and Mamba? Which here has the closest architecture to transformers?

    • @stellabiderman4056
      @stellabiderman4056 4 місяці тому +3

      This is maybe worth a video all to itself, but they're all different types / levels of tradeoff between a transformer and an RNN, basically

  • @user-gb5td1ld7s
    @user-gb5td1ld7s Місяць тому

    Great explanation. Will you please share the link to the github repo?

  • @feiyuchen1383
    @feiyuchen1383 2 місяці тому

    what does the ‘’zoom" mean in the vedio?

  • @lexer_
    @lexer_ 4 місяці тому +4

    This very much triggers the same intuitive reservations and skepticism I feel towards linear transformers and other attempts at improving scaling by throwing away one of the essential components to what makes transformers work. I am not convinced that any of these architectures actually bring the same scaling we have seen with transformers so far where the only limiting factor seems to be the amount and quality of the training data.

    • @sagetmaster4
      @sagetmaster4 4 місяці тому +1

      The future looks like collections of models working in concert, so having transformers doing most of the work in most systems with certain domains using a different architecture seems plausible

    • @dennismertens990
      @dennismertens990 4 місяці тому +1

      After "Text Embeddings Reveal (Almost) As Much As Text" came out, I became convinced that encoder-decoder transformers learn embeddings that behave just like an RNN's hidden/recurrent state. If you plot the embedding of a sentence unmasked sequentially, you get a trajectory. That paper shows this trajectory is quite informative w.r.t. the original text.
      This is interesting because that suggests you can train an RNN on the learned embeddings. Since you already have the embeddings, there is no need for back-prop through time. It would be like training a regular neural net on input-output examples, where the input is the embedding at time t and the output is the embedding at time t+1. It's only speculation, but it could be a practical way of distilling pre-trained transformers into RNNs for deployment.
      P.S. By unmasking a sentence sequentially, I mean for example "hel", "hell", "hello".

    • @lexer_
      @lexer_ 4 місяці тому

      ​@@dennismertens990 I agree that these architectures in general and Mamba in particular seem like they are a very good idea for more task specific deep learning. I should have specified that I was referring to the general purpose LLMs or more multi-modal models which develop some level of reasoning and general abstraction capability. For more constrained tasks like DNA modelling this might very well be a huge leap forward.
      In regards to sequential unmasking, it would of course still be token based, not character based, but I don't think you can get away with throwing away the attention matrix without a massive loss in performance even if you train an embedding.

    • @lexer_
      @lexer_ 4 місяці тому +2

      ​@@sagetmaster4 A lot of people have been saying this for many years at this point. And for a long time this seemed like a very reasonable assumption. If you can not scale generally you have to specialize. It's the same with general compute as well as many other areas. But I haven't seen a single example of this actually working to a degree that outerforms just putting same amount of extra compute into a larger or longer trained model. I am beginning to think that we are missing some very fundamental and essential insight to make this actually work. Sparse network designs like the mixture of expert seem to have some real benefits here but only for inference speed. But I would argue this is only tangentially related to the whole heterogenous architecture idea.
      I for one think the next major architectural step probably needs to be intermediate higher level memory instead of just scaling the context. Being able to store abstractions in memory seems like such a fundamental component to human thinking that I can't help but think that it might have the potential for major improvements in an llm as well.
      The other thing that will eventually be necessary is a way for models to spend significantly more time "thinking" before answering. There were a few attempts with traditional LLMs and a thinking-token to essentiall allow it more inference steps for the next token. And the results looked promising in the paper. But now it seems to have been mostly forgotten about. So it seems a more fundamental way for the model to recurse internally might be necessary for introspective logic.

    • @TheRyulord
      @TheRyulord 4 місяці тому +4

      @@lexer_ If this paper's claims are actually true then it has better perplexity on the pile (general purpose language modeling) than traditional transformers. We've also seen attention actually struggles quite a bit on information retrieval with long sequences so while it is powerful it's not a perfect mechanism. A lot of older subquadratic transformer papers basically just did "attention but worse" (eg. window + random attention) and so they naturally had tradeoffs that a completely different mechanism like this isn't going to necessarily have.

  • @ayushtibrewal4535
    @ayushtibrewal4535 3 місяці тому

    Can we use this model for the image classification like vit

    • @SOFTWAREMASTER
      @SOFTWAREMASTER 3 місяці тому +1

      Yep you can. Checkout V Mamba (Vision Mamba)

    • @ayushtibrewal4535
      @ayushtibrewal4535 3 місяці тому

      @@SOFTWAREMASTER but there is no training code for vision mamba.

  • @mkamp
    @mkamp 4 місяці тому

    Can somebody help me understand this? In figure 1 in the paper it says “Structured SSMs independently map each channel (e.g. 𝐷 = 5) of an input 𝑥 to output 𝑦 through a higher dimensional latent state ℎ (e.g. 𝑁 = 4)”. How are four dimensions higher dimensional than five?
    I am not trolling and I understand that it says “e.g.” and that it could not be deliberate. But given the quality of the paper that seems unlikely.
    Is there another way to interpret higher dimensional? Does it just mean “not a scalar”?

    • @mkamp
      @mkamp 4 місяці тому +1

      I found the solution. Well, GPT4 did, but who’s counting? 😅
      Each of the five channels is mapped to four hidden dimensions. Of course, now it all makes sense.
      This is what the horse said: “The caption mentions that structured State Space Models (SSMs) map each channel of an input to an output through a higher dimensional latent state \( h \) (for example, \( N = 4 \)) while avoiding the materialization of a large effective state space. The inconsistency you're pointing out seems to be related to the dimensionality \( D \) given as an example (i.e., \( D = 5 \)) versus the term "higher dimensional."
      This likely means that each channel \( D \) is independently mapped, and the "higher dimensional" aspect refers to the combination of these mappings to create a state space with more dimensions than the individual inputs. The effective state space would then be \( D \times N \), which is larger than the dimensionality of each individual channel \( D \) alone. So, in this context, "higher dimensional" does not contradict the example dimensions given; rather, it points to the resultant multi-dimensional space created by the model's structure.”

    • @akolec
      @akolec 3 місяці тому

      Why a "horse" said that? @@mkamp

  • @qingyangzhang6093
    @qingyangzhang6093 4 місяці тому +5

    will we use mamba to install mamba?

    • @GatlingNG
      @GatlingNG 4 місяці тому +5

      I hope not the whole conda ecosystem should not be used for serious production code. Pyenv + poetry + pip is all you need.

    • @vcool
      @vcool 4 місяці тому +2

      @@GatlingNG conda can install things that pip can't.

    • @GatlingNG
      @GatlingNG 4 місяці тому

      @@vcool we have containers for that not some badly half-assed cobbled together venv manager with barely maintained packages.

  • @PeterIsza
    @PeterIsza 4 місяці тому +1

    okay so hidden layer is just a really big IIR filter en.wikipedia.org/wiki/Infinite_impulse_response

  • @user-uw4iz4us5i
    @user-uw4iz4us5i 4 місяці тому +7

    If Mamba can handle such long sequences, does it need tokenization ?

    • @simonl1938
      @simonl1938 4 місяці тому +1

      You can't really just input words like that into any model, but it might still be easier in other ways. In the S4 talk video Albert Gu talks about how much easier it was for people in other fields to use their model since it's a one size fits all with great performance without tuning.

    • @albertmashy8590
      @albertmashy8590 4 місяці тому +2

      Neither transformers or mamba need tokenization, it's more just for efficiency but curious to hear about the potential implications and whether mamba would be more capable of handling larger vocabularies

  • @tjpld
    @tjpld 4 місяці тому +2

    Generative Pretrained Mamba

  • @alivecoding4995
    @alivecoding4995 4 місяці тому

    I am super sceptical towards these reduced models. But I think they are interesting as a kind of ablation study for the Transformer architecture. Helping us to understand their inner mechanisms better.

  • @vcool
    @vcool 4 місяці тому

    I want to see a model that requires log memory of length, which could be better than requiring linear memory.

    • @johnzinhoinhoinho
      @johnzinhoinhoinho 4 місяці тому +1

      This model has constant memory on inference, it has linear memory on trainning fase

  • @NoNameAtAll2
    @NoNameAtAll2 3 місяці тому

    tragic loss of not seeing your sunglasses*
    fify

  • @worthstream
    @worthstream 4 місяці тому

    I've always preferred videos with no face cam, actually.

  • @monoham1
    @monoham1 3 місяці тому

    yet again i cant understand a word of the maths except the intro
    ive heard of LSTM
    big O
    matrix
    FFT
    thats about IT

    • @monoham1
      @monoham1 3 місяці тому

      "its just basic 2nd year maths" - i hate people who say this. they have no idea of the stuggles of people who have to work for 10 years just to afford 1 year of university or cant understand why paying 2 months of your income for 1 months of rent is a barrier to learning this IMPOSSIBLE maths!
      ITS NOT MATHS ITS LATIN CODE FOR BASIC SHIT ONLY YOU UNDERSTAND. THERES NO WAY TO LEARN THIS IF YOUR PARENTS ARENT MILLIONAIRES

    • @justtoleavecomments3755
      @justtoleavecomments3755 Місяць тому

      @@monoham1My parents were poor immigrants from Eastern Europe, and I took loans to go to a public university (Canadian) that cost $10k/yr. I got good grades and the govt (of Canada) wrote off half those loans. The US has cheap state schools too right?
      You don't need to be a millionaire.
      You just need time and interest.

  • @thinkalinkle
    @thinkalinkle 4 місяці тому

    I can't help but feeling like the claim that this will "replace" Transformers is a bit much... Especially after watching the video

  • @csabaczcsomps7655
    @csabaczcsomps7655 4 місяці тому +2

    Remember traing data contain influenced, wrong, intermediate trues, high adv. data, school philosophy, wrong time, era optimized ideas ... and lot more. You need some how filter, or devaluate these values, or put in some different value dimensions. This will make hallucination go down little more. My noob opinion.

  • @krimdelko
    @krimdelko 4 місяці тому

    Why learn multiplication and division when all you need is addition and real numbers? Same principle here. Reduce complexity

  • @axe863
    @axe863 23 дні тому

    Before most of you were born ..... lol 😅

  • @jondo7680
    @jondo7680 4 місяці тому +1

    It really makes no sense that all these efficient architectures come is small sizes like 3b models. If they trade of capabilities for performance, they should release 7b or 13b models, why would anyone run a 3b Mamba or rvkv if 7b mistral runs on everything? Nice tech but as long as it's under 7b it's just a demo.

    • @wenhanzhou5826
      @wenhanzhou5826 3 місяці тому

      This is research, not engineering. Maybe they just had limited compute resources.

  • @DanFrederiksen
    @DanFrederiksen 4 місяці тому

    I was surprised to learn that transformers rely on a classical algo hack to have a kind of a memory. I'm quite sure that's a flawed premise that wont last. It reminds me of bag of words which was a tremendously flawed premise. Despite how relatively amazingly well transformers work. Most approaches so far seem hacky.
    I've thought ahead a bit and I figure that any kind of real world well rounded AI needs a vastly more complex and sophisticated architecture. That's not to say it couldn't happen relatively quickly but LLM aint it. Just the ability to hold a thought and progressively work on it with shifting focus is way beyond LLMs. And that's analog to image generation which superficially looks very close to flawless but in reality is also very far from the holy grail for much the same reason.

  • @BB-sd6sm
    @BB-sd6sm 4 місяці тому

    Opinion: over engineering architectures won't actually solve the real problem at hand of intelligence

    • @erkinalp
      @erkinalp 4 місяці тому

      overbuilding allows one to handle more unforeseen types of problems, though.

  • @hanyanglee9018
    @hanyanglee9018 4 місяці тому +1

    lstm, transformer, and anything in similar direction are just not gonna work.

    • @Japneets1
      @Japneets1 3 місяці тому

      what do you mean by 'not gonna work'? Transformers have already proven their worth, haven't they?