Mamba and S4 Explained: Architecture, Parallel Scan, Kernel Fusion, Recurrent, Convolution, Math

Поділитися
Вставка
  • Опубліковано 2 лют 2025

КОМЕНТАРІ • 144

  • @andrewhaslam8785
    @andrewhaslam8785 Рік тому +53

    Brilliant - you are easily one of the most lucid and accessible teachers of deep learning.

  • @ItsRyanStudios
    @ItsRyanStudios Рік тому +32

    this is absolutely FANTASTIC
    I watched Albert Gu's stanford lecture on state space models/ Mamba, and it was a great high level overview.
    But I really appreciate you taking it slower, and going farther into detail on the basic/ fundamental concepts.
    A lot of us aren't mathematicians or ML engineers, so it's much appreciated to be helped along with those concepts.

    • @umarjamilai
      @umarjamilai  Рік тому +7

      Thank you for your kind words. Please share the video in your network, it would help me a lot. Thanks!

  • @danaosama4247
    @danaosama4247 10 місяців тому +10

    I rarely comment on videos, but this one was worth it. Thank you so much for such a clear explanation. You explained all the nuances that I previously did not understand in a very clear way. God bless you.

  • @anirudh514
    @anirudh514 Рік тому +10

    Your teaching approach is very good. You started from fundamental concepts and went deeper. This helped in gaining intuitions, understanding and avoid confusions in later part. Brilliant!

  • @remyshootingstars
    @remyshootingstars Рік тому +7

    🙌 Still working through Transformers from scratch. Hopefully a Mamba from scratch is in the future!

  • @trungquang1581
    @trungquang1581 10 місяців тому +3

    I just read about mamba and wanted to find a detailed explanation video. All you covered in this video is everything I need, thank you so much, keep on cooking

  • @sid-prod
    @sid-prod Рік тому +7

    I'm so glad i found this channel, you are a gold mine for such content, please keep them coming.

  • @SatyanarayanSenapati-b1s
    @SatyanarayanSenapati-b1s 5 місяців тому +1

    Words will fall short to appreciate the work you put to create these videos. Simply BRILLIANT.

  • @aruns.v9248
    @aruns.v9248 Рік тому +5

    The whole lecture was very intuitive. Thanks for the efforts put into building this video!

  • @Andy-paw-Jessica
    @Andy-paw-Jessica 7 місяців тому +1

    You are just too amazing! You can understand these stuff in great detail. Then you take the time and explain to us in educative videos. A true gem channel!

  • @Frederickawuahgyasi
    @Frederickawuahgyasi 3 місяці тому +1

    You're amazing. God Bless you. You made this the best hour i've spent on trying to understand MAMBA. Keep up the great work.

  • @RabeehKarimiCH
    @RabeehKarimiCH 25 днів тому +1

    Best ever presentation I saw so far, thanks so much

  • @trevorhobenshield
    @trevorhobenshield Рік тому +2

    Very high quality, this is great. Hard to find good content like this. Thanks Umar!

  • @sari54754
    @sari54754 Рік тому +2

    After I saw this lecture, I subscribed your channel. It is the most easy to understand Mamba lecture I've seen.

  • @mudassirkhan9054
    @mudassirkhan9054 Рік тому +1

    Thanks for explaining it in a way that anyone with some high school math background can understand, keep this up!

  • @ankush4617
    @ankush4617 Рік тому +2

    Thanks for the amazing work as usual! Keep it up - this is probably one of the highest quality content on LLMs on youtube.

  • @AUTO-g7s
    @AUTO-g7s Рік тому +2

    作为一个来自北京的大学生,谢谢你分享的这篇文章解析!best wishes!

  • @optomosprime
    @optomosprime Рік тому +1

    Excited for the video. I was searching for a video on Mamba and today I saw this. Your Transformer video helped me alot previously. Keep it up!

  • @ankush4617
    @ankush4617 Рік тому +1

    Thanks!

  • @arvyzukai
    @arvyzukai Рік тому +2

    This is gold! I really appreciate attention to the details. Thank you Umar!

  • @purohitadey-bc9bg
    @purohitadey-bc9bg 8 місяців тому +1

    Understanding mamba couldn't be better than this !

  • @myfolder4561
    @myfolder4561 9 місяців тому +1

    Thank you so much. Lots of useful details yet you curate through them at such a good tempo with easy to follow examples

  • @The_bioinformatician
    @The_bioinformatician 11 місяців тому +1

    This is the best deep learning video I've ever seen. I will surely use some of your slides to teach my students

  • @celestchowdhury2605
    @celestchowdhury2605 11 місяців тому +1

    Thank you so much for your detailed video and thoughtful thinking of you that we will need help with the equations! You are a savior!

  • @nishanthshetty435
    @nishanthshetty435 11 місяців тому

    Thanks a ton! Excellent explanation and great analogies to introduce the more advanced material. This is an absolute masterclass on how to teach advanced material.

  • @ActualCode0
    @ActualCode0 Рік тому

    This is one of the best ML explanations I've seen even though I didn't understand all of it but I definitely learnt something new.

  • @majidemami577
    @majidemami577 Рік тому +1

    Excellent video! Thank you. I have watched a few videos about mamba and this one was by far the best.

  • @周毅-b1h
    @周毅-b1h 4 місяці тому +1

    I'm very thankful for your explanation of this article, best wishes for you!

  • @Hello-tx7ug
    @Hello-tx7ug 11 місяців тому

    Thanks!

  • @mcHsyu
    @mcHsyu Рік тому +1

    Great explanation!! This is the first video that mekes me comprenhad the whole mamba paper.

  • @belamipro7073
    @belamipro7073 Рік тому

    Danke!

    • @umarjamilai
      @umarjamilai  Рік тому

      Thank you very very very much for your generous support! Let's connect on LinkedIn!

  • @fabiogomez8250
    @fabiogomez8250 Рік тому +2

    Best MAMBA video at the moment!

  • @GenAiWarrior
    @GenAiWarrior Рік тому +1

    Thank you so much for your efforts to make such an amazing video on Mamba architecture !!

  • @wayneqwele8847
    @wayneqwele8847 Рік тому +1

    Thank you. I appreciate the approach you took in explaining the major concepts.

  • @raaminakbari
    @raaminakbari 10 місяців тому

    Thank you for this great and smooth explanation. I think the model you are showing at 36:14 is valid if matrix A ( and B also to send each input directly to the corresponding ssm) is diagonal. Now in this way each hidden state at different canonical direction ( or different element of the vector) is independent of each other. So if A is not diagonal then assuming an eigen decomposition exist, then we may say there exist an equivalent ssm which can be represented independent ( if we change the basis to eigen basis) .

  • @Erosis
    @Erosis Рік тому +1

    As others have mentioned, you have a keen ability to explain difficult topics succinctly and completely. Keep up the awesome work! I could of used this when I took a class on time-series modeling! Hah!

  • @흰강아지-s4v
    @흰강아지-s4v 4 місяці тому +1

    this is just a pure art; thanks so much

  • @Mirai12377
    @Mirai12377 11 місяців тому +2

    very good video!!! thanks a lot for your efforts!!!!

  • @akshikaakalanka
    @akshikaakalanka 9 місяців тому +1

    This is really helpful for another talk I am doing on Mamba. Thank you very much for putting this out.

  • @beincheekym8
    @beincheekym8 8 місяців тому +1

    Brilliant video! Really clear and with just the right amount of details!

  • @selayan4985
    @selayan4985 7 місяців тому +1

    Such a briliant work you have done. Really learned a lot, thanks!!!

  • @BooleanDisorder
    @BooleanDisorder Рік тому +1

    Even I understood much of this. I have no education. Thank you! Mamba looks really cool. Especially like the long context and further refinement. It looks like a model that could be made to learn as it goes. Plasticity potential

  • @TheRohit901
    @TheRohit901 Рік тому

    Amazing explanation. I love this video because it covers sufficient depth and explains each concept with proper examples. I've subscribed instantly, and look forward to more such videos on recent papers.

  • @SpandanMishra-z4r
    @SpandanMishra-z4r Рік тому +1

    OMG ! this is such as amazing description , you made my day

  • @shoubhikdasguptadg9911
    @shoubhikdasguptadg9911 10 місяців тому +1

    Ohhh Man, why did I discover this gem so late :( This guy is a rockstar!

  • @prashlovessamosa
    @prashlovessamosa Рік тому +2

    Salute to consistency
    Thanks Umar sir.

  • @a123s1l
    @a123s1l 4 місяці тому

    Thanks for your clear explanation of MAMBA, coming from a control theory background, very much appreciate its usage in LLMs. One small error that I noted was that the A matrix must be N x N to translate the previous N-dimensional hidden states h(t-1) to h(t). I believe the A matrix is also time-varying to produce selective output tokens.

  • @TheFitsome
    @TheFitsome 5 місяців тому +1

    some people are just born to teach.

  • @soroushmehraban
    @soroushmehraban 10 місяців тому +1

    Love it! Keep up the amazing work.

  • @danamics
    @danamics 6 місяців тому +1

    Great job on this video! I learned a lot

  • @kwanhowong5065
    @kwanhowong5065 11 місяців тому +1

    Really an amazing video! You save me a lot of time! Thank you!

  • @lewylondon
    @lewylondon 11 місяців тому +1

    I did learn a lot! Many thanks for making this video.

  • @tunatuncer5639
    @tunatuncer5639 10 місяців тому +1

    wow that's a great explanation , thanks for the efforts!

  • @jason988081
    @jason988081 6 місяців тому

    Dear Umar, referring to 53:50, recurrent SSM is indeed similar as prefix-sum (i.e., y=x_0+x_1+....x_N), but I the difference is that h_t=Ah_{t-1}+Bx_t, where h_{t_1} depends on h_{t-2}. I know how Blelloch parallel prefix scan works for calculating the sum of constants, but I do not know how parallel scan works for h_t=Ah_{t-1}+Bx_t. Could you please elaborate on it ? Thank you. @Umar

  • @umuthalil5001
    @umuthalil5001 10 місяців тому

    Hi, I was wondering if you could explain 36:40 a bit more where you talk about multi head attention. From what I understand each head in multi-head attention each head looks at the whole input vector. Our key value and query matrices are all of size Dx(head_size) where D being dimension of embedding, so when we find key say we do key = X @ key_matrix where X is an CxD dimensional matrix, C is context len. This means each head looks at the whole dimension of the embedding D and represents it a head_size vector meaning that arrows going into each head should point at every single input dim.

  • @我我-p3z
    @我我-p3z 6 місяців тому +1

    最清晰的讲解!

  • @EkShunya
    @EkShunya Рік тому +1

    i always eagerly wait for your explainer. they are 🤯.
    thank you :)

  • @bryanbocao4906
    @bryanbocao4906 7 місяців тому

    Thanks for the video! Very informative! Just to check: At @1:03:42, 3. be "... save back the result to HBM."?

  • @bulat_15
    @bulat_15 Рік тому +1

    Thanks man! This helped me a lot

  • @whisperlast6548
    @whisperlast6548 10 місяців тому +1

    This video is of great help!!Thank you very much.

  • @erfanasgari21
    @erfanasgari21 Місяць тому

    In 57:00 isn't the time complexity reduced to 2*lg(n) in parallel scan? Thanks for the amazing explanation btw. 💚

  • @m1k3b7
    @m1k3b7 11 місяців тому +1

    Brilliant explanations. Thanks.

  • @samuelbeaussant3097
    @samuelbeaussant3097 11 місяців тому

    Very good lecture ! Thank you very much for putting this for free on youtube :) I have question though, if my understanding of the HiPPO framework is correct, the A matrix is built to uniformly approximate the input signal (name HiPPO LegS in the paper). "Our novel scaled Legendre measure (LegS) assigns uniform weight to all history [0, t]". But however at 41:49 you explain that it is decaying exponentially similarly to HiPPO LagT. Do they opt for HiPPO LagT when moving to s4 and Mamba or am I missing something ?

  • @deepikagurung9410
    @deepikagurung9410 3 місяці тому

    are going to code it as well.
    I really liked the video it was easy and very comprehensive.

  • @divgill6062
    @divgill6062 Рік тому

    Amazing! So detailed. Well done sir

  • @amitshukla1495
    @amitshukla1495 Рік тому +1

    Absolutely amazing 🎉

  • @810602jay
    @810602jay Рік тому

    Thanks Umar! 🥰Very amazing learning material for Mamba!

  • @mahmoudreda5054
    @mahmoudreda5054 3 місяці тому +1

    thank you for this video , really helped me

  • @eafadeev
    @eafadeev 10 місяців тому

    You're making very useful content, thank you!!! Maybe you could consider using larger text, so that one could read easily from a phone. Also a plus would be if the presentation were white on black (or bright color on black), it is less tiring to look at a dark screen for long periods of time.

  • @walidmaly3
    @walidmaly3 11 місяців тому +1

    One of the best! I have one question if we apply conv in S4 on sequence of length L, what will be size of conv layer?

  • @nguyenhuuuc2311
    @nguyenhuuuc2311 Рік тому +1

    Thanks for the awesome content! Hope the next one will be about DPO and coding it from scratch ❤

    • @umarjamilai
      @umarjamilai  9 місяців тому

      You're welcome: ua-cam.com/video/hvGa5Mba4c8/v-deo.html

    • @nguyenhuuuc2311
      @nguyenhuuuc2311 9 місяців тому +1

      @@umarjamilai Thank you!!! You're so talented at research and teaching!!!!

  • @akashkumar-jg4oj
    @akashkumar-jg4oj Рік тому +1

    Great explanation!

  • @allengeng6660
    @allengeng6660 9 місяців тому +1

    Very nice talk, thank you.

  • @immakiku
    @immakiku 6 місяців тому

    Trying to follow the rationale/last-slides - one advantage of SSM/RNN was that they would scale to infinite context. But Mamba reintroduced L-lengthed parameters. Why is this not limiting to this architecture the same way it limits Transformers? Qualitatively, it seems the only remaining advantage over transformers is the inference is cheaper - could you help clarify? Thanks

  • @GoogleColab003
    @GoogleColab003 11 місяців тому +1

    absolutely fantastic

  • @alainrieger6905
    @alainrieger6905 6 місяців тому

    Awesome video as usual

  • @팽도리-v6s
    @팽도리-v6s 8 місяців тому +1

    Amazing video.

  • @杨辉-l2g
    @杨辉-l2g Рік тому +1

    excellent work! Thank you

  • @pcwang7803
    @pcwang7803 10 місяців тому

    Great lecture! It is easier for me to understand the work with your lecture.
    Can you give one for Reinforcement learning?

  • @undefined-mj6oi
    @undefined-mj6oi 11 місяців тому

    Hey! Thanks for the details in this video.
    I'm confused about the HiPPO matrix, which seems to be fixed given N?
    However the paper stated that delta, A, B, C are all trainable. What did I miss?

    • @undefined-mj6oi
      @undefined-mj6oi 11 місяців тому

      is HiPPO the initialization of A?

    • @umarjamilai
      @umarjamilai  11 місяців тому

      Yeah, just the initialization

    • @undefined-mj6oi
      @undefined-mj6oi 11 місяців тому

      Thanks for clarification.
      Could you please further explain how the parameter of A is (D, N) in S4? If I have D*SSMs, one for each embedding dimension, shouldn't A have DN^2 parameters?

  • @artaasadi9497
    @artaasadi9497 11 місяців тому +1

    Thanks a lot that was very useful!

  • @buh357
    @buh357 10 місяців тому +1

    you are the best.

  • @abrahamsong6913
    @abrahamsong6913 Рік тому

    this is so far the only video I found that described the math part in the mamba model. thanks a lot.
    One small issue. In 37:00, for the attention model, you mentioned each head takes only a portion of input dimensions, can you confirm this? I believe each head actually use all input dimensions.

    • @abrahamsong6913
      @abrahamsong6913 Рік тому

      It might be true for LLMs, but I believe this is not true for the original transformer model.

    • @umarjamilai
      @umarjamilai  Рік тому

      Hello! First of all thanks for the kind words.
      Yes, in multi-head attention, the idea is that each head sees the entire sequence, but a different portion of the embedding of each token. This is to make each head relate tokens in different ways. This mechanism is described in my previous video on the Transformer model.

  • @albertmashy8590
    @albertmashy8590 11 місяців тому +1

    Amazing video

  • @rezagholipoor7900
    @rezagholipoor7900 5 місяців тому +1

    It was very informative

  • @Huawei_Jiang
    @Huawei_Jiang 10 місяців тому

    I have one question in terms of the example which you provided, 'the number of buddies'. I think the function should be like this : b(t)=5squ(3)^λt . please comment to me if I am wrong.

  • @НикитаБуров-ъ6р
    @НикитаБуров-ъ6р 10 місяців тому

    i've just started watching but guess this vid'll be much usefull

  • @passarodavide
    @passarodavide Рік тому +1

    Bellissimo video, grazie!

  • @dotori-hj
    @dotori-hj 11 місяців тому +1

    Fantastic

  • @andreanegreanu8750
    @andreanegreanu8750 6 місяців тому

    Hi Professor! Very good explanation as always. However, I have huge difficulties to understand the dimensions of objects. Why the hell A matrix would be of (D,N) dimensions since it is used to project a vector h_t-1 of N dimensions into N dimensions? By the way, why is it written "Represents structured N x N matrix" ?????!!!!

  • @pawanpatil4715
    @pawanpatil4715 Рік тому +4

    Hi Umar, amazing video. You are the best teacher. You are Karpathy 2.0. :) Please make a video on DPO :)

    • @umarjamilai
      @umarjamilai  9 місяців тому +1

      Done: ua-cam.com/video/hvGa5Mba4c8/v-deo.html

    • @pawanpatil4715
      @pawanpatil4715 9 місяців тому

      @@umarjamilai thank you so much 😃

  • @sayandas13
    @sayandas13 6 місяців тому

    Awesome explanation. Really appreciate such content. Can you please make a similar explanation video on the Mamba-2 paper?

  • @Eateryy
    @Eateryy 11 місяців тому

    amazing explanation
    waiting for new video
    please upload soon

  • @ShubhamAshokGandhi
    @ShubhamAshokGandhi 11 місяців тому +1

    Great explanation. Very through. Loved it. I struggled with understanding the SSM paper. You explained all the bits beautifully

  • @SandeepS-i4e
    @SandeepS-i4e 6 місяців тому +1

    Great❤

  • @HosseinKhosravipour
    @HosseinKhosravipour 7 місяців тому +1

    very great

  • @Charles-Darwin
    @Charles-Darwin 11 місяців тому +1

    Thank you

  • @123456ewr
    @123456ewr Рік тому +1

    Thanks, i hope you explain rwkv

  • @venkateshr6127
    @venkateshr6127 Рік тому +1

    Please can you make video on optimizers like adam,adagrad,...

  • @edsonjr6972
    @edsonjr6972 Рік тому

    Excellent video! I'm looking forward if you do a coding one. Thank you so much for your work to the AI community

    • @umarjamilai
      @umarjamilai  Рік тому +2

      Coding one is not very interesting, because the most interesting part is the selective scan algorithm, which is a CUDA Kernel. The architecture is not so different from any other language model. Of course it would be super cool to code the CUDA kernel from scratch ;-)