Mamba and S4 Explained: Architecture, Parallel Scan, Kernel Fusion, Recurrent, Convolution, Math

Поділитися
Вставка
  • Опубліковано 6 чер 2024
  • Explanation of the paper Mamba: Linear-Time Sequence Modeling with Selective State Spaces
    In this video I will be explaining Mamba, a new sequence modeling architecture that can compete with the Transformer. I will first start by introducing the various sequence modeling architectures (RNN, CNN and Transformer) and then deep dive into State Space Models. To fully understand State Space Models, we need to have some background in differential equations. That's why, I will provide a brief introduction to differential equations (in 5 minutes!) and then proceed to derive the recurrent formula and the convolutional formula from first principles. I will also prove mathematically (with the help of visual diagrams) why State Space Models can be run as a convolution. I will explain what is the HIPPO matrix and how it can help the model "memorize" the input history in a finite state.
    In the second part of the video, I will explore Mamba and in particular the Selective Scan algorithm, but first explaining what is the scan operation and how it can be parallelized, and then showing how the authors further improved the algorithm with Kernel Fusion and activations recomputation. I will also provide a brief lesson on the memory hierarchy in the GPU and why some operations may be IO-bound.
    In the last part of the video we will explore the architecture of Mamba and some performance results to compare it with the Transformer.
    Slides PDF and Parallel Scan (excel file): github.com/hkproj/mamba-notes
    Chapters
    00:00:00 - Introduction
    00:01:46 - Sequence modeling
    00:07:12 - Differential equations (basics)
    00:11:38 - State Space Models
    00:13:53 - Discretization
    00:23:08 - Recurrent computation
    00:26:32 - Convolutional computation
    00:34:18 - Skip connection term
    00:35:21 - Multidimentional SSM
    00:37:44 - The HIPPO theory
    00:43:30 - The motivation behind Mamba
    00:46:56 - Selective Scan algorithm
    00:51:34 - The Scan operation
    00:54:24 - Parallel Scan
    00:57:20 - Innovations in Selective Scan
    00:58:00 - GPU Memory Hierarchy
    01:01:23 - Kernel Fusion
    01:01:48 - Activations recomputation
    01:06:48 - Mamba architecture
    01:10:18 - Performance considerations
    01:12:54 - Conclusion
  • Наука та технологія

КОМЕНТАРІ • 116

  • @andrewhaslam8785
    @andrewhaslam8785 5 місяців тому +39

    Brilliant - you are easily one of the most lucid and accessible teachers of deep learning.

  • @ItsRyanStudios
    @ItsRyanStudios 4 місяці тому +24

    this is absolutely FANTASTIC
    I watched Albert Gu's stanford lecture on state space models/ Mamba, and it was a great high level overview.
    But I really appreciate you taking it slower, and going farther into detail on the basic/ fundamental concepts.
    A lot of us aren't mathematicians or ML engineers, so it's much appreciated to be helped along with those concepts.

    • @umarjamilai
      @umarjamilai  4 місяці тому +5

      Thank you for your kind words. Please share the video in your network, it would help me a lot. Thanks!

  • @danaosama4247
    @danaosama4247 2 місяці тому +7

    I rarely comment on videos, but this one was worth it. Thank you so much for such a clear explanation. You explained all the nuances that I previously did not understand in a very clear way. God bless you.

  • @anirudh514
    @anirudh514 4 місяці тому +9

    Your teaching approach is very good. You started from fundamental concepts and went deeper. This helped in gaining intuitions, understanding and avoid confusions in later part. Brilliant!

  • @purohitadey-bc9bg
    @purohitadey-bc9bg 15 днів тому +1

    Understanding mamba couldn't be better than this !

  • @trungquang1581
    @trungquang1581 2 місяці тому +3

    I just read about mamba and wanted to find a detailed explanation video. All you covered in this video is everything I need, thank you so much, keep on cooking

  • @beincheekym8
    @beincheekym8 8 днів тому +1

    Brilliant video! Really clear and with just the right amount of details!

  • @sid-prod
    @sid-prod 4 місяці тому +5

    I'm so glad i found this channel, you are a gold mine for such content, please keep them coming.

  • @aruns.v9248
    @aruns.v9248 4 місяці тому +5

    The whole lecture was very intuitive. Thanks for the efforts put into building this video!

  • @ankush4617
    @ankush4617 5 місяців тому +2

    Thanks for the amazing work as usual! Keep it up - this is probably one of the highest quality content on LLMs on youtube.

  • @trevorhobenshield
    @trevorhobenshield 4 місяці тому +2

    Very high quality, this is great. Hard to find good content like this. Thanks Umar!

  • @arvyzukai
    @arvyzukai 5 місяців тому +2

    This is gold! I really appreciate attention to the details. Thank you Umar!

  • @remyshootingstars
    @remyshootingstars 4 місяці тому +6

    🙌 Still working through Transformers from scratch. Hopefully a Mamba from scratch is in the future!

  • @user-dk9rn9bc7b
    @user-dk9rn9bc7b 4 місяці тому +2

    作为一个来自北京的大学生,谢谢你分享的这篇文章解析!best wishes!

  • @sari54754
    @sari54754 4 місяці тому +2

    After I saw this lecture, I subscribed your channel. It is the most easy to understand Mamba lecture I've seen.

  • @mudassirkhan9054
    @mudassirkhan9054 4 місяці тому +1

    Thanks for explaining it in a way that anyone with some high school math background can understand, keep this up!

  • @optomosprime
    @optomosprime 4 місяці тому +1

    Excited for the video. I was searching for a video on Mamba and today I saw this. Your Transformer video helped me alot previously. Keep it up!

  • @celestchowdhury2605
    @celestchowdhury2605 3 місяці тому +1

    Thank you so much for your detailed video and thoughtful thinking of you that we will need help with the equations! You are a savior!

  • @wayneqwele8847
    @wayneqwele8847 4 місяці тому +1

    Thank you. I appreciate the approach you took in explaining the major concepts.

  • @nishanthshetty435
    @nishanthshetty435 3 місяці тому

    Thanks a ton! Excellent explanation and great analogies to introduce the more advanced material. This is an absolute masterclass on how to teach advanced material.

  • @majidemami577
    @majidemami577 4 місяці тому +1

    Excellent video! Thank you. I have watched a few videos about mamba and this one was by far the best.

  • @myfolder4561
    @myfolder4561 Місяць тому +1

    Thank you so much. Lots of useful details yet you curate through them at such a good tempo with easy to follow examples

  • @ActualCode0
    @ActualCode0 4 місяці тому

    This is one of the best ML explanations I've seen even though I didn't understand all of it but I definitely learnt something new.

  • @GenAiWarrior
    @GenAiWarrior 4 місяці тому +1

    Thank you so much for your efforts to make such an amazing video on Mamba architecture !!

  • @mcHsyu
    @mcHsyu 4 місяці тому +1

    Great explanation!! This is the first video that mekes me comprenhad the whole mamba paper.

  • @Erosis
    @Erosis 4 місяці тому +1

    As others have mentioned, you have a keen ability to explain difficult topics succinctly and completely. Keep up the awesome work! I could of used this when I took a class on time-series modeling! Hah!

  • @The_bioinformatician
    @The_bioinformatician 3 місяці тому +1

    This is the best deep learning video I've ever seen. I will surely use some of your slides to teach my students

  • @user-lb8sh2vh3r
    @user-lb8sh2vh3r 4 місяці тому +1

    OMG ! this is such as amazing description , you made my day

  • @fabiogomez8250
    @fabiogomez8250 5 місяців тому +2

    Best MAMBA video at the moment!

  • @prashlovessamosa
    @prashlovessamosa 5 місяців тому +2

    Salute to consistency
    Thanks Umar sir.

  • @akshikaakalanka
    @akshikaakalanka Місяць тому +1

    This is really helpful for another talk I am doing on Mamba. Thank you very much for putting this out.

  • @TheRohit901
    @TheRohit901 4 місяці тому

    Amazing explanation. I love this video because it covers sufficient depth and explains each concept with proper examples. I've subscribed instantly, and look forward to more such videos on recent papers.

  • @junhaoliu9436
    @junhaoliu9436 3 місяці тому +2

    very good video!!! thanks a lot for your efforts!!!!

  • @EkShunya
    @EkShunya 5 місяців тому +1

    i always eagerly wait for your explainer. they are 🤯.
    thank you :)

  • @whisperlast6548
    @whisperlast6548 2 місяці тому +1

    This video is of great help!!Thank you very much.

  • @luisrperaza
    @luisrperaza 3 місяці тому +1

    I did learn a lot! Many thanks for making this video.

  • @kwanhowong5065
    @kwanhowong5065 3 місяці тому +1

    Really an amazing video! You save me a lot of time! Thank you!

  • @divgill6062
    @divgill6062 4 місяці тому

    Amazing! So detailed. Well done sir

  • @soroushmehraban
    @soroushmehraban 2 місяці тому +1

    Love it! Keep up the amazing work.

  • @810602jay
    @810602jay 5 місяців тому

    Thanks Umar! 🥰Very amazing learning material for Mamba!

  • @tunatuncer5639
    @tunatuncer5639 2 місяці тому +1

    wow that's a great explanation , thanks for the efforts!

  • @akashkumar-jg4oj
    @akashkumar-jg4oj 5 місяців тому +1

    Great explanation!

  • @user-dh3up2iw7o
    @user-dh3up2iw7o 13 годин тому +1

    Amazing video.

  • @bulat_15
    @bulat_15 4 місяці тому +1

    Thanks man! This helped me a lot

  • @amitshukla1495
    @amitshukla1495 5 місяців тому +1

    Absolutely amazing 🎉

  • @BooleanDisorder
    @BooleanDisorder 4 місяці тому +1

    Even I understood much of this. I have no education. Thank you! Mamba looks really cool. Especially like the long context and further refinement. It looks like a model that could be made to learn as it goes. Plasticity potential

  • @m1k3b7
    @m1k3b7 3 місяці тому +1

    Brilliant explanations. Thanks.

  • @shoubhikdasguptadg9911
    @shoubhikdasguptadg9911 2 місяці тому +1

    Ohhh Man, why did I discover this gem so late :( This guy is a rockstar!

  • @user-hh5cu5ir2e
    @user-hh5cu5ir2e 4 місяці тому +1

    excellent work! Thank you

  • @allengeng6660
    @allengeng6660 Місяць тому +1

    Very nice talk, thank you.

  • @user-tr8ic8jo6g
    @user-tr8ic8jo6g 3 місяці тому +1

    absolutely fantastic

  • @raminakbari394
    @raminakbari394 2 місяці тому

    Thank you for this great and smooth explanation. I think the model you are showing at 36:14 is valid if matrix A ( and B also to send each input directly to the corresponding ssm) is diagonal. Now in this way each hidden state at different canonical direction ( or different element of the vector) is independent of each other. So if A is not diagonal then assuming an eigen decomposition exist, then we may say there exist an equivalent ssm which can be represented independent ( if we change the basis to eigen basis) .

  • @albertmashy8590
    @albertmashy8590 3 місяці тому +1

    Amazing video

  • @artaasadi9497
    @artaasadi9497 3 місяці тому +1

    Thanks a lot that was very useful!

  • @nguyenhuuuc2311
    @nguyenhuuuc2311 5 місяців тому +1

    Thanks for the awesome content! Hope the next one will be about DPO and coding it from scratch ❤

    • @umarjamilai
      @umarjamilai  Місяць тому

      You're welcome: ua-cam.com/video/hvGa5Mba4c8/v-deo.html

    • @nguyenhuuuc2311
      @nguyenhuuuc2311 Місяць тому +1

      @@umarjamilai Thank you!!! You're so talented at research and teaching!!!!

  • @passarodavide
    @passarodavide 4 місяці тому +1

    Bellissimo video, grazie!

  • @dotori-hj
    @dotori-hj 3 місяці тому +1

    Fantastic

  • @eafadeev
    @eafadeev 2 місяці тому

    You're making very useful content, thank you!!! Maybe you could consider using larger text, so that one could read easily from a phone. Also a plus would be if the presentation were white on black (or bright color on black), it is less tiring to look at a dark screen for long periods of time.

  • @buh357
    @buh357 2 місяці тому +1

    you are the best.

  • @pcwang7803
    @pcwang7803 2 місяці тому

    Great lecture! It is easier for me to understand the work with your lecture.
    Can you give one for Reinforcement learning?

  • @toxicbisht4344
    @toxicbisht4344 3 місяці тому

    amazing explanation
    waiting for new video
    please upload soon

  • @mdbayazid6837
    @mdbayazid6837 5 місяців тому

    Jazakallah Khairan

  • @edsonjr6972
    @edsonjr6972 5 місяців тому

    Excellent video! I'm looking forward if you do a coding one. Thank you so much for your work to the AI community

    • @umarjamilai
      @umarjamilai  5 місяців тому +2

      Coding one is not very interesting, because the most interesting part is the selective scan algorithm, which is a CUDA Kernel. The architecture is not so different from any other language model. Of course it would be super cool to code the CUDA kernel from scratch ;-)

  • @user-zu2sy2lq6t
    @user-zu2sy2lq6t 2 місяці тому

    i've just started watching but guess this vid'll be much usefull

  • @123456ewr
    @123456ewr 5 місяців тому +1

    Thanks, i hope you explain rwkv

  • @walidmaly3
    @walidmaly3 3 місяці тому +1

    One of the best! I have one question if we apply conv in S4 on sequence of length L, what will be size of conv layer?

  • @ankush4617
    @ankush4617 5 місяців тому +1

    Thanks!

  • @Charles-Darwin
    @Charles-Darwin 3 місяці тому +1

    Thank you

  • @kunchangli9319
    @kunchangli9319 4 місяці тому +1

    Brilliant! 太棒了!

  • @samuelbeaussant3097
    @samuelbeaussant3097 3 місяці тому

    Very good lecture ! Thank you very much for putting this for free on youtube :) I have question though, if my understanding of the HiPPO framework is correct, the A matrix is built to uniformly approximate the input signal (name HiPPO LegS in the paper). "Our novel scaled Legendre measure (LegS) assigns uniform weight to all history [0, t]". But however at 41:49 you explain that it is decaying exponentially similarly to HiPPO LagT. Do they opt for HiPPO LagT when moving to s4 and Mamba or am I missing something ?

  • @aamir122a
    @aamir122a 4 місяці тому

    As suggestion for your next video you can cover GTP decoder based Multi-model model.

  • @ShubhamAshokGandhi
    @ShubhamAshokGandhi 3 місяці тому +1

    Great explanation. Very through. Loved it. I struggled with understanding the SSM paper. You explained all the bits beautifully

  • @heewoongchoi27
    @heewoongchoi27 3 місяці тому

    you are so smart!

  • @baiyouheng5365
    @baiyouheng5365 5 місяців тому +1

    great😀😀

  • @umuthalil5001
    @umuthalil5001 2 місяці тому

    Hi, I was wondering if you could explain 36:40 a bit more where you talk about multi head attention. From what I understand each head in multi-head attention each head looks at the whole input vector. Our key value and query matrices are all of size Dx(head_size) where D being dimension of embedding, so when we find key say we do key = X @ key_matrix where X is an CxD dimensional matrix, C is context len. This means each head looks at the whole dimension of the embedding D and represents it a head_size vector meaning that arrows going into each head should point at every single input dim.

  • @cicerochen313
    @cicerochen313 5 місяців тому +1

    Awesome! 讚!

  • @venkateshr6127
    @venkateshr6127 5 місяців тому +1

    Please can you make video on optimizers like adam,adagrad,...

  • @pawanpatil4715
    @pawanpatil4715 4 місяці тому +4

    Hi Umar, amazing video. You are the best teacher. You are Karpathy 2.0. :) Please make a video on DPO :)

    • @umarjamilai
      @umarjamilai  Місяць тому +1

      Done: ua-cam.com/video/hvGa5Mba4c8/v-deo.html

    • @pawanpatil4715
      @pawanpatil4715 Місяць тому

      @@umarjamilai thank you so much 😃

  • @GrifinsBrother
    @GrifinsBrother 5 місяців тому

    Need more code from scratch videos!

  • @user-mr7dd5ye8e
    @user-mr7dd5ye8e Місяць тому

    You are amazing! How did you learn all this?

  • @belamipro7073
    @belamipro7073 4 місяці тому

    Danke!

    • @umarjamilai
      @umarjamilai  4 місяці тому

      Thank you very very very much for your generous support! Let's connect on LinkedIn!

  • @LukasSmith827
    @LukasSmith827 5 місяців тому

    you're extremely underrated, I don't think I'll be able to use much valuable info tbh.

  • @RahulPrajapati-jg4dg
    @RahulPrajapati-jg4dg 5 місяців тому

    Hi Umar can please upload the videos regarding details explanation of GPT architecture

  • @Huawei_Jiang
    @Huawei_Jiang 2 місяці тому

    I have one question in terms of the example which you provided, 'the number of buddies'. I think the function should be like this : b(t)=5squ(3)^λt . please comment to me if I am wrong.

  • @undefined-mj6oi
    @undefined-mj6oi 3 місяці тому

    Hey! Thanks for the details in this video.
    I'm confused about the HiPPO matrix, which seems to be fixed given N?
    However the paper stated that delta, A, B, C are all trainable. What did I miss?

    • @undefined-mj6oi
      @undefined-mj6oi 3 місяці тому

      is HiPPO the initialization of A?

    • @umarjamilai
      @umarjamilai  3 місяці тому

      Yeah, just the initialization

    • @undefined-mj6oi
      @undefined-mj6oi 3 місяці тому

      Thanks for clarification.
      Could you please further explain how the parameter of A is (D, N) in S4? If I have D*SSMs, one for each embedding dimension, shouldn't A have DN^2 parameters?

  • @abrahamsong6913
    @abrahamsong6913 4 місяці тому

    this is so far the only video I found that described the math part in the mamba model. thanks a lot.
    One small issue. In 37:00, for the attention model, you mentioned each head takes only a portion of input dimensions, can you confirm this? I believe each head actually use all input dimensions.

    • @abrahamsong6913
      @abrahamsong6913 4 місяці тому

      It might be true for LLMs, but I believe this is not true for the original transformer model.

    • @umarjamilai
      @umarjamilai  4 місяці тому

      Hello! First of all thanks for the kind words.
      Yes, in multi-head attention, the idea is that each head sees the entire sequence, but a different portion of the embedding of each token. This is to make each head relate tokens in different ways. This mechanism is described in my previous video on the Transformer model.

  • @Huawei_Jiang
    @Huawei_Jiang 2 місяці тому

    Can we run Mamba via normal GPU?

  • @user-jb3ht1wq5l
    @user-jb3ht1wq5l 2 дні тому

    PLEASE explain spacetimeformer

  • @user-hd7xp1qg3j
    @user-hd7xp1qg3j 5 місяців тому

    You're GOAT

    • @umarjamilai
      @umarjamilai  5 місяців тому

      GOAT? 🐐 Beeeehhh 😅😅

    • @user-hd7xp1qg3j
      @user-hd7xp1qg3j 5 місяців тому

      @umarjamilai yeah you're Greatest Of All Time (GOAT)

  • @user-ud3rv5xo6z
    @user-ud3rv5xo6z 4 місяці тому

    Umar please do trian mamba from scratch video. Everybody wants that (even on mamba github there are alot of requests but authors told they do not published training loop). I hope and believe you will fix this knowledge gap.

  • @easydoesitismist
    @easydoesitismist 4 місяці тому +1

    Orange 🧡 place brought me here.

  • @techw4y
    @techw4y 5 місяців тому

    Listened for about half hour, didnt get a clue of this topic, stopped it! Thanks for the attempt though

  • @shamraiznazir1755
    @shamraiznazir1755 4 місяці тому

    میں وی کجھ سیکھنا چاہندا

  • @ZatoichiRCS
    @ZatoichiRCS 2 місяці тому

    Am I watching the total rip off the Fourier Transform and Z-Transforms in all of AI/ML? The Differential Equation is brute force. We use Laplace.

    • @umarjamilai
      @umarjamilai  2 місяці тому +1

      Yeah, you can do everything in the S space with Laplace transform, but most ML researchers do not have a controls engineering background, so we stick to differential equations 😉

  • @Hello-tx7ug
    @Hello-tx7ug 3 місяці тому

    Thanks!