Emergency Pod: Mamba, Memory, and the SSM Moment

Поділитися
Вставка
  • Опубліковано 15 вер 2024

КОМЕНТАРІ • 38

  • @arinco3817
    @arinco3817 8 місяців тому +9

    I'm only about 10 mins in but this is a joy to watch! I try to have conversations about how we think etc with work colleagues and they roll their eyes lol. So watching this feels like eating popcorn and watching an awesome movie. Thanks for putting the effort in to create this! It's a service!

  • @mattscheper115
    @mattscheper115 8 місяців тому +5

    I really like this episode a lot. Thanks a lot for making this.

  • @jeffspaulding43
    @jeffspaulding43 8 місяців тому +4

    I love these deep dives

  • @jackied962
    @jackied962 8 місяців тому +4

    Great episode, haven't really seen anyone else talking about this

  • @henrylawler3264
    @henrylawler3264 8 місяців тому +1

    Best content I've seen in making Mamba explainable. There's also no way you can convince me you're not Trevor from Whitest Kids U Know

  • @JazevoAudiosurf
    @JazevoAudiosurf 8 місяців тому +1

    i searched for long time to find someone who gets it, thinks in first principles. i fully share your vision. im deeply convinced we will get heavily fine tuned agents doing neural architecture search. they will create ideas, mutate them and write and benchmark it. and since thats something i can do with my limited resources as a web dev, they are probably already doing it. and the smarter the bots get, the better the ideas, recursive loop to craziness. furthermore there is no in 10 years, i dont see the slightest chance that stuff wont takeoff soon. the simplicity of every part of the chain is just overwhelming

  • @couldntfindafreename
    @couldntfindafreename 8 місяців тому +5

    I've tested the `state-spaces/mamba-2.8b` model. The published Mamba models were trained up to only 2k context length. Therefore the long context support (1M+ tokens), which would be the most important addition of this model cannot be tested with the published model weights. It would need continued training of the model on longer and longer context lengths until it reaches 1M tokens. Quoting: "That extrapolation was for a simple synthetic task (induction head). For language modeling it remains to be seen."

    • @nathanlabenz
      @nathanlabenz 8 місяців тому +2

      Yes this is the biggest proof point currently missing. I don’t see any reason it won’t work well enough to at least complement the attention mechanism in frontier systems but.. time will tell!

  • @anonymousaustralianhistory2081
    @anonymousaustralianhistory2081 8 місяців тому +2

    This chanel is great under appreciated very good talk

  • @palfers1
    @palfers1 7 місяців тому

    First off, good job on the analysis and it's good to see you've actually tried stuff out.
    A couple of thoughts I had about Mamba:
    1. The origins of SSM lie in the 1960s, and that was when the Kalman filter was born. It's **provably optimal** (with certain caveats) and that should tell you that there's some serious theoretical meat on the SSM bone. I worked in radar processing in the 1970s and everything was Kalman. Another Kalman factoid: it's **extensible** in various directions pertaining to the nonlinearity of the input space. Which leads to the second thought:
    2. Mamba is extensible in the same way because it's an SSM variant. Thus, instead of making its parameters be a function of just the input, they can further be made a function of the hidden state too.
    Now the second point might make the hardware optimisation moot on Nvidia GPUs, but why should we in principle care? Hardware is not Nvidia although the reverse is true. Coming along for AI processing are analog systems and spiking systems and indeed a combination of both. Neuromorphic chips will eventually take over the role currently occupied by GPUs. Perhaps sooner than we think.

  • @toniarbona
    @toniarbona 8 місяців тому +4

    Amazing window into the future. States may be the missing ingredient for System 2.

  • @genegray9895
    @genegray9895 8 місяців тому +5

    200k tokens is a LOT of tokens. If you consider there are 24x3600 = 86 400 seconds in a day, and you're asleep for a quarter to a third of those, you'd have to take in about 3.5 tokens per second to reach 200k in your waking hours. Admittedly with vision, hearing, etc, you could argue you're taking in thousands of tokens per second, but we're not really far off from that in terms of extending MLLM context lengths. If you assume 10k tokens per second, generously, that's around 600 million tokens during the waking hours of the day. There are already techniques in the literature that allow us to scale context lengths beyond this size, into the billions of tokens.

  • @albertmashy8590
    @albertmashy8590 8 місяців тому +4

    Holy shit, this is gonna be crazy if you think about it. It's like you could initialize an "Assistant" or agent with a huge prompt, but rather than including that information every time, you "save" that state space to save on compute for generating the next tokens because they don't need to be re-loaded every time. This also means that agents could also all have their own different personalities and behaviors without significant fine tuning requirements

  • @chandreshkk
    @chandreshkk 8 місяців тому +1

    This was very helpful, thanks

  • @brianmulder4920
    @brianmulder4920 8 місяців тому +1

    Listening to your insight on "state decay" reminds me of this recent paper that highlights Hebbian memory as one potential strategy.
    "Memoria: Hebbian Memory Architecture for Human-Like Sequential Processing"

    • @nathanlabenz
      @nathanlabenz 8 місяців тому +1

      Thank you - will check it out

  • @IlEagle.1G
    @IlEagle.1G 8 місяців тому +2

    YES

  • @seanbergman8927
    @seanbergman8927 8 місяців тому +1

    Always appreciate your insights. You mentioned on 80,000 hours you’re thinking about a more organized AI scouting community. That piqued my attention. It’s something I’m looking for. Would be interesting to closely follow and contribute to this state space architecture as it develops, from the start.

  • @logmusic1993
    @logmusic1993 8 місяців тому +1

    Love that you’re setting this up with explaining human cognition. Are you aware of the best resources to understand the state-of-the-art of human brain regions and how they operate? I feel like the way we get closest to human-like cognition is by blending the key brain regions into AI architecture

  • @grainbow24
    @grainbow24 8 місяців тому +1

    insightful video though a big part of it sounds like excitement about good old RNNs/LSTMs (+input dependence&hardware awareness) tbh

  • @user-wi3gk5mj2i
    @user-wi3gk5mj2i 8 місяців тому

    I would really appreciate a list of references somewhere

  • @swfsql
    @swfsql 8 місяців тому

    1:26:41 I think the next level rnn would be one that could choose when and which input to read (including repeated readings), and when and which output to write (including overwriting the outputs). But not sure if this thing would be trainable? But maybe just allowing it to circle around the input and freeze outputting would be a step towards that.
    This is considering the task that you have unending pure noise and can be asked arbitrary questions about it. I guess the only way to deal with this is to be able to re-read the input.

  • @joynohemi
    @joynohemi 8 місяців тому

    Very informative and scary. Thank you for going so in-depth! Re question why they published it in this way, I think it's the authors' identity mainly, they seem to be super pro open source, fast-paced collaborative approaches. Tri is even part of a community providing open source LLMs if I'm not mistaken. I do wonder about your own reasons to publish this, as calling it out so much and naming it an emergency will mainly lead to more attention, and an increased chance of potential highly capable ai, don't you think?

  • @corley-ai
    @corley-ai 8 місяців тому +1

    Where did you go for your follow-up research? What else do we have on mamba?

    • @nathanlabenz
      @nathanlabenz 8 місяців тому +2

      the other papers I read most deeply include the original Hippo memory encoding paper, the block-state transformers recently from Deepmind, and the StripedHyena announcement from TogetherAI. I had also previously read earlier papers from the authors including the hungry hungry hippos (h3) paper and other attempts to match Transformer expressiveness like RetNet from Microsoft & Tsinghua

  • @augmentos
    @augmentos 8 місяців тому

    Regarding the memory token concept, isn't that SPR?

  • @kwikstah
    @kwikstah 8 місяців тому +1

    Feels like this could use some visual aids

    • @nathanlabenz
      @nathanlabenz 8 місяців тому +1

      agree. speed of delivery vs production quality was a real trade off here!

  • @tyc00n
    @tyc00n 8 місяців тому

    I don't know man, I have seen people come up with a similar idea for training QLoRAs for each customer which is basically state and the results have been poor compared with the prompt as state

    • @nathanlabenz
      @nathanlabenz 8 місяців тому +1

      Time will tell, for sure, but fine tuning generally doesn’t seem to store facts well. I have not even been able to teach a model my name reliably that way. Compressed history on the other hand these models seem to be able to work with

  • @movieblues4614
    @movieblues4614 8 місяців тому

    8 minutes. Eight whole minutes on a Shopify Ad? Fail. Shame on you. An insult to 2 intelligence. Bye.

    • @nathanlabenz
      @nathanlabenz 8 місяців тому

      that's a labeling issue, fwiw - ad is normal length

    • @movieblues4614
      @movieblues4614 8 місяців тому

      @@nathanlabenz A lack of disciple and creativity.

    • @udoedelmann8132
      @udoedelmann8132 8 місяців тому

      A few taps take care of it and u roll forward through it 😊

  • @tyc00n
    @tyc00n 8 місяців тому +4

    how have you managed to make your video production worse over time? that apple zoom in zoom out effect cropped looks trash, is distracting and exceeds your black background. These aren't recorded live so a second high quality camera recording would be the simplest hack ever. And the hat... its like you ran into a wall wearing it and said "this is fine" 😄

    • @nathanlabenz
      @nathanlabenz 8 місяців тому +22

      All substance, no style! :)

    • @benshums
      @benshums 8 місяців тому +3

      We love you

    • @rockapedra1130
      @rockapedra1130 8 місяців тому +2

      @@nathanlabenzgood one