I'm only about 10 mins in but this is a joy to watch! I try to have conversations about how we think etc with work colleagues and they roll their eyes lol. So watching this feels like eating popcorn and watching an awesome movie. Thanks for putting the effort in to create this! It's a service!
i searched for long time to find someone who gets it, thinks in first principles. i fully share your vision. im deeply convinced we will get heavily fine tuned agents doing neural architecture search. they will create ideas, mutate them and write and benchmark it. and since thats something i can do with my limited resources as a web dev, they are probably already doing it. and the smarter the bots get, the better the ideas, recursive loop to craziness. furthermore there is no in 10 years, i dont see the slightest chance that stuff wont takeoff soon. the simplicity of every part of the chain is just overwhelming
I've tested the `state-spaces/mamba-2.8b` model. The published Mamba models were trained up to only 2k context length. Therefore the long context support (1M+ tokens), which would be the most important addition of this model cannot be tested with the published model weights. It would need continued training of the model on longer and longer context lengths until it reaches 1M tokens. Quoting: "That extrapolation was for a simple synthetic task (induction head). For language modeling it remains to be seen."
Yes this is the biggest proof point currently missing. I don’t see any reason it won’t work well enough to at least complement the attention mechanism in frontier systems but.. time will tell!
First off, good job on the analysis and it's good to see you've actually tried stuff out. A couple of thoughts I had about Mamba: 1. The origins of SSM lie in the 1960s, and that was when the Kalman filter was born. It's **provably optimal** (with certain caveats) and that should tell you that there's some serious theoretical meat on the SSM bone. I worked in radar processing in the 1970s and everything was Kalman. Another Kalman factoid: it's **extensible** in various directions pertaining to the nonlinearity of the input space. Which leads to the second thought: 2. Mamba is extensible in the same way because it's an SSM variant. Thus, instead of making its parameters be a function of just the input, they can further be made a function of the hidden state too. Now the second point might make the hardware optimisation moot on Nvidia GPUs, but why should we in principle care? Hardware is not Nvidia although the reverse is true. Coming along for AI processing are analog systems and spiking systems and indeed a combination of both. Neuromorphic chips will eventually take over the role currently occupied by GPUs. Perhaps sooner than we think.
200k tokens is a LOT of tokens. If you consider there are 24x3600 = 86 400 seconds in a day, and you're asleep for a quarter to a third of those, you'd have to take in about 3.5 tokens per second to reach 200k in your waking hours. Admittedly with vision, hearing, etc, you could argue you're taking in thousands of tokens per second, but we're not really far off from that in terms of extending MLLM context lengths. If you assume 10k tokens per second, generously, that's around 600 million tokens during the waking hours of the day. There are already techniques in the literature that allow us to scale context lengths beyond this size, into the billions of tokens.
Holy shit, this is gonna be crazy if you think about it. It's like you could initialize an "Assistant" or agent with a huge prompt, but rather than including that information every time, you "save" that state space to save on compute for generating the next tokens because they don't need to be re-loaded every time. This also means that agents could also all have their own different personalities and behaviors without significant fine tuning requirements
Listening to your insight on "state decay" reminds me of this recent paper that highlights Hebbian memory as one potential strategy. "Memoria: Hebbian Memory Architecture for Human-Like Sequential Processing"
Always appreciate your insights. You mentioned on 80,000 hours you’re thinking about a more organized AI scouting community. That piqued my attention. It’s something I’m looking for. Would be interesting to closely follow and contribute to this state space architecture as it develops, from the start.
Love that you’re setting this up with explaining human cognition. Are you aware of the best resources to understand the state-of-the-art of human brain regions and how they operate? I feel like the way we get closest to human-like cognition is by blending the key brain regions into AI architecture
1:26:41 I think the next level rnn would be one that could choose when and which input to read (including repeated readings), and when and which output to write (including overwriting the outputs). But not sure if this thing would be trainable? But maybe just allowing it to circle around the input and freeze outputting would be a step towards that. This is considering the task that you have unending pure noise and can be asked arbitrary questions about it. I guess the only way to deal with this is to be able to re-read the input.
Very informative and scary. Thank you for going so in-depth! Re question why they published it in this way, I think it's the authors' identity mainly, they seem to be super pro open source, fast-paced collaborative approaches. Tri is even part of a community providing open source LLMs if I'm not mistaken. I do wonder about your own reasons to publish this, as calling it out so much and naming it an emergency will mainly lead to more attention, and an increased chance of potential highly capable ai, don't you think?
the other papers I read most deeply include the original Hippo memory encoding paper, the block-state transformers recently from Deepmind, and the StripedHyena announcement from TogetherAI. I had also previously read earlier papers from the authors including the hungry hungry hippos (h3) paper and other attempts to match Transformer expressiveness like RetNet from Microsoft & Tsinghua
I don't know man, I have seen people come up with a similar idea for training QLoRAs for each customer which is basically state and the results have been poor compared with the prompt as state
Time will tell, for sure, but fine tuning generally doesn’t seem to store facts well. I have not even been able to teach a model my name reliably that way. Compressed history on the other hand these models seem to be able to work with
how have you managed to make your video production worse over time? that apple zoom in zoom out effect cropped looks trash, is distracting and exceeds your black background. These aren't recorded live so a second high quality camera recording would be the simplest hack ever. And the hat... its like you ran into a wall wearing it and said "this is fine" 😄
I'm only about 10 mins in but this is a joy to watch! I try to have conversations about how we think etc with work colleagues and they roll their eyes lol. So watching this feels like eating popcorn and watching an awesome movie. Thanks for putting the effort in to create this! It's a service!
I really like this episode a lot. Thanks a lot for making this.
I love these deep dives
Great episode, haven't really seen anyone else talking about this
Best content I've seen in making Mamba explainable. There's also no way you can convince me you're not Trevor from Whitest Kids U Know
i searched for long time to find someone who gets it, thinks in first principles. i fully share your vision. im deeply convinced we will get heavily fine tuned agents doing neural architecture search. they will create ideas, mutate them and write and benchmark it. and since thats something i can do with my limited resources as a web dev, they are probably already doing it. and the smarter the bots get, the better the ideas, recursive loop to craziness. furthermore there is no in 10 years, i dont see the slightest chance that stuff wont takeoff soon. the simplicity of every part of the chain is just overwhelming
I've tested the `state-spaces/mamba-2.8b` model. The published Mamba models were trained up to only 2k context length. Therefore the long context support (1M+ tokens), which would be the most important addition of this model cannot be tested with the published model weights. It would need continued training of the model on longer and longer context lengths until it reaches 1M tokens. Quoting: "That extrapolation was for a simple synthetic task (induction head). For language modeling it remains to be seen."
Yes this is the biggest proof point currently missing. I don’t see any reason it won’t work well enough to at least complement the attention mechanism in frontier systems but.. time will tell!
This chanel is great under appreciated very good talk
First off, good job on the analysis and it's good to see you've actually tried stuff out.
A couple of thoughts I had about Mamba:
1. The origins of SSM lie in the 1960s, and that was when the Kalman filter was born. It's **provably optimal** (with certain caveats) and that should tell you that there's some serious theoretical meat on the SSM bone. I worked in radar processing in the 1970s and everything was Kalman. Another Kalman factoid: it's **extensible** in various directions pertaining to the nonlinearity of the input space. Which leads to the second thought:
2. Mamba is extensible in the same way because it's an SSM variant. Thus, instead of making its parameters be a function of just the input, they can further be made a function of the hidden state too.
Now the second point might make the hardware optimisation moot on Nvidia GPUs, but why should we in principle care? Hardware is not Nvidia although the reverse is true. Coming along for AI processing are analog systems and spiking systems and indeed a combination of both. Neuromorphic chips will eventually take over the role currently occupied by GPUs. Perhaps sooner than we think.
Amazing window into the future. States may be the missing ingredient for System 2.
200k tokens is a LOT of tokens. If you consider there are 24x3600 = 86 400 seconds in a day, and you're asleep for a quarter to a third of those, you'd have to take in about 3.5 tokens per second to reach 200k in your waking hours. Admittedly with vision, hearing, etc, you could argue you're taking in thousands of tokens per second, but we're not really far off from that in terms of extending MLLM context lengths. If you assume 10k tokens per second, generously, that's around 600 million tokens during the waking hours of the day. There are already techniques in the literature that allow us to scale context lengths beyond this size, into the billions of tokens.
Holy shit, this is gonna be crazy if you think about it. It's like you could initialize an "Assistant" or agent with a huge prompt, but rather than including that information every time, you "save" that state space to save on compute for generating the next tokens because they don't need to be re-loaded every time. This also means that agents could also all have their own different personalities and behaviors without significant fine tuning requirements
This was very helpful, thanks
Listening to your insight on "state decay" reminds me of this recent paper that highlights Hebbian memory as one potential strategy.
"Memoria: Hebbian Memory Architecture for Human-Like Sequential Processing"
Thank you - will check it out
YES
Always appreciate your insights. You mentioned on 80,000 hours you’re thinking about a more organized AI scouting community. That piqued my attention. It’s something I’m looking for. Would be interesting to closely follow and contribute to this state space architecture as it develops, from the start.
Love that you’re setting this up with explaining human cognition. Are you aware of the best resources to understand the state-of-the-art of human brain regions and how they operate? I feel like the way we get closest to human-like cognition is by blending the key brain regions into AI architecture
insightful video though a big part of it sounds like excitement about good old RNNs/LSTMs (+input dependence&hardware awareness) tbh
I would really appreciate a list of references somewhere
1:26:41 I think the next level rnn would be one that could choose when and which input to read (including repeated readings), and when and which output to write (including overwriting the outputs). But not sure if this thing would be trainable? But maybe just allowing it to circle around the input and freeze outputting would be a step towards that.
This is considering the task that you have unending pure noise and can be asked arbitrary questions about it. I guess the only way to deal with this is to be able to re-read the input.
Very informative and scary. Thank you for going so in-depth! Re question why they published it in this way, I think it's the authors' identity mainly, they seem to be super pro open source, fast-paced collaborative approaches. Tri is even part of a community providing open source LLMs if I'm not mistaken. I do wonder about your own reasons to publish this, as calling it out so much and naming it an emergency will mainly lead to more attention, and an increased chance of potential highly capable ai, don't you think?
Where did you go for your follow-up research? What else do we have on mamba?
the other papers I read most deeply include the original Hippo memory encoding paper, the block-state transformers recently from Deepmind, and the StripedHyena announcement from TogetherAI. I had also previously read earlier papers from the authors including the hungry hungry hippos (h3) paper and other attempts to match Transformer expressiveness like RetNet from Microsoft & Tsinghua
Regarding the memory token concept, isn't that SPR?
Feels like this could use some visual aids
agree. speed of delivery vs production quality was a real trade off here!
I don't know man, I have seen people come up with a similar idea for training QLoRAs for each customer which is basically state and the results have been poor compared with the prompt as state
Time will tell, for sure, but fine tuning generally doesn’t seem to store facts well. I have not even been able to teach a model my name reliably that way. Compressed history on the other hand these models seem to be able to work with
8 minutes. Eight whole minutes on a Shopify Ad? Fail. Shame on you. An insult to 2 intelligence. Bye.
that's a labeling issue, fwiw - ad is normal length
@@nathanlabenz A lack of disciple and creativity.
A few taps take care of it and u roll forward through it 😊
how have you managed to make your video production worse over time? that apple zoom in zoom out effect cropped looks trash, is distracting and exceeds your black background. These aren't recorded live so a second high quality camera recording would be the simplest hack ever. And the hat... its like you ran into a wall wearing it and said "this is fine" 😄
All substance, no style! :)
We love you
@@nathanlabenzgood one