Excellent presentation and impressive research, i only wonder why SSMs are recurrently efficient? (video timestamp : 32:27) Suppose k is the token len of input history. The general sequence model takes k square (s.t. transformer) time complexity. On the other hand, SSMs still need to encode all stateful history "recurrently". The S4 paper also aims to deal with this issue (multiply A, k-1 times to create a K bar matrix, it also ends in nearly k square) by diagonalizing the matrix. So, it seems SSMs recurrent aren't "naturally" efficient, but require some linear algebra technique. Any suggestion will be appreciated!!
Thanks for a very nice Presentation. at 44:17 (algorithm1). you mentioned "we've been developing simplifications of the model that allow you to bypass all of this and do things much more simply"? Is it already done by now?
There were two follow-ups on simpler diagonal state space models: DSS (arxiv.org/abs/2203.14343) and S4D (arxiv.org/abs/2206.11893). The code for these is also available from the main repository
Thanks for the amazing talk and work! Maybe it's trivial but I wonder how you actually reconstruct the signal from the hidden state, i.e., how does C look like ? (at 23:50)
Just as A and B have specific formulas, there is a corresponding formula for C (related to evaluations of Legendre polynomials) that can be used for reconstruction. Notebooks for reproducing plots in this talk are available here in the official repository
Regarding the speech classification example (53:53): theoretically I am not convinced why should the model works perfectly if is trained at different sampling rate. As we know A_bar and B_bar are calculated based on the delta_t (as well as A and B). So sample rate affect A_bar and B_bar and therefore we are training A_and B_bar specifically for that sample rate. Can you please clarify what I am I missing here? Thank you in advance
Instead of training Abar and Bbar, the parameters that are trained are A, B, and Delta. At test time on a different sampling rate, Delta can simply be multiplied by the relative change in rate (for the given experiment, Delta would be doubled at test time without retraining any parameters)
he is one of the heads of this new Mamba architecture
And s4, and the ssm paper before that lol
excellent presentation. Thank you
Amazing talk, and impressive research. Thanks.
impressive presentations. thank you
Excellent presentation
Awesome!
Super interesting! Thanks for the presentation.
I work in game development for now, but cool to see how things are going in the ML world 😊
Excellent presentation and impressive research, i only wonder why SSMs are recurrently efficient? (video timestamp : 32:27)
Suppose k is the token len of input history. The general sequence model takes k square (s.t. transformer) time complexity. On the other hand, SSMs still need to encode all stateful history "recurrently". The S4 paper also aims to deal with this issue (multiply A, k-1 times to create a K bar matrix, it also ends in nearly k square) by diagonalizing the matrix.
So, it seems SSMs recurrent aren't "naturally" efficient, but require some linear algebra technique.
Any suggestion will be appreciated!!
Why do you need to learn the delta? For example, for the ecg example, you already know the sample rate of the data, right?
so good
Will subspace identification help to initialize A,B,C and D?
Thanks for a very nice Presentation.
at 44:17 (algorithm1). you mentioned "we've been developing simplifications of the model that allow you to bypass all of this and do things much more simply"?
Is it already done by now?
There were two follow-ups on simpler diagonal state space models: DSS (arxiv.org/abs/2203.14343) and S4D (arxiv.org/abs/2206.11893). The code for these is also available from the main repository
Thanks for the amazing talk and work! Maybe it's trivial but I wonder how you actually reconstruct the signal from the hidden state, i.e., how does C look like ? (at 23:50)
Just as A and B have specific formulas, there is a corresponding formula for C (related to evaluations of Legendre polynomials) that can be used for reconstruction. Notebooks for reproducing plots in this talk are available here in the official repository
Regarding the speech classification example (53:53):
theoretically I am not convinced why should the model works perfectly if is trained at different sampling rate.
As we know A_bar and B_bar are calculated based on the delta_t (as well as A and B). So sample rate affect A_bar and B_bar and therefore we are training A_and B_bar specifically for that sample rate.
Can you please clarify what I am I missing here?
Thank you in advance
Instead of training Abar and Bbar, the parameters that are trained are A, B, and Delta. At test time on a different sampling rate, Delta can simply be multiplied by the relative change in rate (for the given experiment, Delta would be doubled at test time without retraining any parameters)
084 Veum Drive
希望有中文字幕,英文听力不好。
不是有自动翻译嘛