Another interesting architecture is the Tolman-Eichenbaum Machine which is inspired by the hippocampus and lends some interesting abilities to infer latent relationships in the data.
One of the problems I face when trying to implement simple models which utilize a latent space, is the volatility of their input and output sizes. Never should a model require truncation, nor should a model allow inaccuracies. How for example, shall you model a compression algorithm (encode-decode) for any and all data that can exist? You are required to make the latent space before the model, effectively becoming part of the preprocessing step. This is of course, expected and within reason. I am one to think the solution to this problem is one which would uppend most of the field.
My intuition: Transformers for capture very linked concepts and words in each chapter and its summarization and mamba for union and interconexión of all sumarized ideas (no linked words but link group of very disperse ,distributed among chapters, ideas )
@@code4AIYes, GPT models are transformers. But they are not the type of transformer architecture covered by Google’s patent. Google’s patent is for the original encoder-decoder architecture only. GPT models are decoder-only, which is a different type of architecture.
What's stored in the real space if not the position? Isn't the example phase space storing an even bigger vector because it now doesn't only store the position of the center of the mass but also the velocity?
Great coverage, and thanks once again. One issue I am grappling with is attention, which is managed at "run-time" (i.e. inference) on the prompt for transformers, where Mamba seems to capture this concept entirely during training. No need for an attention matrix, as with transformers. Very long context windows, improved access to early information from the stream, and faster performance. Love all this. My concern / reasoning: Removing the "run-time" attention at inference means we're using statistical understandings of language from training. For prompts that are quite varied from the training set data, can Mamba LLMs excel at activities that aim for creativity and brainstorming? Also seems to me that training Mamba LLMs on multiple languages may degrade predictability in any one language since the "attention" (conceptually) is calculated at training time. But I am still pondering this; certainly may be wrong as I wrap my head around it!
Like your q. I am struggling to find benchmarks on the ICL performance of Mamba like systems. Also actual performance data in direct comparison w/ current generation LLMs are missing. And some authors hint, that given the few-shot examples ability might be associated with the self-attention mechanism itself? That would pose some serious limitations to State Space systems, linear RNN and alike, if I loose the ability to inject new data and info in my prompt and the system understands the new semantic config and its semantic correlations (eg for reasoning). But I trust the open source community to come up with advanced solutions ....
Conceptually, this is brilliant: Savoir-Faire for Accuracy and Precision. However, a deeper understanding of non-Matrix mathematics and challenges of serial hardware engineering would be greatly appreciated.
I do think that an artificial brain does plug to many engines StepByStep as the arithmetic calculator, the logical reasoner, the theorem prover, etc becoming it to a cyborg-like.
I appreciate your attempt at simplifying and introducing how state spaces are used in a very particular application at dynamical systems. However I am afraid you are missing quite a lot and, perhaps, confused about the mathematics.
Your comment made it into my next video on BEYOND MAMBA (ua-cam.com/video/C2fFL8pVX2M/v-deo.html) and provided a beautiful transition from the origin of State Space mathematics to overcome the limitations of current S4 and S6 State Space Model. Hope the new video explains your mistake and that you learned that interdisciplinary (from Physics to Statistics and time series) is something beautiful. Thanks for your comment.
hi ,I am developing offline chatbot with RAG. Should I use Llama 7b as the llm model? Or should I choose the Zephyer 7B model? It needs to work locally without internet.
Interesting (and also all replies here, there doesnt seam to be a place anymore where thinkers can exchange ideas). Do you know of a model using this concept (to try out in lm studio or in jupiter notebook ?). Personally i think they way LLM's work/are trained is not the way to go. To many useless facts inside them, for fact they should just use a callout to wiki pedia or other sites. LLM's 'world domain', should be language, no politics, no famous people, but theoretical skills, translations, medicine, law, math, physics coding, etc. Not who was Trump or JF kennedy or Madona. Those gigs should be removed.
I think the jury is still out. There isn't enough real world usage [yet] to say how well the Mamba arch really performs against Transformers. Over the past couple years, we (at large) been able to evaluate a variety of business use cases for Transformer-based LLMs. We have no idea how the Mamba arch will compare in those same use cases.
Another interesting architecture is the Tolman-Eichenbaum Machine which is inspired by the hippocampus and lends some interesting abilities to infer latent relationships in the data.
Just as they start etching the transformer architecture onto silicon ha!
that also made me chuckle
Just a Bullshit...
The way you say hello community is a ray of sunshine 🌞 😊
Big smile.
That's the truth! Always love the enthusiastic hellos!
Best way to learn
It's clear transformers can be improved. Excited to see this proposal play out. Thanks for the update!
First video i’ve watched from you and very impressed! Looking forward to watching more
One of the problems I face when trying to implement simple models which utilize a latent space, is the volatility of their input and output sizes. Never should a model require truncation, nor should a model allow inaccuracies. How for example, shall you model a compression algorithm (encode-decode) for any and all data that can exist? You are required to make the latent space before the model, effectively becoming part of the preprocessing step.
This is of course, expected and within reason.
I am one to think the solution to this problem is one which would uppend most of the field.
My intuition: Transformers for capture very linked concepts and words in each chapter and its summarization and mamba for union and interconexión of all sumarized ideas (no linked words but link group of very disperse ,distributed among chapters, ideas )
That sounds like a cool combo
The GPT family of models are a decoder-only architecture which is not covered by the patent.
GPT (Generative Pretrained Transformer)
@@code4AIYes, GPT models are transformers. But they are not the type of transformer architecture covered by Google’s patent. Google’s patent is for the original encoder-decoder architecture only. GPT models are decoder-only, which is a different type of architecture.
Excellent high level overview.
can you make more content for state space
What's stored in the real space if not the position? Isn't the example phase space storing an even bigger vector because it now doesn't only store the position of the center of the mass but also the velocity?
Great coverage, and thanks once again. One issue I am grappling with is attention, which is managed at "run-time" (i.e. inference) on the prompt for transformers, where Mamba seems to capture this concept entirely during training. No need for an attention matrix, as with transformers. Very long context windows, improved access to early information from the stream, and faster performance. Love all this.
My concern / reasoning: Removing the "run-time" attention at inference means we're using statistical understandings of language from training. For prompts that are quite varied from the training set data, can Mamba LLMs excel at activities that aim for creativity and brainstorming?
Also seems to me that training Mamba LLMs on multiple languages may degrade predictability in any one language since the "attention" (conceptually) is calculated at training time. But I am still pondering this; certainly may be wrong as I wrap my head around it!
Like your q. I am struggling to find benchmarks on the ICL performance of Mamba like systems. Also actual performance data in direct comparison w/ current generation LLMs are missing. And some authors hint, that given the few-shot examples ability might be associated with the self-attention mechanism itself? That would pose some serious limitations to State Space systems, linear RNN and alike, if I loose the ability to inject new data and info in my prompt and the system understands the new semantic config and its semantic correlations (eg for reasoning).
But I trust the open source community to come up with advanced solutions ....
Conceptually, this is brilliant: Savoir-Faire for Accuracy and Precision.
However, a deeper understanding of non-Matrix mathematics and challenges of serial hardware engineering would be greatly appreciated.
I do think that an artificial brain does plug to many engines StepByStep as the arithmetic calculator, the logical reasoner, the theorem prover, etc becoming it to a cyborg-like.
Thank you!
I appreciate your attempt at simplifying and introducing how state spaces are used in a very particular application at dynamical systems. However I am afraid you are missing quite a lot and, perhaps, confused about the mathematics.
como assim?
Please further elaborate your claim.
Your comment made it into my next video on BEYOND MAMBA (ua-cam.com/video/C2fFL8pVX2M/v-deo.html) and provided a beautiful transition from the origin of State Space mathematics to overcome the limitations of current S4 and S6 State Space Model. Hope the new video explains your mistake and that you learned that interdisciplinary (from Physics to Statistics and time series) is something beautiful. Thanks for your comment.
I’m a new fan.
hi ,I am developing offline chatbot with RAG. Should I use Llama 7b as the llm model? Or should I choose the Zephyer 7B model? It needs to work locally without internet.
I don't know details of your project but I had the best experience with dolphin-mistral 7b 2.2.1
@@qwertydump4720 I am doing an offline chatbot as a graduation project. So I may need a lot of information about the model I'm going to use.
I came to see a new better means of AC voltage conversion. I was disappointed.
Interesting
(and also all replies here, there doesnt seam to be a place anymore where thinkers can exchange ideas).
Do you know of a model using this concept (to try out in lm studio or in jupiter notebook ?).
Personally i think they way LLM's work/are trained is not the way to go.
To many useless facts inside them, for fact they should just use a callout to wiki pedia or other sites.
LLM's 'world domain', should be language, no politics, no famous people, but theoretical skills, translations, medicine, law, math, physics coding, etc. Not who was Trump or JF kennedy or Madona. Those gigs should be removed.
This not good for that startup that is building transformer chips
exactly what I thought
Rip startup. Died before birth.
I think the jury is still out. There isn't enough real world usage [yet] to say how well the Mamba arch really performs against Transformers. Over the past couple years, we (at large) been able to evaluate a variety of business use cases for Transformer-based LLMs. We have no idea how the Mamba arch will compare in those same use cases.