RWKV has been something I've been following for like a few months. It was one of the first models to process a long 10k context before many other models.
It is so shameful that big indian corporates are yet to train an Llama2 equivalent indian language LLM from bottom up. Not just a fine tuned LLM. Inform me if there are any by the likes of TCS or Infosys
Please don't make this a nationalist competition. Indians have made fantastic contributions to AI, partly because open source AI is an international effort.
@@OccamsPlasmaGunyeh but still...what he says is true...HAVE Indian corps done anything in terms of taking advantage of the AI blowup to boost their Infra?
Having that many languages in the multilingual test is absolutely fair. Many people need good performance in foreign languages like Japanese. While Mixtral does do decently at Japanese for example, it's still beaten by 3.5 turbo at certain tasks, like proper display of hiragana given a word in kanji.
That being said, you can mix and match LLMs to leverage the best of all of them. For example, using Mixtral to translate, then using 3.5 turbo to break down the sentence with pronunciations provided. By mixing models in this way, you get GPT-4 level results at a much higher speed and much lower cost.
It wrote the python i needed on the first go. Multi language is key for coding skills. Improves reasoning and gives access to more code not in English.
Bit strange that none of the two biggest pros of RNN language models (higher inference performance, "cheaper" high context size) was covered / measured in demo, which is limitting context window by measly 300 tokens
As mentioned in the video, I'm waiting for this to be integrated with transformers to test on Colab. Right now the queue was huge and often there were errors due to the queue capacity!
RNNs are anything but mature, we've literally abandoned them because they didn't work. We've only figured out back a year or two that we can use the logarithmic magic of FFT to not only parallelize the computations but also make it big O faster.
none of your statements is correct : 1- RNN are absolutely mature and in many tasks they just can't be replaced by transformers, especially those where infinitely growing KV-cache is unacceptable. 2- FFT is not even present in many efficient RNN, but rather IO bandwidth aware architecture, optimized operators carefully written with lower level tools etc ... 3- performing FFT is in fact big O slower than "vanilla" RNN that scales in O(n) compared to O(nlog n) for FFT. In particular RWKV is O(n) in both memory and time during training and O(1) memory during inference. No FFT. No prefix-sum. Moreover big O complexity is not a tangible measure for every usecase.
Where do you guys get information about early developments like this architecture? How can i be up to date with it? Ofc, a part from the great work of 1littlecoder
Hi Bro, oru udhavi. My son is in B.Tech (IT). I want him to learn about AI ML in Engineering level. He is on second semester only. Please suggest the the starting course online to take. I want him to do well in core instead of prompting level.
Still waiting on the diffusion language models to dominate. Noticed it is claiming much faster inference in terms of cuda commands, I wonder how the memory usage during inference compares, obviously if it takes 10x the ram but runs 10x faster that would limit the desirability. Also how did the training expense compare? These guys seem quite heavily invested in the notion of making multi lingual models and are complaining that the multi lingual approach inhibits performance on the the English benchmarks. Rather sad to see it as another monolithic model instead of growing on the breakout success of mixtral, it seems like that is the approach to emulate and as it is a mixture of experts it would be more apt to having some experts focused on languages without spoiling performance on other languages. I want to see an 8x2b knockoff of mixtral. And I want to be able to plugin deferent experts, maybe pick a couple that are good at language and drop some coding and science ones, treat them like cards in your pokemon deck.
@@kalilinux8682 The experts are trained on deferent datasets. At inference tokens are routed to two experts and the output of one of them is selected. Quite sure I am correct. Now the routing engine may need work to allow for swapping in and out of experts but that hardly seems insurmountable.
My title is based on the metrics. Also as a matter of fact the model on Hugging face is a base model, not a fine tuned one. A new architecture would take more community members to chime in. I'm spreading the word for that to happen.
RWKV has been something I've been following for like a few months. It was one of the first models to process a long 10k context before many other models.
Thank you for making this video on RWKV. Very interesting.
It is so shameful that big indian corporates are yet to train an Llama2 equivalent indian language LLM from bottom up. Not just a fine tuned LLM. Inform me if there are any by the likes of TCS or Infosys
Please don't make this a nationalist competition. Indians have made fantastic contributions to AI, partly because open source AI is an international effort.
@@OccamsPlasmaGunyeh but still...what he says is true...HAVE Indian corps done anything in terms of taking advantage of the AI blowup to boost their Infra?
Having that many languages in the multilingual test is absolutely fair. Many people need good performance in foreign languages like Japanese. While Mixtral does do decently at Japanese for example, it's still beaten by 3.5 turbo at certain tasks, like proper display of hiragana given a word in kanji.
That being said, you can mix and match LLMs to leverage the best of all of them. For example, using Mixtral to translate, then using 3.5 turbo to break down the sentence with pronunciations provided. By mixing models in this way, you get GPT-4 level results at a much higher speed and much lower cost.
thanks a lot for the quick update
We gonna have to rename you Mamba
It wrote the python i needed on the first go. Multi language is key for coding skills. Improves reasoning and gives access to more code not in English.
You are correct. That is not a funny joke :-D
Thank you :)
Bit strange that none of the two biggest pros of RNN language models (higher inference performance, "cheaper" high context size) was covered / measured in demo, which is limitting context window by measly 300 tokens
As mentioned in the video, I'm waiting for this to be integrated with transformers to test on Colab. Right now the queue was huge and often there were errors due to the queue capacity!
@@1littlecoder cool. Sorry missed that part
What about inference speeds?
RNNs are anything but mature, we've literally abandoned them because they didn't work. We've only figured out back a year or two that we can use the logarithmic magic of FFT to not only parallelize the computations but also make it big O faster.
none of your statements is correct :
1- RNN are absolutely mature and in many tasks they just can't be replaced by transformers, especially those where infinitely growing KV-cache is unacceptable.
2- FFT is not even present in many efficient RNN, but rather IO bandwidth aware architecture, optimized operators carefully written with lower level tools etc ...
3- performing FFT is in fact big O slower than "vanilla" RNN that scales in O(n) compared to O(nlog n) for FFT. In particular RWKV is O(n) in both memory and time during training and O(1) memory during inference. No FFT. No prefix-sum. Moreover big O complexity is not a tangible measure for every usecase.
With 1 trillion tok in all lang, it seems good. What if 1T English only?
Where do you guys get information about early developments like this architecture? How can i be up to date with it? Ofc, a part from the great work of 1littlecoder
Follow AK on Twitter. My go to news source!
@@1littlecoder i dont know who ak is can you provide a link
paperswithcode
Andrej Kaparthy?
Assuming yes
@@tonym4953
Interesting but let me know when the next open source model beats Mixtral 8x7b in cognitive performance
Its only possible with better data
How can one contact you for consulting engagements
please email 1littlecoder at gmail dot com
Hi Bro, oru udhavi. My son is in B.Tech (IT). I want him to learn about AI ML in Engineering level. He is on second semester only. Please suggest the the starting course online to take. I want him to do well in core instead of prompting level.
Please tell him to do fast learning course. That is really good starting point and then they have a second part in that.
@@1littlecoder I guess you meant FastAI course, didn't you?
@vrynstudios
@@sammathew535 Thank you Sam. My bad. Yes FastAI course by Jeremy Howard
@@1littlecoder Thanks bro. I will surely tell him. Thanks again.
The Elon joke was either awful or meaningless so don’t worry it definitely wasn’t clear 😹
really? I thought neural networks were the old school didn’t work as well as transformers? Nice
Is there any model for Tamil?
it's v5 paper
The output was insane, like trained on 4chan 🤣, so rude
Still waiting on the diffusion language models to dominate.
Noticed it is claiming much faster inference in terms of cuda commands, I wonder how the memory usage during inference compares, obviously if it takes 10x the ram but runs 10x faster that would limit the desirability. Also how did the training expense compare?
These guys seem quite heavily invested in the notion of making multi lingual models and are complaining that the multi lingual approach inhibits performance on the the English benchmarks. Rather sad to see it as another monolithic model instead of growing on the breakout success of mixtral, it seems like that is the approach to emulate and as it is a mixture of experts it would be more apt to having some experts focused on languages without spoiling performance on other languages. I want to see an 8x2b knockoff of mixtral. And I want to be able to plugin deferent experts, maybe pick a couple that are good at language and drop some coding and science ones, treat them like cards in your pokemon deck.
That is not how true MoE works.
@@kalilinux8682 The experts are trained on deferent datasets. At inference tokens are routed to two experts and the output of one of them is selected. Quite sure I am correct. Now the routing engine may need work to allow for swapping in and out of experts but that hardly seems insurmountable.
i gave it a try on Huggingface, your title is a false statement. this is far from being a good model
My title is based on the metrics. Also as a matter of fact the model on Hugging face is a base model, not a fine tuned one. A new architecture would take more community members to chime in. I'm spreading the word for that to happen.
As you explained. Tnx good vid@@1littlecoder