RNN just outperformed TRANSFORMERS!!!

Поділитися
Вставка
  • Опубліковано 27 жов 2024

КОМЕНТАРІ • 45

  • @hackandtech24
    @hackandtech24 9 місяців тому +8

    RWKV has been something I've been following for like a few months. It was one of the first models to process a long 10k context before many other models.

  • @marcfruchtman9473
    @marcfruchtman9473 9 місяців тому

    Thank you for making this video on RWKV. Very interesting.

  • @suvirmisra
    @suvirmisra 9 місяців тому +3

    It is so shameful that big indian corporates are yet to train an Llama2 equivalent indian language LLM from bottom up. Not just a fine tuned LLM. Inform me if there are any by the likes of TCS or Infosys

    • @OccamsPlasmaGun
      @OccamsPlasmaGun 8 місяців тому

      Please don't make this a nationalist competition. Indians have made fantastic contributions to AI, partly because open source AI is an international effort.

    • @BlakeTedKord
      @BlakeTedKord 8 місяців тому +1

      ​@@OccamsPlasmaGunyeh but still...what he says is true...HAVE Indian corps done anything in terms of taking advantage of the AI blowup to boost their Infra?

  • @blisphul8084
    @blisphul8084 9 місяців тому +2

    Having that many languages in the multilingual test is absolutely fair. Many people need good performance in foreign languages like Japanese. While Mixtral does do decently at Japanese for example, it's still beaten by 3.5 turbo at certain tasks, like proper display of hiragana given a word in kanji.

    • @blisphul8084
      @blisphul8084 9 місяців тому +1

      That being said, you can mix and match LLMs to leverage the best of all of them. For example, using Mixtral to translate, then using 3.5 turbo to break down the sentence with pronunciations provided. By mixing models in this way, you get GPT-4 level results at a much higher speed and much lower cost.

  • @amortalbeing
    @amortalbeing 9 місяців тому

    thanks a lot for the quick update

  • @matten_zero
    @matten_zero 9 місяців тому +2

    We gonna have to rename you Mamba

  • @KevinKreger
    @KevinKreger 9 місяців тому

    It wrote the python i needed on the first go. Multi language is key for coding skills. Improves reasoning and gives access to more code not in English.

  • @actorjohanmatsfredkarlsson2293
    @actorjohanmatsfredkarlsson2293 9 місяців тому +1

    You are correct. That is not a funny joke :-D

  • @alx8439
    @alx8439 9 місяців тому +1

    Bit strange that none of the two biggest pros of RNN language models (higher inference performance, "cheaper" high context size) was covered / measured in demo, which is limitting context window by measly 300 tokens

    • @1littlecoder
      @1littlecoder  9 місяців тому +1

      As mentioned in the video, I'm waiting for this to be integrated with transformers to test on Colab. Right now the queue was huge and often there were errors due to the queue capacity!

    • @alx8439
      @alx8439 9 місяців тому

      @@1littlecoder cool. Sorry missed that part

  • @MrSur512
    @MrSur512 9 місяців тому

    What about inference speeds?

  • @jonatan01i
    @jonatan01i 9 місяців тому +2

    RNNs are anything but mature, we've literally abandoned them because they didn't work. We've only figured out back a year or two that we can use the logarithmic magic of FFT to not only parallelize the computations but also make it big O faster.

    • @theodorlemerle4232
      @theodorlemerle4232 9 місяців тому

      none of your statements is correct :
      1- RNN are absolutely mature and in many tasks they just can't be replaced by transformers, especially those where infinitely growing KV-cache is unacceptable.
      2- FFT is not even present in many efficient RNN, but rather IO bandwidth aware architecture, optimized operators carefully written with lower level tools etc ...
      3- performing FFT is in fact big O slower than "vanilla" RNN that scales in O(n) compared to O(nlog n) for FFT. In particular RWKV is O(n) in both memory and time during training and O(1) memory during inference. No FFT. No prefix-sum. Moreover big O complexity is not a tangible measure for every usecase.

  • @MrSur512
    @MrSur512 9 місяців тому

    With 1 trillion tok in all lang, it seems good. What if 1T English only?

  • @VeioooOOO
    @VeioooOOO 9 місяців тому

    Where do you guys get information about early developments like this architecture? How can i be up to date with it? Ofc, a part from the great work of 1littlecoder

    • @1littlecoder
      @1littlecoder  9 місяців тому +1

      Follow AK on Twitter. My go to news source!

    • @purecheese9012
      @purecheese9012 9 місяців тому

      ​@@1littlecoder i dont know who ak is can you provide a link

    • @quebono100
      @quebono100 9 місяців тому

      paperswithcode

    • @tonym4953
      @tonym4953 9 місяців тому

      Andrej Kaparthy?

    • @augmentos
      @augmentos 9 місяців тому

      Assuming yes
      @@tonym4953

  • @lexuscrow1932
    @lexuscrow1932 9 місяців тому

    Interesting but let me know when the next open source model beats Mixtral 8x7b in cognitive performance

    • @MrSur512
      @MrSur512 9 місяців тому

      Its only possible with better data

  • @AjayKumar-nt7lx
    @AjayKumar-nt7lx 9 місяців тому

    How can one contact you for consulting engagements

    • @1littlecoder
      @1littlecoder  9 місяців тому

      please email 1littlecoder at gmail dot com

  • @vrynstudios
    @vrynstudios 9 місяців тому

    Hi Bro, oru udhavi. My son is in B.Tech (IT). I want him to learn about AI ML in Engineering level. He is on second semester only. Please suggest the the starting course online to take. I want him to do well in core instead of prompting level.

    • @1littlecoder
      @1littlecoder  9 місяців тому

      Please tell him to do fast learning course. That is really good starting point and then they have a second part in that.

    • @sammathew535
      @sammathew535 9 місяців тому +1

      @@1littlecoder I guess you meant FastAI course, didn't you?
      @vrynstudios

    • @1littlecoder
      @1littlecoder  9 місяців тому +1

      @@sammathew535 Thank you Sam. My bad. Yes FastAI course by Jeremy Howard

    • @vrynstudios
      @vrynstudios 9 місяців тому +1

      @@1littlecoder Thanks bro. I will surely tell him. Thanks again.

  • @randombubby1
    @randombubby1 9 місяців тому +2

    The Elon joke was either awful or meaningless so don’t worry it definitely wasn’t clear 😹

  • @d3mist0clesgee12
    @d3mist0clesgee12 9 місяців тому

    really? I thought neural networks were the old school didn’t work as well as transformers? Nice

  • @SR-zi1pw
    @SR-zi1pw 9 місяців тому

    Is there any model for Tamil?

  • @augmentos
    @augmentos 9 місяців тому

    it's v5 paper

  • @damien2198
    @damien2198 9 місяців тому

    The output was insane, like trained on 4chan 🤣, so rude

  • @zyxwvutsrqponmlkh
    @zyxwvutsrqponmlkh 9 місяців тому

    Still waiting on the diffusion language models to dominate.
    Noticed it is claiming much faster inference in terms of cuda commands, I wonder how the memory usage during inference compares, obviously if it takes 10x the ram but runs 10x faster that would limit the desirability. Also how did the training expense compare?
    These guys seem quite heavily invested in the notion of making multi lingual models and are complaining that the multi lingual approach inhibits performance on the the English benchmarks. Rather sad to see it as another monolithic model instead of growing on the breakout success of mixtral, it seems like that is the approach to emulate and as it is a mixture of experts it would be more apt to having some experts focused on languages without spoiling performance on other languages. I want to see an 8x2b knockoff of mixtral. And I want to be able to plugin deferent experts, maybe pick a couple that are good at language and drop some coding and science ones, treat them like cards in your pokemon deck.

    • @kalilinux8682
      @kalilinux8682 9 місяців тому

      That is not how true MoE works.

    • @zyxwvutsrqponmlkh
      @zyxwvutsrqponmlkh 9 місяців тому

      @@kalilinux8682 The experts are trained on deferent datasets. At inference tokens are routed to two experts and the output of one of them is selected. Quite sure I am correct. Now the routing engine may need work to allow for swapping in and out of experts but that hardly seems insurmountable.

  • @Aiducateur
    @Aiducateur 9 місяців тому

    i gave it a try on Huggingface, your title is a false statement. this is far from being a good model

    • @1littlecoder
      @1littlecoder  9 місяців тому +1

      My title is based on the metrics. Also as a matter of fact the model on Hugging face is a base model, not a fine tuned one. A new architecture would take more community members to chime in. I'm spreading the word for that to happen.

    • @augmentos
      @augmentos 9 місяців тому

      As you explained. Tnx good vid@@1littlecoder