RWKV: Reinventing RNNs for the Transformer Era (Paper Explained)

Поділитися
Вставка
  • Опубліковано 6 чер 2024
  • #gpt4 #rwkv #transformer
    We take a look at RWKV, a highly scalable architecture between Transformers and RNNs.
    Fully Connected (June 7th in SF) Promo Link: www.fullyconnected.com/?promo...
    OUTLINE:
    0:00 - Introduction
    1:50 - Fully Connected In-Person Conference in SF June 7th
    3:00 - Transformers vs RNNs
    8:00 - RWKV: Best of both worlds
    12:30 - LSTMs
    17:15 - Evolution of RWKV's Linear Attention
    30:40 - RWKV's Layer Structure
    49:15 - Time-Parallel vs Sequence Mode
    53:55 - Experimental Results & Limitations
    58:00 - Visualizations
    1:01:40 - Conclusion
    Paper: arxiv.org/abs/2305.13048
    Code: github.com/BlinkDL/RWKV-LM
    Abstract:
    Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scalability. We propose a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of Transformers with the efficient inference of RNNs. Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transformer or an RNN, which parallelizes computations during training and maintains constant computational and memory complexity during inference, leading to the first non-transformer architecture to be scaled to tens of billions of parameters. Our experiments reveal that RWKV performs on par with similarly sized Transformers, suggesting that future work can leverage this architecture to create more efficient models. This work presents a significant step towards reconciling the trade-offs between computational efficiency and model performance in sequence processing tasks.
    Authors: Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Xiangru Tang, Bolun Wang, Johan S. Wind, Stansilaw Wozniak, Ruichong Zhang, Zhenyuan Zhang, Qihang Zhao, Peng Zhou, Jian Zhu, Rui-Jie Zhu
    Links:
    Homepage: ykilcher.com
    Merch: ykilcher.com/merch
    UA-cam: / yannickilcher
    Twitter: / ykilcher
    Discord: ykilcher.com/discord
    LinkedIn: / ykilcher
    If you want to support me, the best thing to do is to share out the content :)
    If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
    SubscribeStar: www.subscribestar.com/yannick...
    Patreon: / yannickilcher
    Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
    Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
    Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
    Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
  • Наука та технологія

КОМЕНТАРІ • 125

  • @YannicKilcher
    @YannicKilcher  Рік тому +14

    Fully Connected (June 7th in SF) Promo Link: www.fullyconnected.com/?promo=ynnc
    OUTLINE:
    0:00 - Introduction
    1:50 - Fully Connected In-Person Conference in SF June 7th
    3:00 - Transformers vs RNNs
    8:00 - RWKV: Best of both worlds
    12:30 - LSTMs
    17:15 - Evolution of RWKV's Linear Attention
    30:40 - RWKV's Layer Structure
    49:15 - Time-Parallel vs Sequence Mode
    53:55 - Experimental Results & Limitations
    58:00 - Visualizations
    1:01:40 - Conclusion
    Paper: arxiv.org/abs/2305.13048

    • @xxdaggerxx5
      @xxdaggerxx5 Рік тому

      stop with this 1hr videos and summarize this shit

    • @akashkarnatak6581
      @akashkarnatak6581 Рік тому +3

      @@xxdaggerxx5 this is for people who want to understand the paper in depth. want summary read the abstract

  • @Fanney3
    @Fanney3 Рік тому +56

    Can't believe someone is just doing this work and sharing it. Amazing.

  • @MindFactoryAI
    @MindFactoryAI Рік тому +61

    Always impressed how you record these in a single take. Great explanation, thanks!

  • @TTTrouble
    @TTTrouble Рік тому +111

    Jesus keeping up with the literature in this field for those of you that actually work in it must be absolutely exhausting.

    • @arshzahed1970
      @arshzahed1970 Рік тому +19

      In the past year, my backlog of papers to go through has grown exponentially. Just staying up to date is a full time job now

    • @victoraranda3349
      @victoraranda3349 Рік тому +3

      Sometimes it do be like that

    • @mlopolis
      @mlopolis Рік тому +10

      You should use LLMs to get the most important points from each paper and then you can stay on top 😊

    • @Will-kt5jk
      @Will-kt5jk Рік тому +5

      @@mlopolis at a human-machine system level, that sounds like a self-improving augmentation. Maybe it helps explain the exponential increase in papers…😅

    • @raynhardtvanzyl4729
      @raynhardtvanzyl4729 Рік тому +1

      Yup...

  • @mgostIH
    @mgostIH Рік тому +29

    Keep in mind that at 21:00, regarding memory usage of attention, current approaches like "FlashAttention" and "Attention doesn't need O(N^2) memory" have reduced drastically the memory needed for transformers to run, which is what allows approaches like ChatGPT to have such a long context.

    • @erickmacias5153
      @erickmacias5153 Рік тому

      But attention in GPT does use N^2 memory doesn't it?

    • @mgostIH
      @mgostIH Рік тому +9

      @@erickmacias5153 in older public models like GPT 2, yes, but the papers I wrote above provide implementations that are mathematically equivalent to the standard way of doing attention, you can use them as drop in replacements and get improved performance during training and inference.

  • @YvesQuemener
    @YvesQuemener Рік тому +15

    Can't say thank you enough! Diving into RWKV has been on my todo list for two months at least and when I saw the UA-cam alert I immediately felt relieved that instead of a full day of trying to understand the paper and the code, you would provide the important parts in one hour. And it delivered! I agree that it is kind of stretching the definition of attention to call what they are doing "linear attention". I am not sure that calling it a ConvNet is actually less stretchy btw :-) But anyway thanks a lot!

  • @itayatelis2898
    @itayatelis2898 11 місяців тому +1

    Amazing! Thank you for doing this! You're amazing! I hope you would keep doing it weekly

  • @halocemagnum8351
    @halocemagnum8351 Рік тому

    Amazing explanation! Great video. I had been reading all of the RWKV posts on the r/MachineLearning subreddit but I don’t think I fullly grasped it till this review.

  • @johnnypeck
    @johnnypeck Рік тому +3

    This is awesome. Seen the use of RNN percolating on Twitter for a bit. Glad you're coving it. That is a lot of authors.

  • @andres_pq
    @andres_pq Рік тому +2

    Great to see you do paper explanations agan!

  • @hansdietrich1496
    @hansdietrich1496 Рік тому +4

    The best in-depth AI channel out there, chapeau!

  • @dairin0d
    @dairin0d Рік тому +4

    @YannicKilcher would be interesting to hear your take on hyperdimensional computing / vector symbolic architectures :-)
    It seems like a really cool idea, though I can't quite wrap my head around (or maybe wasn't able to find a clear explanation) how it's actually supposed to interface with non-symbolic inputs (e.g. images) or learn complex structured concepts from data.

  • @sheevys
    @sheevys Рік тому +4

    Haha, that's a quick reaction, your "all you need" pun was defo not intended.

  • @mattanimation
    @mattanimation Рік тому +1

    was waiting for this one, thanks!

  • @jondo7680
    @jondo7680 Рік тому +2

    Just to give feedback, that example with "I'm the word cat" was just great. It helps to make sure if I understood you right or not.

  • @Sciencehub-oq5go
    @Sciencehub-oq5go Рік тому

    Thankful for your work!

  • @ChaseFreedomMusician
    @ChaseFreedomMusician Рік тому

    THANK GOD! Somebody is finally talking about RWKV!

  • @noagarnett
    @noagarnett 11 місяців тому

    Thanks Yannic for (another) great video! Really amazing that you do all this work and share it. Worth a lot for me and the likes of me. Also the paper discussed is very impressive.
    I might be wrong, but I think there is a confusion in the explanation. On 24:53, 30:21, you claim that the k_i modulation is defined by the current token ("if I am "cat" I should probably look 3 tokens behind"), but if correctly understand, it is defined by the referred token, and is the same for all following positions ("if I am "cat" I should probably have big influence on all the tokens following me").
    Did I mix It up?
    Thanks!

  • @smnt
    @smnt Рік тому

    Hey Yannic, quick question
    What do you mean when you say RNNs don't scale well or that you might "just need models that scale". What does a model scaling mean to you? I've definitely seen people stack RNNs and it seemingly works just fine. I thought the issue with RNNs was that they lose context pretty quickly even though their context length is "infinite".
    Thanks for the video as always, love it!

  • @killers31337
    @killers31337 Рік тому +2

    If recalling information in long contexts is the problem, perhaps throwing in a few transformer layers would solve that?
    E.g. something like language parsing can be done using just RNN as information is largely local.
    E.g. if you have 20 layers in total layers 1..10 would be RNN, then layer 11 is a transformer, then 12..20 are RNNs again. Then the "quadratic" part is only 1/20th of the NN.
    Yes, it would route only 1/20th of the information a full transformer would, but if only few important pieces of the context are necessary, that might be enough.

  • @strawberryfield891
    @strawberryfield891 11 місяців тому

    Thank you very much for the great video!!
    Are channles like equivalent to multi-heads in transformers?

  • @addoul99
    @addoul99 Рік тому

    Hi, are the weights for the the linear layer Wv tied between a pair of channel and time mixing blocks?

  • @justfoundit
    @justfoundit Рік тому +6

    If we revisit ideas, maybe we could try shared weight transformers. It worked for cnn. Minor memory footprint of the model, easy to achieve hundreds of billions of parameters by just repeating the same layer multiple times

  • @clementdato6328
    @clementdato6328 Рік тому

    Does it explain how the error signal passed through time or is it implicitly assumed that bptt is used?

  • @simonstrandgaard5503
    @simonstrandgaard5503 Рік тому

    Great explanation

  • @debanjandas7738
    @debanjandas7738 10 місяців тому

    In AFT attention equation, weights associated with token i for input token t is given by w(t,i)+k(i) => how do we add a scalar to a vector? Wouldn't have it been more appropriate to do w(t,i)*k(i) ?

  • @erickmarin6147
    @erickmarin6147 Рік тому +1

    Balding king you dropped this 👑

  • @banseoklee392
    @banseoklee392 Рік тому

    Awesome!! Thank you sooooooooo much

  • @edwardfanboy
    @edwardfanboy Рік тому

    It looks like it would be possible to parallelize the WKV step across time using something akin to a parallel prefix sum.

  • @NeoShameMan
    @NeoShameMan Рік тому +2

    Base on my experiment you won't nn for long, the distribution is the same as input and output, and fine tuning is just a skew of that distribution towards the fine tuning corpus. Better, we there is a very high probability that it won't be a black box for long and we can extract optimal entropy encoding, no more weird sparsity. I'm just waiting for a new Hard drive to test more

  • @sortysciaofiscia
    @sortysciaofiscia Рік тому +1

    I have a question at the halfway mark of the video:
    if the importance of attention to tokens linearly decreases based on how far back it is, does that mean that by the end of the answer, it will forget what it started talking with? What stops this approach from repeating itself?
    I'm trying to wrap my head around: "The brown fox jumped over a lazy old dog, and then ...." in this example the next word will be computed based on the dog reference MORE than the fox one?
    I'd assume the transformers look at every other token in this sentence, and compute. Whereas from your explanation I gather that importance drops off the further back the token is. right?
    sorry, I'm new to this.

    • @zhenyuanzhang
      @zhenyuanzhang 11 місяців тому +1

      Not really. There are almost half of the channels in the middle-to-high layers that do not decay at all (after training). The important information stored there could last forever, in theory. As long as the model is aware that this piece of information is important, it won't forget it easily.

  • @alles_moegliche73
    @alles_moegliche73 Рік тому +1

    Can you also take a look at the Meta Megabyte Paper?

  • @dreamphoenix
    @dreamphoenix Рік тому

    Thank you.

  • @djfl58mdlwqlf
    @djfl58mdlwqlf 10 місяців тому

    hi, I am not convinced that the absent of non-linearity helped parallelization 45:00
    the paper asserts that this is possible within two different dimension (batch, time)
    can anyone give me brief explanation of this?

  • @user-ys2nd2bg6r
    @user-ys2nd2bg6r Рік тому

    17:35 (EDIT: and 20:50) That is just a matrix multiplication therefore inner product instead of outer right?

  • @spoonikle
    @spoonikle Рік тому +2

    We need to focus on more complex multi-step models. Humans take notes, humans ruminate, humans speak aloud.
    Multi-modality is key, tools built into the model to compensate for shortfalls is key. Design the model with a calculator, let it train on the tool, design the model with outputs hooked into dozens of tools and reward correct tool use, force the influence of tools on the output and train a model that no longer wastes time reinventing calculators.
    We trained a model to make paintings instead of making a model that calls adobe API’s to paint - now that LLM’s exist we have seen the light, we see the true power of AI… calculators are better left to the programmers.

  • @Veptis
    @Veptis Місяць тому +1

    google tried to train a 500B LSTM - so one of those claims "first" might be incorrect.

  • @oneman7094
    @oneman7094 Рік тому +2

    Can you do S4?

  • @edhofiko3168
    @edhofiko3168 11 місяців тому +1

    I unironically loves this paper even though it absolutely lacks theoritical analysis. I ve been following rwkv since before they made the paper. I would really love it if pytorch would implement discounted cumulative sum since this is exactly what rwkv attention use and this is what people in RL also use.

    • @alexeykrylov9995
      @alexeykrylov9995 11 місяців тому +1

      I agree that it'd be good to have it as a primitive. But as long as it's unavailable, it can be implemented in O(N log N) time (instead of O(N) if it was a primitive) by decomposing it into a convolution of several dilated exponential kernels (I mean, for example: 1st conv: dilation 1, kernel size 4, geometric progression factor k; 2nd: dilation 4, size 4, factor k^4; 3rd: dilation 16, size 4, factor k^16; etc.). It worked well in practice (I did this trick for my colleague's project once).

  • @LostMekkaSoft
    @LostMekkaSoft Рік тому +1

    44:10 "so thats what they mean by states, if they say... they dont mean the united states, im sorry, they mean these values."
    this is such a wonderful feynman moment ^^
    on a more serious note: i wonder if those approaches could be combined... like a standard transformer based model part that is really good short term and one like this that is really good long term that somehow complement each other?
    i think my best idea for that so far would be if you let the RWKV model "summarize" the long context and produce not a sequence of output tokens, but a sequence of internal representation values that act as a kind of compressed version of the complete context. then the transformer model could go to town with its superior capabilities but with a shorter context window and pick out which parts of the summary it wants to attend to.
    would that be feasible, or am i thinking out of my ass? :D

  • @corgirun7892
    @corgirun7892 Рік тому

    Amazing!

  • @schwajj
    @schwajj 10 місяців тому

    Halfway through, but I have a question. Transformers have been applied outside the domain of language modeling (or even more generally, outside of sequence modeling), e.g. Vision Transformers. In building our intuition, Yannic talks in terms of how much RWKV pays attention to the past for each internal feature learned by the model. Does this imply that RWKV is more specialized to sequence modeling than classic Transformers? i.e. would RWKV *not* work well if you try to apply it to image-based input? Or is this an open question? Is there reason to lean one way or the other?
    (probably most people who would answer this already saw the video a month ago, but fingers crossed for an answer)

    • @clehaxze
      @clehaxze 9 місяців тому

      No answer, but RWKV-4-neo supports image input by slapping basically a CLIP as input into one of it's layers. This way it can use the representations as an understanding during a conversation.

    • @giuliavirgili1660
      @giuliavirgili1660 Місяць тому

      LineRWKV

  • @antonioILbig
    @antonioILbig Рік тому +1

    Yannic, good guess! Scalability could be the real deal. Deep architectures have different "lego blocks" (transformers, lstm, conv, residual, ...) When you build a big model, the meaning of its pieces it's lost. What stays is the computational efficiency, scalability and optimization behaviour.

  • @serta5727
    @serta5727 Рік тому +1

    General idea for transformers: Evolutional attention heads come to my mind. Instead of training multiple attention heads in a transformer, how about just having one that branches off and the best evolved version gets merged into the original. So that at inference time there is only one attention head to save compute.

    • @AntoshaPushkin
      @AntoshaPushkin Рік тому +1

      Assume your task is to add numbers like
      123456 + 987654 = ?
      You will need at least 2 attention heads to attend to two numbers.
      Not saying that you should transformers to add up numbers, but it's just a random example of a situation where it's clear that you need multiple attention heads

    • @schwajj
      @schwajj 10 місяців тому +2

      @@AntoshaPushkinThat doesn’t sound right to me: you’re essentially saying that a separate attention head would self-assign to each number. It’s not completely implausible, but I’d like to see some rigorous analysis that indicates that transformers have been observed to operate in that manner. Are you aware of such research? I’d be grateful for any pointers you could provide.

  • @michael05242002
    @michael05242002 10 місяців тому +3

    🎯 Key Takeaways for quick navigation:
    00:14 🔄 RWKV是一种高度可扩展的模型架构,具有Transformer和RNN的一些属性。
    01:21 ⚖️ RWKV模型在某些情况下与大型的Transformer模型在性能上相媲美。
    03:29 📚 RWKV是一种用于语言建模的模型,可预测文本中的下一个单词或标记。
    05:15 🧠 RNNS仅需一定的内存即可进行推理,但每个推理步骤只能考虑当前记忆和前一个标记。
    10:17 📈 RWKV是第一个能够扩展到数百亿参数的非Transformer架构。
    11:12 🧠 LSTM模型是RNN的一种,通过引入门控机制来解决梯度消失问题,并具有长期记忆能力。
    14:44 🚪 LSTM使用门控机制来控制隐藏状态和记忆状态的更新,包括遗忘门和输入门。
    16:09 📊 LSTM模型的更新涉及多次非线性计算,导致顺序计算和无法并行化。
    18:37 🤔 注意力机制可以动态地分配注意力权重,以聚合信息,但计算量大且顺序计算。
    20:33 🔄 Attention-free Transformers试图通过不进行token间交互的方式重新定义注意力机制,以减少内存需求。
    23:43 ⚖️ RWKV模型使用固定的关注机制,对所有数据点都适用,但可以通过添加key来进行调制,以使关注模式具有一定的灵活性。
    24:37 🔄 RWKV的关注机制是通过加法进行调制,而不是乘法,相对于Transformer的乘法交互,这种方式的作用相对较小。
    25:30 🔍 RWKV的固定关注不能考虑到当前token的含义,而只能定义一个固定的关注模式,相比之下,原始的注意力机制更加强大和灵活。
    28:10 💡 RWKV模型提出了一种新型的注意力机制,通过定义向量W来调制关注模式,从而考虑到了过去的信息。
    30:09 📝 RWKV模型通过将模型应用于一系列的token上,构建了一个具有重复结构的模型,以实现处理序列数据的目标。
    33:02 🔍 RWKV模型中每个块的计算都会保留一部分信息并传递到下一层,这种计算方式类似于LSTM中的状态传递,但是以层与层的方式进行传递。
    34:01 💡 RWKV模型中的通道混合模块使用线性层、非线性函数和元素级乘法来实现通道之间的混合。
    36:25 📝 RWKV模型通过向每个时间步骤的输入中添加上一时间步骤的输入并进行线性插值,实现了时间或令牌的平移操作。
    38:14 🌟 RWKV模型中的时间混合模块采用类似Transformer的方法进行计算,包括线性层和加权求和操作。
    42:40 ✨ RWKV模型通过无限制的加权求和操作,可以对整个过去的值进行加权求和,而不受固定大小的注意力矩阵限制。
    44:05 📝 RWKV模型使用线性插值和加权求和来实现时间或令牌的平移操作。
    45:10 🌟 RWKV模型中的隐藏状态通过线性插值的线性函数进行计算,没有经过非线性操作,因此可以使用并行计算进行训练。
    46:03 ✨ RWKV模型通过令牌平移操作使每个元素可以访问它前面的元素,从而实现了深度增长的感受野。
    47:36 💡 RWKV模型实现了对过去值进行加权求和的线性聚合操作,可以有效地回顾过去。
    51:49 🚀 RWKV模型相比Transformer和LSTM来说,在能够回顾过去的能力和复杂计算的能力上处于中间水平,但可以通过堆叠多个层来增强模型的表达能力。
    54:54 ⚡ RWKV模型相较于Transformer和LSTM,在处理复杂计算和较长上下文方面不如优势明显。
    55:31 ✨ 增加上下文长度可以降低语言建模的损失。
    56:27 📉 线性注意力机制可能在处理长上下文任务时限制了模型性能。
    56:53 🏗️ RWKV模型比标准Transformer模型更加依赖精心设计的提示,相关问题需要进一步探索和确认。
    58:15 🌍 RWKV模型在通道维度上逐层考虑过去信息,高层网络越倾向于考虑更长时期的信息。
    Made with HARPA AI

  • @OperationDarkside
    @OperationDarkside Рік тому +1

    53:50 for 5s summary of the paper

  • @agsystems8220
    @agsystems8220 Рік тому +2

    So it specially chooses the representation of internal state to be already decomposed into it's eigenvectors with respect to time decay, meaning that we can infer relevance forward with a simple fixed matrix? That is pretty cool. I guess you could do something similar with any transformer where you have some natural definition of distance that can be precomputed. For the initial layers at least they seem to be very interested in nearby features (both in language and images), so this definitely seems a natural specialisation/optimisation. If it is going to be doing something like this anyway we might as well give it an architecture that does it well. Later layers don't seem to care about those features though, so this technique would cease to be valuable pretty fast I think. For more abstract inferences the order the pieces of information are fed in is not relevant, so the exponential term would tend to one and the whole system would collapse to fully attention free. You cannot build something able to make abstract inferences with a compact representation using this architecture. A nice piece of work, but a local optimisation rather than an improvement IMO.

    • @jackhe4336
      @jackhe4336 Рік тому

      IMO, it's hard to extract a compact representation of the data without hurting expressiveness and generality of the representation. Could you recommend some papers that address this issue?

    • @schwajj
      @schwajj 10 місяців тому

      Great comment. A question: if you’re correct that this would work OK on early layers, but less well on later layers which deal with more abstract concepts, could you use a hybrid where some layers use RWKV and later layers use classic attention? I suppose that asymptotically it would still use O(n**2) space; this would only improve things by a constant factor (e.g. if only half of the layers use classic attention, the memory savings will asymptotically approach 50%).
      Do you see any value in such an approach?

  • @andres_pq
    @andres_pq 10 місяців тому +1

    please make a video about RetNet :)

  • @thntk
    @thntk 11 місяців тому

    Didn't Schmidhuber do this already in the 90s?

  • @cassandrasinclair8722
    @cassandrasinclair8722 Рік тому

    transformers too are convnets ;) they do convolution over a graph :D Attention is just one instance of graph convolution.

  • @gunale925
    @gunale925 5 місяців тому

    I still didn't get how the time-mixing could training parallel? It's must depend on previous state.

    • @summer_tree3821
      @summer_tree3821 4 місяці тому

      MeToo. Do you know it now?😘

    • @gunale925
      @gunale925 4 місяці тому

      @@summer_tree3821 yep. The previous state will directly use actual data on training.

  • @hanskraut2018
    @hanskraut2018 Рік тому

    It should be trained by equasions and text that is imbeded in irrelevant numbers and text and in the end the far back equasion/text would be needed to calculate the end number result/text result that way a neural net based on a automatic way would lern to selectively pay attention based on output

    • @schwajj
      @schwajj 10 місяців тому

      That’s the whole trade-off here. The classic 2017 transformer model does what you say (to a certain extent). The model being discussed here is worse at the sort of task you’re proposing, but has the benefit of not using O(n**2) space.

  • @rolfengstrand9838
    @rolfengstrand9838 Рік тому +1

    THANK YOU for pointing out that the use of the word "attention" in the context of transformers has strayed far away from the meaning of "attention" in other contexts. We have to accept that this is happening, of course. But it is important that introductury material explains clearly what "attention" is intended to mean here. It would be wrong to assume that newbie reader has the same concept associated with the word, "attention".

  • @panofilossas6564
    @panofilossas6564 Рік тому

    Looks like a good candidate for running in low spec hardware

  • @Will-kt5jk
    @Will-kt5jk Рік тому

    1:02:05 - The Matrix looks different to what I remember from the movie

  • @HD-Grand-Scheme-Unfolds
    @HD-Grand-Scheme-Unfolds Рік тому +2

    Just a very loose and wild thought came to mind though. What if maybe somehow "transformer architectures" can be employed as an imitation of "System 2" (more logical and critical and decisive) while "RWKV" can be used for "System 1" (though a bit fuzzy in accuracy but captures the essence of life long experience had by the AI agent. hence a derived ability to exercise intuition or also instinct-like thinking and response to situations ) . Both can be combined in a Pseudo-Cognitive Architecture approach to tackle on the AGI achievement challenge. Wouldn't that be something to see 😄.

  • @hachembetrouni6731
    @hachembetrouni6731 Рік тому +3

    😅Yannic makes LSTMs sounds like prehistory

  • @danplt
    @danplt Рік тому +5

    so many authors with many different institutions

    • @marshallmcluhan33
      @marshallmcluhan33 Рік тому

      Neko institute of Science and The waifu research department are still on the top of the charts on hugging face. I'm not sure these archaic institutions are as mobile so they have to team up to stay relevant.

    • @TheThunderSpirit
      @TheThunderSpirit Рік тому

      need more institutions. can u give me?

  • @toddnedd2138
    @toddnedd2138 Рік тому +3

    Perform badly, this approach will, if speak like Yoda you do. ;) Thank you for the detailed explanation of the paper and the afford you put into this.

  • @alyzst
    @alyzst Рік тому

    How does it compare with Hyena?

  • @danylaley
    @danylaley 11 місяців тому

    This is just a fancy convolutional lstm

  • @cexploreful
    @cexploreful Рік тому +5

    what's next? a transformer model of RNN subunits composed by a stack of transformer models designed as an RNN structure of transformers units aligned with a convolutional recurrent boltzman gate.

    • @guillaumevermeillesanchezm2427
      @guillaumevermeillesanchezm2427 Рік тому

      do it! do it!

    • @erickmarin6147
      @erickmarin6147 Рік тому +1

      Learning in a layer wise manner

    • @filoautomata
      @filoautomata Рік тому +1

      Quantum Multi Modal Transformer Model LSTM using Ensemble of Neutrosophic Logic based Attention Model for an Interpretable 'Human Extiction Capable' Military Grade Artificial General Intelligence.

    • @erickmarin6147
      @erickmarin6147 Рік тому +1

      With active dendrite modeling for multi task approachs

    • @vivienseguy
      @vivienseguy Рік тому +1

      Yes, all learned end-to-end

  • @lancemarchetti8673
    @lancemarchetti8673 Рік тому

    Really excited about this !▬Love from Su∩∩y South Afroca

  • @deeplerg7913
    @deeplerg7913 Рік тому

    I can't understand anything here but I'm sure it's something very interesting :P

  • @bertobertoberto242
    @bertobertoberto242 Рік тому

    the convnet explanation reminds me a lot wave net from deepmind...

  • @howuhh8960
    @howuhh8960 Рік тому +5

    I really don't like the very strong statements in the paper, such as "surpasses the capabilities of any existing RNN". lol, ZERO comparisons with other new RNNs based on S4 for example...

    • @iOhadRubin
      @iOhadRubin Рік тому

      There are no public open source S4 models of this size

  • @haraldtopfer5732
    @haraldtopfer5732 Рік тому

    53:41 my model has a linear scaling where everything else goes *Brrrrrrruummm* .... story of my life

  • @akashkarnatak3014
    @akashkarnatak3014 Рік тому +3

    If you would've uploaded this video 3 days ago, it would have helped me with my assignment as well. Anyways greater video.

  • @almoni127
    @almoni127 Рік тому

    Why do people still claim that transformers require memory that is quadratic in the sequence length when it was shown to be avoidable? (See the work on flash attention for example)
    It is still true, however, that it requires quadratic time.

    • @schwajj
      @schwajj 10 місяців тому

      Flash attention is still quadratic in the sequence length (more precisely, context length). It just massively improves the constant factor via more efficient use of the GPU memory hierarchy.

  • @chrisBruner
    @chrisBruner Рік тому

    So a couple of thoughts. 1. for the intellgent prompt generation, you could just use a small transformer dedicated to that task. 2. Because of it's parallel nature, you could have one of these things working on a bunch of raspberry pies, or.... a world wide network of computers sharing the task. That would more than make up for the limitations compared to transformers. 3. It seems to me that these guys get fuzzy in recall of "minutiae" but there is no reason you can't have these hooked together so the recall can occurr by asking another set. Just some thoughts.

  • @anglikai9517
    @anglikai9517 10 місяців тому

    Tested it today, too slow compared to Llama2 GGML, hope that GGML version of RWKV is more user friendly.

  • @kimchi_taco
    @kimchi_taco Рік тому

    I'm not sure. It looks complicated MLPMixer.

  • @nyyotam4057
    @nyyotam4057 Рік тому

    Was like "Great, so now they'll implement it and my Alpaca will stop consuming so much memory". But then I got to the "tradeoff with computation" part 🙂.

  • @novelspace
    @novelspace Рік тому

    Galaxy 🧠 stuff

  • @sebastianp4023
    @sebastianp4023 Місяць тому

    53:53

  • @7200darkcharm
    @7200darkcharm Рік тому

    This abstract is summarizing a research paper that presents a new model architecture for natural language processing (NLP) tasks called Receptance Weighted Key Value (RWKV).
    Here's a breakdown of the abstract:
    Problem with Transformers: Transformers are a type of model that have been very successful in NLP tasks. However, they have a major drawback: their memory and computational needs increase quadratically with the length of the sequences they process. This means that as the input data (like a sentence or document) gets longer, the resources needed to process it grow very quickly, which can make them impractical for very large datasets or very long sequences.
    Problem with Recurrent Neural Networks (RNNs): RNNs, another type of model, have memory and computational needs that grow linearly with sequence length, which is more efficient than Transformers. However, they tend to perform worse on NLP tasks because they are harder to train in parallel (meaning, it's harder to split the work of training them across multiple machines or processors), and they don't scale as well (meaning, their performance doesn't improve as much when you add more data or make them bigger).
    The Proposed Solution - Receptance Weighted Key Value (RWKV): The authors propose a new model, the RWKV, that aims to combine the best of both worlds. It can be trained in parallel like a Transformer, which makes it efficient to train, and it has linear memory and computational complexity like an RNN, which makes it efficient to use once trained. This is achieved by using a linear attention mechanism, which is a method for deciding which parts of the input data the model should pay most attention to.
    Results: The authors scaled the RWKV model to tens of billions of parameters (which is a measure of the model's size and complexity) and found that it performs similarly to a Transformer of the same size. This suggests that it could be a useful alternative to Transformers for large-scale NLP tasks.
    Conclusion: This work represents a significant step towards reconciling the trade-off between computational efficiency (how much computing resources a model needs) and model performance (how well the model does its job) in sequence processing tasks. The hope is that future work can build on this to create even more efficient models.
    So in essence, the abstract is saying, "We've developed a new model that combines the best parts of two existing types of models. Our new model can handle large amounts of data and perform as well as the best current models, while using less computational resources. This is a big step forward for the field."

  • @anatalelectronics4096
    @anatalelectronics4096 11 місяців тому

    ok one more attempt without pasting the link to the paper, Apple's AFT is around since 2015, before the 2017 AIAYN paper, remarkable. I can't paste the link it seems hence no more info I can give. Look for "An Attention Free Transformer" to get to the paper.

    • @schwajj
      @schwajj 10 місяців тому

      No it’s not. It’s from 2021. Its citations include many papers from 2021, so obviously it wasn’t written in 2015.

  • @triplea657aaa
    @triplea657aaa 8 місяців тому

    I think RWKV in combination with a transformer model to generate the prompts could be really powerful

  • @qwerty123443wifi
    @qwerty123443wifi Рік тому +2

    Does being an author on a ML paper mean anything anymore? There are so many authors on some of these papers that it seems a bit ridiculous

    • @wujacob4642
      @wujacob4642 Рік тому +2

      That's because the paper is written in an open-sourced way. The main author Bo Peng told that in his blog

  • @fo.c.horton
    @fo.c.horton Рік тому

    please do like 10% more work making the annotations neater

  • @alivecoding4995
    @alivecoding4995 10 місяців тому

    Two months later. Have you seen adoption of these ideas?

  • @Sciencehub-oq5go
    @Sciencehub-oq5go Рік тому

    The paper isn't very well written, and too short / confusing in parts.
    What do they mean exactly with "channels"? The components of embedding vectors?

  • @girrajjangid4681
    @girrajjangid4681 Рік тому

    Which mic you are using for video. It amazing. @YannicKilcher

  • @adi-ee8zj
    @adi-ee8zj Рік тому +2

    CNN is all you need?

  • @yilei1051
    @yilei1051 11 місяців тому

    I lost interest half way through the explanation... The most profound results are often simple and coherent architectures, this work required too much explanation that it feels just playing with scalability and performance, without revealing some raw science.

  • @arzigogolato1
    @arzigogolato1 Рік тому

    Why, why didn't they think of a better name? RWKV is really bad marketing...

  • @xxdaggerxx5
    @xxdaggerxx5 Рік тому

    i cant watch 1hr video man, summarize this shit

  • @klammer75
    @klammer75 Рік тому

    Amazing amazing amazing! I’ve been delving sooo much into the code side of implementation I forgot how much I love the maths side of the architecture and this walkthrough so expertly done by Yannic has lit my maths brain on fire once again! I can’t thank you enough for that, was a thrilling explanation and you are by far my favorite technical AI explainer out there! You sir are an asset to humanity and I for one tip my hat to you! And to think that there’s billions if not trillions of these weights/equations/parameters or whatever you want to call them in these models which give rise to the results we see is truly mind boggling….I feel like I just took an address watching that🤪😂🥳🦾🤓🤫