Retentive Network: A Successor to Transformer for Large Language Models (Paper Explained)

Поділитися
Вставка
  • Опубліковано 7 чер 2024
  • #ai #retnet #transformers
    Retention is an alternative to Attention in Transformers that can both be written in a parallel and in a recurrent fashion. This means the architecture achieves training parallelism while maintaining low-cost inference. Experiments in the paper look very promising.
    OUTLINE:
    0:00 - Intro
    2:40 - The impossible triangle
    6:55 - Parallel vs sequential
    15:35 - Retention mechanism
    21:00 - Chunkwise and multi-scale retention
    24:10 - Comparison to other architectures
    26:30 - Experimental evaluation
    Paper: arxiv.org/abs/2307.08621
    Abstract:
    In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence and attention. Then we propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallel representation allows for training parallelism. The recurrent representation enables low-cost O(1) inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance. The chunkwise recurrent representation facilitates efficient long-sequence modeling with linear complexity, where each chunk is encoded parallelly while recurrently summarizing the chunks. Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference. The intriguing properties make RetNet a strong successor to Transformer for large language models. Code will be available at this https URL.
    Authors: Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, Furu Wei
    Links:
    Homepage: ykilcher.com
    Merch: ykilcher.com/merch
    UA-cam: / yannickilcher
    Twitter: / ykilcher
    Discord: ykilcher.com/discord
    LinkedIn: / ykilcher
    If you want to support me, the best thing to do is to share out the content :)
    If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
    SubscribeStar: www.subscribestar.com/yannick...
    Patreon: / yannickilcher
    Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
    Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
    Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
    Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
  • Наука та технологія

КОМЕНТАРІ • 250

  • @YannicKilcher
    @YannicKilcher  8 місяців тому +22

    OUTLINE:
    0:00 - Intro
    2:40 - The impossible triangle
    6:55 - Parallel vs sequential
    15:35 - Retention mechanism
    21:00 - Chunkwise and multi-scale retention
    24:10 - Comparison to other architectures
    26:30 - Experimental evaluation

    • @cerealpeer
      @cerealpeer 8 місяців тому

      i asked chatgpt to design a chatgpt like llm based on retnet, and this is what it came up with. i hope at least some part of it is useful.
      #chatgpt #openai
      Creating a complete ChatGPT-like model based on the RetNet architecture would require extensive resources and training data, which cannot be provided within this text-based environment. However, I can provide you with a simplified code structure that combines a RetNet-based architecture with a basic chat interaction. Please note that this code serves as a starting point for building such a model and is highly simplified.
      First, we'll define the RetNet components, and then integrate them into a chat interaction loop:
      ```python
      import torch
      import torch.nn as nn
      import torch.nn.functional as F
      # Define the components of the RetNet-based ChatGPT
      class MultiScaleRetention(nn.Module):
      def __init__(self, input_dim, num_heads):
      super(MultiScaleRetention, self).__init__()
      self.multihead_attention = nn.MultiheadAttention(input_dim, num_heads)
      def forward(self, x):
      # Apply multihead attention
      x, _ = self.multihead_attention(x, x, x)
      return x
      class FeedForwardNetwork(nn.Module):
      def __init__(self, input_dim, hidden_dim):
      super(FeedForwardNetwork, self).__init__()
      self.linear1 = nn.Linear(input_dim, hidden_dim)
      self.gelu = nn.GELU()
      self.linear2 = nn.Linear(hidden_dim, input_dim)
      def forward(self, x):
      # Apply the feedforward network
      x = self.linear1(x)
      x = self.gelu(x)
      x = self.linear2(x)
      return x
      class RetNetBlock(nn.Module):
      def __init__(self, input_dim, hidden_dim, num_heads):
      super(RetNetBlock, self).__init__()
      self.multi_scale_retention = MultiScaleRetention(input_dim, num_heads)
      self.feed_forward = FeedForwardNetwork(input_dim, hidden_dim)
      self.layer_norm1 = nn.LayerNorm(input_dim)
      self.layer_norm2 = nn.LayerNorm(input_dim)
      def forward(self, x):
      # Apply multi-scale retention
      residual = x
      x = self.layer_norm1(x + self.multi_scale_retention(x))
      # Apply feedforward network
      x = self.layer_norm2(x + self.feed_forward(x))
      return x
      class RetNet(nn.Module):
      def __init__(self, input_dim, hidden_dim, num_heads, num_layers):
      super(RetNet, self).__init__()
      self.blocks = nn.ModuleList([RetNetBlock(input_dim, hidden_dim, num_heads) for _ in range(num_layers)])
      def forward(self, x):
      # Apply multiple RetNet blocks
      for block in self.blocks:
      x = block(x)
      return x
      # Define the ChatGPT architecture
      class ChatGPT(nn.Module):
      def __init__(self, input_dim, hidden_dim, num_heads, num_layers, vocab_size):
      super(ChatGPT, self).__init__()
      self.retnet = RetNet(input_dim, hidden_dim, num_heads, num_layers)
      self.embedding = nn.Embedding(vocab_size, input_dim)
      def forward(self, input_seq):
      embedded_input = self.embedding(input_seq)
      retnet_output = self.retnet(embedded_input)
      return retnet_output
      # Example usage:
      input_dim = 512 # Input dimension
      hidden_dim = 1024 # Hidden dimension for the feedforward network
      num_heads = 8 # Number of attention heads
      num_layers = 6 # Number of RetNet blocks
      vocab_size = 10000 # Vocabulary size
      chat_gpt = ChatGPT(input_dim, hidden_dim, num_heads, num_layers, vocab_size)
      input_seq = torch.randint(0, vocab_size, (32, 10)) # Batch size of 32, sequence length of 10
      output = chat_gpt(input_seq)
      print(output.shape) # Print the output shape
      ```
      This code defines a simplified ChatGPT architecture based on the RetNet components. In practice, you would need to fine-tune this model on a large dataset of conversational data and integrate a decoding mechanism for generating responses. Additionally, a more extensive vocabulary and tokenization scheme should be used to handle natural language text effectively.

    • @cerealpeer
      @cerealpeer 8 місяців тому

      ok i asked chatgpt for a chatgpt and this is lol... idk, man... uhh... 🤯 i have a headache from caffiene and nicotine:
      i hope this does a thing... i hate it when im stupid.
      #chatgpt #openai
      #gtforeally
      Incorporating the described system into a process that improves itself over iterations is a complex task. It requires a continuous feedback loop and the ability to modify the system based on user feedback and learning from interactions. Below is a simplified example of how the system can iterate to improve itself:
      ```python
      class ChatSystem:
      def __init__(self):
      self.responses = {}
      self.learning_iterations = 3 # Number of learning iterations
      def user_interaction(self):
      user_input = input("User: ")
      if user_input.lower() == "exit":
      return False # Exit the interaction
      else:
      response = self.generate_response(user_input)
      print("System:", response)
      feedback = input("Was this response helpful? (yes/no): ").lower()
      if feedback == "yes":
      self.store_feedback(user_input, response)
      return True # Continue interaction
      def generate_response(self, user_input):
      return "Thank you for your input."
      def store_feedback(self, user_input, response):
      if user_input in self.responses:
      self.responses[user_input].append(response)
      else:
      self.responses[user_input] = [response]
      def dynamic_learning(self):
      for _ in range(self.learning_iterations):
      for user_input, responses in self.responses.items():
      if len(responses) > 1:
      improved_response = " ".join(responses) # Simplified improvement
      self.responses[user_input] = [improved_response]
      def run(self):
      print("Chat System - Type 'exit' to end the conversation.")
      while True:
      continue_interaction = self.user_interaction()
      if not continue_interaction:
      break
      self.dynamic_learning()
      print("Chat System - Learning and Feedback Collection Completed.")
      # Print peak utility
      print("
      Peak Utility of the System:")
      for user_input, responses in self.responses.items():
      print("User Input:", user_input)
      print("Improved Response:", responses[0]) # Assuming the first response is the most improved
      if __name__ == "__main__":
      chat_system = ChatSystem()
      chat_system.run()
      ```
      In this code:
      - The system collects feedback and stores it.
      - After the conversation, it goes through a specified number of learning iterations to improve responses.
      - Finally, it prints the peak utility, assuming that the first improved response in the stored feedback represents the peak utility.
      Please note that this is a simplified example for demonstration purposes. In a real-world scenario, dynamic learning and improvement would involve more complex NLP techniques and data handling processes.

    • @cerealpeer
      @cerealpeer 8 місяців тому

      im trying to get chatgpt to make a self developing chatgpt:
      #chatgpt
      Integrating the simplified Selective Black Box System into a Retentive Large Language Model (LLM) internal environment, while linking it to both internal and external black boxes, along with supervised user interactions for live automatic security and development, is a complex endeavor. Here, I'll provide a conceptual overview of how such an integration might work:
      ```python
      import random
      class SelectiveBlackBoxSystem:
      def __init__(self):
      self.code = ""
      self.history = []
      self.security_vulnerabilities = []
      def generate_response(self, user_input):
      response = f"System response to user input: {user_input}"
      # Save user input and response to history
      self.history.append((user_input, response))
      # Analyze user input and detect security vulnerabilities (simplified for demonstration)
      if "sql injection" in user_input.lower():
      self.security_vulnerabilities.append("SQL Injection detected in user input.")
      # Check if it's time to self-develop or request user input for improvements
      if len(self.history) % 5 == 0:
      self.self_develop()
      response += "
      System has self-developed based on history."
      else:
      response += "
      System is waiting for user input or self-development."
      return response
      def self_develop(self):
      # Simplified self-development process (replace with actual logic)
      new_code = random.choice(["print('Hello, World!')", "for i in range(10): print(i)"])
      self.code += new_code + "
      "
      def report_security_vulnerabilities(self):
      if self.security_vulnerabilities:
      return "
      ".join(self.security_vulnerabilities)
      else:
      return "No security vulnerabilities detected."
      class RetentiveLLMEnvironment:
      def __init__(self):
      self.llm_code = ""
      self.selective_black_box = SelectiveBlackBoxSystem()
      def generate_llm_response(self, user_input):
      # Process user input with the LLM and get a response
      llm_response = f"LLM response to user input: {user_input}
      "
      self.llm_code += llm_response # Store LLM-generated code in the environment
      # Pass user input to the Selective Black Box System for further processing
      black_box_response = self.selective_black_box.generate_response(user_input)
      # Return both LLM and Black Box responses
      return llm_response + black_box_response
      def report_security_vulnerabilities(self):
      # Report security vulnerabilities detected by the Selective Black Box System
      return self.selective_black_box.report_security_vulnerabilities()
      if __name__ == "__main__":
      environment = RetentiveLLMEnvironment()
      while True:
      user_input = input("User: ").strip()
      if user_input.lower() == 'exit':
      print("Goodbye!")
      break
      system_response = environment.generate_llm_response(user_input)
      print("System:", system_response)
      # Check for security vulnerabilities and report if any
      security_report = environment.report_security_vulnerabilities()
      if security_report:
      print("Security Report:")
      print(security_report)
      ```
      In this conceptual integration:
      - The `SelectiveBlackBoxSystem` is encapsulated within the `RetentiveLLMEnvironment`.
      - User input is first processed by the LLM to generate a response and store the LLM-generated code in the environment.
      - Then, the same user input is passed to the `SelectiveBlackBoxSystem` for further processing.
      - Both the LLM response and the Black Box System response are provided to the user.
      - Security vulnerabilities detected by the Black Box System are reported.
      This integration combines the capabilities of the LLM, the Selective Black Box System, and user interactions for security monitoring and code development within the internal environment. It serves as a simplified demonstration, and in practice, more sophisticated logic, security measures, and integration points would be necessary.

  • @kristoferkrus
    @kristoferkrus 8 місяців тому +336

    First they published the ResNet, now they publish the RetNet... There is only one conclusion to be drawn from this: In 2031, Microsoft Research will publish The ReuNet.

    • @NoNameAtAll2
      @NoNameAtAll2 8 місяців тому +25

      I'll implement only RewNet. RevNet was too weak

    • @justfoundit
      @justfoundit 8 місяців тому +15

      Extrapolation beyond model size

    • @alles_moegliche73
      @alles_moegliche73 8 місяців тому +12

      the next one is RestNet, trust me

    • @jantuitman
      @jantuitman 8 місяців тому +2

      I think they try to slowly work up to RedmondNet

    • @anmolt3840051
      @anmolt3840051 8 місяців тому

      I'm waiting for ReyNet

  • @jjkjkjka
    @jjkjkjka 8 місяців тому +51

    Training parallelism is referring to parallel computation along time axis, not to the ability to have loss on every token (both transformers and RNNs can do this). Basically in transformers you can use parallel_for(batch and time) but in RNNs you can't do parallel_fro along time, because along time dimension It's casually dependent on output of previous token not on input (as in transformers).

  • @kristoferkrus
    @kristoferkrus 8 місяців тому +90

    This feels familiar. Isn't this almost precisely what they did in the paper "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention" back in 2020, except for the discount factor to allows the retention to decay exponentially?

    • @kimchi_taco
      @kimchi_taco 8 місяців тому +11

      I feel the same, and I think the discount factor is not a feature but tech debt.

    • @lexer_
      @lexer_ 8 місяців тому +41

      Throughout the entire explanation I had the impression the main contribution of this paper was to accumulate and actually try out existing ideas from other papers. This is of course also a very valuable contribution but I dislike the way this is publicized as this revolutionary invention from Microsoft research.

    • @Supreme_Lobster
      @Supreme_Lobster 8 місяців тому +13

      @@kimchi_taco in the paper they mention that Linear Attention (which basically removes the softmax aswell) has problems with modelling positional encoding, so this might be the intuition they had for how to solve it.

    • @hanyanglee9018
      @hanyanglee9018 8 місяців тому +2

      It's about protecting the output of each layer. since it doesn't have the softmax.

    • @donglixp
      @donglixp 8 місяців тому +30

      Any question is welcome!
      The modelling is different between them. For example, linear attention still tries to approximate softmax, while retention is motivated from a different angle. The comparisons between them are described in the paper. In terms of running the arch as RNNs, we indeed were inspired a lot by previous work, which was also clearly presented in the paper.

  • @lexer_
    @lexer_ 8 місяців тому +44

    I am always really curious with these paper explanations which parts of these concepts are actually novel and which are novel combinations of known principles and concepts. Maybe you could consider just very briefly mentioning this aspect while going through the explanation? Sometimes you implicitly touch on this aspect but I would love a more explicit pointing out of which of this you think is novel. I don't ask for some rigorous lookup through all of ML and math publications of course! Just a gut-based best guess is fine.

  • @IcyyDicy
    @IcyyDicy 8 місяців тому +41

    I'm so glad I took a linear algebra course at uni, I can actually kinda understand what's going on!
    Thanks for going into the math of this

    • @petevenuti7355
      @petevenuti7355 8 місяців тому

      I wish I did, heck, I wish I had that and some calc in highschool if not middle school..

    • @sunnohh
      @sunnohh 8 місяців тому

      Its amazing the garbage people do with linear algebra

    • @KEKW-lc4xi
      @KEKW-lc4xi 8 місяців тому +3

      You must have had an excellent math teacher. I struggled with math in college, even up to calculus 2. I often need to search for the meanings of math symbols and how they're used in different situations. In my opinion, research papers often fail to explain their work clearly, maybe because they're worried about job security or something?
      Whenever I read a research paper and find it needlessly complex, I'm reminded of the quote:
      "An idiot admires complexity, where a genius admires simplicity"
      - Terry Davis

    • @IcyyDicy
      @IcyyDicy 8 місяців тому

      ​@@KEKW-lc4xi I fully agree with you. So many good details are hidden within formulas and notation that it's very hard to understand if you don't know what to look for. I personally find code way more readable than formulas, and I wish that papers would write out processes as charts or code to be clearer.
      But since that's what we've been doing since the beginning, it's expected, so that unfortunately isn't likely to change anytime soon.

  • @ai.srirambandi
    @ai.srirambandi 8 місяців тому +22

    Next do Bayesian Flow Networks by Alex Graves et al.

  • @alanjlockett
    @alanjlockett 8 місяців тому +6

    Firstly, I appreciate these videos, which help me stay up to date with the state of research. Second, two quibbles. First, you cannot ignore the \Thetas, which do not cancel but instead implement rotary position embeddings. Second, although Retnet certainly looks like a Linear Transformer, I think the per head GroupNorm operation is actually implementing a kind of nonlinearity (scaling by the variance). It doesn't matter that it's applied after v is multiplied in, if you reimagine softmax as a Boltzmann distribution and think of extracting division by the partition function outside the expectation, I think you'll see what I mean -- though Retnet is certainly not taking a weighted average of the vs like transformer attention does.

  • @ItIsKnown42
    @ItIsKnown42 8 місяців тому +5

    Sounds huge! Please keep posting.

    • @Schaelpy
      @Schaelpy 8 місяців тому

      I do not trust a computer science paper with a unrelated quote and a optical illusion in it

  • @joshuascholar3220
    @joshuascholar3220 8 місяців тому +8

    I love your channel. Please keep going!

  • @mloewen248
    @mloewen248 8 місяців тому +1

    Fantastic as always Yannic!

  • @yannickpezeu3419
    @yannickpezeu3419 8 місяців тому +3

    Thanks ! You are a great teacher !

  • @clray123
    @clray123 8 місяців тому +7

    17:48 No, this decay is not really positional encoding. I suspect it's for numerical stability. For positional encoding they use xPos, which is a kind of RoPE (which also has built-in decay).

  • @mathematicalninja2756
    @mathematicalninja2756 8 місяців тому +3

    Softmax is used to model multinomial probablity distrubutions, they have modelled it like a poission process.

  • @JTMoustache
    @JTMoustache 8 місяців тому

    Good to have you back

  • @dr.mikeybee
    @dr.mikeybee 4 місяці тому

    Very good work, Yannic.

  • @dr.mikeybee
    @dr.mikeybee 4 місяці тому

    The softmax is a filter on the signal, so we are getting an unfiltered signal in retention networks. That means part of the signal we need is in there.

  • @abderrahimbenmoussa4359
    @abderrahimbenmoussa4359 8 місяців тому +2

    I am not very good at maths and this was very clear. Thanks.

  • @araldjean-charles3924
    @araldjean-charles3924 8 місяців тому +7

    Hi Yannic Kilcher. Let me start by saying thank you for what you do! You are a great teacher. You are doing an excellent service to humanity. Nevertheless, maestro, around time frame 11:26, you make a statement about non-linearity (about activation function). If you use Relu, you are dealing with a piece-wise linear transformation; one piece maps everything negative to zero (a null space mapping, sort of), and the other is identity mapping. Putting softmax aside, you have only linear maps and Relu activation; the whole neural network must have a more straightforward representation as a piece-wise linear map. My intuitive argument is as follows. Without non-linear activation, the entire network will become one matrix operator by a matrix composition. Suppose we perturb the above so that all negative values of a dimension get mapped to Zero (restricted to negative input value. This is linear) while all positive values of an input dimension are identically mapped; identity mapping is also linear. Why can't we have a piece-wise linear decomposition of such operators and call it a day? Does not this hole scream at you? (I am talking just about Neural Network with Relu, leaving out softmax at the output layer.)
    Thank you for the voodoo that you do so well!
    (I am from Haiti. Voodoo has a totally different meaning for me.)

    • @drdca8263
      @drdca8263 8 місяців тому +2

      I believe some work has been done on that.
      One thing is: the number of components in the piecewise-linear decomposition grows like exponentially in the number of layers, I think?
      In any case, there ends up being *a lot* of such components.
      But people have done stuff like, studying how the size of the components varies throughout the input space?

    • @araldjean-charles3924
      @araldjean-charles3924 8 місяців тому +2

      @@drdca8263 Thank you so much for sharing. The "Winning the lottery concept," pruning, and low-rank approximation scream that most space does not matter. The solution is concentrated in a small volume of a large space. We don't know how to approach it directly, so we presently use a shotgun shot approach (so-called winning the lotto).

    • @edh615
      @edh615 8 місяців тому +1

      if negative values get mapped to 0 that is not reversible, the ability to exclude a part of the input going forward is what gives expressivity to the network.

    • @drdca8263
      @drdca8263 8 місяців тому

      @@araldjean-charles3924 I found the paper I was thinking of. It is called “Interpreting Neural Networks through the Polytope Lens”.

    • @clray123
      @clray123 8 місяців тому +2

      It is not about removing non-linearity from the neural network (the FFN layers in between retention layers still have it). It about removing non-linearity from calculating attention scores.

  • @DamianReloaded
    @DamianReloaded 8 місяців тому +17

    A model that can run decently on disconnected end user devices to do language processing, maybe even speech recognition would be awesome.

    • @unkarsthug4429
      @unkarsthug4429 8 місяців тому

      I mean, pixel phones since the integration of the tensor chip already do this.

    • @unkarsthug4429
      @unkarsthug4429 8 місяців тому

      And by that, I mean onboard translation and speech processing.

  • @meisherenow
    @meisherenow 8 місяців тому +3

    Seems like you could get at least some of the computational advantage of ditching the softmax but still have a non-linearity, as long as it's a point-wise non-linearity--relu, square, etc.

  • @RuslanLagashkin
    @RuslanLagashkin 8 місяців тому +2

    It is not "greening out", it is you are breaking the Matrix

  • @woohooliving8683
    @woohooliving8683 8 місяців тому

    great video. the cross-chunk is attention across chunks. is that what u meant by accumulating the distant past?

  • @SimonJackson13
    @SimonJackson13 8 місяців тому

    So the attention decay exponential effectively becomes the non-linear process? Wouldn't a Monte-Carlo on the best causal "mess" sequence-distance be in order?

  • @robinwang6399
    @robinwang6399 2 місяці тому

    Remembering stuff from the performer paper, it seems if you pick Q and K carefully, retention reduces to approximately attention with extra steps. So I think it is a good approximation plus corrections situation. At least in the chunk wise situation.

  • @siminfan9300
    @siminfan9300 8 місяців тому +1

    what’s the group link to the Saturday paper discussion? 👀

  • @edhofiko3168
    @edhofiko3168 8 місяців тому

    If you somehow think this is familiar because you read alot about linear transformer, then this is indeed familiar.
    Linear transformer introduce the idea of using kernel to calculate softmax so we can multiply K and V first then with Q, resulting in linear space complexity requirement. Then Performer came along and provide more theoritical background about what kernel to use. But, then researcher realize the whole thing resembles a fast weight programmer architecture (this is actually the recurrent form of RetNet with maybe slight difference) from way back in 90s which is actually better then rnn but they were kind of slow (i guess?). Now they decided that, maybe training by multiplying Q and K first for parallel training but inferencing by multiplying K and V first for optimizing space and comoutation might be the best of both worlds.
    But why linear attention in the sense of dropping softmax altogether works? Well there are a few paper, like Intention mechanism, which models transformer attention as linear regression with softmax kernel. So, they decided to ditch the softmax kernel altogether and it actually works better in some case.

  • @mgostIH
    @mgostIH 8 місяців тому +4

    14:00 isn't necessarily true in works like "Self-attention Does Not Need O(n^2) Memory" and implementations like Flash Attention. Better implementations of the standard design is why it's very hard for new works to displace the original.

    • @donglixp
      @donglixp 8 місяців тому +5

      Thanks for raising the question. FlashAtt is still O(N log(N)) or O(N sqrt(N)) in memory, and O(N^2) in computation.

  • @charstringetje
    @charstringetje 8 місяців тому

    With Associativity you can change order of operations (change around the parentheses): (a+b)+c ≡ a+(b+c). For a commutative function, you can change order of parameters: a+b ≡ b+a

    • @AKumar-co7oe
      @AKumar-co7oe 8 місяців тому

      so it's both in this case

    • @AKumar-co7oe
      @AKumar-co7oe 8 місяців тому

      man, kind of ridiculous that i have to remember this stuff when I've had multiple interns from MIT

  • @darshank8748
    @darshank8748 8 місяців тому +3

    What about a presentation of Hyena? Really cool too

  • @nikilragav
    @nikilragav 8 місяців тому

    So the goal is that at inference you can easily run it on some computer that doesn't do matrix operations (no gpu) or has small memory?

  • @FD286
    @FD286 8 місяців тому

    Thank you : D

  • @NeoShameMan
    @NeoShameMan 8 місяців тому +2

    You can basically bog down neural network at large into two main function, sorting input data together, then tagging them with class. LLm goes one step beyond and dissociate the learning of the input (embedding training) and the output token generation) ie it's virtually deeper than the "physical" model because really it's two model chained together.LLM are "special" in machine learning because the input distribution and the output distribution is the same. Embedding prevent the network from learning an arbitrarily sparse input, ie the data is self sorted, the self similarity encode the distribution bias that define the inherent semantic of the input.
    Most big LLM are very sparse with neuron only reacting to specific Ngram (aka bag of word), I encourage people to look at how chatbot using chatscript are programmed too (suzette and rosette). The reason is that we tokenized the input allowing better composition (ie higher level token), reproducing an adhoc parsing tree. This pose the question, what limit a DAG (typical network architecture) impose on that structure. The transformer is merely a clever acceleration structure, but if you look at the evolution of power of neural network, limiting the input field (attention mechanism) as been a way to accelerate learning and increase power.
    But I would go much deeper, I would ask you, what function are the neuron ACTUALLY are doing, I don't mean logistic regression and the kind, I mean *concretely*, try to separate each mechanism of the neuron (beyond the bs brain analogy or obscure math) and question yourself what do they do (the mul, the add, the bias, the activation). After all the math is merely the same as groceries store ticket you look at to see if you got over budget or not...
    IMHO LLM will go the way of NeRF vs Gaussian splatting ... YAGNNNI! (You Aren't Going to Need Neural Network for Intelligence)

    • @millionare5446
      @millionare5446 8 місяців тому +3

      dunning kruger

    • @clray123
      @clray123 8 місяців тому

      @@millionare5446given that small (L)LMs are increasingly demonstrated as able to reproduce the behavior/benchmark scores of huge LLMs, I would not be so fast to judge

    • @NeoShameMan
      @NeoShameMan 8 місяців тому

      @@millionare5446 proof? You can't say something and quit Scott free 🤣
      I can literally show assertion with code, I can demonstrate and cite paper. I can point to historical model.
      I recommend the paper neuron in large languages network models: dead, ngram, position.
      I implemented neural circuit from scratch to test hypothesis. I'm pretty confident in my claim that we will crack neuron out. Here us an exercise for you, to see if you aren't the one with dunning Kruger. Implement a small neural circuit that encode ordinality, such a token only detect and discriminate previous and following token, such abc but not bdf. Try to figure out what impact it has on the graph structure of the circuit, conclude about how the typical dag structure of neural network impact emergence of such circuitry.

    • @NeoShameMan
      @NeoShameMan 8 місяців тому

      @@millionare5446 here is another claim you can test. Any neural net trained for embedding will converge into a similar distribution given similar dimensionality. Such that we can align both distribution by taking the relation of a group of cluster centroid.
      I'm well aware that the public of this channel is sophisticated, when I say go back to what a neuron does, I'm exhorting to look at stuff more concretely in terms of function. I was kind of amused of someone studying a toy Othello model, being amazed than the model learn which color it is playing with the state of a single neuron, when you are accustomed to create neural circuit by hand, it's not surprising because xoring is how you implement a switch in representation.
      But more importantly I'm implicitly trying to handwave in a particular direction. I made pretty early a prediction that we could train a language model close to the entropy limit, then the paper grip is all you need happen, surprising a lot of people but me. Compressor are a type of predictor, feed them random data ans they will produce a distribution similar to what they "learned". But compressor are overfitter, that's a different way to look at the problem, overfit then relax to generalize, what gzip is all you need did, was to show that joint probability of data is enough to create some form of discrimination. Ie when I talk about data self similarity, I'm talking about compression by another name.
      And the reason I bring chatscript specifically, and not like chatbot made with aiml, is to link back the notion of embedding, notably explicit embedding unlike current version. Implicit embedding is a black box that surely capture more semantic than explicit embedding, we have a history of failing to use explicit embedding to make any notable breakthrough, IMHO, such as conceptnet.
      However these were not useless, it was simply what I call the dictionary problem, it cost lot to exhaustively list all minutiae of the world in a model. To bring back to chatscript, they had leverage wordnet as an ontology to help generalize, and a lot of hand made rules to operate, ie dictionary problem, however it works to a degree afforded by the limitations they were operating. Initially I wanted to mentioned the stanford parser but choose chatscript on purpose.
      The big breakthrough llm really add is shattering the dictionary problem at the cost of black boxing. Pointing back to these historical model, I'm positing there is a way to learn a way to bridge the gap. Imagine you have an explicit embedding and an implicit embedding, you can probably learn a joint probability that will translate hidden class into known class, and measure the difference to uncover novel class, thus better understanding.
      One thing neural network, and language model in general do, is confusing the nature of class as human use it. Consider the phrase the capital of France is ___. What type is the next word, from a syntactic level we are waiting for a noun or a noun group, from an ontological level, we wait for a city name. Model don't explicitly differentiate between those two type, those are just cluster of words, or bag of word from another perspective. We know model store semantics in a structured laten space, such that words are point in that space, to infer Paris, it would take the vector capital and France then use a learn vector to point to another space where the cluster of word that encode the ontology city, and the differential capital and country has a joint distribution with the differential country capital. That also explains why llm or nn can be quantize, except some vector, it depend on the equivalence of words in a cluster, it also explains why some neuron have abnormaly large activation, they are cross domain link of the internal graph model. We can safely conclude that model learn more complex and abstract structures the deeper they are, but also the dag model mean there is redundancy at each layer or pass through neuron.
      By finding a way to express compressor as a model, extending them to parser domain and using explicit embedding, that looks like a desirable end game that remove the black boxing of neural net. Thus me making bold claim.

    • @millionare5446
      @millionare5446 8 місяців тому

      @@NeoShameMan which laboratory is letting you create neural circuits with biological neurons?

  • @samanthaqiu3416
    @samanthaqiu3416 8 місяців тому +7

    8:45 but I don't get how the causal mask by itself helps make transformers have token training parallelism, in fact you could say the same about recurrent networks, since they have the causal mask inbuilt into their latent propagation architecture

    • @sluicedeuce4367
      @sluicedeuce4367 8 місяців тому +4

      A causal mask allows a single sequence to be deconstructed into many training examples within one forward pass.
      For a given input: 1234 and a given target output: 2345, there are 4 training examples from the one sequence:
      1 -> 2
      12 -> 3
      123 -> 4
      1234 -> 5
      The causal mask enforces this structure in the outputs by masking the attention matrices' upper triangular. A visualization can help cement how the math works, but this is the fundamental idea for autoregressive training.

    • @-mwolf
      @-mwolf 8 місяців тому

      ​@@sluicedeuce4367But wouldnt just using multiple batches have the same effect, regarding parallelism? (except ofc not utilizing the data to its full extent)

    • @anmolt3840051
      @anmolt3840051 8 місяців тому +2

      @@-mwolf AFAIK training RNNs with batched data requires all the samples (within the batch) to have the same length, so with an RNN you'd have to run "123 -> 4" and "1234 -> 5" on different batches and recompute the common states (corresponding to inputs 1, 2 and 3) whereas a transformer can optimize on both "123 -> 4" and "1234 -> 5" objectives within the same batch execution without having to recompute anything. At least, that is my understanding

    • @sluicedeuce4367
      @sluicedeuce4367 8 місяців тому +2

      @@-mwolf Yes exactly right. Autoregressive prediction is all about predicting the next token, but using those sub-sequence examples is what makes training so powerful!

  • @Akshaylive
    @Akshaylive 8 місяців тому +4

    I haven't read the paper yet, but I'm not sure non-linearity is completely taken away. The complex conjugation is a non-linear operation, which is still very much in use here.

    • @Akshaylive
      @Akshaylive 8 місяців тому

      Additionally, since positional embedding is kinda baked in, I'm not sure how this will perform on non-textual tasks like working on images where positional embeddings are quite crucial and non-trivial.

    • @clray123
      @clray123 8 місяців тому +1

      @@AkshayliveIt's not baked in, if you look at the implementation, you see they use xPos for positional embedding, which you could replace with something else like traditional RoPE (in face xPos is supposed to be an improvement over RoPE which attenuates cyclical swings in attention scores over sequence length).

    • @clray123
      @clray123 8 місяців тому

      @@whannabi The idea is that these specialized faculties/subregions do not necessarily need to be designed as such, but can spontaneously arise within that single unified architecture. When you look at a living organism, you see many specialized organs, but all of them are somehow encoded in the same DNA, using just 4 base pairs, and they evolved over millions of years by executing the same sort of "code" over and over again.

    • @clray123
      @clray123 8 місяців тому

      @@whannabi Yes, although multimodal models such as ChatGPT currently do in fact include some designed subcomponents ("mixture of experts" architecture), so your thinking that a single architecture to rule them all may not be enough is not wrong either.

  • @kevinamiri909
    @kevinamiri909 8 місяців тому +1

    I suppose the next will be fullnet, purely written with dot Net framework, and high compatibility, while high performance only with windows

  • @yorth8154
    @yorth8154 8 місяців тому +6

    The fact that the jury is still out on this paper is already the best possible outcome/praise. I came into this video thinking that by half of it the bullshit would be already too smelly to bear.

    • @100c0c
      @100c0c 8 місяців тому

      What's the significance of this? Is it just more efficient training/inference or does it add new capabilities to language models?

    • @yorth8154
      @yorth8154 8 місяців тому +2

      @@100c0c faster inference time and longer context window. There is also the fact that it apparently just scales better with parameter count, which alone could be reason enough to switch. However, no peer review yet and no serious foundation model has been released yet so it's still in "good potential" category.

    • @etopowertwon
      @etopowertwon 8 місяців тому

      @@100c0c I suspect little to none: it's not the first time RNN are introduced to transformers. See transformersXL.
      Better yet, see paper on YaRN. Same quadratic transformers. GovReport: ~3.5-~4.5 perplexity. Against 16 of "successor".
      Yarn give perplexities for 4 methods. Add them all together, to drive the point, PPL will be 15.5. That's still better than RetNet.
      It's almost as if you can't put super long context into fixed size memory that is being decayed and erased with each token and get a good result.
      Llama 2 uses "vanilla" transformers. No XL version. No other RNN based approach.

    • @clray123
      @clray123 8 місяців тому

      The papers' authors are a little overenthusiastic and cringeworthy in their selling of the invention. However, that is understandable if what they claim turns out to be true. Anyhow, the maths is sound, and nothing* keeps you from comparing RetNet performance with a transformer architecture yourself (in fact there are already two good implementations you could use straight away out there). If it was "all bullshit", you would have heard about the impossibility to reproduce results already.
      *nothing besides the money required to perform a large-scale LLM training, of course; which was also the paper authors' problem

    • @100c0c
      @100c0c 8 місяців тому +1

      @@clray123 So it just make things cheaper and enhances it's current LLM abilities, but doesn't add any new functionality?

  • @elanm4311
    @elanm4311 8 місяців тому

    Transformers with KV-cache (all modern transformers) generate each token in linear complexity by saving previous key-value pairs. Once people figure out how to keep/compress the KV-cache to a fixed size we will have all the benefits retentive networks for regular transformers.

    • @AM-yk5yd
      @AM-yk5yd 8 місяців тому

      Sounds like Memory Transformer.

    • @daniellawson9894
      @daniellawson9894 8 місяців тому

      wouldn't this be O(N) not linear? You don't have to redo attention for prev tokens, but for generating the current token, you still need to do attention to the previous O(n) tokens?

  • @gr8ape111
    @gr8ape111 8 місяців тому

    For "simple" NLP tasks with data this seems like a very good model!

  • @triforce42
    @triforce42 8 місяців тому

    Hmm, could be an erratum. do you call the sigmoidal activation the same thing as softmax around 20-22 mins? Also don't transformers only use softmax in later layers? Isn't it just a relu activation for the nonlinearity?

    • @pi5549
      @pi5549 8 місяців тому +2

      There's a softmax in the attention mechanism. Say there''s 256 tokens. You'll get a 256x256 attention matrix and you softmax over each row. So 256 softmaxes, each softmaxing a 256 length vector.

    • @triforce42
      @triforce42 8 місяців тому

      @@pi5549 oh that clears up a misconception on my part. But isn't it still unconventional to use the sigma to denote softmax or is this a convention I'm not aware of? Usually sigma denotes the sigmoidal activation function right?

    • @a1k0n
      @a1k0n 8 місяців тому

      @@triforce42 sigmoid and softmax are literally the same thing; sigmoid is what you get when you have a softmax between only two alternatives.

    • @triforce42
      @triforce42 8 місяців тому

      @@a1k0n if I understand correctly, sigmoid and softmax are not "literally the same thing," sigmoid takes a scalar input and softmax takes a vector input. Notice the summation present in the softmax formula that's not present in the sigmoid one on Wikipedia, for example

    • @a1k0n
      @a1k0n 8 місяців тому

      @@triforce42 To see this, say we have a softmax(x1, x2) and we're looking at the result for x1 = e^x1 / (e^x1 + e^x2); normalize by dividing through by e^x1 = 1/(1+e^(x2-x1)); this is equivalent to sigmoid(x1-x2).

  • @burkhardstackelberg1203
    @burkhardstackelberg1203 8 місяців тому

    Is the RetNet completely linear, or does it contain nonlinearities at some point? If not, I would expect some representation failure at some point.

    • @Luke-kv9ew
      @Luke-kv9ew 8 місяців тому

      The use a swish gate in each layer: arxiv.org/pdf/1710.05941v1.pdf?source=post_page

    • @donglixp
      @donglixp 8 місяців тому

      It has LN and swish gating to add non-linearity.

  • @hkj4276
    @hkj4276 8 місяців тому +1

    20:00 associativity you want to say haha😆

  • @nabileljabri2768
    @nabileljabri2768 8 місяців тому +1

    where can i join your discord?

  • @jondo7680
    @jondo7680 8 місяців тому +3

    Yeah I was also thinking that they got rwkv wrong. So I'm curious how they 2 really compare.

    • @donglixp
      @donglixp 8 місяців тому

      You could refer to RWKV implementation, which is not sequentially parallel. This is also RWKV claimed in the paper, i.e., recurrent for both training and testing. BTW, mathematically, you could also train it sequentially parallel. But as long as you tried, there would be numerical issues, at least it's non-trivial to solve. I would not say it's the issue of RWKV, but instead, it's caused by AFT. The key difference is that AFT/RWKV has a normalization for Keys, and it's also elementwise.

  • @TooYoober
    @TooYoober 8 місяців тому +2

    Retention is all you need?

  • @TheReferrer72
    @TheReferrer72 8 місяців тому +1

    Microsoft Research is doing some solid stuff.
    Phi-1.5 is ground breaking.

    • @clementdato6328
      @clementdato6328 8 місяців тому +1

      What is Phi-1.5

    • @TheReferrer72
      @TheReferrer72 8 місяців тому

      @@clementdato6328 Small LLM 1B parameter model that outperforms LLama 2 models that are 10x bigger.
      search "Textbooks are all you need technical paper"

    • @fontende
      @fontende 8 місяців тому

      Microsoft not related to this paper, there no any foreigners even for review, it's purely from China academia, but I would take it skeptically knowing academia in communistic political pliers on the balls

    • @clray123
      @clray123 8 місяців тому

      @@clementdato6328A small model with great results on plausible reasoning tasks which you can't use for anything practical because of Microsoft's license.

    • @potential900
      @potential900 8 місяців тому

      ​@@clementdato6328Name of a new model iirc, you'll find the technical report on it online.

  • @billyf3346
    @billyf3346 8 місяців тому

    awesome. another paper to increase model size and power without any insight at the symbolic level or any features to push forward mechinterp.

    • @st0a
      @st0a 8 місяців тому +1

      I hops Yannic will do an episode on model compression, now that's an interesting topic.

    • @athelstanrex
      @athelstanrex 8 місяців тому

      Best timeline

    • @clray123
      @clray123 8 місяців тому

      It mostly increases inference speed over long sequences... and decreases memory consumption. And these are important aspects, quite regardless of the topics you mention.

  • @Embassy_of_Jupiter
    @Embassy_of_Jupiter 8 місяців тому

    I imagine multi-scale retention just as different heads screaming at each other about what part of the input is most important.
    A new word comes in and head 5 that deals with the 57th to 63rd percentile of the input screams "AHHHHH I HAVE SOMETHING IMPORTANT TO SAY"

  • @xyh6552
    @xyh6552 7 місяців тому

    So, can we still pass off an article without using softmax? This is a job at a second-grade level

  • @agentds1624
    @agentds1624 8 місяців тому +4

    I really do appreciate the humourus scepticism (or should i just call it sarcasm 🤔). Anyways, your videos would only be half as enjoyable without it. Also, beeing able to mathematicaly explain why softmax prevents recurrent implementation, yet describing associative property as "the property that sort of alows us to change the order of computation" (which is by any means correct) gives me hope for my PhD 😂.

  • @lejoon
    @lejoon 8 місяців тому

    I don't really understand why you say that the causal mask allows for parallel computing? I mean it's simply that the computation of a token doesn't really depend on its position because of self attention.

    • @etopowertwon
      @etopowertwon 8 місяців тому

      transformers take N tokens as input and produce N tokens as output. Each output token is used for training.
      If it weren't for causal mask, the network would learn that output[i] = input[i+1]. With causal mask tokens cant see the future and therefore you can calculate the loss from (output[i], input[i+1]). Thus you can train for N tokens in parallel from a single batch. In RNN you need to call the network N times. In transformers - you feed entire batch and calculate the loss for every token in one go.

    • @lejoon
      @lejoon 8 місяців тому

      Sure but that's not actually dependent on the casual mask itself but how the attention works. You can drop the casual mask and train a model on missing tokens and it would still be parallelizable. @@etopowertwon

  • @clray123
    @clray123 8 місяців тому +17

    I'd say, if the RetNet "we don't need no softmax" trick turns out as justified, it would be a huge embarrassment for the top LLM researchers out there who have been overlooking this for multiple years. But then, every great discovery appears trivial in hindsight. The worry is that this potential for embarrassment (and also considerable sums of money already sunk because of the mistake) will actually stifle the adoption and testing of this new architecture.

    • @clray123
      @clray123 8 місяців тому +6

      Strangely enough, I don't see top researchers jumping to protect their reputation and assert "we have tried it without softmax before and it did not work for ... reason"...

    • @anonymousanon4822
      @anonymousanon4822 8 місяців тому +11

      Softmax will not turn out to be a huge embarassement. It is incredibly cheap computationally and steers models to improve performance. It has driven the ai boom of the last 10 years. I'm not saying that it's impossible to be replaced or become abundant, but it is certainly not a "huge embarrassment"

    • @clray123
      @clray123 8 місяців тому +1

      @@anonymousanon4822 It's not about softmax in general, it's about softmax in attention scores.
      Designing an algorithm to be O(n^2) where O(n) suffices or O(n) where O(1) suffices SHOULD be a huge embarrassment for any CS expert.
      And because the models are monolithic beasts and you would have to retrain them to fix the mistake, it means that such a design error incurs very considerable costs on model training even if it pertains to the inference part of the algorithm (where it still incurs considerable costs on shared hosting and implementation complexity).
      Having such a mistake go unnoticed for 8 years while the faulty design has been pushed as brilliant would further add to the embarrassment.
      Everything hinges on whether the softmax in that location is really necessary or not for the (larger) models' performance.
      As Yannic mentioned, the jury is still out on that, but the competing researchers should all be scrambling to find out and to prove RetNet authors wrong if they want to uphold their own reputation. Rather than ignore this as "just another paper" or, worse, insinuate that the authors have purposely or accidentally withheld some critical information to make the results appear in their favor (which might be the case, be the story is too big to simply assume so). Given that the design decision in question (now) quite obviously (and again, embarrassingly) appears to be lacking theoretical support, this is the least one would expect from intelligent and honest people who came up with it. Not investigating it would reflect negatively on the entire field.

    • @quAdxify
      @quAdxify 8 місяців тому

      Linearization has been attempted many times and never worked too well but maybe it does this time, who knows. I mean you can approximate any non-linear function by a piece-wise linear one if my memory serves me well.

    • @clray123
      @clray123 8 місяців тому

      @@quAdxify It would be interesting to know why the non-linear function was used at this place in the architecture to start with.
      But hey, such subtle design decisions, which cost hundreds of millions of dollars, are not really documented in AI engineering/research. There's probably a nice picture with it in someone's paper - and that's it.

  • @05xpeter
    @05xpeter 8 місяців тому

    I'm confused the reason we invented neural networks was because linear models did not perform and what I have seen time and time again that if we don't introduce non-linearities it is just a linear model under the hood and have similar performance. There must be some non linearities, a linear model can't perform so well. Looking forward to read the paper.

    • @lennart6520
      @lennart6520 8 місяців тому +1

      I was first thinking the same and I believe the answer is that the FFNs in each layer are still non-linear.

  • @klaudialustig3259
    @klaudialustig3259 8 місяців тому +1

    Did anybody else notice the hardware they trained on? The paper says: 500 AMD MI200 GPUs

    • @fontende
      @fontende 8 місяців тому +2

      Plausible, A100 & H100 already sanctioned for any exportation into China, but recently also UAE was sanctioned by smuggling cards into China. A huge diplomatic scandal in fact.

  • @acasualviewer5861
    @acasualviewer5861 6 місяців тому

    I'm hoping someone pretrains a huge model that we can all play with. The architecture is one thing. The trained model is another. It would be great if we could replace Llama or RoBERTa or other heavier models with these that can be run on simpler machines.
    It definitely scales better. I trained a 234M parameter model on Google Colab with no problems (just 3 hours of my life). So it's doable.

  • @mtopscriterion
    @mtopscriterion 7 місяців тому

    Hey Yannic. Would RetNet be a solution for regression based problems? as I am coding a transformer to work on stock market data. In other words I don't want to overfit the model if this is mostly offers resolve to classification/LLM based restrictions

  • @clray123
    @clray123 8 місяців тому +4

    Now, if I wanted to attack RetNet, my preferred angle would be that it might not play will with the concept of shared paging kv caches (see e.g. vLLM), so while it reduces the absolute amount of memory for a single session, it may not provide benefits while batching multiple sessions at scale. Not sure if this is really the case, but at least something that would need to be considered when pushing RetNet as replacement for kv cache based architectures.

  • @hanyanglee9018
    @hanyanglee9018 8 місяців тому +1

    I predict that the next successor is called ReuNet

  • @nicholaswalker6996
    @nicholaswalker6996 8 місяців тому +3

    I've trained a number of medium GPT2 transformers and RetNets, and RetNet training is slower (compared to transformer with fast attention) and their perplexity is lower (which they admit in the paper - they say gets better with larger networks). RetNets are much better for inference. There is something strange about the role of the GroupNorm which I'm trying to work out.......

    • @nicholaswalker6996
      @nicholaswalker6996 8 місяців тому +2

      Sorry Perplexity is larger - loss is worse.

    • @cajampa
      @cajampa 8 місяців тому

      @@nicholaswalker6996 You know you can edit a UA-cam comment, right?

  • @FelixPetersen
    @FelixPetersen 8 місяців тому

    🎉

  • @JerryFederspiel
    @JerryFederspiel 8 місяців тому +1

    When I read this I just couldn't tell what was new enough that would produce the claimed jump in performance. It seemed kind of premature to declare this a transformer successor.

    • @clray123
      @clray123 8 місяців тому +3

      I see a problem in that everyone is TOO skeptical and nobody seems to be checking these claims as a result. People still training old architectures because of this "that can't possibly work" assumption.

    • @etopowertwon
      @etopowertwon 8 місяців тому +1

      @@clray123 MS didn't release the weights, so it's not really possible to check results

    • @clray123
      @clray123 8 місяців тому +1

      @@etopowertwon They did release the implementation though (and there's even an independent one), so you could produce your own results and become famous in disproving their exaggerated claims (if there are any).

    • @JerryFederspiel
      @JerryFederspiel 8 місяців тому

      @@clray123That's a very reasonable concern.

    • @potential900
      @potential900 8 місяців тому +1

      ​@@clray123So what's the Nvidia equivalent of 500 AMD MI200 GPUs?

  • @kan_drio
    @kan_drio 4 місяці тому

    I have a genuine, possibly dumb question. Regarding training parallelism, I don't understand why we care about it so much. When the training is parallel, it is also faster of course, but the memory usage is quadratic. Right? Why do we care about speed, but we don't care about the memory usage "exploding" quadratically? Both time and memory are costly resources, right?

  • @AlexanderGambaryan
    @AlexanderGambaryan 8 місяців тому +1

    Great video, wanted to send it to my best friend but he's got epilepsy 🤷🏿‍♂️

    • @clray123
      @clray123 8 місяців тому +1

      Just make him shut his eyes at strategic moments.

    • @mshonle
      @mshonle 8 місяців тому

      I was surprised he used the S word to describe the flickering screen, but that may be cultural differences… think back to Tiger Woods unfortunate use of the term in 2006. Other than that, I appreciate the video but wanted to bring awareness that this is a very sensitive term in the UK.

    • @AlexanderGambaryan
      @AlexanderGambaryan 8 місяців тому

      @@mshonle what's an S word?

    • @mshonle
      @mshonle 8 місяців тому

      @@AlexanderGambaryan I tried to include a link to the Tiger Woods incident from 2006, but I can't include links. You can search for the story to see what it was. In this context, it was in reference to the flickering in the video. I don't think there was any harm intended, but it is a sensitive term in UK. It's kind of like how "the R word" was used a lot in the US in the 1980s as a general insult but we don't speak it now because that's not how we speak about disabilities.

    • @AlexanderGambaryan
      @AlexanderGambaryan 8 місяців тому

      @@mshonle I'm sorry, English is not my first (or second) language and I'm totally lost in the forest of S, R, N and any other words that people make such a big deal about

  • @ryanfranz6715
    @ryanfranz6715 8 місяців тому

    Idk if this paper in particular will do it (it might just not sure), but it seems to me, conceptually, there should be no reason recurrence can’t be re-applied to transformers after-the-fact in a way that still retains high performance. I know that’s just a breakthrough away, and maybe this was it. Another thing I think is just 1 breakthrough away is the ability if a transformer-like architecture to “reason” over text input prior to producing an output. Like transformers clearly show what I call “shallow reasoning” over the text, but they just instantly spit out their “gut feeling” (which has become quite good in large models) without the ability to go back and think more or less carefully over specific pieces of text. And sure there’s prompting techniques that vaguely accomplish this like asking it to list out it’s steps (which already presents an opportunity for progressively self-improving but)… I’m talking about having that much deeper in the architecture. I think there’s nothing theoretically stopping that, just a breakthrough away, and if a huge model like GPT-4 suddenly had that ability… it’s like if we had a chess computer that can only think 1 move ahead and that by itself was as good as most humans already, and suddenly we let that computer think many moves ahead… think we’re in for some interesting stuff…

    • @potential900
      @potential900 8 місяців тому

      Have you looked at the recent Graph of Thoughts paper?

  • @fanero3046
    @fanero3046 8 місяців тому

    I got no idea how to picture those math equations into an ai.

  • @Summersault666
    @Summersault666 8 місяців тому

    Is linear Transformers like low rank adaptation applied to transformer?

    • @TheRyulord
      @TheRyulord 8 місяців тому +7

      No. It's a transformer variant that has linear time complexity with respect to sequence length instead of quadratic like in a vanilla transformer.

    • @Summersault666
      @Summersault666 8 місяців тому +2

      @@TheRyulord well, Lora has linear complexity too

    • @Summersault666
      @Summersault666 8 місяців тому

      I read the paper I think I understand what he means now. On every iteration you only recalculate the attention change due to the new tokens predicted, reusing the past ones you once calculated.

    • @TheRyulord
      @TheRyulord 8 місяців тому +2

      @@Summersault666 The additional cost of LoRAs during training is linear but the whole model is still quadratic if it's a regular transformer.
      It's also worth mentioning that the additional cost of LoRAs during inference can be zero because you can fuse the adapter into the model itself.

    • @Summersault666
      @Summersault666 8 місяців тому

      @@TheRyulord yes, but you can create low rank transformers by default, in this case it will be linear. But I understand that he was mentioning the paper that the computational effort is reused and only calculated on every new predicted output.

  • @fire17102
    @fire17102 8 місяців тому +1

    Anyone seen anyone implementing an llm with RetNet?

    • @donglixp
      @donglixp 8 місяців тому +2

      We are working on this.

    • @clray123
      @clray123 8 місяців тому +2

      I did some testing, but I am not in a position to train base LLMs. From what I see in small-scale training (~1.5B) - and without much formal validation - it has the same performance as equivalent size transformers. It's certainly not "bullshit". What I notice is that it often repeats a single token ONCE (stutters), but it does not seem to suffer from the "repeat a token subsequence endlessly" problem as much as transformer does (without artificial countermeasures such as repeat penalty or banning seen ngrams). But this perhaps has to do with the improved positional embedding.

    • @clray123
      @clray123 8 місяців тому +1

      @@donglixpWould it be possible for you to retrain phi-1.5 using RetNet? Do the respective groups in Microsoft Research compete or cooperate?

  • @darogajee3286
    @darogajee3286 8 місяців тому

    Welcome Agent Smith

  • @julius4858
    @julius4858 8 місяців тому

    Can you make a video on the current state of OpenAssistant? Kinda out of the loop about the current landscape of opensource alternatives to ChatGPT.

  • @jawadmansoor6064
    @jawadmansoor6064 8 місяців тому +2

    any open source model based on this paper?

    • @fontende
      @fontende 8 місяців тому

      no, it's 100% chinese paper and closed research

    • @jawadmansoor6064
      @jawadmansoor6064 8 місяців тому

      microsoft is chinese ?@@fontende

    • @fontende
      @fontende 8 місяців тому

      ​​​​@@jawadmansoor6064it's not related to Microsoft, there's no any foreigners names, and western companies forbidden in China by law without independent local offices like on offshores, many ways to include just by association of one person with local chinese child-company by Microsoft. Financing by foreign companies also forbidden, you must open new local clone company.
      Microsoft tested many things in China including Ai you never seen, but it was in the past, it's only still present in China because Windows still used by bureaucracy, but latest revealing of "state OS" based on Linux Ubuntu core is the sign to Microsoft to leave officially. Knowing their history with Gogle and other giants (Yahoo chinese scandal made into Congress hearings), there's no way they'll save any presence in China.

    • @fontende
      @fontende 8 місяців тому

      ​​@@jawadmansoor6064Like all chinese research papers in english (with only few in China speaking it) and they produce a tonnes of them every year, I think it's race for promotion or salary bonuses, in USSR no one cares about any translation into english except few, in isolation it produced it's own patents, research, also not published abroad because many blatantly copied all western examples (it's even encouraged by high officials).
      Legally it's strange, in China international patents and copyright not eligible (Michael Jordan court lost its own trademark to local brand pirates), if they publish it internationally it's also won't be protected abroad, there's no any logic in doing this except collecting "social rating" and for few exchange programmes still alive.

  • @corgirun7892
    @corgirun7892 8 місяців тому +6

    The claim in this paper that RWKV cannot be trained in parallel is absolutely incorrect. RWKV can be trained in parallel, just like GPT.

    • @donglixp
      @donglixp 8 місяців тому

      BTW, mathematically, you could also train it sequentially parallel. But as long as you tried, there would be numerical issues, at least it's non-trivial to solve.

    • @donglixp
      @donglixp 8 місяців тому

      I would not say it's the issue of RWKV, but instead, it's caused by AFT.

  • @anmolt3840051
    @anmolt3840051 8 місяців тому

    Associative property

  • @GabrieldeOliveira86
    @GabrieldeOliveira86 8 місяців тому +1

    If we could just replace Transformers with RetNet, we'll soon have faster and smaller LLM

    • @lucamatteobarbieri2493
      @lucamatteobarbieri2493 8 місяців тому +2

      Early LLMs will soon be seen like dinosaurs. In a way neural networks are a lossy compression of the training dataset. So yes, given the recent advancements, a huge increase in performance is to be expected, just how early compression algorithms got better and better trough a serie of improvements. In nature brains do specialize to save energy, so I expect something similar to happen in portable devices: at a certain point extra intelligence becomes a waste of energy and resources.

  • @pvlis
    @pvlis 8 місяців тому +1

    I think that they have misrepresented the difference between RetNet and RWKV. To me, it seems that at the end of the day, RetNet is to RWKV as MultiAttentionHead is to Transformers. I believe RWKV still trains in parallel just like RetNet.

    • @donglixp
      @donglixp 8 місяців тому

      You could refer to RWKV implementation, which is not sequentially parallel. This is also RWKV claimed in the paper, i.e., recurrent for both training and testing.

    • @donglixp
      @donglixp 8 місяців тому

      BTW, mathematically, you could also train it sequentially parallel. But as long as you tried, there would be numerical issues, at least it's non-trivial to solve.

    • @donglixp
      @donglixp 8 місяців тому

      I would not say it's the issue of RWKV, but instead, it's caused by AFT.

    • @donglixp
      @donglixp 8 місяців тому

      The key difference is that AFT/RWKV has a normalization for Keys, and it's also elementwise.

  • @Aldraz
    @Aldraz 8 місяців тому +1

    Woah, this will literally change the world. This single paper. I can already see this. It will change everything. Bet.

    • @ilusions4
      @ilusions4 8 місяців тому +3

      bet fam

    • @hermestrismegistus9142
      @hermestrismegistus9142 8 місяців тому +5

      I doubt it. There will likely be significant tradeoffs. Free lunches are rare.

    • @Aldraz
      @Aldraz 8 місяців тому

      @@hermestrismegistus9142 This seriously remind me RWKV architecture. I am on their Discord for a long time, seems like everything is going great now with RWKV5 and will be better than transformers by all means both in training and inference. I don't see why this couldn't be even better.

    • @clray123
      @clray123 8 місяців тому

      @@hermestrismegistus9142 Based on that argument alone, you could literally stop all future research.

    • @hermestrismegistus9142
      @hermestrismegistus9142 8 місяців тому

      @@clray123 Why? Because there are tradeoffs? Cars generally aren't as strong as trucks but that hasn't stopped them from succeeding in the niches in which speed is more important.
      Linear attention may well have its niche but it is computational cheaper than traditional attention which implies there is a tradeoff unless the extra computation of traditional attention is completely wasteful which I highly doubt. As I said, there are few free lunches.

  • @Timotheeee1
    @Timotheeee1 8 місяців тому

    Can you make a video about the InstaFlow paper?

  • @Seek_Solve
    @Seek_Solve 8 місяців тому

    The Failure to fetch error on HF chat ain't a good look. It's a shame, been holdin down open assistant from release. Hatin it. All the best, hope you guys can get it sorted. I know theres gotta be a lot goin on, NASA congrats.

  • @MCroppered
    @MCroppered 8 місяців тому

    Why are you wearing sunglasses though?

  • @Mulnader
    @Mulnader 8 місяців тому

    WTF do I watch 5:22 in the morning instead of going to sleep? It took me over 10 min to realize that I have no idea why Top Gun Guy calculates alphabet and gets cat as answer. #hyperfocus mode off!!!

  • @syncrossus
    @syncrossus 8 місяців тому +1

    So it's just a linear transformer with an explicit bias to the attention heads, cool.

  • @ssssssstssssssss
    @ssssssstssssssss 8 місяців тому +2

    I wish researchers would just state what the pros and cons are. If they don't state what the cons are it either they are dishonest and keeping it from the reader or they haven't done enough work yet

    • @donglixp
      @donglixp 8 місяців тому +1

      The con is that we need to retrain our Transformers.

    • @clray123
      @clray123 8 місяців тому

      @@donglixp YES! And also admit that we may have been wrong about some basic stuff. Which is a GREAT barrier to some people, especially if their mistake made other people spend lots more money than they would otherwise have had to.

  • @christianszegedy3694
    @christianszegedy3694 8 місяців тому

    Your formula for softmax is a bit "original" - ie off, but nice explanation otherwise.

  • @mtopscriterion
    @mtopscriterion 7 місяців тому

    having to re-record is the worst

  • @kimchi_taco
    @kimchi_taco 8 місяців тому +1

    I think it's good proof why softmax is needed by counter example. If there is not softmax, n of k/v vectors aren't different from single vector 😂
    Of course, I don't believe their shady scalability claim.

  • @fontende
    @fontende 8 місяців тому

    So a challenge from China, there's no any foreigner in credits, only citizens of China. Why they won't able or have no will to invite any reputable foreigner to research is ranking down whole paper (ideology, politics, military? racism?).
    I would take it with huge grain of salt considering high corruption levels in China academia from state with free luxury apartments, expensive trips conferences but with salary on level of poverty-most bureaucracy built there like that.
    And "planning" socialistic goals in all industries including science force people to produce even false papers to achieve state plan and get salary bonuses. As USSR history show the best smart people are smart enough to run away, that's why certain scientists there was basically on conditions of hostages, local academia except only few always was below average, because smart are dangerous politically and russian scientists called this social method of negative selection as negative inversion like scythe through the grass.
    Biggest mystery for me why China even publish all their papers in english and abroad. USSR didn't do that, there's no sense, worlds patents and copyrights not eligible in China and their courts proved that, same abroad with publications from China, they declined all int agreements even with car drivers licenses (Paris treaty).

  • @Adhil_parammel
    @Adhil_parammel 8 місяців тому

    new hype from china like lk-99

  • @swyveu
    @swyveu 8 місяців тому

    why the stupid sunglasses? They are too big btw...

  • @sagetmaster4
    @sagetmaster4 8 місяців тому +1

    Hype?

    • @justfoundit
      @justfoundit 8 місяців тому +1

      Probably not. Probably there's no obvious reason to have real percentages to weight the value vectors. I think any weighting can work, hence this solution.

  • @shaycray7662
    @shaycray7662 8 місяців тому

    Oh look, the woke censorious grifter is back. Nice!