Model Based RL Finally Works!

Поділитися
Вставка
  • Опубліковано 5 жов 2024

КОМЕНТАРІ • 66

  • @Geosquare8128
    @Geosquare8128 Рік тому +6

    great explanation! thanks

  • @-mwolf
    @-mwolf Рік тому +8

    Wow! Exactly what I was hoping to see / implement myself one day. Very impressive

    • @RS-303
      @RS-303 Рік тому +3

      Yes, the ability to train a model with fixed hyperparameters on a wide range of environments is a significant accomplishment in the field of reinforcement learning. The ability to generalize to unseen tasks is a key goal of AI research, and the DreamerV3 algorithm is a step in that direction. However, it is important to note that the real-world applications of this technology, particularly in the field of autonomous systems, raise important ethical concerns that should be considered.

    • @mgetommy
      @mgetommy Рік тому

      @@RS-303 chagpt ass comment

    • @Nahte001
      @Nahte001 Рік тому +13

      @@RS-303 chatgpt response fr

  • @FuzzyJeffTheory
    @FuzzyJeffTheory Рік тому +3

    Great explanations of the different components of the loss and the diagrams. It’s really cool the hyperparameters are fixed

  • @simonyin9229
    @simonyin9229 Рік тому +7

    I feel this approach has many ingredients necessary for animalistic intelligence. It goes back to the baysian brain theory, which frames the brain as mashine that builds a model of the world to predict future sensory input and uses its surprise to improve the model. I believe a level 2 intelligent system will need to 1. have a goal or goals 2. be able to generate predictions of future sensory inputs 3. be able to perform actions on the enviornment to see how the actions influence the predictions. I think all these are necessary to have a hypothesis-testing based learning cycle which i think will lead us to artificial conciousness. I see all of these requirements represented in this approach which makes me very excited for the future.

  • @kristoferkrus
    @kristoferkrus Рік тому +7

    11:37 The symlog function is definitely fully differentiable.

  • @petemoss3160
    @petemoss3160 Рік тому +1

    today they're mining diamonds... tomorrow building Redstone botnets.
    what a time to be alive!

  • @coolhead20
    @coolhead20 Рік тому

    This is super helpfull breakdown. Thank you!

  • @NoNameAtAll2
    @NoNameAtAll2 Рік тому +10

    can you cover forward-forward algorithm?
    the net that doesn't do backpropagation

    • @EdanMeyer
      @EdanMeyer  Рік тому +8

      Already in the works 😉

  • @giantbee9763
    @giantbee9763 Рік тому +3

    Why don't we see curiosity being used in dreamer series? I think that could be one way to get pass stochastic sampling not getting to break blocks. There is after all a visual cue.

    • @RS-303
      @RS-303 Рік тому +1

      DreamerV2 and DreamerV3 do not use explicit curiosity-driven exploration. Instead, they use a combination of intrinsic motivation techniques to drive exploration, such as predicting the future state of the world and regularizing the policy towards more random actions.

    • @VctorDeInflictor
      @VctorDeInflictor Рік тому +3

      @@RS-303 I am very new to this topic, but you sparked my curiosity.
      Wouldn't just a small change be needed to integrate curiosity into this kind of techniques? Something along the lines of:
      Use intrinsic motivation techniques to drive exploration, however, if you get too good at predicting a certian action, make repeating that action a lot less rewarding (something like boredom for the machine) so it is better for it to try and predict a different action linked with a different outcome, so it learns different stuff. and this would seem feasible in my mind since when you click a single block in minecraft it starts changing it's model (some cracks start slowly appearing). so it might get curious about it after trying it one too many times.
      This seems better to me rather than just trying to push it to take random actions every one in a while, since you don't want it to do a random action after clicking a block for a millisecond, you want it to be curious about the cracks, and for him to try to predict what would happen if he keeps clicking it.
      I'm very curious about this, and i have no experience in this field, so i would love to hear why this is not way it is implemented.

  • @kwillo4
    @kwillo4 Рік тому +2

    This was Awesome! Thank you for the explination. I still wanted to know the details of the neural net architectures, and whether they use ppo or something better? Will check the paper now

    • @RS-303
      @RS-303 Рік тому +3

      DreamerV3 utilizes a hierarchical architecture that includes both a low-level and a high-level controller. The low-level controller is responsible for handling sensorimotor information and the high-level controller is responsible for handling abstract reasoning and planning.DreamerV3 uses a more sophisticated sensorimotor architecture, which includes a 3D environment embedding to handle 3D inputs, not present in DreamerV2.The sensorimotor architecture includes a control network, which generates the control signals that are sent to the actuators. In DreamerV3, the control network is composed of several layers of fully connected layers, which are designed to learn the mapping from the high-level representation of the state to the control signals.DreamerV3 uses a distributional reinforcement learning algorithm, which models the distribution of returns rather than the expected return. This allows for more accurate estimation of value and better handling of rare events.DreamerV3 also utilizes a more advanced state representation, which is able to reason about objects and their properties in a more flexible manner.DreamerV3 uses an actor-critic architecture with PPO algorithm.PPO is used to optimize the policy and SAC is used to optimize the value function. The combination of PPO and SAC allows the agent to learn more efficiently and effectively, particularly in challenging environments with high-dimensional state spaces and stochastic dynamics.
      DreamerV3 uses a variant of the PPO algorithm called Distributional PPO (DPPO). In DPPO, the algorithm learns a distribution of the values for each state-action pair instead of a single value. This allows for better handling of uncertainty and improves the stability and sample efficiency of the algorithm. This is achieved by using a quantile regression technique, where the model tries to learn the quantiles of the value distribution for each state-action pair instead of the mean value. This allows the model to better handle extreme events and rare scenarios which is important in real-world environments like industrial robots.
      In reinforcement learning, distributional learning is a method for estimating the distribution of returns for a given policy or value function, rather than just the expected return. This can provide a more complete understanding of the uncertainty and potential variability of the returns. The distributional perspective can also be used to improve the stability and robustness of learning algorithms. In DreamerV3, the authors use distributional reinforcement learning to estimate the full distribution of returns for different actions, rather than just the expected return. They also use a novel distributional value function to improve the stability and robustness of the learning process. Additionally, the authors use a technique called quantile regression to estimate the distribution of returns, which allows for more accurate estimation of the tails of the distribution and improved performance in rare or out-of-distribution states.

    • @kwillo4
      @kwillo4 Рік тому

      @@RS-303 thanks you very much! That makes it a lot more clear :)

  • @bluehorizon9547
    @bluehorizon9547 Рік тому +2

    In such paper it is always about defining a path of steps for AI to reach in such a way that its minimal ability to learn can "jump" through the gaps, if these steps are dense enough even dumb genetic algorithm will find diamonds

  • @thesofakillers
    @thesofakillers Рік тому +1

    Is the model-based nature of the work instrumental? I wonder if similar generalisation can be achieved with model-free approaches. at 17:40 it seems like you're suggesting indeed that the model-based portion of the work is the important one.

    • @EdanMeyer
      @EdanMeyer  Рік тому +2

      It is clear that the model based component allows for a high level of sample efficiency. It may also help as a sort of auxiliary task that help learn features, but the degree to which that is the case is unclear.

  • @nowithinkyouknowyourewrong8675

    It's interesting that IRIS kept up. Since it's much simpler. I hope that dreamer v4 tries simplifying it and consolidating the tricks.

  • @RaoBlackWellizedArman
    @RaoBlackWellizedArman Рік тому

    simlog is differentiable. It doesn't matter if the individual components are not.

  • @RickeyBowers
    @RickeyBowers Рік тому

    Danijar Hafner, "Note that a reward is provided only for the first diamond per episode, so the agent is not incentivized to pick up additional diamonds."
    I don't know if this also implies no other rewards were given. Looking forward to the code release.

    • @EdanMeyer
      @EdanMeyer  Рік тому +2

      Rewards are given for several benchmarks leading up to diamonds, there is a standard benchmark you can find online and a paper with details

  • @Anders01
    @Anders01 Рік тому +1

    DreamerV3 seems like a step towards artificial general intelligence (AGI). The AI can learn by itself like how a human child learns by itself. And DreamerV3 could be bootstrapped with a transformer architecture so that it starts in a safer way before learning by itself on the internet and in the real world.

    • @RS-303
      @RS-303 Рік тому +2

      DreamerV3 is indeed a step towards AGI, in the sense that it is a model-based RL agent that can learn to perform a wide range of tasks from raw visual inputs, without any task-specific supervision. However, it is still a long way from achieving true AGI, which would require the ability to learn and reason about a wide range of concepts and domains, as well as the ability to understand and use natural language.
      Bootstrapping DreamerV3 with a transformer architecture could be beneficial as it could provide the agent with a better ability to process and understand visual inputs, which would make it easier for the agent to learn the task. Additionally, using a transformer architecture can also help in providing the agent with some sort of prior knowledge that could help it to learn the task more efficiently. However, it is important to note that this would still require a lot of research and development to be done in order to build a safe and robust AGI agent.

  • @thekiwininja99
    @thekiwininja99 2 місяці тому

    I'm confused, how is this considered model based? Doesn't model based mean that mapping of each possible state with each possible action and a transition probability of going to another state after taking an action exists, and that clearly isn't the case here, did the definition change or...?

  • @Okarin_Time_Wizard
    @Okarin_Time_Wizard Рік тому

    Great video, I must ask what pdf viewer are you using, it's not Adobe right?

  • @youngsdiscovery8909
    @youngsdiscovery8909 Рік тому +1

    Nice... I don't have the computer power, though

  • @dianboliu505
    @dianboliu505 5 місяців тому

    I wonder if the decoder is really needed

    • @christrifinopoulos8639
      @christrifinopoulos8639 2 місяці тому

      the decoder exists to evaluate the efficiency of the latent representation. for example when the encoder uses an encoding that discards all the important information, then the decoder fails to recontruct the input forcing the model to learn a more efficient encoding.

  • @billykotsos4642
    @billykotsos4642 Рік тому +1

    200M is tiny compared to some of the NLP models out there...

    • @RS-303
      @RS-303 Рік тому +2

      Yes, that is correct. 200M parameters is considered a relatively small model size compared to some of the large language models used in natural language processing (NLP) tasks, which can have billions or trillions of parameters. However, for reinforcement learning tasks, 200M parameters is still considered a large model size and allows for significant improvement in performance and data efficiency.

  • @drdca8263
    @drdca8263 Рік тому +1

    If changing the action for a few frames is a problem, what if you changed the representation of actions such that there is “click” (mouse down and then mouse up), and separately, “mouse down” and “mouse up”, where if you take the “mouse down” action, it stays down until you either take the “mouse up” action, or you take the “click” action?

    • @EdanMeyer
      @EdanMeyer  Рік тому +6

      If the goal was just to do well in Minecraft then there are several simple approaches here like what you mention, the thing is that I would imagine the authors are more interested in building an algorithm that works generally for many problems, so changing specific environments just so you can get good numbers on them defats the purpose of evaluating on a range of environments in many cases

    • @drdca8263
      @drdca8263 Рік тому

      @@EdanMeyer I guess, but, it seems like a smaller modification to me than modifying the game to make blocks mine faster? But maybe not

    • @deltamico
      @deltamico Рік тому +2

      Or if you find yourself often doing a specific sequence of actions you wrapp the sequence into one action. Once you overcome learning that mining is benefitial you'll be able to call the mine sequence and continue learning with this new ability. It's more likely you'll find out about hardness this way. I also feel the compressibility of the sequence might be corelated with how difficult it would be to come up with it

  • @alozzk
    @alozzk Рік тому

    A great watch!
    I want to ask though, could you link me to a reference to learn about the distributional learning paradigm? I toyed with a deep Q-learning model for a card game while to learn about RL, but that precise issue of having a stochastic environment whose rewards vary i think holds my model down.

    • @RS-303
      @RS-303 Рік тому +1

      DreamerV3 is based on the distributional learning paradigm, which is a way of representing and learning from the distribution of rewards in a reinforcement learning setting. This is in contrast to traditional Q-learning, which only represents the expected reward for each state-action pair. The distributional learning paradigm aims to provide a more complete representation of the reward distribution, which can lead to more robust and efficient learning in environments where rewards are stochastic or vary widely. DreamerV3 uses this paradigm by using the symlog function to compress the magnitude of rewards, and by training the world model to predict not only the expected reward but also the full distribution of rewards. This allows DreamerV3 to learn more efficiently in environments with varying reward distributions.
      One of the main references for this paradigm is "A Distributional Perspective on Reinforcement Learning" by Bellemare, Dabney, and Munos (2017). This paper introduces the Categorical DQN algorithm, which uses a categorical distribution to model the distribution of returns.
      Other references include "Distributional Reinforcement Learning with Quantile Regression" by Dabney et al. (2018) and "Rainbow: Combining Improvements in Deep Reinforcement Learning" by Hessel et al. (2018), which also use the distributional learning paradigm and improve upon the Categorical DQN algorithm.

    • @alozzk
      @alozzk Рік тому

      @@RS-303 wow, thanks, I appreciate a lot the thought out response!!

  • @Xavier-es4gi
    @Xavier-es4gi Рік тому +11

    symlog looks pretty differentiable to me

    • @deepdata1
      @deepdata1 Рік тому +1

      It's only differentiable once.

    • @EdanMeyer
      @EdanMeyer  Рік тому +6

      The individual operators, absolute value and sign, are not differentiable (as least not at 0). Perhaps there’s another reason but I don’t know what that would be

    • @Xavier-es4gi
      @Xavier-es4gi Рік тому +2

      @@EdanMeyer Also thanks for this great video!

    • @RS-303
      @RS-303 Рік тому +5

      @@EdanMeyer
      Yes, the individual operators absolute value and sign are not differentiable at 0. However, in DreamerV3, the authors use a variant of the symlog function which is differentiable at all points. The symlog function uses a combination of logarithm and exponential functions, and is designed to compress large values while preserving the details of small values. This allows the model to efficiently learn from both large and small values in the inputs. Additionally, by using the symlog function in this way, the authors can avoid issues with the non-differentiability of the absolute value and sign operators when computing gradients during training.

    • @dimitriognibene8945
      @dimitriognibene8945 Рік тому +1

      @@EdanMeyer i think that if you apply differtiability definition instead of using decomposition the differtiability is easy to prove

  • @ethanwmonster9075
    @ethanwmonster9075 Рік тому +2

    Hmmm collecting diamonds without training data from scratch woah that's a big deal.

    • @RS-303
      @RS-303 Рік тому

      Yes, the ability for DreamerV3 to collect diamonds in the popular video game Minecraft from scratch, given only sparse rewards, is a significant achievement in the field of artificial intelligence. It demonstrates the algorithm's robustness and scalability, as well as its ability to learn in a complex, procedurally generated environment without the need for human data or domain-specific heuristics. Additionally, the fact that DreamerV3 is able to accomplish this task using the same hyperparameters across all domains, and outperforming specialized model-free and model-based algorithms in a wide range of benchmarks and data-efficiency regimes, further highlights the algorithm's potential as a general-purpose reinforcement learning solution.

    • @Nnm26
      @Nnm26 Рік тому

      @@RS-303 did you write this using chatgpt?

    • @RS-303
      @RS-303 Рік тому

      @@Nnm26
      No, chatgpt KAI. Can't you write using chatgpt?

  • @zramsey11
    @zramsey11 Місяць тому

    First independent agent to mine diamonds in minecraft is such a flex, and the authors know it.

  • @billykotsos4642
    @billykotsos4642 Рік тому

    I counted at least 5 nets ? Am I wrong? this is just too complex to actually work !

  • @billykotsos4642
    @billykotsos4642 Рік тому +1

    Do I still need a quantum computer to run this ? :(

    • @georgesmith4768
      @georgesmith4768 Рік тому +3

      It’s reinforcment learning. Your probably going to want atleast a datacenter

    • @painperdu6740
      @painperdu6740 Рік тому

      @@georgesmith4768 even for inference and not training?

    • @georgesmith4768
      @georgesmith4768 Рік тому +4

      @@painperdu6740 Depends on the model, but it’s realy just the training thats a problem. Inference usaly isn’t any worse than for anything else. You’ll probably get to use some actual RL models soon, it’s just makking them yourself that will be an exspensive pain

    • @painperdu6740
      @painperdu6740 Рік тому

      @@georgesmith4768 true, especially from scratch. But is fine tuning a RL model just like fine tuning a feed forward NN or is the process completely different ?

    • @howuhh8960
      @howuhh8960 Рік тому +4

      no, all experiments in this paper were done on single V100. So it is computationally cheap so say the least

  • @dancar2537
    @dancar2537 Рік тому

    this is unrealistic. your expectancy is unrealistic to find diamonds. this is just a paper showing advance in searching a bit further but taking less steps or being simpler to implement. diamonds are a man s discovery and it is common knowledge that they lie deep down buried. so it should ask chatgpt which has pre-trained knowledge where to find diamonds and start digging. a fundamentally different approach should be taken