What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study (Paper Explained)

Поділитися
Вставка
  • Опубліковано 14 чер 2024
  • #ai #research #machinelearning
    Online Reinforcement Learning is a flourishing field with countless methods for practitioners to choose from. However, each of those methods comes with a plethora of hyperparameter choices. This paper builds a unified framework for five continuous control tasks and investigates in a large-scale study the effects of these choices. As a result, they come up with a set of recommendations for future research and applications.
    OUTLINE:
    0:00 - Intro & Overview
    3:55 - Parameterized Agents
    7:00 - Unified Online RL and Parameter Choices
    14:10 - Policy Loss
    16:40 - Network Architecture
    20:25 - Initial Policy
    24:20 - Normalization & Clipping
    26:30 - Advantage Estimation
    28:55 - Training Setup
    33:05 - Timestep Handling
    34:10 - Optimizers
    35:05 - Regularization
    36:10 - Conclusion & Comments
    Paper: arxiv.org/abs/2006.05990
    Abstract:
    In recent years, on-policy reinforcement learning (RL) has been successfully applied to many different continuous control tasks. While RL algorithms are often conceptually simple, their state-of-the-art implementations take numerous low- and high-level design decisions that strongly affect the performance of the resulting agents. Those choices are usually not extensively discussed in the literature, leading to discrepancy between published descriptions of algorithms and their implementations. This makes it hard to attribute progress in RL and slows down overall progress (Engstrom'20). As a step towards filling that gap, we implement over 50 such "choices" in a unified on-policy RL framework, allowing us to investigate their impact in a large-scale empirical study. We train over 250'000 agents in five continuous control environments of different complexity and provide insights and practical recommendations for on-policy training of RL agents.
    Authors: Marcin Andrychowicz, Anton Raichuk, Piotr Stańczyk, Manu Orsini, Sertan Girgin, Raphael Marinier, Léonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, Sylvain Gelly, Olivier Bachem
    Links:
    UA-cam: / yannickilcher
    Twitter: / ykilcher
    Discord: / discord
    BitChute: www.bitchute.com/channel/yann...
    Minds: www.minds.com/ykilcher
    Parler: parler.com/profile/YannicKilcher
    LinkedIn: / yannic-kilcher-488534136
    If you want to support me, the best thing to do is to share out the content :)
    If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
    SubscribeStar: www.subscribestar.com/yannick...
    Patreon: / yannickilcher
    Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
    Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
    Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
    Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
  • Наука та технологія

КОМЕНТАРІ • 32

  • @herp_derpingson
    @herp_derpingson 3 роки тому +12

    I think PPO is a good candidate for [Classic] paper
    .
    0:00 So many authors! I think the authors are combining their research GPU/TPU hours to make this research feasible.
    .
    19:45 If I remember it correctly, these environments have action space between -1 and 1. So, perhaps tanh is better because it keeps it in the range.
    .
    34:50 Oh yes. The fabled 3e-4. Wow, it also does it magic in reinforcement learning?

    • @bdennyw1
      @bdennyw1 3 роки тому +3

      +1 on the PPO paper

  • @timofeyabramski492
    @timofeyabramski492 3 роки тому +3

    Very new to your channel but have to say I love it. Keep up the great work, you get far too views for this good work

  • @hangzhiguo1857
    @hangzhiguo1857 3 роки тому +1

    Very interesting paper and engaging explanation. I am wondering whether exits some similar papers in investigating what matters in the deep neural networks in supervised learning. Can someone list some papers?

  • @siddhantrai7529
    @siddhantrai7529 3 роки тому +1

    Pretty good explanation, which software do you use while reading research paper, like the one you used in the video. It would be really fun to have a assisting tool to have while reading papers.

  • @edbeeching
    @edbeeching 3 роки тому +3

    Such a shame they did not test in more challenging partially observable environments with recurrent agents. Where V-trace etc would actually make a difference.

  • @firedrive45
    @firedrive45 3 роки тому +1

    Yannic, what do you think about Optical Based NN? they are using light temporal path efficiency as their backpropagating feedback and achieve much higher computation efficiency.

    • @YannicKilcher
      @YannicKilcher  3 роки тому

      that's cool, but ultimately it will come down to dollars, not raw speed

  • @sedi4361
    @sedi4361 3 роки тому +1

    Isn't Tanh logically preferred for the policy network activation function? I mean our policy outputs the mean (and variance) of a distribution for each action. Using ReLU might be (is as the paper shows) contra-productive for that task.

    • @YannicKilcher
      @YannicKilcher  3 роки тому

      I agree with that intuition, but in deep learning, you can never know unless you test :)

    • @sedi4361
      @sedi4361 3 роки тому

      @@YannicKilcher true, but I thought this is already a standard for continuous action spaces. At least several other papers have shown. Im kind of disappointed by those large scale papers Google is doing lately.

  • @drdca8263
    @drdca8263 3 роки тому +3

    21:00 “The key recipe appears is to initialize [...]”? Should this say “The key recipe appears to be to initialize [...]”?

  • @jeffreylim5920
    @jeffreylim5920 3 роки тому

    28:30 we should not use gae with ppo loss ? This is surprising to me, since ppo always comes with gae!

  • @jeffreylim5920
    @jeffreylim5920 3 роки тому

    Is the code still not available ?

  • @cycman98
    @cycman98 3 роки тому +1

    32:00 I don't get this part. Why are they reusing old data? Wasn't it supposed to be an on-policy RL?

    • @clee5653
      @clee5653 3 роки тому +1

      These data are collected using the latest version of policy only.

    • @cycman98
      @cycman98 3 роки тому +2

      @@clee5653 it still doesn't work for me. In 32:00 he says: "you should always go back to this dataset, recompute this estimates with your current value network, then do the whole shuffling thing again and then do ANOTHER EPOCH and then basically come back to here again and RECOMPUTE the advantages". But why do we recompute advantages on data from previous epoch?

    • @clee5653
      @clee5653 3 роки тому +1

      @@cycman98 If you read the paper(section 3.5), you'll find that's an improvement to PPO the authors proposed. Computing advantage requires value therefore advantage has to be recomputed at each iteration.

    • @cycman98
      @cycman98 3 роки тому +1

      @@clee5653 ok, I will read the paper xd thank you

  • @mdmishfaqahmed5523
    @mdmishfaqahmed5523 3 роки тому +1

    number 7 will surprise you :D :D

  • @jonathanballoch
    @jonathanballoch 3 роки тому

    Kinda frustrating that they didn't do TRPO in light of Madry Groups' NeurIPS2020 paper (which shows that PPO's improvements over TRPO are mostly a result of improved implementations not better loss

  • @jwstolk
    @jwstolk 3 роки тому +2

    32:16 "It makes a lot of sense." Do you sell canvas prints?

  • @bishalsantra
    @bishalsantra 3 роки тому +2

    What are GAE and V Trace?

    • @YannicKilcher
      @YannicKilcher  3 роки тому +1

      GAE=Generalized Advantage Estimation and VTrace is from the IMPALA paper