What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study (Paper Explained)

Yannic Kilcher

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 14 чер 2024
#ai #research #machinelearning
Online Reinforcement Learning is a flourishing field with countless methods for practitioners to choose from. However, each of those methods comes with a plethora of hyperparameter choices. This paper builds a unified framework for five continuous control tasks and investigates in a large-scale study the effects of these choices. As a result, they come up with a set of recommendations for future research and applications.
OUTLINE:
0:00 - Intro & Overview
3:55 - Parameterized Agents
7:00 - Unified Online RL and Parameter Choices
14:10 - Policy Loss
16:40 - Network Architecture
20:25 - Initial Policy
24:20 - Normalization & Clipping
26:30 - Advantage Estimation
28:55 - Training Setup
33:05 - Timestep Handling
34:10 - Optimizers
35:05 - Regularization
36:10 - Conclusion & Comments
Paper: arxiv.org/abs/2006.05990
Abstract:
In recent years, on-policy reinforcement learning (RL) has been successfully applied to many different continuous control tasks. While RL algorithms are often conceptually simple, their state-of-the-art implementations take numerous low- and high-level design decisions that strongly affect the performance of the resulting agents. Those choices are usually not extensively discussed in the literature, leading to discrepancy between published descriptions of algorithms and their implementations. This makes it hard to attribute progress in RL and slows down overall progress (Engstrom'20). As a step towards filling that gap, we implement over 50 such "choices" in a unified on-policy RL framework, allowing us to investigate their impact in a large-scale empirical study. We train over 250'000 agents in five continuous control environments of different complexity and provide insights and practical recommendations for on-policy training of RL agents.
Authors: Marcin Andrychowicz, Anton Raichuk, Piotr Stańczyk, Manu Orsini, Sertan Girgin, Raphael Marinier, Léonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, Sylvain Gelly, Olivier Bachem
Links:
UA-cam: / yannickilcher
Twitter: / ykilcher
Discord: / discord
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher
Parler: parler.com/profile/YannicKilcher
LinkedIn: / yannic-kilcher-488534136
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
Наука та технологія

КОМЕНТАРІ • 32

@herp_derpingson 3 роки тому ⁺¹²
I think PPO is a good candidate for [Classic] paper
.
0:00 So many authors! I think the authors are combining their research GPU/TPU hours to make this research feasible.
.
19:45 If I remember it correctly, these environments have action space between -1 and 1. So, perhaps tanh is better because it keeps it in the range.
.
34:50 Oh yes. The fabled 3e-4. Wow, it also does it magic in reinforcement learning?
@bdennyw1 3 роки тому ⁺³
+1 on the PPO paper
@timofeyabramski492 3 роки тому ⁺³
Very new to your channel but have to say I love it. Keep up the great work, you get far too views for this good work
@hangzhiguo1857 3 роки тому ⁺¹
Very interesting paper and engaging explanation. I am wondering whether exits some similar papers in investigating what matters in the deep neural networks in supervised learning. Can someone list some papers?
@siddhantrai7529 3 роки тому ⁺¹
Pretty good explanation, which software do you use while reading research paper, like the one you used in the video. It would be really fun to have a assisting tool to have while reading papers.
@YannicKilcher 3 роки тому
I'm using OneNote
@edbeeching 3 роки тому ⁺³
Such a shame they did not test in more challenging partially observable environments with recurrent agents. Where V-trace etc would actually make a difference.
@firedrive45 3 роки тому ⁺¹
Yannic, what do you think about Optical Based NN? they are using light temporal path efficiency as their backpropagating feedback and achieve much higher computation efficiency.
@YannicKilcher 3 роки тому
that's cool, but ultimately it will come down to dollars, not raw speed
@sedi4361 3 роки тому ⁺¹
Isn't Tanh logically preferred for the policy network activation function? I mean our policy outputs the mean (and variance) of a distribution for each action. Using ReLU might be (is as the paper shows) contra-productive for that task.
@YannicKilcher 3 роки тому
I agree with that intuition, but in deep learning, you can never know unless you test :)
@sedi4361 3 роки тому
@@YannicKilcher true, but I thought this is already a standard for continuous action spaces. At least several other papers have shown. Im kind of disappointed by those large scale papers Google is doing lately.
@drdca8263 3 роки тому ⁺³
21:00 “The key recipe appears is to initialize [...]”? Should this say “The key recipe appears to be to initialize [...]”?
@dermitdembrot3091 3 роки тому ⁺²
Yes
@YannicKilcher 3 роки тому ⁺²
Thanks, I've notified the authors
@jeffreylim5920 3 роки тому
28:30 we should not use gae with ppo loss ? This is surprising to me, since ppo always comes with gae!
@jeffreylim5920 3 роки тому
Is the code still not available ?
@cycman98 3 роки тому ⁺¹
32:00 I don't get this part. Why are they reusing old data? Wasn't it supposed to be an on-policy RL?
@clee5653 3 роки тому ⁺¹
These data are collected using the latest version of policy only.
@cycman98 3 роки тому ⁺²
@@clee5653 it still doesn't work for me. In 32:00 he says: "you should always go back to this dataset, recompute this estimates with your current value network, then do the whole shuffling thing again and then do ANOTHER EPOCH and then basically come back to here again and RECOMPUTE the advantages". But why do we recompute advantages on data from previous epoch?
@clee5653 3 роки тому ⁺¹
@@cycman98 If you read the paper(section 3.5), you'll find that's an improvement to PPO the authors proposed. Computing advantage requires value therefore advantage has to be recomputed at each iteration.
@cycman98 3 роки тому ⁺¹
@@clee5653 ok, I will read the paper xd thank you
@mdmishfaqahmed5523 3 роки тому ⁺¹
number 7 will surprise you :D :D
@jonathanballoch 3 роки тому
Kinda frustrating that they didn't do TRPO in light of Madry Groups' NeurIPS2020 paper (which shows that PPO's improvements over TRPO are mostly a result of improved implementations not better loss
@jwstolk 3 роки тому ⁺²
32:16 "It makes a lot of sense." Do you sell canvas prints?
@bishalsantra 3 роки тому ⁺²
What are GAE and V Trace?
@YannicKilcher 3 роки тому ⁺¹
GAE=Generalized Advantage Estimation and VTrace is from the IMPALA paper

Наступне

Автоматичне відтворення

Decision Transformer: Reinforcement Learning via Sequence Modeling (Research Paper Explained)