Does your PPO agent fail to learn?

RL Hugh

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 18 січ 2025

КОМЕНТАРІ • 35

@philippk5446 Рік тому ⁺¹⁶
In this context, you should always track your KL divergence, since a high KL divergence may indicate over-exploration
@tommygin7561 4 місяці тому
Awesome video! Can you also talk about how to tune these hyperparameters generally? It would be very helpful!
@vladyslavkorenyak872 Рік тому ⁺¹
Hello Sir. Do you have any insight about the "use_sde" variable in PPO stable baselines v3? It supposedly activates "generalized State Dependent Exploration" but I did not find any clear results about the pros and cons of this.
@SP-db6sh Рік тому ⁺¹
Make a video on using Finrl
@remcopoelarends9888 Рік тому ⁺²
Very nice video! Could you maybe make a video about explaining and setting the hyperparameters of PPO in sb3? Keep up the good work!
@rlhugh Рік тому
Thanks! Any particular parameter(s) that you are most interested in?
@remcopoelarends9888 Рік тому ⁺⁵
@@rlhugh The ones that are less self-explanatory, such as clip_range, normalize_advantage, ent_coef, max_grad_norm and use_sde.
@hoseashpm7810 2 роки тому ⁺¹
Every episode, my PPO agent cumulative reward seems very “noisy”. Meaning the average cumulative reward increases but the instantaneous cumulative reward seems similar to a a noisy signal. I tried tips to designing a reward function with a gradient, and tried changing the entropy loss weight, yet it just does not reach to a consistent policy.
I feel like pulling my hair now.
@rlhugh 2 роки тому
Somehow I missed this comment earlier. Yeah, the reward usually is very noisy. In Tensorboard, there is an option to smooth the graph. Same option exists in mlflow, probably Weights and Biases too. But .. what do you mean by 'instantaneous cumulative reward'? Isn't the cumulative reward by definition the sum of all rewards from time 0 until some time T?
@hoseashpm7810 2 роки тому ⁺¹
@@rlhugh hi Hugh. Thanks for the tip. By “instantaneous” i meant that the cumulated reward at the end of every episode.
I used matlab for designing the agent. I ended up using a double DQN with a discrete action space. It ended up learning a lot faster and smoother. Maybe my knowledge of PPO sucks. I tried extending the training time but the PPO agent gets stuck somehow.
@rlhugh 2 роки тому
Interesting. Good info. Thank you! Do you have any thoughts on what about your task might make it more amenable to value function learning? What are some of the characteristics of your input and output space that might be different than eg playing Doom using the screen as input?
@petarulev9021 Рік тому
I have the exact same problem of overfitting - my agent learns very useful stuff, but at some point - it just overfits to one action. This is why I take the checkpoint before overfitting, but this is a nasty fix.
I just incorporated the entropy regularization and my model is training. The data is incredibly noisy, I will let you know about the result.
In the meantime, I am wondering how the kf_coeff influences the whole process and what do you think about it and the relation between the entropy regularization and kl_coeff? I would appreciate a video or a comment.
Cheers,
petar
@willnutter1194 2 роки тому ⁺³
Great videos, really enjoy your style of communication and thoughts. Thanks for making them :)
@RoboticusMusic 11 місяців тому ⁺³
It might be more helpful to explain and demo what entropy regularization is, what it does, and the history of the concept and different forms of it. The rest would be pretty intuitive.
@rlhugh 11 місяців тому ⁺¹
Thank you for the feedback. Very useful, and I appreciate it :)
@Jolle_Gaming 10 місяців тому
So the entreg is the same as ent_coef in the PPO, or did i missfollow you?
@rlhugh 10 місяців тому
Yes, thats correct.
@Meditator80 Рік тому ⁺¹
really fantastic videos 🎉
@rlhugh Рік тому
Thank you!
Рік тому
It's a great video, I am tunning gains of Kp, Ki with reinforcement learning PPO. The result is a constant too in all the trajectory of the movement of the robot. So I would like to know why this result is a constant too. Maybe something wrong I am doing? Or it is fine. I really appreciate your comments. Thanks!
@TwoThreeFour 6 місяців тому ⁺¹
Wait until we reach C-3PO, instead of 2PO, that would be very interesting. 😁
@Bvic3 Рік тому
What's 100k steps? You run 100 times 1 epoch of learning on 1000 frames?
@rlhugh Рік тому ⁺¹
Steps relate to the simulation, not to the learning. A step is one iteration of: receive an observation, take one action. Epochs of learning etc are configured separately. You can choose to run 5 epochs of learning over each batch of steps, for example, which would result in each step being used in 5 different training epochs.
@Bvic3 Рік тому ⁺¹
@@rlhugh Ok, thanks. That's what I expected but I just wanted a confirmation.
@SP-db6sh Рік тому
I regret to see it 6 months later.
Can u make video on Custom env creation for system like user experience for new app, trading bot ?
@rlhugh Рік тому ⁺¹
So, firstly I don't have experience with using RL for trading. But secondly, my gut intuition is that one uses RL when ones actions affect the environment, or at least, the current state. However, unless you are making giant trades, your trading actions will not much affect your environment, i.e. the price, I think? The state does include things like how much money you have, and what stock you own. However I'm not sure that how much stock you own, and how much money you have, will much affect an estimate of the value of a stock? I would imagine that supervised learning is all you need, and will be much more efficient? What makes you feel that RL could be appropriate for estimating the value of a stock, or taking actions on stock?
@rlhugh Рік тому ⁺¹
(I suppose one option could be to create a simulator, by using stock prices from a year or so ago, and assuming that one's stock trades do not affect market price?)
@rlhugh Рік тому ⁺¹
what timeframe were you thinking of using for each step of RL? eg 5 minutes? 1 day? 1 week? 1 month? Do you know where one could obtain prices for several stock that you are interested in trading, for eg 1 year ago, at the level of granularity that you are interested in training RL on?
@p4ros960 Рік тому ⁺¹
@@rlhugh keep in mind that price does not mean anything in trading.
@rlhugh Рік тому
@@p4ros960 can you elaborate on that? Afaik, all securities with stocks as the underlying asset do have a value that depends on the price of the underlying stock? For example, if you sell a call, the more the price of underlying stock goes up, the more money you will lose when that call is exercised, I think?
@vialomur__vialomur5682 Рік тому
thanks!

Наступне

Автоматичне відтворення

An introduction to Policy Gradient methods - Deep Reinforcement Learning