Proximal Policy Optimization is Easy with Tensorflow 2 | PPO Tutorial

Machine Learning with Phil

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 4 лис 2024

КОМЕНТАРІ • 45

@complexobjects 2 роки тому ⁺¹⁸
I have followed along through your DDPG, PPO torch, and deep Q Learning. There are scarce resources like this. Deep RL is relatively new, very difficult for newcomers, and I believe not as popular as other deep learning-supported fields like computer vision and NLP. You sir are doing a tremendous service! SO many questions I had on RL clarified! 👏
@MachineLearningwithPhil 2 роки тому
Thank you for the kind words.
@bobingstern4448 2 роки тому ⁺²
Been waiting for this one! Awesome video!
@jobinthomas1126 2 роки тому
I agree!
@MachineLearningwithPhil 2 роки тому
hard to believe it took a year to get to it .... I recall having some issues getting it working towards the end of 2020, but had no problems coding it up last night. Hope it helped!
@ashishjayant3486 2 роки тому ⁺⁶
Phil continuing his amazing work as usual! I would encourage viewers to also read an ICLR 2020 accpeted paper about small caveats of PPO and TRPO, "Implementation Matters in Deep RL: A Case Study on PPO and TRPO". A suggestion from a frustrated deep rl researcher. :')
@MachineLearningwithPhil 2 роки тому
Thanks Ashish, I'll have to check it out. It might be a good paper for an analysis video.
@SupriyaR-p7b 9 місяців тому
Wonderful lecture on PPO, I have small doubt sir in PPO can I integrate the CNN architecture. Could you please share your thought regarding this.
@qiguosun129 2 роки тому
Thanks for his video Dr. Phil, it is important for me as I come from Pytorch and cannot clear the difference between TF2 and Pytorch. This video helps me a lot.
@gabrielvalentim197 Рік тому
Thank you for your videos again Phil!!
In this case, how can I introduce bonus entropy in loss?
Can I just compute dist.entropy() and add in actor_loss like this: beta * dist.entropy()?
@MachineLearningwithPhil Рік тому ⁺¹
Yup!
@StylianosVagenas Рік тому
Hello Phil, thank you very much for your video!
I tested your implementation for 10000 episodes.
I chose to use the episodic return as evaluation tool of the RL algorithm (instead of the avg of the last 100episodes), and I notice that the episodic return oscillates between 0 and 200 throughout the whole training (I guess we would expect a "generally increasing" behavior but it does not look like that). Instead, when using the avg return of the last 100episdes as evaluation tool, we see this avg return increasing at first, but then it saturates at about a value of 100. It is my impression though that the avg return of the last episodes is expected to saturate at a value close to 100, even by randomly sampling (not training), given the nature of the Cartpole-v0 environment (episodic return from 0 to 200). Having said that, I am curious if the particular agent is actually learning or not. Have you come across this issue? Thank you in advance for your time. I tried my best for my comment to be clear enough.
@filipwojcik4133 2 роки тому ⁺²
Very nice video as usual ;) thanks. Are you planning to implement PPO for continuous action spaces? I'm struggling with this task for a couple of days and my own implementation seems to be useless in environments like Mountain Car Continuous or Pendulum. And this happens despite of the fact that it is based on similar implementation for e.g. SAC or TD3.
Anyway - thanks again for the video, I find implementing algorithms form scratch to be the best way of learning to really understand ml models.
Best regards
@MachineLearningwithPhil 2 роки тому ⁺¹
Yup, starting from the paper is the best way to learn, IMO. Honestly, I'm on the fence about the continuous action spaces. I may do it for youtube, or I may stick it in a course. I'm working on a solution to make the paid content more accessible in the very near future and PPO is something I'd include from the close to the begining.
@filipwojcik4133 2 роки тому
@@MachineLearningwithPhil Sounds good to me. I have browsed a lot of public Github repositories, and most people implement PPO for discrete action spaces. What's funny is that many repos repeat the same pattern (like clipping actions or rescaling them) after (mean, std) output. Still, most of the time... it simply doesn't work for environments other than the specific one for which the code was written for (e.g. Bipedal Walker, etc.).
Right now I'm playing with your discrete-action-space PPO implementation for TF2.0, trying to make a continuous-space-agent, but still it looks absolutely random for LunarLanderContinuous. Something is not right.
I guess the problem is in action parametrization - PPO requires calculating new log probs in the training phase. At that time - actions are already rescaled and clipped.
Best regards,
Filip W.
@bobingstern4448 2 роки тому ⁺¹
I was wondering, have you considered trying to make a self-play agent for something like Connect 4 or maybe even Tic Tac Toe?
@mctrjalloh6082 2 роки тому
This looks great. But where is the video on the previous pytorch implementation?
@MachineLearningwithPhil 2 роки тому
ua-cam.com/video/hlv79rcHws0/v-deo.html
@LingfeiWang-ix5lb 2 роки тому
Hi Phil, nice course. I have a question about critic loss. WHY are you using advantage in critic loss? I have seen many other implementations just use v-return. Is there any paper supporting this?
@MachineLearningwithPhil 2 роки тому ⁺¹
It's in the source paper for PPO.
@vojtechsykora2401 2 роки тому
Hi thanks for the video but could you please specify the versions of libraries you used? thanks
@lep9956 9 місяців тому
Im running this setup on lunar lander right now with a rollout length of 4096 a batch size of 512 and 10 epochs of updates with clipped value loss and Im noticing that at every step the updates to the policy are massive with the agent doing something almost completely different. Do you have any advice on what to do?
@pntra1220 2 роки тому
Phil, im deciding whether to study Electrical Engineering or physics, the thing is that after that I would love to work in robotics and AI. So, as I know that you studied physics, I would like your advice and if its useful for the areas already mentioned. Great video and keep up the good content.
@MachineLearningwithPhil 2 роки тому
Go for EE and take physics classes as your workload permits.
@pntra1220 2 роки тому
@@MachineLearningwithPhil thanks a lot!
@lysine7675 2 роки тому
Hi Phil, when i run adjust your code and run. It shows InvalidArgumentError: required broadcastable shapes [Op:Mul] when running agent.learn() dist = tfp.distributions.Categorical(probs = probs)
new_probs = dist.log_prob(actions)
I want to ask is that the probs and actions should be in same szie?
@dr.labrat-x3z Рік тому ⁺¹
Which tensorflow version is this?
@glisant13 2 роки тому
Hi Phil, just discovered your channel, thanks for your work! I just had a question: my custom environment takes a 20*20 action space but the choose_action function only seems to return 1 action. Is there a way around this or am I doing something wrong?
@MachineLearningwithPhil 2 роки тому
Welcome aboard.
I would honestly look for ways to simplify that action space. Even if you unrolled it, it has 400 elements. That's impractically large for most off the shelf algorithms.
@glisant13 2 роки тому
@@MachineLearningwithPhil Thanks for the reply. The action itself is a pattern of stimulation into a cardiac model and the 20*20 action space is already reduced from the original 100*100 cardiac layer. Reducing it further removes the accuracy we need to test these patterns unfortunately. I'm all ears to any suggestions! Do you think I would need some other framework?
@MachineLearningwithPhil 2 роки тому
Shoot me an email at phil@neuralnet.ai
@alifrahmatullah4110 2 роки тому
Hi, Sir! I'm lucky to know you. I have enrolled in your Udemy class and have completed the course. btw, this is nice video as usual ;) thanks.
I was looking for a tutorial on how to handle a continuous action environment with PPO using Tensorflow but so far I haven't found it. can you tell us how to solve the problem? please
@saleemmuhammad-l7i 2 місяці тому
Can someone tell me the compatible python version foe this video?
@omarabubakr6524 2 роки тому
thanks sooo much for that content but I have a question is it necessary to learn calculus for deep Reinforcement learning
what I see that we don't use them in coding? and thanks
@MachineLearningwithPhil 2 роки тому
It's not used for coding, so long as you're using frameworks and not trying to build something completely new.
@omarabubakr6524 2 роки тому
@@MachineLearningwithPhil so it is not Necessary?
@MachineLearningwithPhil 2 роки тому
Well, if you want to have a deep understanding of what's going on you will need it. If you just want to implement stuff, it's not.
@omarabubakr6524 2 роки тому
@@MachineLearningwithPhil if I learn the basics of math it would be better right
@moneyforyourlife7741 2 роки тому ⁺¹
can ppo be improved using attention network?
@MachineLearningwithPhil 2 роки тому ⁺²
Interesting question. It would probably depend on the environment.
@moneyforyourlife7741 2 роки тому
@@MachineLearningwithPhil I don't know if my message was block because of a documentation link. So I'll send without link... I found ppo with attention on framework called RLLIB, there is an example on this framework using ppo with Attention Net (GTrXL) ... it looks good, I'll do some tests ..
@MachineLearningwithPhil 2 роки тому
I've heard good things about rllib, and intend to look into it in the near future. Let me know how it works out for you.
@moneyforyourlife7741 2 роки тому ⁺¹
@@MachineLearningwithPhil I created an custom env using open ai gym to play a game called slitherio and I'm using ppo from stable-baselines framework but it is taking to much to learn. So now I'm search for different agents to see if it learns faster.
I'll test rllib and if it works better I send here new comment talking about it
@chymoney1 2 роки тому
Lol I was literally working on the same problem!
@MachineLearningwithPhil 2 роки тому ⁺¹
Perfect timing.

Наступне

Автоматичне відтворення

Proximal Policy Optimization (PPO) is Easy With PyTorch | Full PPO Tutorial