I have followed along through your DDPG, PPO torch, and deep Q Learning. There are scarce resources like this. Deep RL is relatively new, very difficult for newcomers, and I believe not as popular as other deep learning-supported fields like computer vision and NLP. You sir are doing a tremendous service! SO many questions I had on RL clarified! 👏
hard to believe it took a year to get to it .... I recall having some issues getting it working towards the end of 2020, but had no problems coding it up last night. Hope it helped!
Phil continuing his amazing work as usual! I would encourage viewers to also read an ICLR 2020 accpeted paper about small caveats of PPO and TRPO, "Implementation Matters in Deep RL: A Case Study on PPO and TRPO". A suggestion from a frustrated deep rl researcher. :')
Thanks for his video Dr. Phil, it is important for me as I come from Pytorch and cannot clear the difference between TF2 and Pytorch. This video helps me a lot.
Thank you for your videos again Phil!! In this case, how can I introduce bonus entropy in loss? Can I just compute dist.entropy() and add in actor_loss like this: beta * dist.entropy()?
Hello Phil, thank you very much for your video! I tested your implementation for 10000 episodes. I chose to use the episodic return as evaluation tool of the RL algorithm (instead of the avg of the last 100episodes), and I notice that the episodic return oscillates between 0 and 200 throughout the whole training (I guess we would expect a "generally increasing" behavior but it does not look like that). Instead, when using the avg return of the last 100episdes as evaluation tool, we see this avg return increasing at first, but then it saturates at about a value of 100. It is my impression though that the avg return of the last episodes is expected to saturate at a value close to 100, even by randomly sampling (not training), given the nature of the Cartpole-v0 environment (episodic return from 0 to 200). Having said that, I am curious if the particular agent is actually learning or not. Have you come across this issue? Thank you in advance for your time. I tried my best for my comment to be clear enough.
Very nice video as usual ;) thanks. Are you planning to implement PPO for continuous action spaces? I'm struggling with this task for a couple of days and my own implementation seems to be useless in environments like Mountain Car Continuous or Pendulum. And this happens despite of the fact that it is based on similar implementation for e.g. SAC or TD3. Anyway - thanks again for the video, I find implementing algorithms form scratch to be the best way of learning to really understand ml models. Best regards
Yup, starting from the paper is the best way to learn, IMO. Honestly, I'm on the fence about the continuous action spaces. I may do it for youtube, or I may stick it in a course. I'm working on a solution to make the paid content more accessible in the very near future and PPO is something I'd include from the close to the begining.
@@MachineLearningwithPhil Sounds good to me. I have browsed a lot of public Github repositories, and most people implement PPO for discrete action spaces. What's funny is that many repos repeat the same pattern (like clipping actions or rescaling them) after (mean, std) output. Still, most of the time... it simply doesn't work for environments other than the specific one for which the code was written for (e.g. Bipedal Walker, etc.). Right now I'm playing with your discrete-action-space PPO implementation for TF2.0, trying to make a continuous-space-agent, but still it looks absolutely random for LunarLanderContinuous. Something is not right. I guess the problem is in action parametrization - PPO requires calculating new log probs in the training phase. At that time - actions are already rescaled and clipped. Best regards, Filip W.
Hi Phil, nice course. I have a question about critic loss. WHY are you using advantage in critic loss? I have seen many other implementations just use v-return. Is there any paper supporting this?
Im running this setup on lunar lander right now with a rollout length of 4096 a batch size of 512 and 10 epochs of updates with clipped value loss and Im noticing that at every step the updates to the policy are massive with the agent doing something almost completely different. Do you have any advice on what to do?
Phil, im deciding whether to study Electrical Engineering or physics, the thing is that after that I would love to work in robotics and AI. So, as I know that you studied physics, I would like your advice and if its useful for the areas already mentioned. Great video and keep up the good content.
Hi Phil, when i run adjust your code and run. It shows InvalidArgumentError: required broadcastable shapes [Op:Mul] when running agent.learn() dist = tfp.distributions.Categorical(probs = probs) new_probs = dist.log_prob(actions) I want to ask is that the probs and actions should be in same szie?
Hi Phil, just discovered your channel, thanks for your work! I just had a question: my custom environment takes a 20*20 action space but the choose_action function only seems to return 1 action. Is there a way around this or am I doing something wrong?
Welcome aboard. I would honestly look for ways to simplify that action space. Even if you unrolled it, it has 400 elements. That's impractically large for most off the shelf algorithms.
@@MachineLearningwithPhil Thanks for the reply. The action itself is a pattern of stimulation into a cardiac model and the 20*20 action space is already reduced from the original 100*100 cardiac layer. Reducing it further removes the accuracy we need to test these patterns unfortunately. I'm all ears to any suggestions! Do you think I would need some other framework?
Hi, Sir! I'm lucky to know you. I have enrolled in your Udemy class and have completed the course. btw, this is nice video as usual ;) thanks. I was looking for a tutorial on how to handle a continuous action environment with PPO using Tensorflow but so far I haven't found it. can you tell us how to solve the problem? please
thanks sooo much for that content but I have a question is it necessary to learn calculus for deep Reinforcement learning what I see that we don't use them in coding? and thanks
@@MachineLearningwithPhil I don't know if my message was block because of a documentation link. So I'll send without link... I found ppo with attention on framework called RLLIB, there is an example on this framework using ppo with Attention Net (GTrXL) ... it looks good, I'll do some tests ..
@@MachineLearningwithPhil I created an custom env using open ai gym to play a game called slitherio and I'm using ppo from stable-baselines framework but it is taking to much to learn. So now I'm search for different agents to see if it learns faster. I'll test rllib and if it works better I send here new comment talking about it
I have followed along through your DDPG, PPO torch, and deep Q Learning. There are scarce resources like this. Deep RL is relatively new, very difficult for newcomers, and I believe not as popular as other deep learning-supported fields like computer vision and NLP. You sir are doing a tremendous service! SO many questions I had on RL clarified! 👏
Thank you for the kind words.
Been waiting for this one! Awesome video!
I agree!
hard to believe it took a year to get to it .... I recall having some issues getting it working towards the end of 2020, but had no problems coding it up last night. Hope it helped!
Phil continuing his amazing work as usual! I would encourage viewers to also read an ICLR 2020 accpeted paper about small caveats of PPO and TRPO, "Implementation Matters in Deep RL: A Case Study on PPO and TRPO". A suggestion from a frustrated deep rl researcher. :')
Thanks Ashish, I'll have to check it out. It might be a good paper for an analysis video.
Wonderful lecture on PPO, I have small doubt sir in PPO can I integrate the CNN architecture. Could you please share your thought regarding this.
Thanks for his video Dr. Phil, it is important for me as I come from Pytorch and cannot clear the difference between TF2 and Pytorch. This video helps me a lot.
Thank you for your videos again Phil!!
In this case, how can I introduce bonus entropy in loss?
Can I just compute dist.entropy() and add in actor_loss like this: beta * dist.entropy()?
Yup!
Hello Phil, thank you very much for your video!
I tested your implementation for 10000 episodes.
I chose to use the episodic return as evaluation tool of the RL algorithm (instead of the avg of the last 100episodes), and I notice that the episodic return oscillates between 0 and 200 throughout the whole training (I guess we would expect a "generally increasing" behavior but it does not look like that). Instead, when using the avg return of the last 100episdes as evaluation tool, we see this avg return increasing at first, but then it saturates at about a value of 100. It is my impression though that the avg return of the last episodes is expected to saturate at a value close to 100, even by randomly sampling (not training), given the nature of the Cartpole-v0 environment (episodic return from 0 to 200). Having said that, I am curious if the particular agent is actually learning or not. Have you come across this issue? Thank you in advance for your time. I tried my best for my comment to be clear enough.
Very nice video as usual ;) thanks. Are you planning to implement PPO for continuous action spaces? I'm struggling with this task for a couple of days and my own implementation seems to be useless in environments like Mountain Car Continuous or Pendulum. And this happens despite of the fact that it is based on similar implementation for e.g. SAC or TD3.
Anyway - thanks again for the video, I find implementing algorithms form scratch to be the best way of learning to really understand ml models.
Best regards
Yup, starting from the paper is the best way to learn, IMO. Honestly, I'm on the fence about the continuous action spaces. I may do it for youtube, or I may stick it in a course. I'm working on a solution to make the paid content more accessible in the very near future and PPO is something I'd include from the close to the begining.
@@MachineLearningwithPhil Sounds good to me. I have browsed a lot of public Github repositories, and most people implement PPO for discrete action spaces. What's funny is that many repos repeat the same pattern (like clipping actions or rescaling them) after (mean, std) output. Still, most of the time... it simply doesn't work for environments other than the specific one for which the code was written for (e.g. Bipedal Walker, etc.).
Right now I'm playing with your discrete-action-space PPO implementation for TF2.0, trying to make a continuous-space-agent, but still it looks absolutely random for LunarLanderContinuous. Something is not right.
I guess the problem is in action parametrization - PPO requires calculating new log probs in the training phase. At that time - actions are already rescaled and clipped.
Best regards,
Filip W.
I was wondering, have you considered trying to make a self-play agent for something like Connect 4 or maybe even Tic Tac Toe?
This looks great. But where is the video on the previous pytorch implementation?
ua-cam.com/video/hlv79rcHws0/v-deo.html
Hi Phil, nice course. I have a question about critic loss. WHY are you using advantage in critic loss? I have seen many other implementations just use v-return. Is there any paper supporting this?
It's in the source paper for PPO.
Hi thanks for the video but could you please specify the versions of libraries you used? thanks
Im running this setup on lunar lander right now with a rollout length of 4096 a batch size of 512 and 10 epochs of updates with clipped value loss and Im noticing that at every step the updates to the policy are massive with the agent doing something almost completely different. Do you have any advice on what to do?
Phil, im deciding whether to study Electrical Engineering or physics, the thing is that after that I would love to work in robotics and AI. So, as I know that you studied physics, I would like your advice and if its useful for the areas already mentioned. Great video and keep up the good content.
Go for EE and take physics classes as your workload permits.
@@MachineLearningwithPhil thanks a lot!
Hi Phil, when i run adjust your code and run. It shows InvalidArgumentError: required broadcastable shapes [Op:Mul] when running agent.learn() dist = tfp.distributions.Categorical(probs = probs)
new_probs = dist.log_prob(actions)
I want to ask is that the probs and actions should be in same szie?
Which tensorflow version is this?
Hi Phil, just discovered your channel, thanks for your work! I just had a question: my custom environment takes a 20*20 action space but the choose_action function only seems to return 1 action. Is there a way around this or am I doing something wrong?
Welcome aboard.
I would honestly look for ways to simplify that action space. Even if you unrolled it, it has 400 elements. That's impractically large for most off the shelf algorithms.
@@MachineLearningwithPhil Thanks for the reply. The action itself is a pattern of stimulation into a cardiac model and the 20*20 action space is already reduced from the original 100*100 cardiac layer. Reducing it further removes the accuracy we need to test these patterns unfortunately. I'm all ears to any suggestions! Do you think I would need some other framework?
Shoot me an email at phil@neuralnet.ai
Hi, Sir! I'm lucky to know you. I have enrolled in your Udemy class and have completed the course. btw, this is nice video as usual ;) thanks.
I was looking for a tutorial on how to handle a continuous action environment with PPO using Tensorflow but so far I haven't found it. can you tell us how to solve the problem? please
Can someone tell me the compatible python version foe this video?
thanks sooo much for that content but I have a question is it necessary to learn calculus for deep Reinforcement learning
what I see that we don't use them in coding? and thanks
It's not used for coding, so long as you're using frameworks and not trying to build something completely new.
@@MachineLearningwithPhil so it is not Necessary?
Well, if you want to have a deep understanding of what's going on you will need it. If you just want to implement stuff, it's not.
@@MachineLearningwithPhil if I learn the basics of math it would be better right
can ppo be improved using attention network?
Interesting question. It would probably depend on the environment.
@@MachineLearningwithPhil I don't know if my message was block because of a documentation link. So I'll send without link... I found ppo with attention on framework called RLLIB, there is an example on this framework using ppo with Attention Net (GTrXL) ... it looks good, I'll do some tests ..
I've heard good things about rllib, and intend to look into it in the near future. Let me know how it works out for you.
@@MachineLearningwithPhil I created an custom env using open ai gym to play a game called slitherio and I'm using ppo from stable-baselines framework but it is taking to much to learn. So now I'm search for different agents to see if it learns faster.
I'll test rllib and if it works better I send here new comment talking about it
Lol I was literally working on the same problem!
Perfect timing.