Proximal Policy Optimization (PPO) is Easy With PyTorch | Full PPO Tutorial
Вставка
- Опубліковано 23 гру 2020
- Proximal Policy Optimization is an advanced actor critic algorithm designed to improve performance by constraining updates to our actor network. It's relatively straight forward to implement in code, and in this full tutorial you're going to get a mini lecture covering the essential concepts behind the ppo algorithm, as well as a complete implementation in the pytorch framework. We'll test our algorithm in a simple open ai gym environment: the cartpole.
Code for this video is here:
github.com/philtabor/UA-cam-...
A written crash course in PPO can be found here:
www.neuralnet.ai/a-crash-cour...
Learn how to turn deep reinforcement learning papers into code:
Get instant access to all my courses, including the new Prioritized Experience Replay course, with my subscription service. $29 a month gives you instant access to 42 hours of instructional content plus access to future updates, added monthly.
Discounts available for Udemy students (enrolled longer than 30 days). Just send an email to sales@neuralnet.ai
www.neuralnet.ai/courses
Or, pickup my Udemy courses here:
Deep Q Learning:
www.udemy.com/course/deep-q-l...
Actor Critic Methods:
www.udemy.com/course/actor-cr...
Curiosity Driven Deep Reinforcement Learning
www.udemy.com/course/curiosit...
Natural Language Processing from First Principles:
www.udemy.com/course/natural-...
Reinforcement Learning Fundamentals
www.manning.com/livevideo/rei...
Here are some books / courses I recommend (affiliate links):
Grokking Deep Learning in Motion: bit.ly/3fXHy8W
Grokking Deep Learning: bit.ly/3yJ14gT
Grokking Deep Reinforcement Learning: bit.ly/2VNAXql
Come hang out on Discord here:
/ discord
Need personalized tutoring? Help on a programming project? Shoot me an email! phil@neuralnet.ai
Website: www.neuralnet.ai
Github: github.com/philtabor
Twitter: / mlwithphil
This content is sponsored by my Udemy courses. Level up your skills by learning to turn papers into code. See the links in the description.
Best christmas gift, I'll definitely buy your course next year
Truly an incredible video. I have looked all over the internet for a simple implementation of PPO - this was the best and only true example I could find.
Wow that was an awesome introduction and overview of how and how the PPO algorithm is structured. I understand basic pytorch, and could follow most of this. I just get confused on those squeeze operations. Many Thanks. Impressive watching live code being written.
Hey Phil, thank you for the video! Your content really helped me getting into RL.
I think I have a spotted a potential improvement for the code:
In your introduction you stated that we sample batches and perform 4 epochs of updates on each of them. However, I think the way the for loops are arranged in this code have it the other way around, sampling batches of 5 four times and then performing a single update one each of them.
Thanks you phil, 1 years i have no full understanding about ppo. But after i watch your tutorial, now i understand what ppo is. You are amazing person phil
Thank you very much for this sir, great tutorial and merry christmas !
with your help I was able to implement PPO for my own NN project :) it's learning like i've never seen it before on my billion attempts at throwing code against a wall
Thank's a lot Phil, this is really great!
Thank you for getting on PPO. Will you do video on tensorflow implementation, as well?
Great video, code ran perfectly the first time, just like in the video!
Thank you so much for this Phil, this will be my christmas present! hahaha
Merry Christmas!
best video ive ever seen, thanks so much for taking the time from your real life to make it and i am happy to see more from your content. i think one sentence to clarify "why we need log probs" (maybe its something pretty straight forward), would have been great, it is not 100% clear to me why we do that log probs collection and then the exp(), one two sentences... sorry for not knowing :/ have a great day!
these videos are so valuable, many thanks!
I've been trying to understand PPO for a while now and this is by *far* the clearest and well-structured (and in PyTorch!) -- thank you very much!!
Glad you found it helpful!
@@MachineLearningwithPhil One thing I'm still a little unclear about is the trajectories, most implementations and pseudocode use trajectories but you don't seem to do this (or at least it seems you only learn from one trajectory at a time?) -- is there a reason why? Or am I missing something? Is it because it's an Actor-Critic implementation of PPO? If that's the case, is there a specific reason to only use one trajectory for AC? Thanks again :)
Just ease of implementation. In the paper they do both ways so I opted for the single thread GPU option.
@@MachineLearningwithPhil That makes sense, thanks for the response!
You can make the code a little easier and more efficient by using dictionaries to store the states, probs, vals, etc. And that way you can avoid typo errors. Also, Pycharm is great for spotting where things are missing or auto-suggesting once you start typing.
In the init, I put:
self.sd= {"states":[], "probs":[], "vals":[], "actions":[], "rewards":[], "dones":[]}
And in the generate_batches, I use:
return [np.array(x) for key, x in self.sd.items()], batches
store_memory is just pass in a dict with the same titles, and then:
def store_memory(self, state_list):
for key, val in state_list.items():
self.sd[key].append(val)
Anyways, dictionaries are a great tool for organizing a lot of lists.
Hi Phil, very nice and helpful content, as always!
I have a question:
What changes hast to be done when solving an environment, where the most inportant reward comes at the end of one game, like in Lunar lander. I implemented the ppo once, but it had bad Performance on lunar lander because the timesteps taken are smaller than an episode length and so the important information is lost. I modified the algo so that it plays 2000 timesteps of complete episodes, except for the last episode of that ppo run. It then solves lunar lander but not as good as SAC with autimated temperature optimizing and 2 Q networks and no V network or TD3. I didnt expect the ppo to perform this low since this algo has such a "hype".
Do you have some thought or solutions on the problem of solving environments like lunar lander wirh PPO?
Thanks for this much awaited video! One remark: I think the buffer size should be much larger than the episode length, not the other way around. The reason is that samples within an episode are correlated and (heavily) so would be the mini batches, increasing the gradient variance.
Your argument is perfectly logical. T
I double checked the paper and it indeed says "where T is much less than the episode length". I wonder if the smoothing parameter in the advantage calculation makes a big enough difference.
@Machine Learning with Phil interesting! I think they do that (with the cut off advantage summation) since they want to learn "as soon as possible" and not wait till an episode ends. This is of course important if episodes are very long, as the case judging by the name in the "mujoco one-million-timesteps benchmark", but otherwise I don't see a big advantage.
In the smallest task they have T=128 timesteps, with correlation of course diminishing over time, and additionally they have N=8 actors in parallel and I would guess that they mix the N*T= 1024 observations from different actors for their update, greatly reducing the variance. For a non-parallel implementation with relatively short episodes, it seems less clear whether setting T below the episode length is good. By the way you are right that the GAE parameter allows looking no more than 128 steps into the future, since it makes contributions of timesteps beyond that vanishingly small
Hi Phil,
Great implementation and love your videos. Just a small detail that you would like to add is "early stopping" criteria that many research people are using while implementating PPO. Its basically says if LogP/Log P_old (which is what we compute in inner loop of learn function) has exceeded a threshold, stop the iteration. You can find it in Open AI's Spinning Up PPO docs as well.
Thanks! You have been great help in my research for converting theory to practice.
Great modification, I'll have to look into it. Thanks for the tip Ashish.
Correction its not I.S. ratio itself but approximate KL divergence that should not go beyond threshold because even with T.clip sometimes policy collapses as pointed out by lot of people.
They use -
approx_KL_div = (LogP - LogP_old).mean()
if approx_KL_div>threhold:
break
Thank you for the tutorial I had one question for some implementation I have seen , people generally use "next_state" values to calculate the advantage but here we are not taking next_state anywhere. why ?
Best Course EVER!!!
Very nice explanation of PPO. I've been working on it for about 3 weeks and still couldn't get it to work. I believe this video will help me a lot in my project on multi agent deep reinforcement learning
Comment for the algorithm and of course smashing that like button!
Damn dude, you're fast. Thanks as always.
When using the policy after training, should we sample an action from the distribution or select the action with the highest probability value?
Amazing video, I have one question though (which was probably addressed in another comment, but just to be sure):
Shouldn't each batch of data contain the steps of only 1 episode? Because they way the PPOMemory class is organized, we are storing individual steps in lists, and then picking a sequence of steps at random. But the sequence of steps we pick might belong to two different episodes, which make the calculations done in line 160 useless, as we are computing advantages for steps of different episodes.
Plus, wouldn't it make sense it compute these advantages only once, and then store them in the PPOMemory in order to avoid recomputing them in every epoch, every time we train the agent?
At the end of the agent.learn() function the memory is cleared so there should only be T (in this case 20) timesteps of data from one singular episode in the memory, at any given time.
also, is there a video where u explain the shannon entropy? prob not a complex concept but coming from you specifically would be great, great coverage of concepts, concise, clear speech. cheers and stay well!
Thanks for your tutorail!! I have some confusions. I'll be grateful if you can help me. I wonder what's the difference between state and obs. And what about the x in the paper, which means "some state information, could consist of the observations of all agents "?
Thanks so much for continuing to share these great videos Phil! They help so much. Would you happen to plan on implementing codes on cooperative multi-agent? I've seen a couple of codes by others but they are quite unclear. Would be fantastic if you do :)
Does that mean yes? lol
I'll be working on it for sure. I'm slow, sorry
@@MachineLearningwithPhil Lovely
is there any way to select the parameter "N(memory buffer)" and batch size ? on what parameters do they depend?
Hey Phil, could you explain that why you use the Return as the approximation of the value function instead of the expected Return?
Brilliant content.
much thanks for your video !
I was wondering about the advantage function A hat sub t, that looks very similar to the temporal difference function. Are they the same thing?
Amazing video. Huge contribution to a coding dumb like me. Immense gratitude to you.
Thank you for watching
Thanks Phil for this great episode! One question, are you sure you got the advantage calculation right? Your discount-variable begins at one, but it is not reset if the game is done. The ppo-memory can contain multiple games.
I'm working on my own implementation and is using something like this:
delta = reward_arr[k] + Gamma * value_arr[k+1] * mask - value_arr[k];
adv = delta + Gamma * Lambda * mask * adv;
I've been wondering that as well. Right, multiple episodes aren't handled correctly.
It's also calculating the advantage again for every epoch even though it should stay the same. It might be better to calculate the advantage after each episode and only save that instead of everything you need to calculate it later.
I believe this is a valid issue. Can you raise it on my GitHub? Link in description.
@@MachineLearningwithPhil I couldn't find the issue raised in github but I will propose a fix there. For anyone reading this that wants to know a fix, I think this will work: change the line
discount *= self.gamma * self.gae_lambda
to
discount = discount * self.gamma * self.gae_lambda * (1 - int(dones_arr[k])) + int(dones_arr[k])
so if it is done, the first term is 0 and the second term "turns on" to set it equal to 1. otherwise, the first term is as before and unaffected.
I was having a lot of issues in a different application where the NN would do well and then basically die off and get stuck in a suboptimal policy (that it never got out of ). This seemed to fix it.
Hey Phil , Great Video! Just wondering if there will be a change of performance by changing the Network implementation of the DQN Agent of your udemy course to nn.Sequential ?
Not sure, I haven't tried.
Hi Phil, I am exploring RL for autonomous driving in my channel by trying to outsource RL to stable baselines3, however I very quickly got into library conflicts in SB3. Now I was looking for ground up builds of RL and found this video. Thanks. I will try to implement it. If you have any other tips, please reach out. Hopefully my videos describe what I would like to do. Thanks.Vadim
Hello. Thanks for the great tutorial. I've got a question. According to your code the very last advantage in the trajectory always remains 0. Why so? Shouldn't it be equal to reward(t) - v(t)?
Did you ever figure this out, I'm having a problem with this
This tutorial is so good, I bought your courses on DQL and actor-critic on Udemy. Are you planning to add a full length course for PPO as well?
I'll be adding the PPO module soon
@@MachineLearningwithPhil Hello from Ukraine, im waiting too)
@@MachineLearningwithPhil A PPO module sounds great!!!
Thank you!
Hello, I know it's been 2 years since the initial release of the video but I had a question regarding the implementation. When we randomize each entry in the batch, aren't we losing the concept of a trajectory? Now we have a mini-batch of uncorrelated states and actions pairs, however we still sum the respective advantages as if they were from sequential timesteps (trajectory)
Unless I made a mistake, we calculate advantages using the full trajectory and then shuffle. As long as we keep the state and new state pairs together it's sufficient for learning.
For more refined code, see my protorl repository on GitHub.
GitHub/philtabor/protorl
13:09, I think the exponent on smoothing/decay term should be with -1 not +1: you try to iterate with t=2 and T=5 for example, you ll see what I mean
Thank you so much for putting an effort to do the whole implementation which is relatively bit easier to grasp than the paper. I am very new to RL and I have a rather weird question(cause no one actually addressed but ignore if I am being stupid), so when for the first time you call the learn function after doing 20 steps, wouldn't the new_probs be equal to the old_probs, because essentially the neural network didn't learn anything so would both these values be random until like several iteration? And if actually they would be random, how is the agent learning?
Yup, new probs and old probs are the same on the first iteration. The probs get updated based on the sampled rewards from the environment, through the advantage calculation.
Hey Phil, why do you use the advantages to calculate the returns? Shouldn't the returns be the discounted rewards?
Yeah, great question. Couple reasons:
1) I was basing my code off the work of another (as mentioned in the video), so perhaps I was influenced by that a little more than I usually would be. I saw other solutions use the same thing, so this is one situation where I deferred to what other people do, since the original paper isn't particularly clear.
2) However, this is a theoretical basis for doing it this way. They aren't pulling it out of a hat. The returns serve as an approximation of the Q, while the critic serves as the approximation of the value function. The advantage is defined A = Q - V, so we can substitute and say (approximately) that the returns are A - V.
Is there any way to create a custom environment similar to the ones in openai gym ?
Can you explain why did you used L_VF as MSE of (advantage+value,critic_value) instead (value,critic_value)?
I have the same question...
Hi Phil. I might spot an error in your code. At line 160 in the file ppo_torch.py, the for loop should break each time an episode terminates(when the done flat is true). Because the advantage value should be calculated in terms of each episode rather than all the steps in the memory which the way you were doing. Please correct me if I am wrong. Thank you very much.
Yeah, I gotta take a look. I'll make an announcement on the community tab and update the github when I do. Can't modify this video and I don't want to redo the whole thing.
@@MachineLearningwithPhil Thank you for your fast reply. Please do change the code in the github as I was using these code to do experiment before discovered this mistake and will be using the updated code when it is ready. Thank you very much.
@@MachineLearningwithPhil Hi Phil. I am sorry for my comment. Your implementation is right and my understanding was wrong.
@@beizhou2488 Are you sure? He is mixing steps of different episodes in the batches, but they are being treated like they belong to the same episode. Or is my understanding also wrong?
Hi Phill
You explained that the batches are divided into subsets, such as groups of five, and prior to this, the indices are shuffled. My query relates to this shuffling process. By shuffling, each batch could consist of unrelated samples. How, then, can we accurately compute the advantage within these batches, given the potential lack of internal correlation?
Your insights would be greatly appreciated and I eagerly look forward to your explanation.
Great question. We calculate the advantage first, and shuffle concurrently. So each time step has the correct advantage associated with it.
i have a question about AI, what is a neuromorphic computer ?
Hello nice video, I got a question. You said that the lenght of the memory is 20 episode, that should be less than the lenght of the episode that is 200 steps. Doesn't the two things be the same? I should terminate the episode with the last step and then update the policy parameters. Instead the policy update could occur even more time within the same episode?
Nope, they're independent. The size of the replay buffer is a hyperparameter, while the episode length is fixed.
@@MachineLearningwithPhil Hi, I am really sorry to bother you again, but I can't figure it out. I understood that if I select 200 steps, the lenght of my episode is 200 steps (because the episode should be how many interactions there are between the agent and the environment), and so will be the dimension of the batch, which contains the experience gain during the episode. At the next episode, the agent will perform other 200 steps, and the size of the batch still will be 200 steps.
These days, I'm doubting what I've written above
Amazing video Phil, how to determine what V(st+1) and so on? thanks
Value states are estimated using the critic network, which is trained using the environment rewards. Simply pass the state through the network and you will get an estimate.
I downloaded you code but whenever it tries to save the models it gives a directory error. I saw that you made the tmp and ppo directory in the video to solve the issue but for me that doesn't fix it. Do you need to make the files for pytorch to save over it?
No, you shouldn't have to make the files. What error are you getting?
hey @magik , i come across the same error as you did, did you find a way of making it work ? :)
Would it be possible to get the link of the william's code ?
Thank you so much for the content. Can you please explain the modification that should be done in the code for the multidiscrete action space?
You can check my GitHub for the code. It's under the advanced actor critic methods repo.
@@MachineLearningwithPhil Thank you so much for your help. Can you also explain the modification in the code to MAPPO and IPPO for multiagent setting ?
impressive programming skills
Thank you for the kind words
Hello, Can I kindly ask you to implement multi-agent actor-critic? This is a very important and interesting topic. there is no goof implementation on the internet.
Maddpg is on the list
@@MachineLearningwithPhil Lovely, waiting with interest!
Hey Phil, I'd love to hear an update on the nn.Sequential performance issue if you happen to have an update.
same
im also curious
I guess the problem may come from the re-usage of batchnorm layer.
Can this implementation code work for computation offloading in edge computing
Hello
I have a question.
Yout code won't work with continuous action space. Or I am wrong. Should I do ant modifications?
It needs modifications. The actor will output a mean and sigma that you feed into a normal distribution.
Hello Phil! This video is dope but I noticed that this solution has been implemented for Discrete action spaces. Is it possible for you to implement a continuous action space tutorial? I am currently working on a continuous action space problem (Unity ML-Agents' crawler problem) using DDPG and the agent's performance doesn't improve at all and even if it improves, the performance suddenly falls off a cliff. I have however found that many people including Unity itself have completed the Crawler problem using PPO network. A categorical distribution is used in discrete action space, right? Others are using some kind of a normal distribution and I simply couldn't understand what is going on. It would be great if you can do the continuous action space version of PPO. Thank you in advance.
hey kishore, have you figured it out by now how you change it from discrete to continous? i am facing the same problem
@@mannequins_ Not yet Amon. I will let you know if I implement it though.
Yeah so bro all you have to do is to output two neurons per action, one for mean and one for variance. Then return (say for 2 actions with continuous domain) Multivariate Normal(both mu, cov matrix with variance as diagonal) distribution object. Now in choose action function just use action= dist.sample() Rest will remain same I guess.
@@ashishj2358 Hey Ashish thank you. I asked it 2 months ago and figured it out later but thank you nonetheless as it would help others who have the same doubt. Cheers!!!
Please share me the code if you has modifiled it as countinuos space!
If you are storing all the states of different episodes in a single array, then how are you accurately calculating the advantage. Doesn't rewards from other episodes increase variance of advantage.
It's a mistake, the advantage calculation only makes sense when used on the same episode. I think you can solve this by resetting the discount factor to 1 if the dones array equals 1, so basically resetting the discount factor after each episode.
Thanks for your vedio. But I think this code is REINFORCE policy gradient, instead of actor critic. Because the advantage value is generated by reward_arry rather than critic network.
I recommend reading the paper.
Finally 😃
why the same code doesn't work for Pendulum-v0 env ??
at 47:30 , why is the total loss = actor_loss * 0.5critic_loss. Why does only the critic loss get halfed?
As far as I understand you can view this 0.5 coefficient as yet another hyperparameter. If you look at the paper ("Proximal Policy Optimization Algorithms
" by J.Schulman et al), at the formula (9) this is the c1 coefficient.
Thank you for the informative content! Unfortunately, as of April 2024, the code throws the following error: " File "...\main.py", line 31, in
action, prob, val = agent.choose_action(observation)
File "...\ppo_torch.py", line 136, in choose_action
state = T.tensor([observation], dtype=T.float).to(self.actor.device)
ValueError: expected sequence of length 4 at dim 2 (got 0)".I know 'CartPole-v0' is outdated but updating it does not solve the problem.
The newest gym returns the observation and info from reset.
It also returns new observation, reward, done, truncated, info from the step function.
Hey
Why is entropy required in this context, and why does it not appear in your code?
Entropy is used to prevent premature convergence. I don't think it was required here.
why do i get this error when running the main:
File "C:\Users\user\anaconda3\lib\site-packages\torch\serialization.py", line 193, in __init__
super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'tmp/ppo\\actor_torch_ppo'
?
You have to do a mkdir
did you find out how to do it @oly ? I created the repos but i don't know whats not working
Maybe dict-type return is more convince than tuple return.
Hello, I want to have multiple outputs so I can use this on my environment, how do I do this?
Change the number of outputs on the final layer of the policy and set the activation function
@@MachineLearningwithPhil thank you, but there is another problem, when I do that, the neural network outputs number from 0 to 25 instead of 26 numbers. I tried outputting probabilities but these always sub to 1 and are very small numbers so it doesn't work. How do I solve this?
What keyboard are you using?
Cooler master something or other. It's from 2014 so I don't think they still make the specific model. It's a mechanical with the cherry switches
How to modify it so that it can be used in continuous action space?
You can use a beta distribution instead of categorical. Check my GitHub for the advanced actor critic course repo.
Traceback (most recent call last):
File "W:\Programming\PPO\main.py", line 12, in
agent = Agent(n_actions=env.action_space.n, batch_size=batch_size, alpha=alpha, n_epochs=n_epochs, input_dims=env.observation_space.shape)
File "W:\Programming\PPO\agent.py", line 16, in __init__
self.critic = CriticNetwork(input_dims, alpha)
TypeError: __init__() takes 1 positional argument but 3 were given
i hate python... somebody knows what is going on?
Merry Christmas, or whatever holiday you do or don't observe.
Merry Christmas Andrew!
Any plan for PPG?
I will add it to the pile. Thanks for the suggestion.
Hey Phil,
thank you so much for making this available for free!
I encounter a problem, when running on a non-CUDA enabled environment. Did anyone have a similar problem?
File "main.py", line 32, in
action, prob, val = agent.choose_action(observation)
File "/home/jovyan/UA-cam-Code-Repository/ReinforcementLearning/PolicyGradient/PPO/torch/ppo_torch.py", line 138, in choose_action
state = T.tensor(observation, dtype=T.float).to(self.actor.device)
ValueError: expected sequence of length 4 at dim 1 (got 0)
yes I also have ,did you solved this issue
In the new gym interface, reset returns both an observation and the debug info.
Step now returns an additional variable: truncated.
So you need to take these into account when getting information back from the environment. You will also need to terminate the while loop when either done or truncated is true.
I assume in real life you don't just type out a paper in python and debug the code in slightly over an hour, right? Or am I wrong?
No, I definitely spend a large amount of time getting stuff to work. I'm working off a cheat sheet in these videos.
mkdir -p to create folders like /tmp/ppo plz (mkdir -p /tmp/ppo)... not doing mkdir tmp then mkdir tmp/ppo again one single command is enough !
Also stop to use :wq with vi/vim ... use :x instead it makes exactly the same with one letter !
9:35 "The advantage is just a measure of the goodness of each state", that is not correct. The advantage is a measure of how much better a particular action is compared to the average action taken from the same state.
31:30
8
I am sorry, but I gave up on this tutorial quite fast. I can't really understand something new when the person teaching is telling me already at the start about all the problems and all the edges cases and parameters and algorithms and everything else before actually explaining the first step. This is like taking your first course in calculus, and before you learn what are even limits, the lecturer tries to explain the problems with taking derivatives of double integration over the whole plane.
This was an awful presentation. It makes no sense whatsoever unless you already have a rough idea of what PPO is and the key definitions. How do I get back the 15 minutes I wasted watching it?
Make sure to smash that dislike button
It's honestly astonishing that one would take the time to write such a comment. Phil is the best
Hey, i am getting "FileNotFoundError: [Errno 2] No such file or directory: 'tmp/ppo/actor_torch_ppo'" - any reasons why?
Do a mkdir tmp && mkdir tmp/ppo