Proximal Policy Optimization (PPO) is Easy With PyTorch | Full PPO Tutorial

Поділитися
Вставка
  • Опубліковано 23 гру 2020
  • Proximal Policy Optimization is an advanced actor critic algorithm designed to improve performance by constraining updates to our actor network. It's relatively straight forward to implement in code, and in this full tutorial you're going to get a mini lecture covering the essential concepts behind the ppo algorithm, as well as a complete implementation in the pytorch framework. We'll test our algorithm in a simple open ai gym environment: the cartpole.
    Code for this video is here:
    github.com/philtabor/UA-cam-...
    A written crash course in PPO can be found here:
    www.neuralnet.ai/a-crash-cour...
    Learn how to turn deep reinforcement learning papers into code:
    Get instant access to all my courses, including the new Prioritized Experience Replay course, with my subscription service. $29 a month gives you instant access to 42 hours of instructional content plus access to future updates, added monthly.
    Discounts available for Udemy students (enrolled longer than 30 days). Just send an email to sales@neuralnet.ai
    www.neuralnet.ai/courses
    Or, pickup my Udemy courses here:
    Deep Q Learning:
    www.udemy.com/course/deep-q-l...
    Actor Critic Methods:
    www.udemy.com/course/actor-cr...
    Curiosity Driven Deep Reinforcement Learning
    www.udemy.com/course/curiosit...
    Natural Language Processing from First Principles:
    www.udemy.com/course/natural-...
    Reinforcement Learning Fundamentals
    www.manning.com/livevideo/rei...
    Here are some books / courses I recommend (affiliate links):
    Grokking Deep Learning in Motion: bit.ly/3fXHy8W
    Grokking Deep Learning: bit.ly/3yJ14gT
    Grokking Deep Reinforcement Learning: bit.ly/2VNAXql
    Come hang out on Discord here:
    / discord
    Need personalized tutoring? Help on a programming project? Shoot me an email! phil@neuralnet.ai
    Website: www.neuralnet.ai
    Github: github.com/philtabor
    Twitter: / mlwithphil

КОМЕНТАРІ • 157

  • @MachineLearningwithPhil
    @MachineLearningwithPhil  3 роки тому +11

    This content is sponsored by my Udemy courses. Level up your skills by learning to turn papers into code. See the links in the description.

  • @dennismusingila3012
    @dennismusingila3012 3 роки тому +7

    Best christmas gift, I'll definitely buy your course next year

  • @CryptoWizards
    @CryptoWizards Рік тому +4

    Truly an incredible video. I have looked all over the internet for a simple implementation of PPO - this was the best and only true example I could find.

  • @juleswombat5309
    @juleswombat5309 2 роки тому

    Wow that was an awesome introduction and overview of how and how the PPO algorithm is structured. I understand basic pytorch, and could follow most of this. I just get confused on those squeeze operations. Many Thanks. Impressive watching live code being written.

  • @felixh.7743
    @felixh.7743 2 роки тому

    Hey Phil, thank you for the video! Your content really helped me getting into RL.
    I think I have a spotted a potential improvement for the code:
    In your introduction you stated that we sample batches and perform 4 epochs of updates on each of them. However, I think the way the for loops are arranged in this code have it the other way around, sampling batches of 5 four times and then performing a single update one each of them.

  • @handokosupeno5425
    @handokosupeno5425 2 роки тому

    Thanks you phil, 1 years i have no full understanding about ppo. But after i watch your tutorial, now i understand what ppo is. You are amazing person phil

  • @arnabsarkar8511
    @arnabsarkar8511 3 роки тому +2

    Thank you very much for this sir, great tutorial and merry christmas !

  • @TJPancakes
    @TJPancakes 8 місяців тому +1

    with your help I was able to implement PPO for my own NN project :) it's learning like i've never seen it before on my billion attempts at throwing code against a wall

  • @maarten8238
    @maarten8238 3 роки тому +2

    Thank's a lot Phil, this is really great!

  • @Jaegg
    @Jaegg 3 роки тому +7

    Thank you for getting on PPO. Will you do video on tensorflow implementation, as well?

  • @fredchache6319
    @fredchache6319 2 роки тому +1

    Great video, code ran perfectly the first time, just like in the video!

  • @MemeSurreal
    @MemeSurreal 3 роки тому +5

    Thank you so much for this Phil, this will be my christmas present! hahaha

  • @ademord
    @ademord 2 роки тому

    best video ive ever seen, thanks so much for taking the time from your real life to make it and i am happy to see more from your content. i think one sentence to clarify "why we need log probs" (maybe its something pretty straight forward), would have been great, it is not 100% clear to me why we do that log probs collection and then the exp(), one two sentences... sorry for not knowing :/ have a great day!

  • @zVincoo
    @zVincoo 2 роки тому

    these videos are so valuable, many thanks!

  • @zacheberhart1564
    @zacheberhart1564 3 роки тому +2

    I've been trying to understand PPO for a while now and this is by *far* the clearest and well-structured (and in PyTorch!) -- thank you very much!!

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  3 роки тому

      Glad you found it helpful!

    • @zacheberhart1564
      @zacheberhart1564 3 роки тому

      @@MachineLearningwithPhil One thing I'm still a little unclear about is the trajectories, most implementations and pseudocode use trajectories but you don't seem to do this (or at least it seems you only learn from one trajectory at a time?) -- is there a reason why? Or am I missing something? Is it because it's an Actor-Critic implementation of PPO? If that's the case, is there a specific reason to only use one trajectory for AC? Thanks again :)

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  3 роки тому

      Just ease of implementation. In the paper they do both ways so I opted for the single thread GPU option.

    • @zacheberhart1564
      @zacheberhart1564 3 роки тому

      @@MachineLearningwithPhil That makes sense, thanks for the response!

  • @jeremiahjohnson6052
    @jeremiahjohnson6052 Рік тому +4

    You can make the code a little easier and more efficient by using dictionaries to store the states, probs, vals, etc. And that way you can avoid typo errors. Also, Pycharm is great for spotting where things are missing or auto-suggesting once you start typing.
    In the init, I put:
    self.sd= {"states":[], "probs":[], "vals":[], "actions":[], "rewards":[], "dones":[]}
    And in the generate_batches, I use:
    return [np.array(x) for key, x in self.sd.items()], batches
    store_memory is just pass in a dict with the same titles, and then:
    def store_memory(self, state_list):
    for key, val in state_list.items():
    self.sd[key].append(val)
    Anyways, dictionaries are a great tool for organizing a lot of lists.

  • @marku7z
    @marku7z 3 роки тому +3

    Hi Phil, very nice and helpful content, as always!
    I have a question:
    What changes hast to be done when solving an environment, where the most inportant reward comes at the end of one game, like in Lunar lander. I implemented the ppo once, but it had bad Performance on lunar lander because the timesteps taken are smaller than an episode length and so the important information is lost. I modified the algo so that it plays 2000 timesteps of complete episodes, except for the last episode of that ppo run. It then solves lunar lander but not as good as SAC with autimated temperature optimizing and 2 Q networks and no V network or TD3. I didnt expect the ppo to perform this low since this algo has such a "hype".
    Do you have some thought or solutions on the problem of solving environments like lunar lander wirh PPO?

  • @dermitdembrot3091
    @dermitdembrot3091 3 роки тому +3

    Thanks for this much awaited video! One remark: I think the buffer size should be much larger than the episode length, not the other way around. The reason is that samples within an episode are correlated and (heavily) so would be the mini batches, increasing the gradient variance.

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  3 роки тому +2

      Your argument is perfectly logical. T
      I double checked the paper and it indeed says "where T is much less than the episode length". I wonder if the smoothing parameter in the advantage calculation makes a big enough difference.

    • @dermitdembrot3091
      @dermitdembrot3091 3 роки тому +2

      @Machine Learning with Phil interesting! I think they do that (with the cut off advantage summation) since they want to learn "as soon as possible" and not wait till an episode ends. This is of course important if episodes are very long, as the case judging by the name in the "mujoco one-million-timesteps benchmark", but otherwise I don't see a big advantage.
      In the smallest task they have T=128 timesteps, with correlation of course diminishing over time, and additionally they have N=8 actors in parallel and I would guess that they mix the N*T= 1024 observations from different actors for their update, greatly reducing the variance. For a non-parallel implementation with relatively short episodes, it seems less clear whether setting T below the episode length is good. By the way you are right that the GAE parameter allows looking no more than 128 steps into the future, since it makes contributions of timesteps beyond that vanishingly small

  • @ashishj2358
    @ashishj2358 3 роки тому +1

    Hi Phil,
    Great implementation and love your videos. Just a small detail that you would like to add is "early stopping" criteria that many research people are using while implementating PPO. Its basically says if LogP/Log P_old (which is what we compute in inner loop of learn function) has exceeded a threshold, stop the iteration. You can find it in Open AI's Spinning Up PPO docs as well.
    Thanks! You have been great help in my research for converting theory to practice.

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  3 роки тому

      Great modification, I'll have to look into it. Thanks for the tip Ashish.

    • @ashishj2358
      @ashishj2358 3 роки тому +2

      Correction its not I.S. ratio itself but approximate KL divergence that should not go beyond threshold because even with T.clip sometimes policy collapses as pointed out by lot of people.
      They use -
      approx_KL_div = (LogP - LogP_old).mean()
      if approx_KL_div>threhold:
      break

  • @pranavkulkarni6489
    @pranavkulkarni6489 3 роки тому +2

    Thank you for the tutorial I had one question for some implementation I have seen , people generally use "next_state" values to calculate the advantage but here we are not taking next_state anywhere. why ?

  • @user-bq8vb7es6o
    @user-bq8vb7es6o 3 роки тому +1

    Best Course EVER!!!

  • @ArMeD217
    @ArMeD217 Рік тому

    Very nice explanation of PPO. I've been working on it for about 3 weeks and still couldn't get it to work. I believe this video will help me a lot in my project on multi agent deep reinforcement learning

  • @JousefM
    @JousefM 3 роки тому +4

    Comment for the algorithm and of course smashing that like button!

  • @beizhou2488
    @beizhou2488 3 роки тому +1

    When using the policy after training, should we sample an action from the distribution or select the action with the highest probability value?

  • @andrewspanopoulos1115
    @andrewspanopoulos1115 2 роки тому +10

    Amazing video, I have one question though (which was probably addressed in another comment, but just to be sure):
    Shouldn't each batch of data contain the steps of only 1 episode? Because they way the PPOMemory class is organized, we are storing individual steps in lists, and then picking a sequence of steps at random. But the sequence of steps we pick might belong to two different episodes, which make the calculations done in line 160 useless, as we are computing advantages for steps of different episodes.
    Plus, wouldn't it make sense it compute these advantages only once, and then store them in the PPOMemory in order to avoid recomputing them in every epoch, every time we train the agent?

    • @nathanjohnson5762
      @nathanjohnson5762 Рік тому +1

      At the end of the agent.learn() function the memory is cleared so there should only be T (in this case 20) timesteps of data from one singular episode in the memory, at any given time.

  • @ademord
    @ademord 2 роки тому

    also, is there a video where u explain the shannon entropy? prob not a complex concept but coming from you specifically would be great, great coverage of concepts, concise, clear speech. cheers and stay well!

  • @beiyang7057
    @beiyang7057 2 роки тому

    Thanks for your tutorail!! I have some confusions. I'll be grateful if you can help me. I wonder what's the difference between state and obs. And what about the x in the paper, which means "some state information, could consist of the observations of all agents "?

  • @_jiwi2674
    @_jiwi2674 3 роки тому +1

    Thanks so much for continuing to share these great videos Phil! They help so much. Would you happen to plan on implementing codes on cooperative multi-agent? I've seen a couple of codes by others but they are quite unclear. Would be fantastic if you do :)

  • @pranavkulkarni6489
    @pranavkulkarni6489 3 роки тому +1

    is there any way to select the parameter "N(memory buffer)" and batch size ? on what parameters do they depend?

  • @kunyanglin1053
    @kunyanglin1053 3 роки тому

    Hey Phil, could you explain that why you use the Return as the approximation of the value function instead of the expected Return?

  • @anshumansharma6758
    @anshumansharma6758 3 роки тому

    Brilliant content.

  • @end-quote
    @end-quote 2 роки тому

    much thanks for your video !

  • @Slanimero
    @Slanimero 3 роки тому

    I was wondering about the advantage function A hat sub t, that looks very similar to the temporal difference function. Are they the same thing?

  • @roomo7time
    @roomo7time Рік тому

    Amazing video. Huge contribution to a coding dumb like me. Immense gratitude to you.

  • @LudvigPedersen
    @LudvigPedersen 3 роки тому +3

    Thanks Phil for this great episode! One question, are you sure you got the advantage calculation right? Your discount-variable begins at one, but it is not reset if the game is done. The ppo-memory can contain multiple games.

    • @LudvigPedersen
      @LudvigPedersen 3 роки тому

      I'm working on my own implementation and is using something like this:
      delta = reward_arr[k] + Gamma * value_arr[k+1] * mask - value_arr[k];
      adv = delta + Gamma * Lambda * mask * adv;

    • @lbers238
      @lbers238 3 роки тому

      I've been wondering that as well. Right, multiple episodes aren't handled correctly.
      It's also calculating the advantage again for every epoch even though it should stay the same. It might be better to calculate the advantage after each episode and only save that instead of everything you need to calculate it later.

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  3 роки тому

      I believe this is a valid issue. Can you raise it on my GitHub? Link in description.

    • @jimmyrisk4412
      @jimmyrisk4412 Рік тому

      ​@@MachineLearningwithPhil I couldn't find the issue raised in github but I will propose a fix there. For anyone reading this that wants to know a fix, I think this will work: change the line
      discount *= self.gamma * self.gae_lambda
      to
      discount = discount * self.gamma * self.gae_lambda * (1 - int(dones_arr[k])) + int(dones_arr[k])
      so if it is done, the first term is 0 and the second term "turns on" to set it equal to 1. otherwise, the first term is as before and unaffected.
      I was having a lot of issues in a different application where the NN would do well and then basically die off and get stuck in a suboptimal policy (that it never got out of ). This seemed to fix it.

  • @kyriakos98apoel
    @kyriakos98apoel 2 роки тому

    Hey Phil , Great Video! Just wondering if there will be a change of performance by changing the Network implementation of the DQN Agent of your udemy course to nn.Sequential ?

  • @FullSimDriving
    @FullSimDriving 17 днів тому

    Hi Phil, I am exploring RL for autonomous driving in my channel by trying to outsource RL to stable baselines3, however I very quickly got into library conflicts in SB3. Now I was looking for ground up builds of RL and found this video. Thanks. I will try to implement it. If you have any other tips, please reach out. Hopefully my videos describe what I would like to do. Thanks.Vadim

  • @andriiartomov237
    @andriiartomov237 2 роки тому

    Hello. Thanks for the great tutorial. I've got a question. According to your code the very last advantage in the trajectory always remains 0. Why so? Shouldn't it be equal to reward(t) - v(t)?

    • @GroupChase
      @GroupChase Рік тому

      Did you ever figure this out, I'm having a problem with this

  • @juliahk
    @juliahk 3 роки тому +3

    This tutorial is so good, I bought your courses on DQL and actor-critic on Udemy. Are you planning to add a full length course for PPO as well?

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  3 роки тому +7

      I'll be adding the PPO module soon

    • @oleksandr8482
      @oleksandr8482 2 роки тому +1

      @@MachineLearningwithPhil Hello from Ukraine, im waiting too)

    • @kevinayers7144
      @kevinayers7144 2 роки тому +2

      @@MachineLearningwithPhil A PPO module sounds great!!!

  • @theshortcut101
    @theshortcut101 3 роки тому +1

    Thank you!

  • @joelbaptista9725
    @joelbaptista9725 8 місяців тому

    Hello, I know it's been 2 years since the initial release of the video but I had a question regarding the implementation. When we randomize each entry in the batch, aren't we losing the concept of a trajectory? Now we have a mini-batch of uncorrelated states and actions pairs, however we still sum the respective advantages as if they were from sequential timesteps (trajectory)

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  8 місяців тому

      Unless I made a mistake, we calculate advantages using the full trajectory and then shuffle. As long as we keep the state and new state pairs together it's sufficient for learning.
      For more refined code, see my protorl repository on GitHub.
      GitHub/philtabor/protorl

  • @sunaxes
    @sunaxes 8 місяців тому

    13:09, I think the exponent on smoothing/decay term should be with -1 not +1: you try to iterate with t=2 and T=5 for example, you ll see what I mean

  • @smitasingh9764
    @smitasingh9764 2 роки тому

    Thank you so much for putting an effort to do the whole implementation which is relatively bit easier to grasp than the paper. I am very new to RL and I have a rather weird question(cause no one actually addressed but ignore if I am being stupid), so when for the first time you call the learn function after doing 20 steps, wouldn't the new_probs be equal to the old_probs, because essentially the neural network didn't learn anything so would both these values be random until like several iteration? And if actually they would be random, how is the agent learning?

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  2 роки тому

      Yup, new probs and old probs are the same on the first iteration. The probs get updated based on the sampled rewards from the environment, through the advantage calculation.

  • @johannesmeier5958
    @johannesmeier5958 3 роки тому

    Hey Phil, why do you use the advantages to calculate the returns? Shouldn't the returns be the discounted rewards?

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  3 роки тому +1

      Yeah, great question. Couple reasons:
      1) I was basing my code off the work of another (as mentioned in the video), so perhaps I was influenced by that a little more than I usually would be. I saw other solutions use the same thing, so this is one situation where I deferred to what other people do, since the original paper isn't particularly clear.
      2) However, this is a theoretical basis for doing it this way. They aren't pulling it out of a hat. The returns serve as an approximation of the Q, while the critic serves as the approximation of the value function. The advantage is defined A = Q - V, so we can substitute and say (approximately) that the returns are A - V.

  • @souravsanyal2612
    @souravsanyal2612 2 роки тому

    Is there any way to create a custom environment similar to the ones in openai gym ?

  • @maarhybrid3037
    @maarhybrid3037 2 роки тому +1

    Can you explain why did you used L_VF as MSE of (advantage+value,critic_value) instead (value,critic_value)?

  • @beizhou2488
    @beizhou2488 3 роки тому +1

    Hi Phil. I might spot an error in your code. At line 160 in the file ppo_torch.py, the for loop should break each time an episode terminates(when the done flat is true). Because the advantage value should be calculated in terms of each episode rather than all the steps in the memory which the way you were doing. Please correct me if I am wrong. Thank you very much.

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  3 роки тому +1

      Yeah, I gotta take a look. I'll make an announcement on the community tab and update the github when I do. Can't modify this video and I don't want to redo the whole thing.

    • @beizhou2488
      @beizhou2488 3 роки тому

      @@MachineLearningwithPhil Thank you for your fast reply. Please do change the code in the github as I was using these code to do experiment before discovered this mistake and will be using the updated code when it is ready. Thank you very much.

    • @beizhou2488
      @beizhou2488 3 роки тому

      @@MachineLearningwithPhil Hi Phil. I am sorry for my comment. Your implementation is right and my understanding was wrong.

    • @andrewspanopoulos1115
      @andrewspanopoulos1115 2 роки тому

      @@beizhou2488 Are you sure? He is mixing steps of different episodes in the batches, but they are being treated like they belong to the same episode. Or is my understanding also wrong?

  • @user-ws2zi5oc8k
    @user-ws2zi5oc8k 10 місяців тому

    Hi Phill
    You explained that the batches are divided into subsets, such as groups of five, and prior to this, the indices are shuffled. My query relates to this shuffling process. By shuffling, each batch could consist of unrelated samples. How, then, can we accurately compute the advantage within these batches, given the potential lack of internal correlation?
    Your insights would be greatly appreciated and I eagerly look forward to your explanation.

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  10 місяців тому

      Great question. We calculate the advantage first, and shuffle concurrently. So each time step has the correct advantage associated with it.

  • @ijknm2531
    @ijknm2531 3 роки тому +1

    i have a question about AI, what is a neuromorphic computer ?

  • @user-qm4zv3nv2k
    @user-qm4zv3nv2k 11 місяців тому

    Hello nice video, I got a question. You said that the lenght of the memory is 20 episode, that should be less than the lenght of the episode that is 200 steps. Doesn't the two things be the same? I should terminate the episode with the last step and then update the policy parameters. Instead the policy update could occur even more time within the same episode?

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  11 місяців тому

      Nope, they're independent. The size of the replay buffer is a hyperparameter, while the episode length is fixed.

    • @user-qm4zv3nv2k
      @user-qm4zv3nv2k 10 місяців тому

      @@MachineLearningwithPhil Hi, I am really sorry to bother you again, but I can't figure it out. I understood that if I select 200 steps, the lenght of my episode is 200 steps (because the episode should be how many interactions there are between the agent and the environment), and so will be the dimension of the batch, which contains the experience gain during the episode. At the next episode, the agent will perform other 200 steps, and the size of the batch still will be 200 steps.
      These days, I'm doubting what I've written above

  • @cbasile22
    @cbasile22 Рік тому

    Amazing video Phil, how to determine what V(st+1) and so on? thanks

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  Рік тому

      Value states are estimated using the critic network, which is trained using the environment rewards. Simply pass the state through the network and you will get an estimate.

  • @magikarplore4941
    @magikarplore4941 3 роки тому +1

    I downloaded you code but whenever it tries to save the models it gives a directory error. I saw that you made the tmp and ppo directory in the video to solve the issue but for me that doesn't fix it. Do you need to make the files for pytorch to save over it?

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  3 роки тому

      No, you shouldn't have to make the files. What error are you getting?

    • @theoz441
      @theoz441 2 роки тому

      hey @magik , i come across the same error as you did, did you find a way of making it work ? :)

  • @thomashirtz
    @thomashirtz 2 роки тому

    Would it be possible to get the link of the william's code ?

  • @salwamostafa1332
    @salwamostafa1332 Рік тому

    Thank you so much for the content. Can you please explain the modification that should be done in the code for the multidiscrete action space?

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  Рік тому +1

      You can check my GitHub for the code. It's under the advanced actor critic methods repo.

    • @salwamostafa1332
      @salwamostafa1332 Рік тому

      @@MachineLearningwithPhil Thank you so much for your help. Can you also explain the modification in the code to MAPPO and IPPO for multiagent setting ?

  • @user-kn8tp7jo3c
    @user-kn8tp7jo3c 8 місяців тому

    impressive programming skills

  • @mahdieskandari3161
    @mahdieskandari3161 3 роки тому +1

    Hello, Can I kindly ask you to implement multi-agent actor-critic? This is a very important and interesting topic. there is no goof implementation on the internet.

  • @aidankennedy6973
    @aidankennedy6973 3 роки тому +1

    Hey Phil, I'd love to hear an update on the nn.Sequential performance issue if you happen to have an update.

  • @RishiPratap-om6kg
    @RishiPratap-om6kg Рік тому

    Can this implementation code work for computation offloading in edge computing

  • @mahdieskandari3161
    @mahdieskandari3161 3 роки тому

    Hello
    I have a question.
    Yout code won't work with continuous action space. Or I am wrong. Should I do ant modifications?

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  3 роки тому

      It needs modifications. The actor will output a mean and sigma that you feed into a normal distribution.

  • @KishoreKumar-uz8ir
    @KishoreKumar-uz8ir 3 роки тому +1

    Hello Phil! This video is dope but I noticed that this solution has been implemented for Discrete action spaces. Is it possible for you to implement a continuous action space tutorial? I am currently working on a continuous action space problem (Unity ML-Agents' crawler problem) using DDPG and the agent's performance doesn't improve at all and even if it improves, the performance suddenly falls off a cliff. I have however found that many people including Unity itself have completed the Crawler problem using PPO network. A categorical distribution is used in discrete action space, right? Others are using some kind of a normal distribution and I simply couldn't understand what is going on. It would be great if you can do the continuous action space version of PPO. Thank you in advance.

    • @mannequins_
      @mannequins_ 3 роки тому

      hey kishore, have you figured it out by now how you change it from discrete to continous? i am facing the same problem

    • @KishoreKumar-uz8ir
      @KishoreKumar-uz8ir 3 роки тому

      @@mannequins_ Not yet Amon. I will let you know if I implement it though.

    • @ashishj2358
      @ashishj2358 3 роки тому

      Yeah so bro all you have to do is to output two neurons per action, one for mean and one for variance. Then return (say for 2 actions with continuous domain) Multivariate Normal(both mu, cov matrix with variance as diagonal) distribution object. Now in choose action function just use action= dist.sample() Rest will remain same I guess.

    • @KishoreKumar-uz8ir
      @KishoreKumar-uz8ir 3 роки тому

      @@ashishj2358 Hey Ashish thank you. I asked it 2 months ago and figured it out later but thank you nonetheless as it would help others who have the same doubt. Cheers!!!

    • @tienichweb
      @tienichweb 3 роки тому

      Please share me the code if you has modifiled it as countinuos space!

  • @Himanshu-xe7ek
    @Himanshu-xe7ek 2 роки тому +1

    If you are storing all the states of different episodes in a single array, then how are you accurately calculating the advantage. Doesn't rewards from other episodes increase variance of advantage.

    • @morty6159
      @morty6159 2 роки тому

      It's a mistake, the advantage calculation only makes sense when used on the same episode. I think you can solve this by resetting the discount factor to 1 if the dones array equals 1, so basically resetting the discount factor after each episode.

  • @CC-ec1il
    @CC-ec1il 3 роки тому

    Thanks for your vedio. But I think this code is REINFORCE policy gradient, instead of actor critic. Because the advantage value is generated by reward_arry rather than critic network.

  • @aliamiri4524
    @aliamiri4524 3 роки тому +1

    Finally 😃

  • @maths_physique_informatiqu2925
    @maths_physique_informatiqu2925 4 місяці тому

    why the same code doesn't work for Pendulum-v0 env ??

  • @janschnyder5137
    @janschnyder5137 3 роки тому

    at 47:30 , why is the total loss = actor_loss * 0.5critic_loss. Why does only the critic loss get halfed?

    • @andriiartomov237
      @andriiartomov237 2 роки тому

      As far as I understand you can view this 0.5 coefficient as yet another hyperparameter. If you look at the paper ("Proximal Policy Optimization Algorithms
      " by J.Schulman et al), at the formula (9) this is the c1 coefficient.

  • @DANIELCOLOMBARO
    @DANIELCOLOMBARO Місяць тому

    Thank you for the informative content! Unfortunately, as of April 2024, the code throws the following error: " File "...\main.py", line 31, in
    action, prob, val = agent.choose_action(observation)
    File "...\ppo_torch.py", line 136, in choose_action
    state = T.tensor([observation], dtype=T.float).to(self.actor.device)
    ValueError: expected sequence of length 4 at dim 2 (got 0)".I know 'CartPole-v0' is outdated but updating it does not solve the problem.

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  Місяць тому

      The newest gym returns the observation and info from reset.
      It also returns new observation, reward, done, truncated, info from the step function.

  • @user-ws2zi5oc8k
    @user-ws2zi5oc8k 9 місяців тому

    Hey
    Why is entropy required in this context, and why does it not appear in your code?

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  9 місяців тому

      Entropy is used to prevent premature convergence. I don't think it was required here.

  • @olylpetit6188
    @olylpetit6188 2 роки тому

    why do i get this error when running the main:
    File "C:\Users\user\anaconda3\lib\site-packages\torch\serialization.py", line 193, in __init__
    super(_open_file, self).__init__(open(name, mode))
    FileNotFoundError: [Errno 2] No such file or directory: 'tmp/ppo\\actor_torch_ppo'
    ?

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  2 роки тому +1

      You have to do a mkdir

    • @theoz441
      @theoz441 2 роки тому

      did you find out how to do it @oly ? I created the repos but i don't know whats not working

  • @ouni127
    @ouni127 2 роки тому

    Maybe dict-type return is more convince than tuple return.

  • @Graverman
    @Graverman Рік тому

    Hello, I want to have multiple outputs so I can use this on my environment, how do I do this?

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  Рік тому

      Change the number of outputs on the final layer of the policy and set the activation function

    • @Graverman
      @Graverman Рік тому

      @@MachineLearningwithPhil thank you, but there is another problem, when I do that, the neural network outputs number from 0 to 25 instead of 26 numbers. I tried outputting probabilities but these always sub to 1 and are very small numbers so it doesn't work. How do I solve this?

  • @RS-go2sn
    @RS-go2sn 3 роки тому

    What keyboard are you using?

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  3 роки тому

      Cooler master something or other. It's from 2014 so I don't think they still make the specific model. It's a mechanical with the cherry switches

  • @chuncheng2632
    @chuncheng2632 8 місяців тому

    How to modify it so that it can be used in continuous action space?

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  8 місяців тому +1

      You can use a beta distribution instead of categorical. Check my GitHub for the advanced actor critic course repo.

  • @fmj.mytube8846
    @fmj.mytube8846 Рік тому

    Traceback (most recent call last):
    File "W:\Programming\PPO\main.py", line 12, in
    agent = Agent(n_actions=env.action_space.n, batch_size=batch_size, alpha=alpha, n_epochs=n_epochs, input_dims=env.observation_space.shape)
    File "W:\Programming\PPO\agent.py", line 16, in __init__
    self.critic = CriticNetwork(input_dims, alpha)
    TypeError: __init__() takes 1 positional argument but 3 were given
    i hate python... somebody knows what is going on?

  • @myselfremade
    @myselfremade 3 роки тому +1

    Merry Christmas, or whatever holiday you do or don't observe.

  • @Mahesha999
    @Mahesha999 3 роки тому

    Any plan for PPG?

  • @BenjaminKelm
    @BenjaminKelm Рік тому

    Hey Phil,
    thank you so much for making this available for free!
    I encounter a problem, when running on a non-CUDA enabled environment. Did anyone have a similar problem?
    File "main.py", line 32, in
    action, prob, val = agent.choose_action(observation)
    File "/home/jovyan/UA-cam-Code-Repository/ReinforcementLearning/PolicyGradient/PPO/torch/ppo_torch.py", line 138, in choose_action
    state = T.tensor(observation, dtype=T.float).to(self.actor.device)
    ValueError: expected sequence of length 4 at dim 1 (got 0)

    • @ajstatus9014
      @ajstatus9014 Місяць тому

      yes I also have ,did you solved this issue

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  Місяць тому +1

      In the new gym interface, reset returns both an observation and the debug info.
      Step now returns an additional variable: truncated.
      So you need to take these into account when getting information back from the environment. You will also need to terminate the while loop when either done or truncated is true.

  • @pensiveintrovert4318
    @pensiveintrovert4318 Рік тому

    I assume in real life you don't just type out a paper in python and debug the code in slightly over an hour, right? Or am I wrong?

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  Рік тому +1

      No, I definitely spend a large amount of time getting stuff to work. I'm working off a cheat sheet in these videos.

  • @Manu-fk7ct
    @Manu-fk7ct 6 місяців тому

    mkdir -p to create folders like /tmp/ppo plz (mkdir -p /tmp/ppo)... not doing mkdir tmp then mkdir tmp/ppo again one single command is enough !
    Also stop to use :wq with vi/vim ... use :x instead it makes exactly the same with one letter !

  • @einsteinsapples2909
    @einsteinsapples2909 7 днів тому

    9:35 "The advantage is just a measure of the goodness of each state", that is not correct. The advantage is a measure of how much better a particular action is compared to the average action taken from the same state.

  • @bbence123
    @bbence123 Рік тому

    31:30

  • @LTL1204
    @LTL1204 4 місяці тому

    8

  • @eofirdavid
    @eofirdavid Місяць тому

    I am sorry, but I gave up on this tutorial quite fast. I can't really understand something new when the person teaching is telling me already at the start about all the problems and all the edges cases and parameters and algorithms and everything else before actually explaining the first step. This is like taking your first course in calculus, and before you learn what are even limits, the lecturer tries to explain the problems with taking derivatives of double integration over the whole plane.

  • @pattiknuth4822
    @pattiknuth4822 3 роки тому

    This was an awful presentation. It makes no sense whatsoever unless you already have a rough idea of what PPO is and the key definitions. How do I get back the 15 minutes I wasted watching it?

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  3 роки тому +3

      Make sure to smash that dislike button

    • @aidankennedy6973
      @aidankennedy6973 3 роки тому +2

      It's honestly astonishing that one would take the time to write such a comment. Phil is the best

  • @TheAusrali
    @TheAusrali Рік тому

    Hey, i am getting "FileNotFoundError: [Errno 2] No such file or directory: 'tmp/ppo/actor_torch_ppo'" - any reasons why?