An introduction to Policy Gradient methods - Deep Reinforcement Learning

Поділитися
Вставка
  • Опубліковано 22 гру 2024

КОМЕНТАРІ • 195

  • @paulstevenconyngham7880
    @paulstevenconyngham7880 6 років тому +216

    This is the best explanation of PPO on the net hands down

  • @Alex-gc2vo
    @Alex-gc2vo 6 років тому +10

    easily the best explanation of PPO I've ever seen. most papers and lectures get too tangled up in the probabilistic principals and advanced mathematic derivations and completely lose sight of what these models are doing in high level terms.

  • @bigdreams5554
    @bigdreams5554 2 роки тому +2

    This guy actually knows what he's talking about. Excellent video.

  • @maloxi1472
    @maloxi1472 4 роки тому +8

    The value you provide in these videos is insane !
    Thank you very much for guiding our learning process ;)

  • @arkoraa
    @arkoraa 6 років тому +18

    I'm loving this RL series. Keep it up!

  • @sarahjamal86
    @sarahjamal86 5 років тому +4

    As someone who is working in RL field .... you did very good job.

  • @DavidSaintloth
    @DavidSaintloth 6 років тому +74

    I actually understood your explanation cover to cover on first view and thought the 19 minutes felt more like 5.
    Outstanding work.

    • @oguretsagressive
      @oguretsagressive 5 років тому +5

      One view, no pauses?! Not willing to be mean, but how can you be sure you've truly understood and weren't conquering Mount Stupid the whole time?

  • @Navhkrin
    @Navhkrin 5 років тому +2

    12:19
    Min operator also gets prefers old PO update
    IF advantage is positive but probability of taking that action decreased, min operator selects unclipped objective here to undo the bad update
    IF the advantage is negative but probability of taking that action increased, min operator also selects unclipped objective to undo bad update, just as mentioned in video.

  • @tyson96
    @tyson96 Рік тому

    Explained so well and it was intuitive as well. I learnt more from this video than all the articles I found in the internet. Great job.

  • @yuktikaura
    @yuktikaura Рік тому

    Keep it up. Brevity is the soul of wit, it is indeed a skill to summarize the crux of a concept in such lucid way..!

  • @alializadeh8095
    @alializadeh8095 6 років тому +4

    Amazing! This was the best explanation of PPO I have seen so far

  • @BoltronRacingTeam
    @BoltronRacingTeam 2 роки тому +1

    Excellent video! Wonderful resource for anyone participating in AWS DeepRacer competitions.

  • @akshatagrawal819
    @akshatagrawal819 5 років тому +325

    He is actually much better than Siraj Raval.

    • @oracletrading3000
      @oracletrading3000 5 років тому +4

      @@ahmadayazamin3313 what kind of scandal?

    • @oracletrading3000
      @oracletrading3000 5 років тому +1

      @@ahmadayazamin3313 I don't know it, just watch one or two videos of him demonstrate RL for trading

    • @DeanRKern
      @DeanRKern 4 роки тому +1

      He seems to know what he's talking about.

    • @joirnpettersen
      @joirnpettersen 4 роки тому +10

      He (Siraj) never explained any of what he was saying, and that was why I stopped watching him. He just rushed through the context and the results, explaining nothing.

    • @revimfadli4666
      @revimfadli4666 4 роки тому +5

      @@oracletrading3000 'how to predict stock market in 5 minute'? More like, how to expose oneself as a fraud & end career in that time

  • @BDEvans
    @BDEvans 4 роки тому

    By far the best explanation on UA-cam.

  • @4.0.4
    @4.0.4 6 років тому +2

    Thank you for including links for learning more on the description.

  • @MShahbazKharal
    @MShahbazKharal 4 роки тому

    it is a long video, no doubt, but once you end watching it you think it was much better than actually reading the paper. thanks man!

  • @berin4427
    @berin4427 4 роки тому

    Fantastic review of policy gradients, and PPO as well! Best place for a refresh

  • @scienceofart9121
    @scienceofart9121 4 роки тому +2

    I watched this video more than 5 times and this was the best video about the PPO. Thank you for making great videos like this and keep up the good work. P.S: Your explanation was even simpler than the creator of this algorithm Schulman.)

  • @curumo_curunir
    @curumo_curunir 2 роки тому

    Thank you for the video, it is very helpful. The key concepts have been explained in just 20min, bravo. I would like to see more videos from your channel. Thank you.

  • @ColinSkow
    @ColinSkow 6 років тому +3

    Great breakdown of PPO. You've simplified a lot of complex concepts to make them understandable! Hahaha... and you can't beat an octopus slap!!!

  • @xiguo2783
    @xiguo2783 4 роки тому

    Great explanation with enough details! Thumbs up for all the free knowledge on the internet!

  • @fktudiablo9579
    @fktudiablo9579 4 роки тому

    one of the best overview of PPO, clean.

  • @Խչո
    @Խչո Рік тому

    Wonderful, this is the first video i've seen on this channel. I suspect it won't be the last!

  • @zeyudeng3223
    @zeyudeng3223 5 років тому +1

    I watched all your videos today, great works! Love them!

  • @cherguioussama1611
    @cherguioussama1611 3 роки тому

    best explanation of PPO I've found. Thanks

  • @Rnjeazy
    @Rnjeazy 6 років тому +2

    Dude, your channel is awesome! So glad I found it!

  • @Fireblazer41
    @Fireblazer41 5 років тому +1

    Thank you so much for this video! This is way more insightful and intuitive than simply reading the papers!

  • @Samuel-wl4fw
    @Samuel-wl4fw 3 роки тому

    Coming back to this after thoroughly understanding Q-learning and looking into the advantage function in another network, this explanation is FAST, I wonder who would understand all that is happening without background knowledge

    • @bigdreams5554
      @bigdreams5554 2 роки тому

      Well for AI/ML some background info is needed. If you taking multivariable calculus, its assumed you know calculus already. For those who already work in machine learning, this video is amazing. If i didn't get something i can research what he's talking about because he's using the proper technical terms, not dumbing it down. It's a wake up call for what I need to know to be knowledge. Great video.

  • @EddieSmolansky
    @EddieSmolansky 6 років тому +2

    This video was very well done, I definitely got a lot of value out of it. Thank you for your work!

  • @m33pr0r
    @m33pr0r 4 роки тому +2

    Thank you for such a clear explanation. I was able to watch this at 2x speed and follow everything, which is a testament to your clarity. It really helped that you tied PPO back to previous work (e.g TRPO)

  • @jeremydesmond638
    @jeremydesmond638 5 років тому +2

    Third video in a row. Really enjoy your work. Keep it up! And thank you!!!

  • @labreynth
    @labreynth 4 місяці тому

    This topic is so far from my comprehension, and yet you got me to understand it within 3 minutes

  • @jeffreylim5920
    @jeffreylim5920 5 років тому +1

    16:46 How to apply gpu usage in PPO. For me, it was hard to implement collect experiences with cuda pytorch; Actually it seems even OpenAI didn't use gpu in the collecting process.

    • @DVDmatt
      @DVDmatt 5 років тому +1

      You need to vectorize the environments. PPO2 now does this (see twitter.com/openai/status/931226402048811008?lang=en). In my experience GPU usage is still very low during collection, however.

  • @Navhkrin
    @Navhkrin 5 років тому

    Much cleaner than deep learning boot camp explanation

  • @DavidCH12345
    @DavidCH12345 5 років тому +1

    I love how you take the formula apart an look at it step by step. Great work!

  • @junjieli9253
    @junjieli9253 5 років тому

    Thank you for help me understand PPO faster, good explanation with useful resources included.

  • @sarvagyagupta1744
    @sarvagyagupta1744 5 років тому

    I don't know if I'll get answers here but I have some questions:
    1) Why are we taking the "min" in the loss function?
    2) We are considering 1 in 1-e and 1+e because the reward we give for each positive action is 1, right? My question here is the scaling factor.

  • @anonymous_user-s3s
    @anonymous_user-s3s Місяць тому

    Fantastic intuitive explanation, thank you.

  • @Bardent
    @Bardent 3 роки тому

    This video is absolutely amazing!!

  • @ravichunduru834
    @ravichunduru834 5 років тому +3

    Great video, but I have a couple of doubts:
    1. In PPO, how does changing the objective help in restricting the updates to the policy? Wouldn’t it make more sense to restrict the gradient so that we don’t update the policy too much in one go?
    2. In PPO, when A

  • @umuti5ik
    @umuti5ik 4 роки тому

    Excellent algorithm and explanation!

  • @antoinemathu7983
    @antoinemathu7983 5 років тому +5

    I watch Siraj Raval for the motivation, but I watch Arxiv Insights for the explanations

  • @Corpsecreate
    @Corpsecreate 6 років тому

    I some questions! Taking a quick step back to the Policy Gradient Loss for a sec, we had:
    Loss = E ( [log prob] * advantage )
    If my understanding is correct, then we actually have two neural networks here. One that calculates the probabilities of each action (this is the policy network we are trying to optimise), and one entirely different neural network that tries to guess the value of being in the current state. Q1 - does the value network simply learn off mean-squared-error by minimising ([actual discounted reward] - [value net prediction])^2? Is there no way to train use policy gradient methods without running 2 networks?
    Q2 - How do we actually calculate the discounted reward for a neural network where only the probabilities of each action are taken? For example, if at time step 0, our NN produces:
    Act 1 : 20%
    Act 2 : 30%
    Act 4 : 50%
    I can only take one of these actions to end up in a new state. Do we take the highest one? Or do we, for each trajectory, randomly pick one based on their probability of being chosen? Do we do this for every time step t = 1 to T?
    After the trajectory of T timesteps, we get one 'actual' value for G, that is attributed to the timestep at time t = 0. Does this mean we can only perform gradient descent on this single observation? If we do a minibatch, do we need multiple tractories, say 50, each of length T, then do gradient descent on the 50 where the only the G value for t = 0 for each of them has been calculated?
    My apologies for the questions, hopefully they make sense and I'm just looking to confirm my understanding :)

    • @ArxivInsights
      @ArxivInsights  6 років тому

      Hi, really good questions!
      Q1: you can train a policy gradient method without using a value function by just training the policy network, but using a value function to estimate the expected return from the current state tends to make things much more stable..
      Q2: You're correct that this might seem a bit weird, but indeed you have to probabilistically sample an action at each timestamp and then play out the episode along that specific path in state space. However, on average & over time you can see it that each action will in fact get selected according to it's probability rate. So stochastically every action gets played!

  • @CommanderCraft98
    @CommanderCraft98 3 роки тому

    At 10:27 he says that first part in the min() expression is the "default policy gradient objective" but I do not really see how, since the objective function usually is J=E[R_t]. Does someone understand this?

  • @yonistoller1
    @yonistoller1 Рік тому

    Thanks for sharing this! I may be misunderstanding something, but it seems like there might be a mistake in the description. Specifically, the claim in 12:50 that "this is the only region where the unclipped part... has a lower value than the clipped version".
    I think this claim might be wrong, because there could be another case where the unclipped version would be selected:
    For example, if the ratio is e.g 0.5 (and we assume epsilon is 0.2), that would mean the ratio is smaller than the clipped version (which would be 0.8), and it would be selected.
    Is that not the case?

  • @jcdmb
    @jcdmb 4 роки тому

    Amazing explanation. Keep up the good work.

  • @arianvc8239
    @arianvc8239 6 років тому

    This is really great! Keep up the good work!

  • @petersilie9702
    @petersilie9702 4 роки тому +1

    Thank you so much. I watch this videos the 10th time :-D

  • @MyU2beCall
    @MyU2beCall 4 роки тому

    Great Video. Excellent intro to this topic .

  • @francesco.messina88
    @francesco.messina88 5 років тому

    Congrats! You have a special skill to explain AI.

  • @victor-iyi
    @victor-iyi 5 років тому +1

    Hi Andrew,
    Can you please make a video explaining OpenAI's transformer model, Google's BERT & OpenAI's GPT&GPT-2 model?
    I can't seem to wrap my head around them.

  • @pawanbhandarkar4199
    @pawanbhandarkar4199 5 років тому

    Hat's off, mate. This is fantastic.

  • @joshuajohnson4339
    @joshuajohnson4339 6 років тому +6

    Just thought I would let you know that I just shared your video with my cohort in the Udacity Reinforcement Learning Nanodegree. We are going through PPO now and this video is relevant and timely - especially wrt to the clip region explanations. Any ideas on how to convert the outputs from discrete to continuous action space?

    • @ArxivInsights
      @ArxivInsights  6 років тому +1

      Sounds great, thx for sharing! Well as I mentioned, the PPO policy head outputs the parameters of a Gaussian distribution (so means and variances) for each action. At runtime, you can then sample from these distributions to get continuous output values and use the reparametrization trick to backpropagate gradients through this non-differentiable block --> check out my video on Variational Autoencoders for all the details on this!

  • @akramsystems
    @akramsystems 6 років тому +2

    Love your videos!!

  • @meddh1065
    @meddh1065 2 роки тому

    There is something I didn't understand : If you clip r then how will you do back propagation ?? the gradient will be just zero in the case r>1+epsilon or r

  • @sainijagjit
    @sainijagjit 5 місяців тому

    Thank you, for the clean explaination

  • @rutvikreddy772
    @rutvikreddy772 4 роки тому

    Great video! I had a question though, at 6:50, the objective function, which you called loss is actually the function that we'd want to maximize right? I mean calling it loss gave me the idea that we should minimize it. Correct me if I am wrong, please.

    • @gregh6586
      @gregh6586 4 роки тому +1

      Yes, we are trying to maximise the advantage. It is called "loss" simply because it has the same function as (true) loss functions in other domains. It might get tricky when you implement a multi-head neural network with Actor-Critic-Methods where you combine different loss functions (GAE for the actor, lambda returns for the critic, entropy for exploration) as you have to make sure which "loss" you aim to maximise and which to minimise.

  • @cherrysun7054
    @cherrysun7054 5 років тому

    I really love your video, professional and informative, thank.

  • @connor-shorten
    @connor-shorten 5 років тому +1

    Thank you! Learned a lot from this!

  • @suertem1
    @suertem1 3 роки тому

    Great explanation and references

  • @abhishekkapoor7955
    @abhishekkapoor7955 6 років тому

    keep up the good work , sir. thanks for this awesome explanation

  • @alizerg
    @alizerg Рік тому

    Thanks buddy, really appreciated!

  • @YuZhang-f1z
    @YuZhang-f1z 5 років тому

    great video for ppo! thanks a lot for you work!

  • @apetrenko_ai
    @apetrenko_ai 6 років тому +2

    It was a great explanation!
    Please do a video on Soft Actor-Critic and Maximum Entropy RL! That would be amazing!

  • @siddharthmittal9355
    @siddharthmittal9355 6 років тому

    more, just more videos. so well explained.

  • @Guytron95
    @Guytron95 4 роки тому

    3:54 to 4:10 or so, why does that section remind me of the method used for ray marching in image rendering?

  • @samidelhi6150
    @samidelhi6150 5 років тому

    How do I tackle the moving target problem using methods from RL ? Where I have more than one reward ,3 possible actions to take and ofcourse state which include many factors / sources of information around the environment ,
    Your help is highly appreciated

  • @luck3949
    @luck3949 6 років тому +2

    Do we actually need machine learning for single agent situations? It seems that if we don't have any adversarial factors, then we only need path planning, which should be doable by SAT/SMT solver or variations of A* much better. For me it seems that RL in cases like moving a cube with a hand is same as "let's shoot a bunch of neural networks to our problem and wait untill it will invent an approximation of multi-dimensional A* for us". And it still doesn't provide any guarantees that it will use the best trajectories (while path planning algoritms do).

    • @ArxivInsights
      @ArxivInsights  6 років тому +5

      The problem is that path planning requires you to have access to a somewhat accurate foreward model of the environment + requires quite a lot of computation at runtime since you need a significant amount of sampled forward trajectories in order to get decent performance. A trained policy network avoids both those constraints.
      But I do agree that current RL methods are far from optimal. The biggest problem from my point of view is that we currently have no idea how to do meaningful abstraction/generalization. What works is overfitting on a dense sampling of data from the problem space, but things like transfer learning / one-shot generalization are very big problems right now and we'll need some radically new approaches to tackle those!

    • @luck3949
      @luck3949 6 років тому

      @@ArxivInsights [I wanted to write that it would be interesting to try using ML to find the rules and then use solver to achieve the goal, but then I recalled a project called AIRIS that does exactly that.]

  • @arkasaha4412
    @arkasaha4412 6 років тому +2

    Great video as usual! Just a suggestion, maybe instead of diving directly into deep RL you can make videos (shorter if you don't have much time to devote) on simpler RL algorithms like DQN, Q-learning. That way someone who wants to know more about RL can build up the knowledge through more vanilla stuff. I admire your style and content like many others and would love to see it grow more :)

    • @ArxivInsights
      @ArxivInsights  6 років тому +11

      You're totally right that this is not easy to step into if you are new to RL, but I feel like there are tons and tons of good introduction resources out there for the simpler stuff.
      I'm really trying to focus on the niche of people that already have this background and want to go a bit further :p

    • @52VaultBoy
      @52VaultBoy 6 років тому +2

      And I thing that makes this channel amazing. Practically following the state-of-art is a fantastic concept. As I am just curious about AI as a hobby and doing science in a different sector, I am glad, that I don't need to go through tons of articles by myself, but you show me the direction, where I should look to stay in a picture. Thank you.

    • @M0481
      @M0481 6 років тому +1

      I was about to say what @Arvix Insights is saying. There's an amazing book called: RL: An introduction by Richard S. Sutton and Andrew G. Barto, which introduces all the basic concepts that you're talking about. This second version of this book has been made publically available as a pdf and will be available on Amazin next month (don't quote me on the release date please :P).

    • @ColinSkow
      @ColinSkow 6 років тому +1

      I'm doing beginner level videos on my channel and would love your feedback... ua-cam.com/channels/rRTWfso9OS3D09-QSLA5jg.html

    • @arkasaha4412
      @arkasaha4412 6 років тому

      Sure, thanks for the videos :)

  • @thiyagutenysen8058
    @thiyagutenysen8058 2 роки тому

    log(probabilities) will be -ve right, so if we take a bad action advantage function is -ve, so Lpg = -ve*-ve = +ve. so Lpg is blowing up when we take bad actions. L represents objective and not the loss function right?

  • @ruslanuchan8880
    @ruslanuchan8880 6 років тому

    Subscribed because the topics so cool!

  • @conlanrios
    @conlanrios 9 місяців тому

    Great breakdown and links for additional resources

  • @adefirmanfauzi5500
    @adefirmanfauzi5500 4 роки тому +7

    9:56
    "Looks surprising simple,,, .. right?"
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    :(

  • @maraoz
    @maraoz 4 роки тому

    Thanks for this video! Really good teaching skills! :)

  • @anthonydawson9700
    @anthonydawson9700 3 роки тому

    Pretty good explanation and very understandable thanks!

  • @ConsultingjoeOnline
    @ConsultingjoeOnline 4 роки тому

    *Great* video. *Great* explanation!

  • @vizart2045
    @vizart2045 2 роки тому

    I need to dive deeper into this.

  • @tumaaatum
    @tumaaatum 3 роки тому

    Can you do a video about DDPG?
    Also, The PPO I know uses simply the discounted future returns in the loss. Is the variant using the Advantage instead the standard one?

  • @hadsaadat8283
    @hadsaadat8283 2 роки тому

    simply the best ever

  • @bikrammajhi3020
    @bikrammajhi3020 9 місяців тому

    This is gold!!

  • @jrkirby93
    @jrkirby93 6 років тому +2

    I'm really confused about what the epsilon is and why it's there. Epsilon is generally used to refer to "a very small number" used to give very small bounds on things. So clip(x, 1-e, 1+e) is basically just 1 right? Why isn't the objective just min(r(θ)A, 1) ?

    • @frederikdesmedt4419
      @frederikdesmedt4419 6 років тому +3

      Epsilon is not that small, you can imagine epsilon as a hyperparameter specifying how much the new policy can differ from the old policy, so if you want PPO to update the policy more radically you increase epsilon, if you want to make smaller updates you decrease epsilon (closer to 0, but not as close such that you can replace clip(x, 1-e, 1+e) by 1).

    • @ArxivInsights
      @ArxivInsights  6 років тому +3

      In PPO, this epsilon value is something like 0.2, so you're clipping r(θ) to within the range [0.8 - 1.2] and than multiplying that value with the Advantage estimate.
      But as I explain in the video, the final result after the min() operator is dependent on the sign of A (pos or neg). So, for A>0 the 1-e clip doesn't matter since whenever r(θ) becomes smaller than 0.8, it's unclipped version will still get returned by the min() operator. Analogous for A

    • @DavidSaintloth
      @DavidSaintloth 6 років тому +3

      This really encapsulates the brilliance of the method, it's a dynamic regulator, like a differential gear for keeping the policy converging on regimes that are "proximal" to ones previously shown to be optimal.

    • @williamchamberlain2263
      @williamchamberlain2263 6 років тому

      @@DavidSaintloth like bounded sampling in some GA: don't sample too far from the current best/working parameter region.

  • @jeffreylim5920
    @jeffreylim5920 5 років тому +1

    12:56 The real power of clipping is that it automatically ignores oulier samples. Not decreasing the influence, but totally ignoring! This is because the gradient of outlier samples are 0

  • @tuliomoreira7494
    @tuliomoreira7494 4 роки тому

    Amazing explanation.
    Also. I just noticed that at 9:12 the seal slaps the guy with an octopus o.O

  • @阮雨迪
    @阮雨迪 3 роки тому

    really good explanation!

  • @alex4good
    @alex4good 6 років тому

    Hey @Arxiv Insights, I do have a basic question regarding Reinforcement Learning and would really appreciate your help. What is the basic difference between Reinforcement Learning, Deep Learning and Deep Reinforcement Learning? Does Basic Reinforcement Learning take advantage of Neural Networks to find the best solution and therefore uses Deep Learning? Thank you very much in advance, trying to get an overview and understand all the differences for my Master Thesis at the moment...

  • @Arctus1491
    @Arctus1491 5 років тому

    At 2:54 you talked about online vs offline learning, but the screen shows some comparisons between off-policy and on-policy learning. Otherwise cool video!

  • @rayanelhelou2009
    @rayanelhelou2009 4 роки тому

    A comment about the PPO paper, not this video: there's a minor typo in Eq. (10, 11).
    The terms in the exponent should read T-t-1 rather than T-t+1.
    Would you agree?

  • @absimaldata
    @absimaldata 3 роки тому

    Why do we take the log of policy in the loss??

  • @RishiPratap-om6kg
    @RishiPratap-om6kg Рік тому

    Can I use this algorithm for "computation offloading in edge computing "

  • @chid3835
    @chid3835 3 роки тому

    Very nice videos. FYI: Please watch at 0.75 speed for better understanding, LOL!

  • @kenfuliang
    @kenfuliang 4 роки тому

    Thank you so much. Very helpful

  • @SG-tz7jj
    @SG-tz7jj Рік тому

    Great explanation.

  • @fabiocescon3772
    @fabiocescon3772 5 років тому

    Thank you, it's really a good explanation

  • @benjaminf.3760
    @benjaminf.3760 5 років тому

    Very well explained, thank you

  • @MdelaRE1
    @MdelaRE1 5 років тому +1

    Amazing work :D

  • @Enerdzizer
    @Enerdzizer 4 місяці тому

    2:59 are you sure you wrote right the difference between on policy and off policy ? Online policy is the one in which agent chooses the next step using the same policy which he is currently updating. Offline policy means that we choose steps according to one policy explanatory and learn absolutely other policy, target policy. I think it’s the difference, all others differences mentioned are not valid

  • @learningdaily4533
    @learningdaily4533 3 роки тому +2

    Hi, I'm really interested in your video and I found that these videos are really helpful for people who are learning RL in particular and A.I in general. Your way of representing the notion and key ideas behind these algorithms is amazing. It's too sad to found that you r not gonna update videos for 2 years. Do you have any another channel or anything I can learn from you. Please let me know :( It's my pleasure

    • @ademord
      @ademord 2 роки тому

      I came here to say this

  • @anshulpagariya6881
    @anshulpagariya6881 3 роки тому

    A big thanks for the video :)

  • @vigneshamudha821
    @vigneshamudha821 6 років тому

    Hi can you talk about reinforcement learning in Adanet which automatically build the good structure of the neural network

  • @idabagusdiaz
    @idabagusdiaz 4 роки тому

    YOU ARE AWESOME!