An introduction to Policy Gradient methods - Deep Reinforcement Learning

Поділитися
Вставка
  • Опубліковано 21 тра 2024
  • In this episode I introduce Policy Gradient methods for Deep Reinforcement Learning.
    After a general overview, I dive into Proximal Policy Optimization: an algorithm designed at OpenAI that tries to find a balance between sample efficiency and code complexity. PPO is the algorithm used to train the OpenAI Five system and is also used in a wide range of other challenges like Atari and robotic control tasks.
    If you want to support this channel, here is my patreon link:
    / arxivinsights --- You are amazing!! ;)
    If you have questions you would like to discuss with me personally, you can book a 1-on-1 video call through Pensight: pensight.com/x/xander-steenbr...
    Links mentioned in the video:
    ⦁ PPO paper: arxiv.org/abs/1707.06347
    ⦁ TRPO paper: arxiv.org/abs/1502.05477
    ⦁ OpenAI PPO blogpost: blog.openai.com/openai-baseli...
    ⦁ Aurelien Geron: KL divergence and entropy in ML: • A Short Introduction t...
    ⦁ Deep RL Bootcamp - Lecture 5: • Deep RL Bootcamp Lect...
    ⦁ RL-adventure PyTorch implementation: github.com/higgsfield/RL-Adve...
    ⦁ OpenAI Baselines TensorFlow implementation: github.com/openai/baselines
  • Наука та технологія

КОМЕНТАРІ • 190

  • @paulstevenconyngham7880
    @paulstevenconyngham7880 5 років тому +200

    This is the best explanation of PPO on the net hands down

  • @maloxi1472
    @maloxi1472 4 роки тому +7

    The value you provide in these videos is insane !
    Thank you very much for guiding our learning process ;)

  • @akshatagrawal819
    @akshatagrawal819 5 років тому +319

    He is actually much better than Siraj Raval.

    • @oracletrading3000
      @oracletrading3000 4 роки тому +4

      @@ahmadayazamin3313 what kind of scandal?

    • @oracletrading3000
      @oracletrading3000 4 роки тому +1

      @@ahmadayazamin3313 I don't know it, just watch one or two videos of him demonstrate RL for trading

    • @DeanRKern
      @DeanRKern 4 роки тому +1

      He seems to know what he's talking about.

    • @joirnpettersen
      @joirnpettersen 4 роки тому +10

      He (Siraj) never explained any of what he was saying, and that was why I stopped watching him. He just rushed through the context and the results, explaining nothing.

    • @revimfadli4666
      @revimfadli4666 3 роки тому +5

      @@oracletrading3000 'how to predict stock market in 5 minute'? More like, how to expose oneself as a fraud & end career in that time

  • @arkoraa
    @arkoraa 5 років тому +17

    I'm loving this RL series. Keep it up!

  • @Alex-gc2vo
    @Alex-gc2vo 5 років тому +5

    easily the best explanation of PPO I've ever seen. most papers and lectures get too tangled up in the probabilistic principals and advanced mathematic derivations and completely lose sight of what these models are doing in high level terms.

  • @tyson96
    @tyson96 10 місяців тому

    Explained so well and it was intuitive as well. I learnt more from this video than all the articles I found in the internet. Great job.

  • @alializadeh8095
    @alializadeh8095 5 років тому +4

    Amazing! This was the best explanation of PPO I have seen so far

  • @jeremydesmond638
    @jeremydesmond638 4 роки тому +2

    Third video in a row. Really enjoy your work. Keep it up! And thank you!!!

  • @EddieSmolansky
    @EddieSmolansky 5 років тому +2

    This video was very well done, I definitely got a lot of value out of it. Thank you for your work!

  • @Fireblazer41
    @Fireblazer41 4 роки тому +1

    Thank you so much for this video! This is way more insightful and intuitive than simply reading the papers!

  • @zeyudeng3223
    @zeyudeng3223 5 років тому +1

    I watched all your videos today, great works! Love them!

  • @4.0.4
    @4.0.4 5 років тому +2

    Thank you for including links for learning more on the description.

  • @DavidSaintloth
    @DavidSaintloth 5 років тому +72

    I actually understood your explanation cover to cover on first view and thought the 19 minutes felt more like 5.
    Outstanding work.

    • @oguretsagressive
      @oguretsagressive 5 років тому +5

      One view, no pauses?! Not willing to be mean, but how can you be sure you've truly understood and weren't conquering Mount Stupid the whole time?

  • @Rnjeazy
    @Rnjeazy 5 років тому +2

    Dude, your channel is awesome! So glad I found it!

  • @xiguo2783
    @xiguo2783 4 роки тому

    Great explanation with enough details! Thumbs up for all the free knowledge on the internet!

  • @berin4427
    @berin4427 3 роки тому

    Fantastic review of policy gradients, and PPO as well! Best place for a refresh

  • @sarahjamal86
    @sarahjamal86 4 роки тому +3

    As someone who is working in RL field .... you did very good job.

  • @scienceofart9121
    @scienceofart9121 3 роки тому +2

    I watched this video more than 5 times and this was the best video about the PPO. Thank you for making great videos like this and keep up the good work. P.S: Your explanation was even simpler than the creator of this algorithm Schulman.)

  • @ColinSkow
    @ColinSkow 5 років тому +3

    Great breakdown of PPO. You've simplified a lot of complex concepts to make them understandable! Hahaha... and you can't beat an octopus slap!!!

  • @BoltronRacingTeam
    @BoltronRacingTeam Рік тому +1

    Excellent video! Wonderful resource for anyone participating in AWS DeepRacer competitions.

  • @junjieli9253
    @junjieli9253 4 роки тому

    Thank you for help me understand PPO faster, good explanation with useful resources included.

  • @curumo_curunir
    @curumo_curunir Рік тому

    Thank you for the video, it is very helpful. The key concepts have been explained in just 20min, bravo. I would like to see more videos from your channel. Thank you.

  • @m33pr0r
    @m33pr0r 3 роки тому +2

    Thank you for such a clear explanation. I was able to watch this at 2x speed and follow everything, which is a testament to your clarity. It really helped that you tied PPO back to previous work (e.g TRPO)

  • @BDEvans
    @BDEvans 4 роки тому

    By far the best explanation on UA-cam.

  • @yuktikaura
    @yuktikaura Рік тому

    Keep it up. Brevity is the soul of wit, it is indeed a skill to summarize the crux of a concept in such lucid way..!

  • @arianvc8239
    @arianvc8239 5 років тому

    This is really great! Keep up the good work!

  • @bigdreams5554
    @bigdreams5554 Рік тому +1

    This guy actually knows what he's talking about. Excellent video.

  • @fktudiablo9579
    @fktudiablo9579 3 роки тому

    one of the best overview of PPO, clean.

  • @cherguioussama1611
    @cherguioussama1611 3 роки тому

    best explanation of PPO I've found. Thanks

  • @apetrenko_ai
    @apetrenko_ai 5 років тому +3

    It was a great explanation!
    Please do a video on Soft Actor-Critic and Maximum Entropy RL! That would be amazing!

  • @Bardent
    @Bardent 2 роки тому

    This video is absolutely amazing!!

  • @user-fu3nb7pd3y
    @user-fu3nb7pd3y 10 місяців тому

    Wonderful, this is the first video i've seen on this channel. I suspect it won't be the last!

  • @DavidCH12345
    @DavidCH12345 4 роки тому +1

    I love how you take the formula apart an look at it step by step. Great work!

  • @abhishekkapoor7955
    @abhishekkapoor7955 5 років тому

    keep up the good work , sir. thanks for this awesome explanation

  • @jcdmb
    @jcdmb 4 роки тому

    Amazing explanation. Keep up the good work.

  • @umuti5ik
    @umuti5ik 3 роки тому

    Excellent algorithm and explanation!

  • @MShahbazKharal
    @MShahbazKharal 4 роки тому

    it is a long video, no doubt, but once you end watching it you think it was much better than actually reading the paper. thanks man!

  • @cherrysun7054
    @cherrysun7054 4 роки тому

    I really love your video, professional and informative, thank.

  • @user-dy3lm9bh6x
    @user-dy3lm9bh6x 4 роки тому

    great video for ppo! thanks a lot for you work!

  • @connorshorten6311
    @connorshorten6311 4 роки тому +1

    Thank you! Learned a lot from this!

  • @akramsystems
    @akramsystems 5 років тому +2

    Love your videos!!

  • @MyU2beCall
    @MyU2beCall 3 роки тому

    Great Video. Excellent intro to this topic .

  • @Navhkrin
    @Navhkrin 5 років тому +2

    12:19
    Min operator also gets prefers old PO update
    IF advantage is positive but probability of taking that action decreased, min operator selects unclipped objective here to undo the bad update
    IF the advantage is negative but probability of taking that action increased, min operator also selects unclipped objective to undo bad update, just as mentioned in video.

  • @siddharthmittal9355
    @siddharthmittal9355 5 років тому

    more, just more videos. so well explained.

  • @maraoz
    @maraoz 3 роки тому

    Thanks for this video! Really good teaching skills! :)

  • @anthonydawson9700
    @anthonydawson9700 3 роки тому

    Pretty good explanation and very understandable thanks!

  • @Samuel-wl4fw
    @Samuel-wl4fw 2 роки тому

    Coming back to this after thoroughly understanding Q-learning and looking into the advantage function in another network, this explanation is FAST, I wonder who would understand all that is happening without background knowledge

    • @bigdreams5554
      @bigdreams5554 Рік тому

      Well for AI/ML some background info is needed. If you taking multivariable calculus, its assumed you know calculus already. For those who already work in machine learning, this video is amazing. If i didn't get something i can research what he's talking about because he's using the proper technical terms, not dumbing it down. It's a wake up call for what I need to know to be knowledge. Great video.

  • @ConsultingjoeOnline
    @ConsultingjoeOnline 3 роки тому

    *Great* video. *Great* explanation!

  • @benjaminf.3760
    @benjaminf.3760 4 роки тому

    Very well explained, thank you

  • @alizerg
    @alizerg Рік тому

    Thanks buddy, really appreciated!

  • @ruslanuchan8880
    @ruslanuchan8880 5 років тому

    Subscribed because the topics so cool!

  • @suertem1
    @suertem1 3 роки тому

    Great explanation and references

  • @pawanbhandarkar4199
    @pawanbhandarkar4199 5 років тому

    Hat's off, mate. This is fantastic.

  • @fabiocescon3772
    @fabiocescon3772 4 роки тому

    Thank you, it's really a good explanation

  • @MdelaRE1
    @MdelaRE1 4 роки тому +1

    Amazing work :D

  • @francesco.messina88
    @francesco.messina88 4 роки тому

    Congrats! You have a special skill to explain AI.

  • @julinamaharjan6987
    @julinamaharjan6987 Рік тому

    Great explanation!

  • @Navhkrin
    @Navhkrin 5 років тому

    Much cleaner than deep learning boot camp explanation

  • @kenfuliang
    @kenfuliang 4 роки тому

    Thank you so much. Very helpful

  • @maximeg3659
    @maximeg3659 Рік тому

    awesome explanation !

  • @SG-tz7jj
    @SG-tz7jj 10 місяців тому

    Great explanation.

  • @petersilie9702
    @petersilie9702 3 роки тому +1

    Thank you so much. I watch this videos the 10th time :-D

  • @user-rn8rd6dk2c
    @user-rn8rd6dk2c 5 років тому

    Great video! Thanx a lot

  • @anshulpagariya6881
    @anshulpagariya6881 3 роки тому

    A big thanks for the video :)

  • @yinghaohu8784
    @yinghaohu8784 4 дні тому

    very good explanations

  • @vizart2045
    @vizart2045 2 роки тому

    I need to dive deeper into this.

  • @user-or7ji5hv8y
    @user-or7ji5hv8y 5 років тому

    Great video!

  • @conlanrios
    @conlanrios Місяць тому

    Great breakdown and links for additional resources

  • @bikrammajhi3020
    @bikrammajhi3020 2 місяці тому

    This is gold!!

  • @ravichunduru834
    @ravichunduru834 5 років тому +3

    Great video, but I have a couple of doubts:
    1. In PPO, how does changing the objective help in restricting the updates to the policy? Wouldn’t it make more sense to restrict the gradient so that we don’t update the policy too much in one go?
    2. In PPO, when A

  • @idabagusdiazagasatya9900
    @idabagusdiazagasatya9900 4 роки тому

    YOU ARE AWESOME!

  • @hadsaadat8283
    @hadsaadat8283 Рік тому

    simply the best ever

  • @Sherlockarim
    @Sherlockarim 5 років тому +1

    great content keep up man

  • @letranthu5165
    @letranthu5165 3 роки тому

    oh thank you so muchhhhhh

  • @tuliomoreira7494
    @tuliomoreira7494 3 роки тому

    Amazing explanation.
    Also. I just noticed that at 9:12 the seal slaps the guy with an octopus o.O

  • @nobutaka5548
    @nobutaka5548 5 років тому

    Hey! Your videos are great but wondering if you could go over some actual lines of code (implementation) for a new PPO video since this seems to be the algo that Openai uses as their default. Thanks for the link to the Pytorch implementation by RL-Adventure though! That is the clearest explanation of how to output mu and std from the actor network! I have been searching for 3 days for something like that because I just cannot implement from scratch based on my meager abilities.

  • @ariel415el
    @ariel415el 5 років тому

    Man you are good !

  • @unoqualsiasi7341
    @unoqualsiasi7341 5 років тому

    Thanks!

  • @learningdaily4533
    @learningdaily4533 3 роки тому +2

    Hi, I'm really interested in your video and I found that these videos are really helpful for people who are learning RL in particular and A.I in general. Your way of representing the notion and key ideas behind these algorithms is amazing. It's too sad to found that you r not gonna update videos for 2 years. Do you have any another channel or anything I can learn from you. Please let me know :( It's my pleasure

    • @ademord
      @ademord 2 роки тому

      I came here to say this

  • @arkasaha4412
    @arkasaha4412 5 років тому +2

    Great video as usual! Just a suggestion, maybe instead of diving directly into deep RL you can make videos (shorter if you don't have much time to devote) on simpler RL algorithms like DQN, Q-learning. That way someone who wants to know more about RL can build up the knowledge through more vanilla stuff. I admire your style and content like many others and would love to see it grow more :)

    • @ArxivInsights
      @ArxivInsights  5 років тому +11

      You're totally right that this is not easy to step into if you are new to RL, but I feel like there are tons and tons of good introduction resources out there for the simpler stuff.
      I'm really trying to focus on the niche of people that already have this background and want to go a bit further :p

    • @52VaultBoy
      @52VaultBoy 5 років тому +2

      And I thing that makes this channel amazing. Practically following the state-of-art is a fantastic concept. As I am just curious about AI as a hobby and doing science in a different sector, I am glad, that I don't need to go through tons of articles by myself, but you show me the direction, where I should look to stay in a picture. Thank you.

    • @M0481
      @M0481 5 років тому +1

      I was about to say what @Arvix Insights is saying. There's an amazing book called: RL: An introduction by Richard S. Sutton and Andrew G. Barto, which introduces all the basic concepts that you're talking about. This second version of this book has been made publically available as a pdf and will be available on Amazin next month (don't quote me on the release date please :P).

    • @ColinSkow
      @ColinSkow 5 років тому +1

      I'm doing beginner level videos on my channel and would love your feedback... ua-cam.com/channels/rRTWfso9OS3D09-QSLA5jg.html

    • @arkasaha4412
      @arkasaha4412 5 років тому

      Sure, thanks for the videos :)

  • @antoinemathu7983
    @antoinemathu7983 4 роки тому +4

    I watch Siraj Raval for the motivation, but I watch Arxiv Insights for the explanations

  • @joshuajohnson4339
    @joshuajohnson4339 5 років тому +6

    Just thought I would let you know that I just shared your video with my cohort in the Udacity Reinforcement Learning Nanodegree. We are going through PPO now and this video is relevant and timely - especially wrt to the clip region explanations. Any ideas on how to convert the outputs from discrete to continuous action space?

    • @ArxivInsights
      @ArxivInsights  5 років тому +1

      Sounds great, thx for sharing! Well as I mentioned, the PPO policy head outputs the parameters of a Gaussian distribution (so means and variances) for each action. At runtime, you can then sample from these distributions to get continuous output values and use the reparametrization trick to backpropagate gradients through this non-differentiable block --> check out my video on Variational Autoencoders for all the details on this!

  • @sarvagyagupta1744
    @sarvagyagupta1744 4 роки тому

    I don't know if I'll get answers here but I have some questions:
    1) Why are we taking the "min" in the loss function?
    2) We are considering 1 in 1-e and 1+e because the reward we give for each positive action is 1, right? My question here is the scaling factor.

  • @anasalnuaimi
    @anasalnuaimi 2 роки тому

    Can you do a video about DDPG?
    Also, The PPO I know uses simply the discounted future returns in the loss. Is the variant using the Advantage instead the standard one?

  • @victor-iyi
    @victor-iyi 5 років тому +1

    Hi Andrew,
    Can you please make a video explaining OpenAI's transformer model, Google's BERT & OpenAI's GPT&GPT-2 model?
    I can't seem to wrap my head around them.

  • @alex4good
    @alex4good 5 років тому

    Hey @Arxiv Insights, I do have a basic question regarding Reinforcement Learning and would really appreciate your help. What is the basic difference between Reinforcement Learning, Deep Learning and Deep Reinforcement Learning? Does Basic Reinforcement Learning take advantage of Neural Networks to find the best solution and therefore uses Deep Learning? Thank you very much in advance, trying to get an overview and understand all the differences for my Master Thesis at the moment...

  • @chid3835
    @chid3835 3 роки тому

    Very nice videos. FYI: Please watch at 0.75 speed for better understanding, LOL!

  • @Guytron95
    @Guytron95 3 роки тому

    3:54 to 4:10 or so, why does that section remind me of the method used for ray marching in image rendering?

  • @billzito1035
    @billzito1035 5 років тому +2

    Video is very helpful, thank you! Personally I find the background music distracting and would prefer it if it didn't exist, but I know others may feel differently.

  • @samidelhi6150
    @samidelhi6150 4 роки тому

    How do I tackle the moving target problem using methods from RL ? Where I have more than one reward ,3 possible actions to take and ofcourse state which include many factors / sources of information around the environment ,
    Your help is highly appreciated

  • @vigneshamudha821
    @vigneshamudha821 5 років тому

    Hi can you talk about reinforcement learning in Adanet which automatically build the good structure of the neural network

  • @yonistoller1
    @yonistoller1 5 місяців тому

    Thanks for sharing this! I may be misunderstanding something, but it seems like there might be a mistake in the description. Specifically, the claim in 12:50 that "this is the only region where the unclipped part... has a lower value than the clipped version".
    I think this claim might be wrong, because there could be another case where the unclipped version would be selected:
    For example, if the ratio is e.g 0.5 (and we assume epsilon is 0.2), that would mean the ratio is smaller than the clipped version (which would be 0.8), and it would be selected.
    Is that not the case?

  • @heenashaikh8422
    @heenashaikh8422 3 роки тому

    Thanks

  • @rutvikreddy772
    @rutvikreddy772 4 роки тому

    Great video! I had a question though, at 6:50, the objective function, which you called loss is actually the function that we'd want to maximize right? I mean calling it loss gave me the idea that we should minimize it. Correct me if I am wrong, please.

    • @gregh6586
      @gregh6586 4 роки тому +1

      Yes, we are trying to maximise the advantage. It is called "loss" simply because it has the same function as (true) loss functions in other domains. It might get tricky when you implement a multi-head neural network with Actor-Critic-Methods where you combine different loss functions (GAE for the actor, lambda returns for the critic, entropy for exploration) as you have to make sure which "loss" you aim to maximise and which to minimise.

  • @JohnDoe-cq3ic
    @JohnDoe-cq3ic 5 років тому

    Very good thorough walkthrough ...but take a breath when your talking ...it takes a moment to catch up to what your saying sometimes but I guess thats the benefit of pause and play again. nice job!

  • @RishiPratap-om6kg
    @RishiPratap-om6kg Рік тому

    Can I use this algorithm for "computation offloading in edge computing "

  • @meddh1065
    @meddh1065 2 роки тому

    There is something I didn't understand : If you clip r then how will you do back propagation ?? the gradient will be just zero in the case r>1+epsilon or r

  • @CommanderCraft98
    @CommanderCraft98 2 роки тому

    At 10:27 he says that first part in the min() expression is the "default policy gradient objective" but I do not really see how, since the objective function usually is J=E[R_t]. Does someone understand this?

  • @jeffreylim5920
    @jeffreylim5920 4 роки тому +1

    16:46 How to apply gpu usage in PPO. For me, it was hard to implement collect experiences with cuda pytorch; Actually it seems even OpenAI didn't use gpu in the collecting process.

    • @DVDmatt
      @DVDmatt 4 роки тому +1

      You need to vectorize the environments. PPO2 now does this (see twitter.com/openai/status/931226402048811008?lang=en). In my experience GPU usage is still very low during collection, however.

  • @Arctus1491
    @Arctus1491 5 років тому

    At 2:54 you talked about online vs offline learning, but the screen shows some comparisons between off-policy and on-policy learning. Otherwise cool video!