RL Course by David Silver - Lecture 7: Policy Gradient Methods

Поділитися
Вставка
  • Опубліковано 6 чер 2024
  • #Reinforcement Learning Course by David Silver# Lecture 7: Policy Gradient Methods (updated video thanks to: John Assael)
    #Slides and more info about the course: goo.gl/vUiyjq

КОМЕНТАРІ • 193

  • @akshatgarg6635
    @akshatgarg6635 3 роки тому +197

    People who feel like quiting at this stage, relax, take a break, watch this video over and over again and read sutton and barto. Do everything but dont quit. You are amongst the 10% who came this far.

    • @matejnovosad9152
      @matejnovosad9152 2 роки тому +6

      Me in highschool trying to make a rocket league bot: :0

    • @TY-il7tf
      @TY-il7tf 2 роки тому +9

      I finally came to this stage of learning after watching his videos over and over. He did very well explaining everything, but RL knowledge differs from other ML and it takes time to learn and get used to.

    • @juliansuse1
      @juliansuse1 2 роки тому +11

      Bro thanks for the encouragement

    • @SuperBiggestking
      @SuperBiggestking Рік тому +6

      Brother Ashsat! This was the most timely comment ever on youtube. I was watching these lectures and I felt brain dead since they are pretty long. This is a great encouragemnt. May God bless you for this!

    • @jamesnesfield7288
      @jamesnesfield7288 Рік тому +5

      my best advice is to apply each algo in the sutton and barto text book to problems in ai gym to help understand all this.....
      you can do it, you got this....

  • @xicocaio
    @xicocaio 3 роки тому +164

    This is what I call commitment: David Silver explored not showing his face policy, received less reward, and then switched back to the past lectures' optimal policy.
    Nothing like learning from this "one stream of data called life."

    • @Sanderex
      @Sanderex 3 роки тому +6

      What a treasure of a comment

    • @camozot
      @camozot 3 роки тому +4

      And we experienced a different state, realized we got much less reward from it, and updated our value function! Then David adjusts his policy to our value function like actor critic? Or is that a lil stretch, meh I think true there's some link between his and our value function here, he wants us to do well, because he's a legend!

    • @stevecarson7031
      @stevecarson7031 2 роки тому +1

      Genius!

    • @js116600
      @js116600 Рік тому +2

      Can't discount face value!

  • @alexanderyau6347
    @alexanderyau6347 6 років тому +303

    Oh, I can't concentrate without seeing David.

    • @terrykarekarem9180
      @terrykarekarem9180 4 роки тому +7

      Exactly the same here.

    • @musicsirme3604
      @musicsirme3604 4 роки тому +2

      Me too. This is much better than 2018 one

    • @mathavraj9378
      @mathavraj9378 3 роки тому +42

      How will I understand expected return without him walking to a spot and summing up all along the path after it

    • @DavidKristoffersson
      @DavidKristoffersson 3 роки тому +2

      Turn on subtitles. Helps a lot.

  • @JonnyHuman
    @JonnyHuman 6 років тому +215

    For those confused:
    - whenever he speaks of the u vector he's talking about the theta vector (slides don't match).
    - At 1:02:00 he's talking about slide 4.
    - At 1:16:35 he says Vhat but slides show Vv
    - He refers to Q in Natural Policy Gradient, which is actually Gtheta in the slides
    - At 1:30:30 the slide should be 41 (the last slide), not the Natural Actor-Critic slide

    • @illuminatic7
      @illuminatic7 6 років тому +6

      Also, when he starts talking about DPG at 1:26:10 it helps a lot to have a look at his original paper (proceedings.mlr.press/v32/silver14.pdf) pages 4 and 5 in particular.
      I think the DPG slides he is actually referring to are not available online.

    • @MiottoGuilherme
      @MiottoGuilherme 5 років тому +8

      @50:00 he mentions G_t, but the slides show v_t, right?

    • @gregh6586
      @gregh6586 4 роки тому +3

      @@MiottoGuilherme yes. G_t as in Sutton/Barto's book, i.e. future, discounted reward.

    • @mhtb32
      @mhtb32 4 роки тому +6

      @@illuminatic7 Unfortunately that link doesn't work anymore, here is an alternative: hal.inria.fr/file/index/docid/938992/filename/dpg-icml2014.pdf

  • @finarwa3602
    @finarwa3602 Місяць тому +1

    I have to listen repetitively because I could not concentrate with out seeing him. I have to imagine what he was trying to show through his gestures . This is a gold standard lecture for RL. Thank you professor David Silver.

  • @saltcheese
    @saltcheese 7 років тому +280

    ahhh... where did u go david.. i loved your moderated gesturing

    • @lucasli9225
      @lucasli9225 7 років тому +11

      I have the same question too. His gestures are really helpful in learning the course!

    • @Chr0nalis
      @Chr0nalis 7 років тому +3

      He probably forgot to turn on the cam capture :)

    • @NoName-vw7wf
      @NoName-vw7wf 7 років тому +22

      It's hard to focus without seeing david :( Sad.

    • @DoingEasy
      @DoingEasy 6 років тому

      Lecture 7 is optional

    • @danielc4267
      @danielc4267 6 років тому +19

      I think Lecture 10 is optional. Lecture 7 seems rather important.

  • @michaellin9407
    @michaellin9407 4 роки тому +113

    This course should be called: "But wait, there's an even better algorithm!"

    • @mathavraj9378
      @mathavraj9378 3 роки тому +4

      lol entire machine learning is like that

    • @akshatgarg6635
      @akshatgarg6635 3 роки тому +8

      That my friend is the core principal of any field of engineering. Thats how computers got from a room sized contraption to a hand held device. Becuse somebody said wait, there is an even better way of doing this

  • @krishnanjanareddy2067
    @krishnanjanareddy2067 2 роки тому +30

    And it turns out that this is best course to learn RL even after 6 years.

    • @snared_
      @snared_ 6 місяців тому

      really? What were you able to do with this information?

  • @NganVu
    @NganVu 4 роки тому +38

    3:24 Introduction
    26:39 Finite Difference Policy Gradient
    33:38 Monte-Carlo Policy Gradient
    52:55 Actor-Critic Policy Gradient

  • @yasseraziz1287
    @yasseraziz1287 3 роки тому +25

    1:30 Outline
    3:25 Policy-Based Reinforcement Learning
    7:40 Value-Based and Policy-Based RL
    10:15 Advantages of Policy Based RL
    14:10 Example: Rock-Paper-Scissors
    16:00 Example: Aliased Gridworld
    20:45 Policy Objective Function
    23:55 Policy Optimization
    26:40 Policy Gradient
    28:30 Computing Gradients by Finite Differences
    30:30 Training AIBO to Walk by Finite Difference Policy Gradient
    33:40 Score Function
    36:45 Softmax Policy
    39:28 Gaussian Policy
    41:30 One-Step MDPs
    46:35 Policy Gradient Theorem
    48:30 Monte-Carlo Policy Gradient (REINFORCE)
    51:05 Puck World Example
    53:00 Reducing Variance Using a Critic
    56:00 Estimating the Action-Value Function
    57:10 Action-Value Actor-Critic
    1:05:04 Bias in Actor-Critic Algorithms
    1:05:30 Compatible Function Approximation
    1:06:00 Proof of Compatible Function Approximation Theorem
    1:06:33 Reducing Variance using a Baseline
    1:12:05 Estimating the Advantage Function
    1:17:00 Critics at Different Time-Scales
    1:18:30 Actors at Different Time-Scales
    1:21:38 Policy Gradient with Eligibility Traces
    1:23:50 Alternative Policy Gradient Directions
    1:26:08 Natural Policy Gradient
    1:30:05 Natural Actor-Critic

  • @Wuu4D
    @Wuu4D 7 років тому +139

    Damn.Its was alot easier understandin it with gestures

    • @fktudiablo9579
      @fktudiablo9579 4 роки тому +4

      he could describe his gestures in the subtiles

  • @T4l0nITA
    @T4l0nITA 2 роки тому

    By far the best video about policy gradient methods on youtube

  • @omeryilmaz6653
    @omeryilmaz6653 5 років тому

    You are fantastic David. Thanks for the tutorial.

  • @JakobFoerster
    @JakobFoerster 8 років тому +2

    Thank you for creating the video John, this is really great!

  • @florentinrieger5306
    @florentinrieger5306 9 місяців тому +4

    It is unfortunate that exactly this episode is without david in the screen. It is again a quite compley topic and Devaid jumping and running around and pointing out the relevant parts make it much easier to digest.

  • @AliRafieiAliRafiei
    @AliRafieiAliRafiei 8 років тому

    thank you many times Dear Karolina. cheers

  • @georgegvishiani736
    @georgegvishiani736 5 років тому +31

    It would have been great if it was possible to recreate David in this lecture based on his voice using some combination of RL frameworks.

  • @chrisanderson1513
    @chrisanderson1513 7 років тому +40

    Starts at 1:25.
    Actor critic at 52:55.

  • @akarshrastogi5145
    @akarshrastogi5145 4 роки тому +15

    This lecture was immensely difficult to get owing to david's absence and mismatch of slides

  • @liamroche1473
    @liamroche1473 6 років тому +15

    I am not sure exactly how this video was created, but the right slide is often not displayed (especially near the end, but elsewhere as well). It is probably better to download the slides for the lecture and find your own way through them while listening to the audio.

  • @charles101993
    @charles101993 5 років тому +1

    What is the log policy exactly? Is it just the log of the output of the gradient with respect to some state action pair?

  • @ck5300045
    @ck5300045 8 років тому +1

    This really helps. Thanks

  • @japneetsingh5015
    @japneetsingh5015 4 роки тому

    will these policy gradient methods work better in the previous methods based on generalized policy iteration MC and TD and SARSA?

  • @MrCmon113
    @MrCmon113 4 роки тому +7

    Unfortunately the slides do not fit what is said. It's a pity they don't seem to put much effort into these videos. David is surely one of the best people to learn RL from.

  • @sengonzi2010
    @sengonzi2010 2 роки тому

    Fantastic lectures

  • @rylanschaeffer3248
    @rylanschaeffer3248 7 років тому

    At 1:03:49, shouldn't the action be sampled from \pi_\theta(s', a)?

  • @alenmanoocherian631
    @alenmanoocherian631 8 років тому +1

    Hello Karolina,
    Is there any real video for this class?

  • @kyanas1750
    @kyanas1750 6 років тому +1

    Why there is not a single implementation in MATALB?

  • @jurgenstrydom
    @jurgenstrydom 7 років тому +67

    I wanted to see the AIBO training :(

    • @felixt1250
      @felixt1250 6 років тому +21

      Me too. If you look at the Paper by Nate Kohl and Peter Stone where they describe it, they reference a web page for the videos. And surprisingly it is still online. You find it at www.cs.utexas.edu/users/AustinVilla/?p=research/learned_walk

    • @tchz
      @tchz 3 роки тому

      @@felixt1250 not anymore :'(

    • @gunsodo
      @gunsodo 3 роки тому +2

      @@tchz I think it is still there but you have to copy and paste the link.

  • @zhaoxuanzhu4364
    @zhaoxuanzhu4364 5 років тому

    I am guessing the slides shown in the video is slightly different from the one they used in the lecture.

  • @SSS.8320
    @SSS.8320 5 років тому +2

    We miss you David

  • @BramGrooten
    @BramGrooten 3 роки тому +2

    Is there perhaps a link to the videos of AIBOs running? (supposed to be shown at 31:55)

    • @BramGrooten
      @BramGrooten 3 роки тому +1

      @@oDaRkDeMoNo Thank you!

  • @serendipity0306
    @serendipity0306 3 роки тому

    Wish to see David in person.

  • @jorgelarangeira7013
    @jorgelarangeira7013 6 років тому +4

    It took me a while to realize that policy function pi(s, a) is alternately used as the probability of taking a certain action in state s, and the action proper (a notation overload that comes from the Sutton book). I think specific notation for each instance would avoid a lot of confusion.

  • @jk9165
    @jk9165 7 років тому

    Thank you for the lecture. I was wondering if you are constrained to use the same state-action feature vectors for actor and critics? The weights are, of course, different, but does \phi(s,a) need to be the same? (57:18)

    • @narendiranchembu5893
      @narendiranchembu5893 6 років тому

      As far as my understanding goes, the feature vectors of actor and critic are completely different. The feature vector of critic is more like the state space and action space representation, as you have seen in Value Function Approximation lecture. But for the actor, the feature vectors are probabilities of taking an action in a given state (mostly).

    • @edmonddantes4705
      @edmonddantes4705 Рік тому

      Of course they don't have to be the same. The state-action value features are stuff that approximate the state-action value function well, and the policy features are stuff that approximate a general useful policy well. For example, look at what compatible function approximation imposes in order to flow along the real gradient 1:05:52. How are you going to achieve that condition with the same features?

  • @Vasha88
    @Vasha88 4 роки тому +1

    First time in my life I had to DECREASE the speed of the video and not increase....man he talks REALLY fast, while at the same time showing new slides filled up by equations

  • @LucGendrot
    @LucGendrot 5 років тому +5

    Is there any particular reason that, in the basic TD(0) QAC pseudocode (1:00:00), we don't update the Q weights first before doing the theta gradient update?

    • @alvinphantomhive3794
      @alvinphantomhive3794 4 роки тому

      i think you can start with arbitrary value for the weight.
      Since the weight value also will be adjusted proportion to the td error and get better as the iteration increase to n steps.

    • @edmonddantes4705
      @edmonddantes4705 Рік тому

      Super good question. I am guessing the reason is computational right? You want to reuse the computation you did for Q_w(s,a) when computing delta instead of computing it again with new weights when doing the gradient ascent update of the policy parameters (theta). However, what you propose seems more solid, just more costly.

  • @shaz7163
    @shaz7163 6 років тому +4

    Can someone explain how he got the score function from the maximum likelihood expression in 22.39 .

    • @sravanchittupalli2333
      @sravanchittupalli2333 3 роки тому +4

      I am 2 years late but this might help someone else😅😅
      It is simple differentiation
      grad of log(a) wrt a = grad(a)/a
      He just did this backwards

  • @shivajidutta8472
    @shivajidutta8472 7 років тому

    Does anyone know about the reading material David mentions in the previous class?

    • @hantinglu8050
      @hantinglu8050 7 років тому

      I guess is this one: "Reinforcement Learning: An Introduction" by Richard S. Sutton and Andrew G. Barto

  • @sarahjamal86
    @sarahjamal86 5 років тому +1

    Where did you go David :-(

  • @d4138
    @d4138 5 років тому +2

    36:20, could anyone please explain, what kind of expectation we are computing (i only see the gradients). And why is expectation of the right-hand side easier to compute then that of the left-hand side

    • @edmonddantes4705
      @edmonddantes4705 Рік тому

      You want to minimise J in 43:25, which is the expected immediate reward. Note that thanks to the computation in 36:20, the gradient of J at 43:25 becomes an expectation. The expectation is computed in the full state-action space of the MDP with policy pi_\theta. Note that without the term pi_\theta(s,a) in the sum, that thing would not be an expectation anymore, so you COULD NOT APPROXIMATE IT BY SAMPLING.

  • @fndTenorio
    @fndTenorio 5 років тому

    28:52 So J(teta) is the average reward your agent gets following policy teta, and pi(teta, a) is the probability of taking action a given policy teta?

    • @AM-kx4ue
      @AM-kx4ue 4 роки тому

      J(teta) is the cost function, check the 22:30 slide.

  • @d4138
    @d4138 6 років тому +1

    could anyone please explain the slide at 45:51. In particular, i don't understand what how the big $R_{s,a}$ becomes just $r$ when we compress the gradient to expectation E. What is the difference between the big R and the small one?

    • @edmonddantes4705
      @edmonddantes4705 Рік тому

      r is the immediate reward understood as a RANDOM VARIABLE. This is useful because we want to compute the expectation of r along the state space generated by the MDP given a fixed policy pi. This is a measure of how good our policy is. R_{s,a} is the expectation of r given that you are at state s and carry out action a, i.e. R_{s,a} = E[r | s,a].

  • @helinw
    @helinw 5 років тому +1

    Just to make sure, in 36:22, the purpose of the likelihood ratio trick is to make the gradient of the objective function gets converted to a expectation again? Just a David said at 44:33, "... that's the whole point of using the likelihood ratio trick".

    • @AM-kx4ue
      @AM-kx4ue 4 роки тому

      I'm not sure about it neither

    • @edmonddantes4705
      @edmonddantes4705 Рік тому

      That's exactly right. Once you convert it into an expectation, you can approximate it by sampling, so that trick is very practical.

  • @thomasmezzanine5470
    @thomasmezzanine5470 7 років тому +1

    Thanks for updating lectures. I have some problem in understanding the state-action feature vector \phi(s, a). I know the feature of environment mentioned in the last lecture, it could be some kind of observation of the environment, but how to understand this state-action feature vector?

    • @edmonddantes4705
      @edmonddantes4705 Рік тому

      The state-action features in the last lecture and this lecture are different, since in the last lecture they were used to approximate the VALUE Q of a particular state-action pair, and in this lecture they are used to approximate a POLICY PI. State-action features filter important information about the state and action used to approximate the state-value function or maybe the policy, depending on the context.

    • @edmonddantes4705
      @edmonddantes4705 Рік тому

      Say we are in a 2D grid world. The possible actions are up, down, left and right. Every time I move up, I get +1 reward, every time I move down, left or right, I get 0 reward. Define two features as (1,0) if I choose to go up, and (0,1) otherwise. Note that now I can compute the value function EXACTLY as a linear combination of my features, since they contain all the relevant information. My optimal policy is also a linear combination of those features only.
      PS: you are asking about the linear case, but for me the most interesting case is the nonlinear one.

    • @guptamber
      @guptamber Рік тому

      @@edmonddantes4705 Wow that is response to 6 year old question. Thanks for taking time.

    • @edmonddantes4705
      @edmonddantes4705 Рік тому

      @@guptamberhaha I do it to practise!

  • @ErwinDSouza
    @ErwinDSouza 5 років тому +1

    At 45:36 I think the notation he is describing is different from that shown in these slides. I think his "capital R" is the small "r" for us. And the "curly R" is the "Rs,a" for us.

    • @MrCmon113
      @MrCmon113 4 роки тому +2

      Also u is theta.

    • @AdamCajf
      @AdamCajf 3 роки тому

      Yes, fully agree. I believe this is important so to reiterate the small correction: the lowercase r is a random reward, the actual reward that agent/we experience, while the curly uppercase R is the reward from the MDP (Markov Decision Process).

  • @hyunjaecho1415
    @hyunjaecho1415 3 роки тому

    What does phi(s) mean at 1:18:05 ?

  • @pabsan-0-ltu910
    @pabsan-0-ltu910 2 роки тому +3

    Wouldn't it be G instead of vt in 50:52? Not only he's saying G but it also makes more sense to estimate Q with G since they both model the total future expected reward, right?

    • @Alejandra-jq4xf
      @Alejandra-jq4xf Рік тому

      yes, it is G. If you look at Sutton and Barto's book, they do use G instead of v there

  • @Darshanhegde
    @Darshanhegde 8 років тому +2

    Thanks for updating lectures :)
    I sort of got stuck on this lecture because the video wasn't available :P Now I have no excuse for not finishing the course !

    • @Darshanhegde
      @Darshanhegde 8 років тому +5

      Thought this is a real video ! I was wrong ! David keeps referring to equations on slides but audio and slides are not synced ! It's confusing sometime ! But still better than just audio !

    • @jingchuliu8635
      @jingchuliu8635 7 років тому +2

      The slides are perfectly synced with the audio for most times, but the slides on "compatible function approximation" is not in the right order and the slides on "deterministic policy gradient" is missing.

  • @MrHailstorm00
    @MrHailstorm00 5 років тому +1

    The slides are outdated, judging by David's speech, he apparently changed notations and added a few slides in the last 30 mins or so.

  • @brandomiranda6703
    @brandomiranda6703 4 роки тому +1

    Does he talk about REINFORCE in this talk/lecture? If yes when?

  • @BigDickEnergy777
    @BigDickEnergy777 6 років тому +3

    One ideia that struck me recently is the possibility of developing a VISION to SOUND recognition/generation AI algorithm. Imagine you could feed this AI lots and lots of videos in order to output what is on the screen that is creating different sound(s), and the AI beign capable of recreating such sounds with a 3D image as basis or other videos (of course, this will require much better image recognition AI as it is the case that noises on the image may interfer with the recognition process).
    The goal with such a task is basicaly:
    1) To create or helping create a General AI*
    2) To use this new technology in needed tasks in our society, such as security, scene recreation (by using sounds to create the images on a inverse function), better voice recognition, better audio technologies, entertainment, etc.
    *I know what i say might look like it doesn't make a lot of sense but humans, for instance, they navigate the world around with the senses, there is something intuitive about the ideia that an AI that uses the same senses as us to understand and navigate the world will be more likely to have general inteligence. So, in the end, even if we don't create a General AI just because of that, it's undeniable that such technology would lead us close, because an AI whose can indetify images PLUS sounds is no doubts better the one which can only do the first (even more if it can connect both senses, which is the goal).
    Gods... i only wish this suggestion fall over the ears of someone with the power to action inside the Google-Deepmind, because i know they can make this happen...

  • @claudiocimarelli
    @claudiocimarelli 7 років тому

    the slide about deterministic policy gradient at the end is the one with compatible function approximation(like in the middle of the presentation)?:S
    Luckily there is the paper from Silver online :)
    Very good videos. With the video would have been better this one, but thanks anyways.

    • @illuminatic7
      @illuminatic7 6 років тому +1

      Not really, the slide you are referring to does not have the gradient of the Q-Function in the equations, which is the main point of what he is talking about.
      It helps a lot to have a look at the original paper (pages 4 and 5 in particular) to understand his explanation of DPG which can be found here: proceedings.mlr.press/v32/silver14.pdf

  • @ffff4303
    @ffff4303 2 роки тому

    While I don't know how generalizable the solution to this specific problem in an adversarial game would be, I can't help but wonder how these Policy Gradient Methods could solve it. The problem I am considering is one where the agent is out-matched, out-classed, or an "underdog" of limited range, damage, or resources than it's opponent in an adversarial game where it is known that the opponent's vulnerability increases with time or proximity.
    Think of Rocky Balboa vs Apollo Creed in Rocky 2 (where Rocky draws punches for many rounds to tire Apollo and then throws a train of left punches to secure the knockout) , being pursued by a larger vessel in water or space (where the opponent has longer range artillery or railguns but less maneuverability due to it's greater size), eliminating a gunmen in a foxhole with a manually thrown grenade, or sieging a castle.
    If we assume that the agent can only win these games by concentrating all the actions that actually give measurable or estimable reward in the last few sequences of actions in the small fraction of possible episodes that reach the goal, how would any of these Policy Gradient Methods be able to find a winning solution?
    Given that all actions for many steps from the initial state would require receiving consistent negative rewards (either through glancing blows with punches for many rounds, evasive actions like maneuvering the agent's ship to dodge or incur nonvital damage from the longer-range attacks, or simply lose large portions of an army to get from the field to castle walls and ascend the walls) I imagine the solution would have to be some bidirectional search with some nonlinear step between minimizing negative rewards from the initial state and maximizing positive reward from the goal.
    But can any of these Gradient Policy Methods ever capture such strategies if they are model-free (what if they have to be online or in partially observable environments as well)? It seems that TD lambda with both forward and backward views might be able to, but would the critical states of transitioning between min-negative and max-positive reward be lost in a "smoothing out" over all action sequence steps or never found given the nonlinearity between the negative and positive rewards? What if the requisite transitions were also the most dangerous for the underdog agent (ie t_100 rewards: -100, +0; t_101 rewards: -1000, +5)?
    If the environment is partially observable, and there really is no real benefit in strictly following the min-negative reward, given that the only true reward that matters is surviving and eliminating the opponent, some stochasticity would be required in action selection on the forward-view to explore states that are nonoptimal locally for the min-negative reward and required for ever experiencing the global terminal reward state, but this stochasticity may not be affordable on the backward view where the concentration of limited resource use cannot be wasted.
    I guess the only assailable method is if the network captured a function in the feature vector of the opponent's vulnerability as a function of time, resources exhausted, and/or proximity, but what still remains is this concern of increased danger for the agent as it gets closer to the goal. I realize that one could bound the negative reward minimization from zero damage to "anything short of death", but normalizing that with the positive rewards at the final steps of the game or episode would be interesting to understand. In this strategy it seems odd for an algorithm at certain states to effectively be "saying" things like:
    "Yes! You just got punched in the face 27 times in a row! (+2700 reward)";
    "Congratulations! 2/3s of your ship has lost cabin pressure! (+6600 reward)";
    "You have one functional leg, one functional arm, and suffering acute exsanguination! (+10,000 reward)"
    "Infantry death rate increases 200x! (+200,000 reward)".
    Any thoughts?

    • @snared_
      @snared_ 6 місяців тому +1

      did you figure it out yet? It's been a year, hopefullly you've had time to sit down and make actual progress towards creating this?

  • @andreariba7792
    @andreariba7792 Рік тому

    it's a pity to see only the slides compared to the previous lectures, the change of format makes it very hard to follow

  • @MinhVu-fo6hd
    @MinhVu-fo6hd 6 років тому

    In line "Sample a ~ pi_theta" of the actor-critic algorithm around 58:00.
    From what I understand that pi_theta(s, a) = P[ a | s, theta], I don't clearly understand how can we pick an action a given s and theta. Do we have to calculate phi(s , a) * theta for all possible action a at state s, and then choose an action accordingly to their probabilities?
    If yes, how can we take an action in continuous action domains?
    If no, how can we pick an action then?

    • @chukybaby
      @chukybaby 5 років тому

      Something like this
      a = np.random.choice(action_space, 1, p=action_probability_distribution)
      See docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.random.choice.html

    • @edmonddantes4705
      @edmonddantes4705 Рік тому

      In continuous action domains, pi_theta(s,a) could be a Gaussian for fixed s (just an example). In discrete action spaces, for every state s, there is a probability of every action given by pi_theta(s,a). They sum to one of course.

  • @nirmalnarasimha9181
    @nirmalnarasimha9181 Рік тому

    Made me cry after very long :( given the professors absence and slide mismatch

  • @gregh6586
    @gregh6586 5 років тому +5

    Hado van Hasselt holds basically the same lecture here: ua-cam.com/video/bRfUxQs6xIM/v-deo.html. I still like David's lecture much more but perhaps this other lecture can fill some of the gaps that appeared with David's disappearence.

  • @emrahe468
    @emrahe468 6 років тому +14

    This has good sound quality, but missing nice body language..

  • @alexanderyau6347
    @alexanderyau6347 6 років тому

    What is state aliasing in reinforcement learning?

    • @sam41619
      @sam41619 6 років тому +1

      its like when two different states are represented with same features OR if two different states are encoded/represented using same encoding. though they are different (and have different rewards) but due to aliasing property, they appear same so it gets difficult for the algorithm or approximator to differentiate between these

  • @Sickkkkiddddd
    @Sickkkkiddddd Місяць тому

    Isn't the state value function 'useless' to an agent considering he 'chooses' actions but can't 'choose' his state?

  • @MinhVu-fo6hd
    @MinhVu-fo6hd 6 років тому

    How about the non-MDP? Does anyone have experience with that?

    • @Erain616
      @Erain616 6 років тому +2

      Minh Vu non-MDP can be artificially converted to be able to use MDP to solve, like quasi-Markov Chain

  • @robert780612
    @robert780612 5 років тому

    David disappeared, but CC subtitle is coming!!

  • @randalllionelkharkrang4047
    @randalllionelkharkrang4047 Рік тому

    around 1:00:00, in the action-value and actor critic algorthims, to update w, he used \beta * \delta * feature. Why is he taking the feature here ? in model free evaluation , he used the eligibility trace , but why feature here ?

    • @edmonddantes4705
      @edmonddantes4705 Рік тому

      He is using linear function approximation for Q. It is a choice. Not sure why you are bothered that much by that.

  • @SuperBiggestking
    @SuperBiggestking Рік тому

    Following this lecture is like learning math by listening to a podcast.

  • @VladislavProkhorov-sr2mf
    @VladislavProkhorov-sr2mf 7 років тому +5

    How does he get the score function at 37:41?

    • @blairfraser8005
      @blairfraser8005 7 років тому +13

      I've seen this question a few places around the net so I answered it here: math.stackexchange.com/questions/2013050/log-of-softmax-function-derivative/2340848#2340848

    • @alexanderyau6347
      @alexanderyau6347 6 років тому

      Thank you, very elaborate answer!

    • @MinhVu-fo6hd
      @MinhVu-fo6hd 6 років тому

      So, how do you get a score function for a deep NN?

    • @AM-kx4ue
      @AM-kx4ue 4 роки тому

      @@blairfraser8005 could you do it for dummies? I don't understand why you put the terms inside logs.

    • @blairfraser8005
      @blairfraser8005 4 роки тому +2

      Our goal is to get a score function by taking the gradient of softmax. It looks like a difficult problem so I need to break it down into a simpler form. The first way to break it down is to separate the numerator and denominator using the log identity: log(x/y) = log(x) - log(y). Now I can apply the gradient to the left and right side independently. I also know that anytime I see something in the form e^x there is a good chance I can simplify and get at the guts of the exponent by taking the log of it. That helps simplify the left side. Next, the right side also takes advantage of a log property - namely that the gradient of the log of f(x) can be written in the form of gradient of f(x) / f(x). This is just the chain rule from calculus. Now the gradients of both the left and right sides are easier.

  • @emilfilipov169
    @emilfilipov169 5 років тому +5

    I love it how there is always someone moaning or chewing food near the camera/microphone.

  • @mohammadfarzanullah5549
    @mohammadfarzanullah5549 3 роки тому +1

    He teaches much better than hado van hasselt. makes it much easier

  • @nirajabcd
    @nirajabcd 3 роки тому +2

    The lectures were going great until someone decided not to show David's gestures. God I was learning so much just from his gestures.

  • @p.z.8355
    @p.z.8355 2 роки тому

    so when can we determine that there is state aliasing ?

    • @edmonddantes4705
      @edmonddantes4705 Рік тому

      Basically when you feel like your features are not representing the MDP very well. The solution is changing the features or improving them.

  • @karthik-ex4dm
    @karthik-ex4dm 5 років тому +1

    Came with high hopes from last video....
    WO video unable to predict what is he pointing to

  • @TillMiltzow
    @TillMiltzow 2 роки тому

    I feel like the last 15 minutes the slides and what he says is not in sync anymore. :(

  • @divyanshushekhar5118
    @divyanshushekhar5118 4 роки тому

    1:07:28 What does Silver mean when he says : "We can reduce the variance without changing the expectation"

    • @alvinphantomhive3794
      @alvinphantomhive3794 4 роки тому

      There's several way to reduce the variance, but reducing variance like using "Critic" at 53:02 , may keeps changing and updating the expectation value onward.
      So this slide shows the way to reduce the variance without changing the expectation.
      The idea here is, by subtracting the "Baseline function B(s)" from the "Policy gradient" could do the job.
      The expectation equation above shows that after a few algebra steps,
      which ends up with the "B(s) or Baseline" multiply by the "gradient" of "the policy that sums up to 1".
      And the gradient of a constant (1) equals to "Zero". So the whole terms of that equation shows that,
      the calculation between the expectation and the "baseline B(s)" actually equal to zero.
      That's mean you could use this "baseline function B(s)" as a Trick to control the variance without changing the expectation.
      The baseline not gonna affect the expectation, Since the calculation between the expectation and the baseline actually equal to zero.

    • @alvinphantomhive3794
      @alvinphantomhive3794 4 роки тому

      sorry if the explanation not straight forward and bit complicated lol

    • @edmonddantes4705
      @edmonddantes4705 Рік тому


      abla log pi(s,a) A(s,a) and

      abla log pi(s,a) Q(s,a) have the same expectation in the MDP space. However, which one has the larger variance? V[X] = E[X^2]-E[X]^2. Obviously E[X]^2 is the same for both. However, which expectation is larger, that of |
      abla log pi(s,a)|^2 |A(s,a)|^2 or that of |
      abla log pi(s,a)|^2 |Q(s,a)|^2? Note that A just centers Q, so tipically its square is smaller.

  • @ProfessionalTycoons
    @ProfessionalTycoons 5 років тому +8

    man without the gestures its not the same, the lecutre is not the same…...

  • @chongsun7872
    @chongsun7872 Місяць тому

    A little mismatched between the voice and slides...

  • @erener7897
    @erener7897 4 роки тому

    Sometimes David words and slides don't corresopond to each other. And I don't know what to do: listen to David or read slides. For example at 1:29:55 when he speaks about deterministic gradient theorem

    • @binjianxin7830
      @binjianxin7830 3 роки тому +1

      David has a paper about DPG which he mentioned was published “last year” in 2014, later a DDPG one. Just check them out.

  • @fktudiablo9579
    @fktudiablo9579 4 роки тому +2

    1:00:54, this man got a -1 reward and restarted a new episode.

  • @20a3c5f9
    @20a3c5f9 2 роки тому

    51:58 - "you get this very nice smooth learning curve... but learning is slow because rewards are high variance"
    Any idea why the learning curve is smooth despite high variance of returns? We use returns directly in gradient formula, so intuitively I'd guess they'd affect behavior of the learning curve as well.

    • @edmonddantes4705
      @edmonddantes4705 Рік тому

      I mean, look at the scale, it is massive. I bet if you zoom in, it is not going to be very smooth. Lets say we have an absorbing MDP with pretty long trajectories and we calculate the mean returns by applying MC. By the central limit theorem, the mean experimental returns converge to the real returns, but it will take many iterations due to the high variance of those returns. The smoothness you would see when zooming out (when looking at how the mean returns converge) would be due to the central limit theorem. Note that I am simply making a parallel. In the case of MC policy gradient, that smoothness is due to its convergence properties, which rely on the fact that the MC returns are unbiased samples of the real returns, but that thing is very bumpy when you zoom in precisely due to the variance.

  • @samlaf92
    @samlaf92 4 роки тому

    I don't understand why he says that Value-based methods can't work with stochastic policies? By definition epsilon-greedy is stochastic. If we find two actions with the same value, we could simply have a stochastic policy with 1/2 probability to both. And thus, value-function based methods could also solve the aliased-gridworld example around 20:00.

    • @edmonddantes4705
      @edmonddantes4705 Рік тому

      The convergence theorems in CONTROL require epsilon --> 0. If you read papers, you will often see assumptions of GLIE type (greedy in the limit with infinite exploration), which go towards a deterministic policy. David also mentions this (lecture 5 I think).

  • @xingyuanzhang5989
    @xingyuanzhang5989 5 років тому +5

    I need David! It's hard to understand some pronouns without seeing him.

  • @jiansenxmu
    @jiansenxmu 6 років тому

    I'm looking for the robot.. 32:49

  • @hyunghochrischoi6561
    @hyunghochrischoi6561 2 роки тому +1

    First time making it this far. But is it just me or did alot of the notations change?

    • @hyunghochrischoi6561
      @hyunghochrischoi6561 2 роки тому

      Also, he seems to be speaking in one notation while the screen is showing something else.

  • @alexanderyau6347
    @alexanderyau6347 6 років тому +1

    Hi, guys, how can I get the v_t in 50:53?

    • @narendiranchembu5893
      @narendiranchembu5893 6 років тому +5

      Since we have rewards of the entire episode, we can calculate the returns, Monte Carlo way. Here, v_t is more like G_t. v_t = R_(t+1) + gamma*R_(t+2)..... + gamma^(T-1)*R_T

    • @alexanderyau6347
      @alexanderyau6347 6 років тому

      Thank you!

  • @alxyok
    @alxyok 5 років тому +2

    rock paper scissors problem:
    would it not be a better strategy to try to fool the opponent into thinking we are following a policy other than the random play so that we can exploit the consequences of his decisions?

    • @BGasperov
      @BGasperov 4 роки тому +1

      Whatever strategy you come up with can not beat the uniform random strategy - that's why it is considered optimal.

    • @edmonddantes4705
      @edmonddantes4705 Рік тому

      In real life it could be good, but theoretically of course not, since it is not a Nash Equilibrium. It can be exploited. Watch lecture 10.

  • @tomwon5451
    @tomwon5451 6 років тому +1

    v: critic param, u: actor param

  • @helloworld9478
    @helloworld9478 3 роки тому +1

    31:40 "...until your graduate students collapse it..." LoL

    • @wumps7573
      @wumps7573 3 роки тому +1

      The optimization method also known as Grad Student Descent.

  • @hardikmadhu584
    @hardikmadhu584 6 років тому +2

    Someone forgot to hit Record!!

  • @kunkumamithunbalajivenkate8893
    @kunkumamithunbalajivenkate8893 2 роки тому +1

    32:38 - AIBO Training Video Links: www.cs.utexas.edu/~AustinVilla/?p=research/learned_walk

  • @Kalernor
    @Kalernor 2 роки тому +1

    Why do all lecture videos on Policy Gradient Methods use the exact same set of slides lol

  • @guptamber
    @guptamber Рік тому

    I found Prof. Silver brilliant but concepts in this lecture by large are not explained concretely but just illustration of book. Moreover the lectures earlier showed where on the slide prof is pointing and that is missing too.

  • @ProfessionalTycoons
    @ProfessionalTycoons 5 років тому

    slide: www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/pg.pdf

  • @TillMiltzow
    @TillMiltzow Рік тому

    When adding the baseline, there is an error. The gradient is zero when multiplying with the baseline because the function B(s) does not depend on theta. Then he uses B(s) = V^{\pi_\Theta} (s), which depends on theta. :( So this is at most a motivation rather than a mathematical proof.

    • @edmonddantes4705
      @edmonddantes4705 Рік тому

      No error. That gradient is not hitting the baseline B, so it does not matter that B depends on theta. The gradient inside the sum is zero because the policy coefficients sum to one for fixed s.
      This is a well-known classical thing anyway. It was originally proven in Sutton's paper "Policy Gradient Methods for Reinforcement Learning with Function Approximation".

  • @lex6709
    @lex6709 Рік тому

    whoa that was fast

  • @ks3562
    @ks3562 2 роки тому

    I lost it after he started talking about bias and reducing variance in actor-critic algorithms, after 1:05:03

  • @MinhVu-fo6hd
    @MinhVu-fo6hd 5 років тому

    Ohh math! Student confused lol.

  • @phuongdh
    @phuongdh 3 роки тому

    Someone should superimpose their gestures on top of this video

  • @hippiedonut1
    @hippiedonut1 3 роки тому

    this lecture was difficult

  • @riccardoandreetta9520
    @riccardoandreetta9520 7 років тому +1

    there should be more real examples ... udacity course is even more heavy.

  • @mohamedakrout9742
    @mohamedakrout9742 6 років тому +2

    24 dislikes ? I cannot believe you can watch David and hit dislike in the end. Some people are really strange

  • @alighahramani2347
    @alighahramani2347 Рік тому

    🍍