CS885 Lecture 7b: Actor Critic

Поділитися
Вставка
  • Опубліковано 11 гру 2024

КОМЕНТАРІ • 14

  • @chicagogirl9862
    @chicagogirl9862 4 роки тому +2

    I've seen many videos pertinent to various machine learning methods, this the best one. You really explain completely and in detail. Hope to see more videos by you! BR

  • @rawandjalal5984
    @rawandjalal5984 4 роки тому

    The way he explains the material makes me want to set in front of the PC for a long time. Excellent job professor Poupart and thank you again for sharing these videos with us

    • @avimohan6594
      @avimohan6594 4 роки тому

      Yes. But only if you watch it at 1.5x. Otherwise it's snoozeville.

    • @akarshrastogi3682
      @akarshrastogi3682 4 роки тому

      @@avimohan6594 Idk, it felt like his explanations to the doubts were shaky and hand-wavy.

  • @interstella5555
    @interstella5555 3 роки тому

    10:48 I believe \gamma^n term should not be in the update for the value function, since effectively we are using MC prediction to estimate the value function

  • @wb9981
    @wb9981 4 роки тому

    The instruction and explanations are great! Thanks for sharing your knowledge

  • @OmerBoehm
    @OmerBoehm 2 роки тому

    Thanks for the clear and detailed expanations

  • @maxmaxxx6563
    @maxmaxxx6563 6 років тому

    34:19 "in practice, DPG and in fact all the other algorithms for actor critiques that that I showed before would all use a replay buffer and and a target network" I probably misunderstood this, but isn't this incorrect because the use of Experience Replay requires an off-policy algorithm, but only DPG is off-policy. The standard Actor-Critic and A2C algorithm that you showed are afaik on-policy and don't include importance sampling ,like ACER does, to make it off-policy.
    "However, experience replay has several drawbacks: . . . ; and it requires off-policy learning algorithms that can update from data generated by an older policy." - Asynchronous Methods for Deep Reinforcement Learning

    • @interstella5555
      @interstella5555 3 роки тому

      I think the A2C algorithm(at least the version shown here) is off-policy, because we are computing the advantage function as r_n + max Q(s_{n+1}, a_{n+1}) rather than taking the expectation wrt the current policy(which would then make it on-policy), this is pretty similar to the difference between SARSA and Q updates

  • @muhammedsaeed4099
    @muhammedsaeed4099 4 роки тому

    what is the difference between DPG and DDPG except that DDPG uses deep neural networks?

  • @ugurkaraaslan9285
    @ugurkaraaslan9285 4 роки тому

    Hello, why do we use Qmax(sn+1, an+1) instead of Vw(sn+1)? Thank you.

    • @akarshrastogi3682
      @akarshrastogi3682 4 роки тому +1

      Because the Advantage quantity is measuring how much better the best action is than the average expected value.

  • @sidharthkumar5517
    @sidharthkumar5517 7 днів тому

    Video watchable at 1.5x, too slow otherwise, but informative