I've seen many videos pertinent to various machine learning methods, this the best one. You really explain completely and in detail. Hope to see more videos by you! BR
The way he explains the material makes me want to set in front of the PC for a long time. Excellent job professor Poupart and thank you again for sharing these videos with us
10:48 I believe \gamma^n term should not be in the update for the value function, since effectively we are using MC prediction to estimate the value function
34:19 "in practice, DPG and in fact all the other algorithms for actor critiques that that I showed before would all use a replay buffer and and a target network" I probably misunderstood this, but isn't this incorrect because the use of Experience Replay requires an off-policy algorithm, but only DPG is off-policy. The standard Actor-Critic and A2C algorithm that you showed are afaik on-policy and don't include importance sampling ,like ACER does, to make it off-policy. "However, experience replay has several drawbacks: . . . ; and it requires off-policy learning algorithms that can update from data generated by an older policy." - Asynchronous Methods for Deep Reinforcement Learning
I think the A2C algorithm(at least the version shown here) is off-policy, because we are computing the advantage function as r_n + max Q(s_{n+1}, a_{n+1}) rather than taking the expectation wrt the current policy(which would then make it on-policy), this is pretty similar to the difference between SARSA and Q updates
I've seen many videos pertinent to various machine learning methods, this the best one. You really explain completely and in detail. Hope to see more videos by you! BR
The way he explains the material makes me want to set in front of the PC for a long time. Excellent job professor Poupart and thank you again for sharing these videos with us
Yes. But only if you watch it at 1.5x. Otherwise it's snoozeville.
@@avimohan6594 Idk, it felt like his explanations to the doubts were shaky and hand-wavy.
10:48 I believe \gamma^n term should not be in the update for the value function, since effectively we are using MC prediction to estimate the value function
The instruction and explanations are great! Thanks for sharing your knowledge
Thanks for the clear and detailed expanations
34:19 "in practice, DPG and in fact all the other algorithms for actor critiques that that I showed before would all use a replay buffer and and a target network" I probably misunderstood this, but isn't this incorrect because the use of Experience Replay requires an off-policy algorithm, but only DPG is off-policy. The standard Actor-Critic and A2C algorithm that you showed are afaik on-policy and don't include importance sampling ,like ACER does, to make it off-policy.
"However, experience replay has several drawbacks: . . . ; and it requires off-policy learning algorithms that can update from data generated by an older policy." - Asynchronous Methods for Deep Reinforcement Learning
I think the A2C algorithm(at least the version shown here) is off-policy, because we are computing the advantage function as r_n + max Q(s_{n+1}, a_{n+1}) rather than taking the expectation wrt the current policy(which would then make it on-policy), this is pretty similar to the difference between SARSA and Q updates
what is the difference between DPG and DDPG except that DDPG uses deep neural networks?
Hello, why do we use Qmax(sn+1, an+1) instead of Vw(sn+1)? Thank you.
Because the Advantage quantity is measuring how much better the best action is than the average expected value.
Video watchable at 1.5x, too slow otherwise, but informative