Hi, I wanted to ask, at 17:31, if the sum in the denominator of all actions in any state is equal to 1, and that is by the definition of the stochastic policy, wouldn't that mean that we are actually just dividing the numerator by 1 and the softmax, in theory, isn't needed and is put there for the practical/implementation reasons?
This course is amazing! Very nice job (I was struggling with Soft policies, now I think I understand much more :) ) I have some quick questions though: - at 16:15 you say that the objective is concave, but why is that ? - at 25:30 We try to prove that Q grows for any pair (a,s). I can understand that the sum over the action of Q(a,s) will grow, but I am confused about the any pair, if at one state, the Q value was overestimated, and we than adjust it because we learn, the Q value of this action will decrease, and the Q value of the other action will increase, no ? I am just confused how the Q values can monotonically increase, even for bad actions - at 31:00 I am very confused how repeatedly applying the derivation can give Q pi+1 - at 31:39 where the epsilon comes from ? (I didn't see it anywhere before) Thanks again to have posted this class :)
Thanks for the video! but I think slide 3 is slightly wrong, or at least phrased confusingly: in some cases the optimal policy is not deterministic. For example, in rock paper scissors, the optimal policy is stochastic.
For single-agent problems where we trying to maximize the expectation of a reward, there will always exist a deterministic optimal policy. Rock-paper-scissors is a two-player zero-sum game, so it does not fall under the standard reinforcement learning setup.
This is SO GOOD, very clear and straight to the point
best video out there for SQL math, particularly some of the derivations which aren't properly explained in the paper
You’re a great teacher. Please do more videos. Thanks
I AM THE LIVING EMBODIMENT OF THIS
at 36:36, (SAC) why do we need to use a network to approximate softmax(Q_w \ l) for the policy (can we not just use it directly?)
Wow!!!! Amazing. It really helped me a lot. Can you do a video on Option Critic and Hierarchical Reinforcement Learning
Hi, I wanted to ask, at 17:31, if the sum in the denominator of all actions in any state is equal to 1, and that is by the definition of the stochastic policy, wouldn't that mean that we are actually just dividing the numerator by 1 and the softmax, in theory, isn't needed and is put there for the practical/implementation reasons?
And also, shouldn't we have used that same condition, the sum of all policies different actions in some state equals to 1, in the objective function?
@@miroslavkosanic2917 The softmax value of all action combined is equal to 1, not the denominator. Not entirely sure of this
The sum of Pi should be 1. So it is a constrained optimization, and the derivation should use Lagrangian.
This course is amazing! Very nice job
(I was struggling with Soft policies, now I think I understand much more :) )
I have some quick questions though:
- at 16:15 you say that the objective is concave, but why is that ?
- at 25:30 We try to prove that Q grows for any pair (a,s). I can understand that the sum over the action of Q(a,s) will grow, but I am confused about the any pair, if at one state, the Q value was overestimated, and we than adjust it because we learn, the Q value of this action will decrease, and the Q value of the other action will increase, no ? I am just confused how the Q values can monotonically increase, even for bad actions
- at 31:00 I am very confused how repeatedly applying the derivation can give Q pi+1
- at 31:39 where the epsilon comes from ? (I didn't see it anywhere before)
Thanks again to have posted this class :)
1) Because you can see the differentiation is of the form dJ/dpi = y -(mx+1), so the integral form must be x^2, hence concave
Thanks for the video! but I think slide 3 is slightly wrong, or at least phrased confusingly: in some cases the optimal policy is not deterministic. For example, in rock paper scissors, the optimal policy is stochastic.
For single-agent problems where we trying to maximize the expectation of a reward, there will always exist a deterministic optimal policy. Rock-paper-scissors is a two-player zero-sum game, so it does not fall under the standard reinforcement learning setup.
sir is this an undergraduate course or graduate course?