I have to say, he is amazing at answering questions. In over 90% of cases, he fully understands what is being asked and responds with a well formulated, straightforward, answer.
The questions from the students are in very high quality by being so informative and representative. And no double they help to make the lecture itself more clear.
Ya the questions from students right from the first lecture are consistently the best , I have ever seen in any course. I wonder who the students were? grad students? or were they professionals, who already hold advanced degrees in other domains.
Looking at these lectures the second time clears up things very well. David Silver is an excellent professor, his articulations are precise and calculated.
His explanations are great! Really had a hard time trying to understand these RL topics from the stanford lectures. This makes it much clearer now. Love how he always comes back to how each algorithm fits into the larger context.
Great Lecture about inner works of Dynamic Programming! Prof. Silver's explanations and intuitions are very helpful.Thanks for release this material for free!
After the first 3 lectures of the series, I would say that it's the best lecture about Reinforcement learning I have took. It helped me a lot when I studied Optimal Robotics control in my master program. Thanks D.Silver for this course.
awesome lectures !! you made it easy to achieve a really good understanding of the reinforcement learning (im talking after watching the first 3 videos), thanks for this great lectures !!
28:14 All 6 fields converges exactly to 20.(0) by using equation. It is because this values represent expected reward starting from one of this field and folowing policy 1/4 for each direction and also -1 reward for each transition. More interestingly is to programm this grid with a pawn running on each direction randomly, with policy 1/4, each time spawning in such spot after achieving termination spot. Over a minute after 1000000 runs you will see that average reward converges to -20.0...., it looks almost like a magic and works perfeclty of course for any other field. Author was just confused a little by a question. Respect for such good lectures.
The person asking the question at 28:07 is mistaken, and the lecturer's answer is incorrect. The incorrect premise of the question is that the only "terminal state" is the lower right corner, and so the square (3,3) (labelled from the bottom left) should be "further" from the terminal point than (4,3). This is wrong: the top left corner is also part of the terminal state, and is closer to (3,3). Dr. Silver's answer does not address this. Instead, he (apparently incorrectly) says the discrepancy is due to rounding error. However, the numbers shown are consistent with an exact solution of the problem by solving Poisson's equation for the random walk.
But you can see that the algorithm has converged to the optimal policy, which would choose the shortest path from the nearest terminal to each state. This would lead to the bottom right terminal for (2, 4), with a higher "reward" (lower cost) than either terminal for (2, 3) (the states asked about were actually (2, 3) and (2,4)). For large k (and k = infinity here), you'll get many more iterations through the optimal policy, outweighing the rewards from initial policies.
This is incorrect. You are making the implicit assumption that the value function will reproduce the distance to nearest terminal state. This is not so. If it were, the states labelled 22, which are at distance 3 from either terminal point, would have the same value as the state you label (2,3), which has value 20. Even though the greedy policy will find the shortest distance to the nearest terminal state, the value function itself is actually the expected time, under a random walk, to either terminal state. So there is a trade-off in being nearer to one state but further to the other, which explains the two 20s the person was asking about. The point here is that we don't update the policy used to iterate the value function. We're just finding the value function corresponding to a random walk. The greedy policy with respect to that value function happens to be optimal in this case.
Ah yes I see. I misspoke; didn't notice this was for the policy evaluation section of the talk. Seems a bit tiresome to compute the weighted probabilities for all the possible paths of (2, 3) and (2,4). Does it come out to be the same thing then? Even if if not, I agree your explanation is more comprehensive than "rounding error."
I made a test program for that 4*4 grids and verified the numbers in the grids are correct. github.com/qqiang00/reinforce/blob/master/python/small_grid_world.py
I found it amazing how he's able to understand the questions. The guy with all the questions is ridiculous, and yet DS manages to get his questions and answer them effortlessly.
The thing that David keeps on mentioning while working backwards is called as recursive leap of faith. We assume Nth state to be present and then make corresponding recursive call, which inturn breaks the problem into subproblem, like finding Fibonacci (n) and assuming our function would return us that. Then breaking into subproblem, just like we do in mathematical induction.
@1:16:00 In value iteration we produce a sequence v1->v2->v3->... He says that, say, v3 might not correspond to any policy. This makes sense, but we're really interested in finding a greedy policy using v3, which we can do as before. I find Silver's exposition slightly confusing at times. edit: I think a student pointed out the same thing immediately after.
In value iteration, you are not interested in acting greedily, you are interesting in converging to v^*. Once you are close to v^* you can act greedily. Say you act greedily after n iterations because you are close to convergence. The value v^n might not correspond with any policy, but if the error between v^n and v^* is sufficiently small, acting greedily will work pretty well anyway.
-- At 28:00 I can see an *error* on the k = 3 left grid. For (2 row 3 column cell) it should be arrows in all directions (from -3.0 to all -2.9). And or course the same is for the (3 row 3 column cell). So "optimal policy" is not yet here for k=3 -- Also at 25:00 on the right grids: all cells on the border cannot have arrows directed outside. It is important because we don't calculate probability of steps to outside. Say, cells in the corners have only two possible steps which has 0.5 probability etc.
First, the 2.9 value on the diagonal is truncated. It's greater than the other 2.9 value. Second, all actions are allowed in border and corner states. The only difference is, that some action might leave you in your state. You choose to go right, but the wind might blow you back.
fyi. at 45:22 he mentions that there is no cost for moving cars between locations. In the Sutton and Barto RL textbook on page 93 where this example is from, it says this: "Jack can move them between the two locations overnight, at a cost of *$2* per car moved."
In the toy model with car rentals some conservation laws are not kept right. It is minor detail, but still is important first to analyze the problem well. If in the first spot cars are taken and returned with probability 3 , 3 per day this means that after a month our total cars will be the same, while the other dealer spot averages 4 by 2 this means after some finite time interval the dealer will run out of cars. The truth is that first spot averages more than 3 returns a day. Minor, but important, keep in mind that questioning the assumptions we make when defining the problem is as important as solving it.
To the person caring about increase/decrease of state values by policy/value iteration, for the grid case, actually the state values are decreasing after each iteration in both iteration types.
I feel like the "Problems" (Prediction and Control, @10:00 and @1:24:17) could have better names. "Prediction" refers to estimating the value for (a) future timepoint(s), at least, to my statistician ears. "Evaluation" seems more accurate. On the other hand, "Control" speaks more of the act of choosing an action, while in the video, it means finding the best policy and value function. Shouldn't it be called something like "fitting" or "solving"?
In the grid world example around 28:00, why are the final v values so small? In words, it means 'expectation of cumulative discounted reward'. Then I guess they should be between -1~-5, because at any tile, we can reach the terminal state only with 1 to 5 steps. Can anyone please explain why do the final v values are much smaller than that? (they range from -14 to -22)
It's assuming a random walk. Even in those states where you are next to the goal, 3/4 of the time you will not be moving to it but instead going somewhere farther away. It is only when you are following an optimal policy that the state values become the negative of steps left.
4:45 I guess in that example of the shortest path, by breaking it down into two subproblems, you need an additional constraint. For example, the shortest path between two points A and B is a straight line. If you take the midpoint C, and try to find the shortest path between AC then CB, then "add" them back together, yes you can get the shortest "length". But there are infinite many solutions to the "path" itself because the path can be "bent", i.e. the line ACB is not a perfect straight line. Here we assume we have Euclidean geometry. Is that true for other problems that we need additional constraints?
It's called dynamic programming because the Air Force general Bellman was working under was somewhat short sighted and didn't appreciate the value of quantitative research. Thus Bellman chose "Dynamic" for the name of his brand of programming because he felt there was no way to use dynamic in a pejorative sense. Also "...we can think of the value function as caching the solutions to all of our subproblems...." is great stuff. Many thanks.
I was confused about the value iteration having v_(k+1) >= v_(k) but my understanding of it now is this: The policy improvement value function inequality only holds for an evaluated policy (i.e. v_pi(s)). So for successived policies, the value of the current policy will be greater than the policy before. This is not true between successive v's during policy evaluation (i.e. for v_1, v_2, ..., v_pi). v_(k+1) >= v_(k) will not necessarily hold.
It is definitely not true that v_(k+1) >= v_(k) for value iteration. I am not sure where you got that from, but not true. Even at 1:07:00 you can see that in that particular example, exactly the opposite happens. Anyway, there is no guarantee that the value grows at every iteration in value iteration, although there are error bounds coming from Banach fixed point theorem. However, in policy iteration, the value always increases since greedy actions ensure that. Maybe you meant policy iteration and not value iteration.
That expectation depends on the policy pi, which affects all the future immediate returns. So it is not exactly the value function in MRPs, since there is no policy in MRPs, even though the formula might look the same. However, the value function of an MDP is the value function of the associated MRP, where the transition probabilities P_{ss'} and the expected immediate rewards R_s are calculated naturally by using the fixed policy pi.
at 28:12 when he asks about the values being the same at grid square 2, and grid square 6 (both -20). I implemented this in python and ran until the difference of absolute values between iterations was
I don't know why youtube put a line through what I wrote. hopefully it's still readable, the true value does appear to be the same for both those squares. I implemented in python and got the same value to 8 decimal places.
I don't find it counter-intuitive. Note that the state at (2,4) is close to one of the goal states, but far from the other, and there are many paths that might lead to the other, since the policy is uniform. Those paths would incur in total negative return. The state (3,3) is further from one of the goal states, but closer to the other. The returns balance.
At 19:00, it was stated that there always is a unique optimal solution for the MDP, what about non-MDPs where the solution may not be unique? i.e., we don't have enough information about the state to create an MDP problem.
Your question is too general to be interesting. You cannot ask whether there is uniqueness for non-MDPs, since "non-MDP" means absolutely nothing. There is no definition for that, you have not specified which assumptions you want to change or break. You first need to define your framework and then ask questions. It is not clear whether you are referring to discrete stochastic processes which are not Markov, to game strategies where your opponent could change their policy, or to some weird object that is not even a stochastic process. Anything that is not an MDP is a "non-MDP", hence my cat, a stone, and a cup of tea are non-MDPs. Anyway, the answer is that in general obviously there are no guarantees of uniqueness if you are not even defining what you problem is. I will also add that the optimal policy of an MDP is NOT UNIQUE like you stated, what is unique is ITS VALUE FUNCTION AND ACTION VALUE FUNCTION.
At 32:28, what does he mean by imporving the policy? As in, we are taking the same greedy policy in every iteration, rit? So, what is it we are improving here? Thanks.
It's like learning from an agent that is more naive than you, for example in second iterationsyou are going to do what the random policy agent thinks is the best, and subsequently you will do what next agent will think is the best. The choice of best is taken by checking the value function.
@@ashutoshsingh544 the best action would be in the direction of a state with msximum value. So, the policy here is "max(V(s))". Isn't this like a hard coded policy that cannpt be updated? Thanks.
@@surajbonagiri I think that you misunderstand the meaning of the "policy". A policy is a distribution over actions. So max(V(s)) is not a policy actually, rather it's a way that helps us to find the policy. Here it allows us to find a deterministic policy.
The lecture is impressive! But I have to say I was so "patient" looking forward to the moment where the Contraction Mapping will be explained! Spoiler alert: . . . . . It wasn't🙈
Mate, the two locations have different probability distributions. Not to be offensive, but are you blind or what? 😂. It is clearly stated in the slide you linked!
Under the policy evaluation - policy improvement framework (35:45), after at least one step policy evaluation, does the value initializes randomly or based on the best value of the last step?
basically, you find the greedy policy and then compute the values again using the values from the last steps. and not the best values from the last step
In the car rental example, how do you act greedily w.r.t. the value function? Do you obtain your value function surface and then try to reinstate the state with the highest value by moving cars around? Say, for example, at iteration 0, the state (18,8) has the highest value, so from state (19,7) we obtain +1 as the optimal action for the next iteration?
This lecture is model-based, so the first thing you should do is calculate and store your transition probabilities P_{ss'}^a. Then at every step you first apply policy evaluation (various iterations of Bellman expectation equation by using your transition probabilities P_{ss'}^a), which gives you an approximation of v(s) for every s, and second you act greedily by calculating q(s,a) for every s and a (use again the transition probabilities P_{ss'}^a) and taking argmax in a. And that's one iteration of policy iteration. Where is the problem? Also, in case it is not clear, the reward at state s is simply the number of cars you expect to sell that day multiplied by 10 (since every car costs 10 dollars). That expected car number is easily computed using the given distributions.
Is it possible to compute posterior action value? i.e., q(a_t = a | S_t+1 = s) using Bayesian equation so the agent knows which action to choose to arrive at a certain state and plan for its actions.
If you have the optimal state value function, by the Bellman principle of optimality, the optimal policy is the greedy policy. This happens because, as you know the optimal state value function, you know how to act optimally from time t+1 onwards.
It does not actually. You need to calculate q* first after getting v*. The premise by the person who answered you is not true. You need the action-value function, not the state-value function.
Question: @21:56. IF this is policy evaluation, This means you evaluate (update) the value base on the previous policy (policy will be updated in each iter), is this right or wrong? If yes, in k = 2, the cell(2,1) will not be -1.7. Because base on previous updated policy (k=1), you can ONLY go up. Hence, value of cell (2,1) when k= 2 should be 1 * (-1 + 0.0) = -1 (first 1 is policy prob, you can go up only; -1 is rewards you leave state, 0.0 is value of cell(1,1) at k = 1). you do use the previous updated policy to calculate the current new value. Can someone verify my understanding? Thank you so much for your time.
yes your understanding is correct. So for the first step, you compute the value function by using a random policy and after that you get the new values. and then using these new values you define the policy greedily. Now using this updated greedy policy you compute the value function.
someone could answer this question please ? @35:28 I do understand that: during the learning process, the policy is getting better and better. However, does this means, during the learning, you always use the updated policy to decide where to "jump" (pick action) ? Or you just do random action all the time to update the policy ? I am asking this is because, if you use the updated policy during the learning, you make stuck at some states, am I right? that's why you should not follow the policy but just random pick action during the learning. Am I correct?
As he said that to iterate you use the greedy approach, so like in the gridworld example there will be more than one directions in some states/grids to move, which after 1-sweep(policy evaluation) and one policy update, will change the nearby state values (s'), thus for each step of evaluation+update you move through the whole state map finally converging at the point where further updates to the policy doesn't affect the state values much. You can try to simulate the same here, cs.stanford.edu/people/karpathy/reinforcejs/gridworld_dp.html
@@suryansudash0502 thanks for your answer. May I confirm this with you, when you say "iterate you use the greedy approach", this means, I do need to follow the updated greedy policy to walk over all the gridworld map, instead of just random walk ? Is this correct? thank you again
@@43SunSon As far as I have understood, that seems to be correct, since you need to go as per as the arrow directions in the greedy approach instead of a random, or to say an exploring search methodology, as from the 2nd chapter of the book. Do correct me if I am wrong, and glad to help :)
@@43SunSon Same, I am studying on my own too. As for the book, I referred to the "Reinforcement Learning: An Introduction" by Andrew Barto and Richard S. Sutton. These lectures are based on that book only, so I am watching these + reading it parallelly. You can also do the RL specialization on Coursera by University of Alberta, which uses that book's material for teaching and exercises.
It's not clear, when we update the values v_{k+1} , in what order should we update the states. It seems if we know the terminal state, we can first update the states which are closer to the terminal states. But again this will not be feasible when there are no terminal states, or the policy is stochastic.
the thing that most surprises me is the replacement of the infinite discount sum with just `the` value from the current dim state. why is that equivalent ? and is that just some approximation.
After the first Gridworld example, the lecturer explained policy iteration process. Seems the policy expectation always uses the updated policy. But in the previous example, the left hand side sticks to random move policy and the right hand side manages to converges to the optimal. Why?
That's what he said. If we start with any random policy and evaluate the value function corresponding to that policy COMPLETELY, we can generate the optimal policy from the value function values. (Reason given at the end of the video). However, policy iteration is different. You're NOT COMPLETELY evaluating the value function for the policy. After every iteration of generating value function, you update the policy. Then you generate a new value function, and a new policy and so on. The reason this algorithm reaches to optimal policy is proved in the end of the policy iteration section. Don't confuse it with policy evaluation. Infact that's what the girl asks at 58:20 .
@@xiaohongdeng825 the lecturer does mention somewhere in this lecture that it doesn't happen often that the optimal policy is found only by looking at the value function update of the initial policy, which is the random policy here.
@@himanshuladia9099 Your reasoning is completely wrong. Let me give you a trivial counterexample. Say we have two states s_1 and s_2. Define two actions at every state: either stay or move, i.e. the MDP is actually deterministic given the choice of the action. Let me define the immediate rewards: staying in s_2 gives reward 1, moving from any state gives reward 0, and staying in s_1 gives reward 2. I am going to choose a particular policy pi, which consists in always moving (in both states s_1 and s_2). Obviously, the value of that policy is 0. There is no way of going from the value of that policy to the optimal policy, which is moving at s_2 and staying at s_1.
There is an ambiguity for me here. Can anyone help? Ok, I agree that when no better policy can be found, the current policy is the optimal one for the set of values we already have. But how it is guaranteed that no better policy can be found while other paths of finding a better policy and value function have been selected?
q_pi(s, pi(s)) : follow pi policy for one step (from s), and then follow it for ever after. v_pi(s) : follow pi policy from the state s (for all along) These two values are the same.
Since policy iteration in each iteration results in realizable policies, would it mean that policy iteration, in general, will require fewer iterations to converge?
Why? Vanilla policy iteration is way more expensive than value iteration and by the Banach fixed point theorem, value iteration converges at rate gamma^n, whereas every evaluation in the evaluation part of policy iteration converges at rate gamma^n. Every backup in policy evaluation is 0(n²m), where n is the number of states and m is the number of actions. You need to do many evaluations of 0(n²m) and then act greedily, for which you need to calculate q* from v*, which is O(n²m) as well. However, every backup in value iteration is O(n²m) and you only need to act greedily once at the end, which is 0(n²m).
In terms of Policy Evaluation, the initial uniform random policy is changing at every iteration, how come that is evaluation instead of optimisation (of policy)?
I kind of understand. The uniform random policy actually does not change for all iterations since all choosable actions (can vary at each iteration) at each state share the same probability, which makes the policy still uniform (random/stochastic), although the actual probability value may change. So the probability distribution of action-state-map changing does not necessarily mean the policy changes, thus policy iteration indeed leads to the final optimal state values guided by this policy. Just in this simple grid example, the final results of policy and value iteration seemly have the same action-state mapping (optimal policy), but I believe in more complex cases they don't. So policy iteration indeed solves policy evaluation (prediction/planning), and value iteration solves policy optimisation (control).
That is a very good question! You need to compute q*(s,a) for all a in A and all s in S by using the transition probabilities and the values of v* (look at the formulas in the previous lecture) and then act greedily with respect to q*. I was bothered by this as well, he skipped it and definitely should have been mentioned. You cannot improve your policy only with v*, you need q* to act greedily.
HAHAHAHAHA I love this comment. But basically yes, the assumptions are stupid. It also assumes that you only get 10 dollars independently of how many days the car you rented is gone 😂. It could be gone forever for 10 dollars :)))
Grate lecture, but I have a question. You show us that a optimal policy must satisfy Bellman’s Optimality Equation. And you show us this iterative method that can only terminate at a policy that satisfies Bellman’s Optimality Equation. However: policy is optimal => policy satisfies Bellman’s Optimality Equation (1) is not the same as policy satisfies Bellman’s Optimality Equation => policy is optimal (2) Has some one proved (2), or do we just hope that it is true?
@@lindalinsefors6578 Banach fixed point theorem gives you uniqueness when \gamma < 1. Just download the slides and read the extra material at the end. Also, he does not prove (1), he says it is true and carries out a hand-wavy argument, but it is slightly tricky to prove it rigorously. I did this in my notes and it is pretty interesting. Basically you prove the two inequalities.
2:25 Introduction
12:38 Policy Evaluation
29:40 Policy Iteration
1:01:45 Value Iteration
1:28:53 Extensions to Dynamic Programming
1:38:30 Contraction Mapping
I have to say, he is amazing at answering questions. In over 90% of cases, he fully understands what is being asked and responds with a well formulated, straightforward, answer.
I compare Udacity class with this. This works better for me
Fully agree.
that is the gift good teachers have!
謝其宏 ditto regarding classes on Coursera. Those are not bad classes, but these lectures seem to make learning effortless by comparison.
Questions are distracting
The questions from the students are in very high quality by being so informative and representative. And no double they help to make the lecture itself more clear.
Really?! I felt exactly the opposite!
ya same here.
Ya the questions from students right from the first lecture are consistently the best , I have ever seen in any course.
I wonder who the students were? grad students? or were they professionals, who already hold advanced degrees in other domains.
jerin antony, can you elaborate on that?
This is from a course at University College London. They are probably grad students with sufficient background
Looking at these lectures the second time clears up things very well. David Silver is an excellent professor, his articulations are precise and calculated.
His explanations are great! Really had a hard time trying to understand these RL topics from the stanford lectures. This makes it much clearer now. Love how he always comes back to how each algorithm fits into the larger context.
dynamic programming - fundamental block to solve bellman euqation.
0:36 outline
2:25 Introduction
12:38 Policy Evaluation
29:40 Policy Iteration
1:01:45 Value Iteration
1:28:53 Extensions to Dynamic Programming
1:38:30 Contraction Mapping
Great Lecture about inner works of Dynamic Programming! Prof. Silver's explanations and intuitions are very helpful.Thanks for release this material for free!
after all these months of head scratching, I now finally understand policy and value iteration.
Thanks to Mr. Silver
Dr. (or Prof.) I believe
After the first 3 lectures of the series, I would say that it's the best lecture about Reinforcement learning I have took. It helped me a lot when I studied Optimal Robotics control in my master program. Thanks D.Silver for this course.
Thank you David Silver for best explanation of RL I have ever seen.
awesome lectures !! you made it easy to achieve a really good understanding of the reinforcement learning (im talking after watching the first 3 videos), thanks for this great lectures !!
This is quite possibly the best lecture I have sat through…. Either in-person or online !
Thank you very much for making this public I've learned a lot!
His tutorials are Gold..Thanks Deepmind
28:14 All 6 fields converges exactly to 20.(0) by using equation. It is because this values represent expected reward starting from one of this field and folowing policy 1/4 for each direction and also -1 reward for each transition. More interestingly is to programm this grid with a pawn running on each direction randomly, with policy 1/4, each time spawning in such spot after achieving termination spot. Over a minute after 1000000 runs you will see that average reward converges to -20.0...., it looks almost like a magic and works perfeclty of course for any other field. Author was just confused a little by a question. Respect for such good lectures.
The person asking the question at 28:07 is mistaken, and the lecturer's answer is incorrect. The incorrect premise of the question is that the only "terminal state" is the lower right corner, and so the square (3,3) (labelled from the bottom left) should be "further" from the terminal point than (4,3). This is wrong: the top left corner is also part of the terminal state, and is closer to (3,3). Dr. Silver's answer does not address this. Instead, he (apparently incorrectly) says the discrepancy is due to rounding error. However, the numbers shown are consistent with an exact solution of the problem by solving Poisson's equation for the random walk.
But you can see that the algorithm has converged to the optimal policy, which would choose the shortest path from the nearest terminal to each state. This would lead to the bottom right terminal for (2, 4), with a higher "reward" (lower cost) than either terminal for (2, 3) (the states asked about were actually (2, 3) and (2,4)). For large k (and k = infinity here), you'll get many more iterations through the optimal policy, outweighing the rewards from initial policies.
This is incorrect. You are making the implicit assumption that the value
function will reproduce the distance to nearest terminal state. This is
not so. If it were, the states labelled 22, which are at distance 3
from either terminal point, would have the same value as the state you
label (2,3), which has value 20.
Even though the greedy policy will find the shortest distance to the
nearest terminal state, the value function itself is actually the
expected time, under a random walk, to either terminal state. So there
is a trade-off in being nearer to one state but further to the other,
which explains the two 20s the person was asking about.
The point here is that we don't update the policy used to iterate the
value function. We're just finding the value function corresponding to a random walk. The greedy policy with respect to that value function happens to be optimal in this case.
Ah yes I see. I misspoke; didn't notice this was for the policy evaluation section of the talk. Seems a bit tiresome to compute the weighted probabilities for all the possible paths of (2, 3) and (2,4). Does it come out to be the same thing then? Even if if not, I agree your explanation is more comprehensive than "rounding error."
I also noticed this problem, and I programmed to verify these two confusing value function are all 20 even. It's not rounding error.
I made a test program for that 4*4 grids and verified the numbers in the grids are correct. github.com/qqiang00/reinforce/blob/master/python/small_grid_world.py
Great courses. Very intuitive about abstract concepts. Just I tends to sit on the right side of my TV to make the image make sense.
Cant believe this lecture is free and open! thank you!
I found it amazing how he's able to understand the questions. The guy with all the questions is ridiculous, and yet DS manages to get his questions and answer them effortlessly.
The thing that David keeps on mentioning while working backwards is called as recursive leap of faith. We assume Nth state to be present and then make corresponding recursive call, which inturn breaks the problem into subproblem, like finding Fibonacci (n) and assuming our function would return us that. Then breaking into subproblem, just like we do in mathematical induction.
@1:16:00 In value iteration we produce a sequence v1->v2->v3->... He says that, say, v3 might not correspond to any policy. This makes sense, but we're really interested in finding a greedy policy using v3, which we can do as before. I find Silver's exposition slightly confusing at times.
edit: I think a student pointed out the same thing immediately after.
In value iteration, you are not interested in acting greedily, you are interesting in converging to v^*. Once you are close to v^* you can act greedily. Say you act greedily after n iterations because you are close to convergence. The value v^n might not correspond with any policy, but if the error between v^n and v^* is sufficiently small, acting greedily will work pretty well anyway.
-- At 28:00 I can see an *error* on the k = 3 left grid.
For (2 row 3 column cell) it should be arrows in all directions (from -3.0 to all -2.9).
And or course the same is for the (3 row 3 column cell).
So "optimal policy" is not yet here for k=3
-- Also at 25:00 on the right grids: all cells on the border cannot have arrows directed outside. It is important because we don't calculate probability of steps to outside. Say, cells in the corners have only two possible steps which has 0.5 probability etc.
First, the 2.9 value on the diagonal is truncated. It's greater than the other 2.9 value. Second, all actions are allowed in border and corner states. The only difference is, that some action might leave you in your state. You choose to go right, but the wind might blow you back.
fyi. at 45:22 he mentions that there is no cost for moving cars between locations. In the Sutton and Barto RL textbook on page 93 where this example is from, it says this: "Jack can move them between the two locations overnight, at a cost of *$2* per
car moved."
Maybe he simplified the problem statement to deliver the intuition to us.
Thank you for this amazing material. It is really interesting stuff. I am benefiting vastly
In the toy model with car rentals some conservation laws are not kept right. It is minor detail, but still is important first to analyze the problem well. If in the first spot cars are taken and returned with probability 3 , 3 per day this means that after a month our total cars will be the same, while the other dealer spot averages 4 by 2 this means after some finite time interval the dealer will run out of cars. The truth is that first spot averages more than 3 returns a day. Minor, but important, keep in mind that questioning the assumptions we make when defining the problem is as important as solving it.
I learnt a lot from these lectures. Thanks for sharing Prof.
To the person caring about increase/decrease of state values by policy/value iteration, for the grid case, actually the state values are decreasing after each iteration in both iteration types.
Thank you so much for sharing these lectures!
Thank you so much for this amazing lecture!!!
I feel like the "Problems" (Prediction and Control, @10:00 and @1:24:17) could have better names.
"Prediction" refers to estimating the value for (a) future timepoint(s), at least, to my statistician ears. "Evaluation" seems more accurate.
On the other hand, "Control" speaks more of the act of choosing an action, while in the video, it means finding the best policy and value function. Shouldn't it be called something like "fitting" or "solving"?
Amazing explaination. Thnx Pr Silver
In the grid world example around 28:00, why are the final v values so small? In words, it means 'expectation of cumulative discounted reward'. Then I guess they should be between -1~-5, because at any tile, we can reach the terminal state only with 1 to 5 steps. Can anyone please explain why do the final v values are much smaller than that? (they range from -14 to -22)
It's assuming a random walk. Even in those states where you are next to the goal, 3/4 of the time you will not be moving to it but instead going somewhere farther away. It is only when you are following an optimal policy that the state values become the negative of steps left.
1:15 optimal policy, solution metrics
4:45 I guess in that example of the shortest path, by breaking it down into two subproblems, you need an additional constraint.
For example, the shortest path between two points A and B is a straight line. If you take the midpoint C, and try to find the shortest path between AC then CB, then "add" them back together, yes you can get the shortest "length". But there are infinite many solutions to the "path" itself because the path can be "bent", i.e. the line ACB is not a perfect straight line. Here we assume we have Euclidean geometry.
Is that true for other problems that we need additional constraints?
It's called dynamic programming because the Air Force general Bellman was working under was somewhat short sighted and didn't appreciate the value of quantitative research. Thus Bellman chose "Dynamic" for the name of his brand of programming because he felt there was no way to use dynamic in a pejorative sense. Also "...we can think of the value function as caching the solutions to all of our subproblems...." is great stuff. Many thanks.
Looks like you were trolled by Goeff Hinton, who keeps perpetuating this myth.
Was it Bellend?
Caching subproblems is known as memoization.
@@marcosrodriguez2496 Bellman himself pointed this out in his autobiography. So I don't think this is a myth but actually a really funny fact.
If you are loooking for the answer to 55:25 , i would recomend you to read cauchy sequences
Best course🎉
At 1:16:00, Am I right in saying that the optimal value Dynamic Programming outputs is equivalent to the value given by ((I-\gamma P)^-1)*R.
By next year, he would lead the AlphaGo team to win the world championship by a triumphant 4:1. A legend already.
Simply amazing! Thank you!
I was confused about the value iteration having v_(k+1) >= v_(k) but my understanding of it now is this:
The policy improvement value function inequality only holds for an evaluated policy (i.e. v_pi(s)). So for successived policies, the value of the current policy will be greater than the policy before.
This is not true between successive v's during policy evaluation (i.e. for v_1, v_2, ..., v_pi). v_(k+1) >= v_(k) will not necessarily hold.
It is definitely not true that v_(k+1) >= v_(k) for value iteration. I am not sure where you got that from, but not true. Even at 1:07:00 you can see that in that particular example, exactly the opposite happens. Anyway, there is no guarantee that the value grows at every iteration in value iteration, although there are error bounds coming from Banach fixed point theorem. However, in policy iteration, the value always increases since greedy actions ensure that. Maybe you meant policy iteration and not value iteration.
In policy iteration you do policy evaluation - policy improvement iteratively. Is policy improvement not the same as "updating the policy"? 1:01:08
excellent lecture!
30:51 Why are we using the value function from MRP? Shouldn't we be using that of MDP, given that we have actions and rewards in the grid world?
Yes, value function from MDP is better since we have policy here, the value function from MRP is more general though.
That expectation depends on the policy pi, which affects all the future immediate returns. So it is not exactly the value function in MRPs, since there is no policy in MRPs, even though the formula might look the same. However, the value function of an MDP is the value function of the associated MRP, where the transition probabilities P_{ss'} and the expected immediate rewards R_s are calculated naturally by using the fixed policy pi.
at 28:12 when he asks about the values being the same at grid square 2, and grid square 6 (both -20). I implemented this in python and ran until the difference of absolute values between iterations was
I don't know why youtube put a line through what I wrote. hopefully it's still readable, the true value does appear to be the same for both those squares. I implemented in python and got the same value to 8 decimal places.
What gamma did you use? You used it for the fixed policy (uniformly random) right?
jmp.sh/kEfOxGV
This is the code. Gamma=1, action policy 0.25 for each direction.
I don't find it counter-intuitive. Note that the state at (2,4) is close to one of the goal states, but far from the other, and there are many paths that might lead to the other, since the policy is uniform. Those paths would incur in total negative return. The state (3,3) is further from one of the goal states, but closer to the other. The returns balance.
Very excellent materials
At 19:00, it was stated that there always is a unique optimal solution for the MDP, what about non-MDPs where the solution may not be unique? i.e., we don't have enough information about the state to create an MDP problem.
Your question is too general to be interesting. You cannot ask whether there is uniqueness for non-MDPs, since "non-MDP" means absolutely nothing. There is no definition for that, you have not specified which assumptions you want to change or break. You first need to define your framework and then ask questions. It is not clear whether you are referring to discrete stochastic processes which are not Markov, to game strategies where your opponent could change their policy, or to some weird object that is not even a stochastic process. Anything that is not an MDP is a "non-MDP", hence my cat, a stone, and a cup of tea are non-MDPs. Anyway, the answer is that in general obviously there are no guarantees of uniqueness if you are not even defining what you problem is. I will also add that the optimal policy of an MDP is NOT UNIQUE like you stated, what is unique is ITS VALUE FUNCTION AND ACTION VALUE FUNCTION.
He makes me want to learn maths. Can you believe that.
Nop
@@shilashm5691 How can I write math formulas like that ?
In the car example at 39:00, is each iteration of the policy evaluated over many runs of the game?
Thanks! Good lecture!
At 32:28, what does he mean by imporving the policy? As in, we are taking the same greedy policy in every iteration, rit? So, what is it we are improving here? Thanks.
It's like learning from an agent that is more naive than you, for example in second iterationsyou are going to do what the random policy agent thinks is the best, and subsequently you will do what next agent will think is the best. The choice of best is taken by checking the value function.
@@ashutoshsingh544 the best action would be in the direction of a state with msximum value. So, the policy here is "max(V(s))". Isn't this like a hard coded policy that cannpt be updated? Thanks.
@@surajbonagiri I think that you misunderstand the meaning of the "policy". A policy is a distribution over actions. So max(V(s)) is not a policy actually, rather it's a way that helps us to find the policy. Here it allows us to find a deterministic policy.
The lecture is impressive! But I have to say I was so "patient" looking forward to the moment where the Contraction Mapping will be explained!
Spoiler alert:
.
.
.
.
.
It wasn't🙈
It is in the extra material at the end of the slides, pretty easy to follow :). Just download the slides.
38:35 Both locations have the same probability distribution, but why is the policy map not symmetrical?
Mate, the two locations have different probability distributions. Not to be offensive, but are you blind or what? 😂. It is clearly stated in the slide you linked!
Under the policy evaluation - policy improvement framework (35:45), after at least one step policy evaluation, does the value initializes randomly or based on the best value of the last step?
basically, you find the greedy policy and then compute the values again using the values from the last steps. and not the best values from the last step
Its crazy when you know how the Shortest Path Bellman Ford algo was derived
In the car rental example, how do you act greedily w.r.t. the value function? Do you obtain your value function surface and then try to reinstate the state with the highest value by moving cars around? Say, for example, at iteration 0, the state (18,8) has the highest value, so from state (19,7) we obtain +1 as the optimal action for the next iteration?
This lecture is model-based, so the first thing you should do is calculate and store your transition probabilities P_{ss'}^a. Then at every step you first apply policy evaluation (various iterations of Bellman expectation equation by using your transition probabilities P_{ss'}^a), which gives you an approximation of v(s) for every s, and second you act greedily by calculating q(s,a) for every s and a (use again the transition probabilities P_{ss'}^a) and taking argmax in a. And that's one iteration of policy iteration. Where is the problem? Also, in case it is not clear, the reward at state s is simply the number of cars you expect to sell that day multiplied by 10 (since every car costs 10 dollars). That expected car number is easily computed using the given distributions.
It was a great lecture. I only have one doubt. Can someone help me with some resources for approximate DP.
Where will the professor discuss about contraction mapping ?? he said that he would do it later.
Is it possible to compute posterior action value? i.e., q(a_t = a | S_t+1 = s) using Bayesian equation so the agent knows which action to choose to arrive at a certain state and plan for its actions.
Sorry in 11:40
It was said that optimal state-value function implies optimal policy? It's not clear how exactly?
If you have the optimal state value function, by the Bellman principle of optimality, the optimal policy is the greedy policy. This happens because, as you know the optimal state value function, you know how to act optimally from time t+1 onwards.
After listening to all lectures, I'm a little surprised that i was asking this question. Now it makes sense. Thank you for the detailed explanation.
It does not actually. You need to calculate q* first after getting v*. The premise by the person who answered you is not true. You need the action-value function, not the state-value function.
Question: @21:56. IF this is policy evaluation, This means you evaluate (update) the value base on the previous policy (policy will be updated in each iter), is this right or wrong?
If yes, in k = 2, the cell(2,1) will not be -1.7. Because base on previous updated policy (k=1), you can ONLY go up. Hence, value of cell (2,1) when k= 2 should be 1 * (-1 + 0.0) = -1 (first 1 is policy prob, you can go up only; -1 is rewards you leave state, 0.0 is value of cell(1,1) at k = 1).
you do use the previous updated policy to calculate the current new value. Can someone verify my understanding? Thank you so much for your time.
yes your understanding is correct. So for the first step, you compute the value function by using a random policy and after that you get the new values. and then using these new values you define the policy greedily. Now using this updated greedy policy you compute the value function.
@@Chitrang62 thank you so much, buddy. If I got more questions, is there a way I could ask you again?
I am also still learning. You can contact me via email chitrang6@gmail.com
It can go to any of the four directions with pi=0.25 each.
@@zyzhang6406 ?? You mean, the last guy explanation is wrong ?
someone could answer this question please ?
@35:28 I do understand that: during the learning process, the policy is getting better and better. However, does this means, during the learning, you always use the updated policy to decide where to "jump" (pick action) ? Or you just do random action all the time to update the policy ?
I am asking this is because, if you use the updated policy during the learning, you make stuck at some states, am I right? that's why you should not follow the policy but just random pick action during the learning. Am I correct?
As he said that to iterate you use the greedy approach, so like in the gridworld example there will be more than one directions in some states/grids to move, which after 1-sweep(policy evaluation) and one policy update, will change the nearby state values (s'), thus for each step of evaluation+update you move through the whole state map finally converging at the point where further updates to the policy doesn't affect the state values much. You can try to simulate the same here, cs.stanford.edu/people/karpathy/reinforcejs/gridworld_dp.html
@@suryansudash0502 thanks for your answer. May I confirm this with you, when you say "iterate you use the greedy approach", this means, I do need to follow the updated greedy policy to walk over all the gridworld map, instead of just random walk ? Is this correct? thank you again
@@43SunSon As far as I have understood, that seems to be correct, since you need to go as per as the arrow directions in the greedy approach instead of a random, or to say an exploring search methodology, as from the 2nd chapter of the book. Do correct me if I am wrong, and glad to help :)
@@suryansudash0502 oh, thanks,buddy. Btw, I am studying this on my own. when you say 2nd chapter of the book, which book ?
@@43SunSon Same, I am studying on my own too. As for the book, I referred to the "Reinforcement Learning: An Introduction" by Andrew Barto and Richard S. Sutton. These lectures are based on that book only, so I am watching these + reading it parallelly. You can also do the RL specialization on Coursera by University of Alberta, which uses that book's material for teaching and exercises.
God bless you.
It's not clear, when we update the values v_{k+1} , in what order should we update the states. It seems if we know the terminal state, we can first update the states which are closer to the terminal states. But again this will not be feasible when there are no terminal states, or the policy is stochastic.
Actually, it is partly answered towards end of the lecture ~1:29:00. More concrete answers for faster learning could be in recent papers.
the thing that most surprises me is the replacement of the infinite discount sum with just `the` value from the current dim state. why is that equivalent ? and is that just some approximation.
It's equivalent because of the Bellman equation. It's an approximate arrive by monte carlo, hence you have to iterate to get a better estimate
Τhey say that a lecture is a review of book chapter…. But I don’t know if 188 slides can be considered a ‘short’/‘abridged’ version of a chapter !!!
For the first equation on the slide at 53:00, why does the last equality hold?
Because pi is a deterministic policy. A particular action is always chosen with probability one at state s.
After the first Gridworld example, the lecturer explained policy iteration process. Seems the policy expectation always uses the updated policy. But in the previous example, the left hand side sticks to random move policy and the right hand side manages to converges to the optimal. Why?
That's what he said. If we start with any random policy and evaluate the value function corresponding to that policy COMPLETELY, we can generate the optimal policy from the value function values. (Reason given at the end of the video). However, policy iteration is different. You're NOT COMPLETELY evaluating the value function for the policy. After every iteration of generating value function, you update the policy. Then you generate a new value function, and a new policy and so on. The reason this algorithm reaches to optimal policy is proved in the end of the policy iteration section. Don't confuse it with policy evaluation. Infact that's what the girl asks at 58:20 .
Himanshu Ladia Actually in the example I was concerned with optimal policy is not guaranteed. See Sutton&Barto Figure 4.1
@@xiaohongdeng825 the lecturer does mention somewhere in this lecture that it doesn't happen often that the optimal policy is found only by looking at the value function update of the initial policy, which is the random policy here.
@@himanshuladia9099 Your reasoning is completely wrong. Let me give you a trivial counterexample. Say we have two states s_1 and s_2. Define two actions at every state: either stay or move, i.e. the MDP is actually deterministic given the choice of the action. Let me define the immediate rewards: staying in s_2 gives reward 1, moving from any state gives reward 0, and staying in s_1 gives reward 2. I am going to choose a particular policy pi, which consists in always moving (in both states s_1 and s_2). Obviously, the value of that policy is 0. There is no way of going from the value of that policy to the optimal policy, which is moving at s_2 and staying at s_1.
@@qqcat3678 Yeah, he is wrong.
There is an ambiguity for me here. Can anyone help?
Ok, I agree that when no better policy can be found, the current policy is the optimal one for the set of values we already have. But how it is guaranteed that no better policy can be found while other paths of finding a better policy and value function have been selected?
@50:00 How did q_pi(s, pi(s)) = v_pi(s) ?
q_pi(s, pi(s)) : follow pi policy for one step (from s), and then follow it for ever after.
v_pi(s) : follow pi policy from the state s (for all along)
These two values are the same.
Because pi is deterministic. One action is chosen with probability one, so the value of the state is the value of the action that is chosen.
Gute gemacht, vielen Danke!
Thanks for nice lecture. It would have been more good if there was more time to finish last part.
Only flaw we can find on this video is the sound volume... Other than that it's masterpiece
Since policy iteration in each iteration results in realizable policies, would it mean that policy iteration, in general, will require fewer iterations to converge?
Why? Vanilla policy iteration is way more expensive than value iteration and by the Banach fixed point theorem, value iteration converges at rate gamma^n, whereas every evaluation in the evaluation part of policy iteration converges at rate gamma^n. Every backup in policy evaluation is 0(n²m), where n is the number of states and m is the number of actions. You need to do many evaluations of 0(n²m) and then act greedily, for which you need to calculate q* from v*, which is O(n²m) as well. However, every backup in value iteration is O(n²m) and you only need to act greedily once at the end, which is 0(n²m).
How great are the audience questions??
In terms of Policy Evaluation, the initial uniform random policy is changing at every iteration, how come that is evaluation instead of optimisation (of policy)?
I kind of understand. The uniform random policy actually does not change for all iterations since all choosable actions (can vary at each iteration) at each state share the same probability, which makes the policy still uniform (random/stochastic), although the actual probability value may change. So the probability distribution of action-state-map changing does not necessarily mean the policy changes, thus policy iteration indeed leads to the final optimal state values guided by this policy. Just in this simple grid example, the final results of policy and value iteration seemly have the same action-state mapping (optimal policy), but I believe in more complex cases they don't. So policy iteration indeed solves policy evaluation (prediction/planning), and value iteration solves policy optimisation (control).
If I used value iteration to solve for MDP, I only get the optimal v*(s). How do we get the optimal policy pi* from v*(s)?
That is a very good question! You need to compute q*(s,a) for all a in A and all s in S by using the transition probabilities and the values of v* (look at the formulas in the previous lecture) and then act greedily with respect to q*. I was bothered by this as well, he skipped it and definitely should have been mentioned. You cannot improve your policy only with v*, you need q* to act greedily.
Hei guys, does anyone jknow where can we find the appendices with the additional proofs?
www.cs.ubc.ca/∼poole/demos/mdp/vi.html is not available..it says page not found :(
car example - didn't get at all. How to read those charts?
he explains it later in the video
Fellow Indians pause on 43:24 for few seconds :)
Any anyone, tell me how much time you take to complete this course..??
Can someone enable the automatic CC?
how?
Just learn proper English then. It's really not that hard and I'm not even a native speaker.
Sometimes the audio is too low. CC would help in such cases.
you dont have to be a dick about it
@@011azr
The amount of times that I hear coughing in this video is alarming duing these COVID times HAHA
188 slides
*has slight heart attack*
27:50
1:28:27
Is the link at 1:23:52 still working?
Nope
is planning by dynamic programming is a model-based algorithm?
Yes as the entire environment dynamics is known and given by the MDP.
Impossible to image a lecture with so many people coughing all the time in 2021...
But where are the cars going? The problem basically describes that on each day, 2 cars are lost ?!?
HAHAHAHAHA I love this comment. But basically yes, the assumptions are stupid. It also assumes that you only get 10 dollars independently of how many days the car you rented is gone 😂. It could be gone forever for 10 dollars :)))
I wish there was less white noise
Where can I find the algorithms/pseudocodes? Thanks.
They are in the book he mentioned, Reinforcement Learning - An Introduction by Richard Sutton and Andrew Barto. It's freely available online.
@Nnotm and is THE standard textbook for just about any RL course you will take.
The guy sitting near the recording system is producing unnecessary noise.
Grate lecture, but I have a question.
You show us that a optimal policy must satisfy Bellman’s Optimality Equation. And you show us this iterative method that can only terminate at a policy that satisfies Bellman’s Optimality Equation.
However:
policy is optimal => policy satisfies Bellman’s Optimality Equation (1)
is not the same as
policy satisfies Bellman’s Optimality Equation => policy is optimal (2)
Has some one proved (2), or do we just hope that it is true?
Basically, has someone proved that Bellman's Optimality Equation has a unique solution?
I think this can be easily proofed by contradiction with Bellman's Optimality Equation.
@@lindalinsefors6578 Banach fixed point theorem gives you uniqueness when \gamma < 1. Just download the slides and read the extra material at the end. Also, he does not prove (1), he says it is true and carries out a hand-wavy argument, but it is slightly tricky to prove it rigorously. I did this in my notes and it is pretty interesting. Basically you prove the two inequalities.
I wish the audio was better
Did anybody manage to run the Value Iteration simulator from the link?
First of all value iteration algo seems to be wrong.Bcoz in each iteration agent will do a single sweep(calculate value for single state)
01:28:20
mooi, dank u wel!
question volume is too low ㅠㅠ
If possible don't watch this with headphones on... GOD! Whosoever is coughing right next to the mic!!!