RL Course by David Silver - Lecture 3: Planning by Dynamic Programming

Google DeepMind

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 12 січ 2025

КОМЕНТАРІ • 252

@NganVu 4 роки тому ⁺¹⁶⁹
2:25 Introduction
12:38 Policy Evaluation
29:40 Policy Iteration
1:01:45 Value Iteration
1:28:53 Extensions to Dynamic Programming
1:38:30 Contraction Mapping
@johnhefele5432 6 років тому ⁺²⁶⁹
I have to say, he is amazing at answering questions. In over 90% of cases, he fully understands what is being asked and responds with a well formulated, straightforward, answer.
@謝其宏-p3z 5 років тому ⁺⁴
I compare Udacity class with this. This works better for me
@孙兵兵-s7q 5 років тому ⁺¹
Fully agree.
@TheMikiwar 4 роки тому ⁺⁹
that is the gift good teachers have!
@donrosenthal5864 4 роки тому ⁺⁴
謝其宏 ditto regarding classes on Coursera. Those are not bad classes, but these lectures seem to make learning effortless by comparison.
@fallenangel8785 Рік тому
Questions are distracting
@sujiang5102 9 років тому ⁺¹²⁹
The questions from the students are in very high quality by being so informative and representative. And no double they help to make the lecture itself more clear.
@jerinantony007 7 років тому ⁺²⁵
Really?! I felt exactly the opposite!
@nautul14 7 років тому ⁺²
ya same here.
@riyaazshaik1540 7 років тому ⁺⁴
Ya the questions from students right from the first lecture are consistently the best , I have ever seen in any course.
I wonder who the students were? grad students? or were they professionals, who already hold advanced degrees in other domains.
@marcosrodriguez2496 7 років тому
jerin antony, can you elaborate on that?
@Alley00Cat 7 років тому ⁺¹
This is from a course at University College London. They are probably grad students with sufficient background
@akarshrastogi3682 5 років тому ⁺³⁵
Looking at these lectures the second time clears up things very well. David Silver is an excellent professor, his articulations are precise and calculated.
@ReatchOfficial 4 роки тому ⁺¹²
His explanations are great! Really had a hard time trying to understand these RL topics from the stanford lectures. This makes it much clearer now. Love how he always comes back to how each algorithm fits into the larger context.
@zhongchuxiong 2 роки тому ⁺³
dynamic programming - fundamental block to solve bellman euqation.
0:36 outline
2:25 Introduction
12:38 Policy Evaluation
29:40 Policy Iteration
1:01:45 Value Iteration
1:28:53 Extensions to Dynamic Programming
1:38:30 Contraction Mapping
@marloncajamarca2793 6 років тому ⁺²¹
Great Lecture about inner works of Dynamic Programming! Prof. Silver's explanations and intuitions are very helpful.Thanks for release this material for free!
@zahash1045 5 років тому ⁺²⁰
after all these months of head scratching, I now finally understand policy and value iteration.
Thanks to Mr. Silver
@ryanwoodruff4247 3 роки тому ⁺¹
Dr. (or Prof.) I believe
@ThuongPham-wg4bc Рік тому
After the first 3 lectures of the series, I would say that it's the best lecture about Reinforcement learning I have took. It helped me a lot when I studied Optimal Robotics control in my master program. Thanks D.Silver for this course.
@mind-set 3 роки тому ⁺²
Thank you David Silver for best explanation of RL I have ever seen.
@sergigalve 7 років тому ⁺¹⁰
awesome lectures !! you made it easy to achieve a really good understanding of the reinforcement learning (im talking after watching the first 3 videos), thanks for this great lectures !!
@billykotsos4642 3 роки тому
This is quite possibly the best lecture I have sat through…. Either in-person or online !
@LightStorm14 6 років тому ⁺⁸
Thank you very much for making this public I've learned a lot!
@karthik-ex4dm 6 років тому ⁺⁴
His tutorials are Gold..Thanks Deepmind
@СергейВоробьёв-ы5л8б 3 роки тому
28:14 All 6 fields converges exactly to 20.(0) by using equation. It is because this values represent expected reward starting from one of this field and folowing policy 1/4 for each direction and also -1 reward for each transition. More interestingly is to programm this grid with a pawn running on each direction randomly, with policy 1/4, each time spawning in such spot after achieving termination spot. Over a minute after 1000000 runs you will see that average reward converges to -20.0...., it looks almost like a magic and works perfeclty of course for any other field. Author was just confused a little by a question. Respect for such good lectures.
@shapeoperator 8 років тому ⁺³³
The person asking the question at 28:07 is mistaken, and the lecturer's answer is incorrect. The incorrect premise of the question is that the only "terminal state" is the lower right corner, and so the square (3,3) (labelled from the bottom left) should be "further" from the terminal point than (4,3). This is wrong: the top left corner is also part of the terminal state, and is closer to (3,3). Dr. Silver's answer does not address this. Instead, he (apparently incorrectly) says the discrepancy is due to rounding error. However, the numbers shown are consistent with an exact solution of the problem by solving Poisson's equation for the random walk.
@sharonxu6548 8 років тому
But you can see that the algorithm has converged to the optimal policy, which would choose the shortest path from the nearest terminal to each state. This would lead to the bottom right terminal for (2, 4), with a higher "reward" (lower cost) than either terminal for (2, 3) (the states asked about were actually (2, 3) and (2,4)). For large k (and k = infinity here), you'll get many more iterations through the optimal policy, outweighing the rewards from initial policies.
@shapeoperator 8 років тому ⁺¹³
This is incorrect. You are making the implicit assumption that the value
function will reproduce the distance to nearest terminal state. This is
not so. If it were, the states labelled 22, which are at distance 3
from either terminal point, would have the same value as the state you
label (2,3), which has value 20.
Even though the greedy policy will find the shortest distance to the
nearest terminal state, the value function itself is actually the
expected time, under a random walk, to either terminal state. So there
is a trade-off in being nearer to one state but further to the other,
which explains the two 20s the person was asking about.
The point here is that we don't update the policy used to iterate the
value function. We're just finding the value function corresponding to a random walk. The greedy policy with respect to that value function happens to be optimal in this case.
@sharonxu6548 8 років тому ⁺¹
Ah yes I see. I misspoke; didn't notice this was for the policy evaluation section of the talk. Seems a bit tiresome to compute the weighted probabilities for all the possible paths of (2, 3) and (2,4). Does it come out to be the same thing then? Even if if not, I agree your explanation is more comprehensive than "rounding error."
@thomasmezzanine5470 8 років тому ⁺³
I also noticed this problem, and I programmed to verify these two confusing value function are all 20 even. It's not rounding error.
@Qiang12345 7 років тому ⁺⁶
I made a test program for that 4*4 grids and verified the numbers in the grids are correct. github.com/qqiang00/reinforce/blob/master/python/small_grid_world.py
@Alexnotjones Рік тому
Great courses. Very intuitive about abstract concepts. Just I tends to sit on the right side of my TV to make the image make sense.
@yunhokim7846 3 роки тому ⁺²
Cant believe this lecture is free and open! thank you!
@nabilbaalbaki7816 5 років тому ⁺³
I found it amazing how he's able to understand the questions. The guy with all the questions is ridiculous, and yet DS manages to get his questions and answer them effortlessly.
@siddhantrai7529 3 роки тому ⁺¹
The thing that David keeps on mentioning while working backwards is called as recursive leap of faith. We assume Nth state to be present and then make corresponding recursive call, which inturn breaks the problem into subproblem, like finding Fibonacci (n) and assuming our function would return us that. Then breaking into subproblem, just like we do in mathematical induction.
@kiuhnmmnhuik2627 7 років тому ⁺²
@1:16:00 In value iteration we produce a sequence v1->v2->v3->... He says that, say, v3 might not correspond to any policy. This makes sense, but we're really interested in finding a greedy policy using v3, which we can do as before. I find Silver's exposition slightly confusing at times.
edit: I think a student pointed out the same thing immediately after.
@edmonddantes4705 Рік тому ⁺¹
In value iteration, you are not interested in acting greedily, you are interesting in converging to v^*. Once you are close to v^* you can act greedily. Say you act greedily after n iterations because you are close to convergence. The value v^n might not correspond with any policy, but if the error between v^n and v^* is sufficiently small, acting greedily will work pretty well anyway.
@LeonidGaneline 8 років тому ⁺¹
-- At 28:00 I can see an *error* on the k = 3 left grid.
For (2 row 3 column cell) it should be arrows in all directions (from -3.0 to all -2.9).
And or course the same is for the (3 row 3 column cell).
So "optimal policy" is not yet here for k=3
-- Also at 25:00 on the right grids: all cells on the border cannot have arrows directed outside. It is important because we don't calculate probability of steps to outside. Say, cells in the corners have only two possible steps which has 0.5 probability etc.
@martinseiler630 8 років тому ⁺¹
First, the 2.9 value on the diagonal is truncated. It's greater than the other 2.9 value. Second, all actions are allowed in border and corner states. The only difference is, that some action might leave you in your state. You choose to go right, but the wind might blow you back.
@adamnoack2154 7 років тому ⁺⁷
fyi. at 45:22 he mentions that there is no cost for moving cars between locations. In the Sutton and Barto RL textbook on page 93 where this example is from, it says this: "Jack can move them between the two locations overnight, at a cost of *$2* per
car moved."
@surajbonagiri 6 років тому ⁺⁵
Maybe he simplified the problem statement to deliver the intuition to us.
@arriva1256 8 років тому ⁺⁵
Thank you for this amazing material. It is really interesting stuff. I am benefiting vastly
@chavdardanchev9584 4 роки тому ⁺¹
In the toy model with car rentals some conservation laws are not kept right. It is minor detail, but still is important first to analyze the problem well. If in the first spot cars are taken and returned with probability 3 , 3 per day this means that after a month our total cars will be the same, while the other dealer spot averages 4 by 2 this means after some finite time interval the dealer will run out of cars. The truth is that first spot averages more than 3 returns a day. Minor, but important, keep in mind that questioning the assumptions we make when defining the problem is as important as solving it.
@sengonzi2010 2 роки тому
I learnt a lot from these lectures. Thanks for sharing Prof.
@princehan920453024 3 роки тому
To the person caring about increase/decrease of state values by policy/value iteration, for the grid case, actually the state values are decreasing after each iteration in both iteration types.
@nphardness 6 років тому ⁺³
Thank you so much for sharing these lectures!
@FirespyWins 8 років тому ⁺²
Thank you so much for this amazing lecture!!!
@juliencorriveau-trudel295 Рік тому ⁺¹
I feel like the "Problems" (Prediction and Control, @10:00 and @1:24:17) could have better names.
"Prediction" refers to estimating the value for (a) future timepoint(s), at least, to my statistician ears. "Evaluation" seems more accurate.
On the other hand, "Control" speaks more of the act of choosing an action, while in the video, it means finding the best policy and value function. Shouldn't it be called something like "fitting" or "solving"?
@hadjseddikyousfi00 4 роки тому
Amazing explaination. Thnx Pr Silver
@마민욱-s9j 6 років тому ⁺³
In the grid world example around 28:00, why are the final v values so small? In words, it means 'expectation of cumulative discounted reward'. Then I guess they should be between -1~-5, because at any tile, we can reach the terminal state only with 1 to 5 steps. Can anyone please explain why do the final v values are much smaller than that? (they range from -14 to -22)
@ricardoV94 6 років тому ⁺⁴
It's assuming a random walk. Even in those states where you are next to the goal, 3/4 of the time you will not be moving to it but instead going somewhere farther away. It is only when you are following an optimal policy that the state values become the negative of steps left.
@SphereofTime 5 місяців тому ⁺¹
1:15 optimal policy, solution metrics
@michaelcheung6290 Рік тому
4:45 I guess in that example of the shortest path, by breaking it down into two subproblems, you need an additional constraint.
For example, the shortest path between two points A and B is a straight line. If you take the midpoint C, and try to find the shortest path between AC then CB, then "add" them back together, yes you can get the shortest "length". But there are infinite many solutions to the "path" itself because the path can be "bent", i.e. the line ACB is not a perfect straight line. Here we assume we have Euclidean geometry.
Is that true for other problems that we need additional constraints?
@ThomasWoodGo 8 років тому ⁺⁸
It's called dynamic programming because the Air Force general Bellman was working under was somewhat short sighted and didn't appreciate the value of quantitative research. Thus Bellman chose "Dynamic" for the name of his brand of programming because he felt there was no way to use dynamic in a pejorative sense. Also "...we can think of the value function as caching the solutions to all of our subproblems...." is great stuff. Many thanks.
@marcosrodriguez2496 7 років тому ⁺¹
Looks like you were trolled by Goeff Hinton, who keeps perpetuating this myth.
@kinkokonko 7 років тому
Was it Bellend?
@JaysonSunshine 6 років тому
Caching subproblems is known as memoization.
@jackflash9123 6 років тому ⁺¹
@@marcosrodriguez2496 Bellman himself pointed this out in his autobiography. So I don't think this is a myth but actually a really funny fact.
@akshatgarg6635 4 роки тому
If you are loooking for the answer to 55:25 , i would recomend you to read cauchy sequences
@georgehu8652 Рік тому
Best course🎉
@dhanpalsiddharth4667 Рік тому
At 1:16:00, Am I right in saying that the optimal value Dynamic Programming outputs is equivalent to the value given by ((I-\gamma P)^-1)*R.
@yli6050 Рік тому
By next year, he would lead the AlphaGo team to win the world championship by a triumphant 4:1. A legend already.
@ijawadi 4 роки тому
Simply amazing! Thank you!
@SuperBiggestking 2 роки тому
I was confused about the value iteration having v_(k+1) >= v_(k) but my understanding of it now is this:
The policy improvement value function inequality only holds for an evaluated policy (i.e. v_pi(s)). So for successived policies, the value of the current policy will be greater than the policy before.
This is not true between successive v's during policy evaluation (i.e. for v_1, v_2, ..., v_pi). v_(k+1) >= v_(k) will not necessarily hold.
@edmonddantes4705 Рік тому
It is definitely not true that v_(k+1) >= v_(k) for value iteration. I am not sure where you got that from, but not true. Even at 1:07:00 you can see that in that particular example, exactly the opposite happens. Anyway, there is no guarantee that the value grows at every iteration in value iteration, although there are error bounds coming from Banach fixed point theorem. However, in policy iteration, the value always increases since greedy actions ensure that. Maybe you meant policy iteration and not value iteration.
@stefanherbek2025 5 років тому
In policy iteration you do policy evaluation - policy improvement iteratively. Is policy improvement not the same as "updating the policy"? 1:01:08
@k-bala-vignesh 7 років тому ⁺¹
excellent lecture!
@anandsaha01 7 років тому ⁺¹
30:51 Why are we using the value function from MRP? Shouldn't we be using that of MDP, given that we have actions and rewards in the grid world?
@qqcat3678 3 роки тому
Yes, value function from MDP is better since we have policy here, the value function from MRP is more general though.
@edmonddantes4705 Рік тому
That expectation depends on the policy pi, which affects all the future immediate returns. So it is not exactly the value function in MRPs, since there is no policy in MRPs, even though the formula might look the same. However, the value function of an MDP is the value function of the associated MRP, where the transition probabilities P_{ss'} and the expected immediate rewards R_s are calculated naturally by using the fixed policy pi.
@richardfredlund3802 6 років тому
at 28:12 when he asks about the values being the same at grid square 2, and grid square 6 (both -20). I implemented this in python and ran until the difference of absolute values between iterations was
@richardfredlund3802 6 років тому
I don't know why youtube put a line through what I wrote. hopefully it's still readable, the true value does appear to be the same for both those squares. I implemented in python and got the same value to 8 decimal places.
@satyanathbhat 6 років тому
What gamma did you use? You used it for the fixed policy (uniformly random) right?
@richardfredlund3802 5 років тому
jmp.sh/kEfOxGV
This is the code. Gamma=1, action policy 0.25 for each direction.
@edmonddantes4705 Рік тому
I don't find it counter-intuitive. Note that the state at (2,4) is close to one of the goal states, but far from the other, and there are many paths that might lead to the other, since the policy is uniform. Those paths would incur in total negative return. The state (3,3) is further from one of the goal states, but closer to the other. The returns balance.
@emilyrong3767 6 років тому
Very excellent materials
@yuxinzhang9403 3 роки тому
At 19:00, it was stated that there always is a unique optimal solution for the MDP, what about non-MDPs where the solution may not be unique? i.e., we don't have enough information about the state to create an MDP problem.
@edmonddantes4705 Рік тому
Your question is too general to be interesting. You cannot ask whether there is uniqueness for non-MDPs, since "non-MDP" means absolutely nothing. There is no definition for that, you have not specified which assumptions you want to change or break. You first need to define your framework and then ask questions. It is not clear whether you are referring to discrete stochastic processes which are not Markov, to game strategies where your opponent could change their policy, or to some weird object that is not even a stochastic process. Anything that is not an MDP is a "non-MDP", hence my cat, a stone, and a cup of tea are non-MDPs. Anyway, the answer is that in general obviously there are no guarantees of uniqueness if you are not even defining what you problem is. I will also add that the optimal policy of an MDP is NOT UNIQUE like you stated, what is unique is ITS VALUE FUNCTION AND ACTION VALUE FUNCTION.
@icantfindausernamehe 2 роки тому
He makes me want to learn maths. Can you believe that.
@shilashm5691 2 роки тому
Nop
@icantfindausernamehe 2 роки тому
@@shilashm5691 How can I write math formulas like that ?
@deepgems2246 4 роки тому
In the car example at 39:00, is each iteration of the policy evaluated over many runs of the game?
@alexanderyau6347 7 років тому ⁺¹
Thanks! Good lecture!
@surajbonagiri 6 років тому ⁺¹
At 32:28, what does he mean by imporving the policy? As in, we are taking the same greedy policy in every iteration, rit? So, what is it we are improving here? Thanks.
@ashutoshsingh544 6 років тому
It's like learning from an agent that is more naive than you, for example in second iterationsyou are going to do what the random policy agent thinks is the best, and subsequently you will do what next agent will think is the best. The choice of best is taken by checking the value function.
@surajbonagiri 6 років тому
@@ashutoshsingh544 the best action would be in the direction of a state with msximum value. So, the policy here is "max(V(s))". Isn't this like a hard coded policy that cannpt be updated? Thanks.
@qqcat3678 3 роки тому
@@surajbonagiri I think that you misunderstand the meaning of the "policy". A policy is a distribution over actions. So max(V(s)) is not a policy actually, rather it's a way that helps us to find the policy. Here it allows us to find a deterministic policy.
@mhdmedfa 2 роки тому
The lecture is impressive! But I have to say I was so "patient" looking forward to the moment where the Contraction Mapping will be explained!
Spoiler alert:
.
.
.
.
.
It wasn't🙈
@edmonddantes4705 Рік тому
It is in the extra material at the end of the slides, pretty easy to follow :). Just download the slides.
@Korea_University_of_Medicine Рік тому
38:35 Both locations have the same probability distribution, but why is the policy map not symmetrical?
@edmonddantes4705 Рік тому
Mate, the two locations have different probability distributions. Not to be offensive, but are you blind or what? 😂. It is clearly stated in the slide you linked!
@jushkunjuret4386 7 років тому
Under the policy evaluation - policy improvement framework (35:45), after at least one step policy evaluation, does the value initializes randomly or based on the best value of the last step?
@Chitrang62 4 роки тому
basically, you find the greedy policy and then compute the values again using the values from the last steps. and not the best values from the last step
@varunpritham7291 7 місяців тому
Its crazy when you know how the Shortest Path Bellman Ford algo was derived
@kaspar8292 4 роки тому
In the car rental example, how do you act greedily w.r.t. the value function? Do you obtain your value function surface and then try to reinstate the state with the highest value by moving cars around? Say, for example, at iteration 0, the state (18,8) has the highest value, so from state (19,7) we obtain +1 as the optimal action for the next iteration?
@edmonddantes4705 Рік тому
This lecture is model-based, so the first thing you should do is calculate and store your transition probabilities P_{ss'}^a. Then at every step you first apply policy evaluation (various iterations of Bellman expectation equation by using your transition probabilities P_{ss'}^a), which gives you an approximation of v(s) for every s, and second you act greedily by calculating q(s,a) for every s and a (use again the transition probabilities P_{ss'}^a) and taking argmax in a. And that's one iteration of policy iteration. Where is the problem? Also, in case it is not clear, the reward at state s is simply the number of cars you expect to sell that day multiplied by 10 (since every car costs 10 dollars). That expected car number is easily computed using the given distributions.
@parvmaheshwari5650 3 роки тому
It was a great lecture. I only have one doubt. Can someone help me with some resources for approximate DP.
@jayantpriyadarshi9266 3 роки тому
Where will the professor discuss about contraction mapping ?? he said that he would do it later.
@yuxinzhang9403 3 роки тому
Is it possible to compute posterior action value? i.e., q(a_t = a | S_t+1 = s) using Bayesian equation so the agent knows which action to choose to arrive at a certain state and plan for its actions.
@wisemen94 6 років тому
Sorry in 11:40
It was said that optimal state-value function implies optimal policy? It's not clear how exactly?
@anselmoufc 6 років тому ⁺¹
If you have the optimal state value function, by the Bellman principle of optimality, the optimal policy is the greedy policy. This happens because, as you know the optimal state value function, you know how to act optimally from time t+1 onwards.
@wisemen94 6 років тому ⁺¹
After listening to all lectures, I'm a little surprised that i was asking this question. Now it makes sense. Thank you for the detailed explanation.
@edmonddantes4705 Рік тому
It does not actually. You need to calculate q* first after getting v*. The premise by the person who answered you is not true. You need the action-value function, not the state-value function.
@43SunSon 4 роки тому
Question: @21:56. IF this is policy evaluation, This means you evaluate (update) the value base on the previous policy (policy will be updated in each iter), is this right or wrong?
If yes, in k = 2, the cell(2,1) will not be -1.7. Because base on previous updated policy (k=1), you can ONLY go up. Hence, value of cell (2,1) when k= 2 should be 1 * (-1 + 0.0) = -1 (first 1 is policy prob, you can go up only; -1 is rewards you leave state, 0.0 is value of cell(1,1) at k = 1).
you do use the previous updated policy to calculate the current new value. Can someone verify my understanding? Thank you so much for your time.
@Chitrang62 4 роки тому
yes your understanding is correct. So for the first step, you compute the value function by using a random policy and after that you get the new values. and then using these new values you define the policy greedily. Now using this updated greedy policy you compute the value function.
@43SunSon 4 роки тому
@@Chitrang62 thank you so much, buddy. If I got more questions, is there a way I could ask you again?
@Chitrang62 4 роки тому
I am also still learning. You can contact me via email chitrang6@gmail.com
@zyzhang6406 4 роки тому
It can go to any of the four directions with pi=0.25 each.
@43SunSon 4 роки тому
@@zyzhang6406 ?? You mean, the last guy explanation is wrong ?
@43SunSon 4 роки тому
someone could answer this question please ?
@35:28 I do understand that: during the learning process, the policy is getting better and better. However, does this means, during the learning, you always use the updated policy to decide where to "jump" (pick action) ? Or you just do random action all the time to update the policy ?
I am asking this is because, if you use the updated policy during the learning, you make stuck at some states, am I right? that's why you should not follow the policy but just random pick action during the learning. Am I correct?
@suryansudash0502 4 роки тому
As he said that to iterate you use the greedy approach, so like in the gridworld example there will be more than one directions in some states/grids to move, which after 1-sweep(policy evaluation) and one policy update, will change the nearby state values (s'), thus for each step of evaluation+update you move through the whole state map finally converging at the point where further updates to the policy doesn't affect the state values much. You can try to simulate the same here, cs.stanford.edu/people/karpathy/reinforcejs/gridworld_dp.html
@43SunSon 4 роки тому ⁺¹
@@suryansudash0502 thanks for your answer. May I confirm this with you, when you say "iterate you use the greedy approach", this means, I do need to follow the updated greedy policy to walk over all the gridworld map, instead of just random walk ? Is this correct? thank you again
@suryansudash0502 4 роки тому
@@43SunSon As far as I have understood, that seems to be correct, since you need to go as per as the arrow directions in the greedy approach instead of a random, or to say an exploring search methodology, as from the 2nd chapter of the book. Do correct me if I am wrong, and glad to help :)
@43SunSon 4 роки тому
@@suryansudash0502 oh, thanks,buddy. Btw, I am studying this on my own. when you say 2nd chapter of the book, which book ?
@suryansudash0502 4 роки тому
@@43SunSon Same, I am studying on my own too. As for the book, I referred to the "Reinforcement Learning: An Introduction" by Andrew Barto and Richard S. Sutton. These lectures are based on that book only, so I am watching these + reading it parallelly. You can also do the RL specialization on Coursera by University of Alberta, which uses that book's material for teaching and exercises.
@ShortVine 4 роки тому
God bless you.
@JagdeepPaniAtEarth 3 роки тому
It's not clear, when we update the values v_{k+1} , in what order should we update the states. It seems if we know the terminal state, we can first update the states which are closer to the terminal states. But again this will not be feasible when there are no terminal states, or the policy is stochastic.
@JagdeepPaniAtEarth 3 роки тому
Actually, it is partly answered towards end of the lecture ~1:29:00. More concrete answers for faster learning could be in recent papers.
@zinking3 7 років тому
the thing that most surprises me is the replacement of the infinite discount sum with just `the` value from the current dim state. why is that equivalent ? and is that just some approximation.
@Alley00Cat 7 років тому
It's equivalent because of the Bellman equation. It's an approximate arrive by monte carlo, hence you have to iterate to get a better estimate
@billykotsos4642 3 роки тому
Τhey say that a lecture is a review of book chapter…. But I don’t know if 188 slides can be considered a ‘short’/‘abridged’ version of a chapter !!!
@ytshortsgogogo 6 років тому
For the first equation on the slide at 53:00, why does the last equality hold?
@edmonddantes4705 Рік тому
Because pi is a deterministic policy. A particular action is always chosen with probability one at state s.
@xiaohongdeng825 6 років тому
After the first Gridworld example, the lecturer explained policy iteration process. Seems the policy expectation always uses the updated policy. But in the previous example, the left hand side sticks to random move policy and the right hand side manages to converges to the optimal. Why?
@himanshuladia9099 6 років тому ⁺¹
That's what he said. If we start with any random policy and evaluate the value function corresponding to that policy COMPLETELY, we can generate the optimal policy from the value function values. (Reason given at the end of the video). However, policy iteration is different. You're NOT COMPLETELY evaluating the value function for the policy. After every iteration of generating value function, you update the policy. Then you generate a new value function, and a new policy and so on. The reason this algorithm reaches to optimal policy is proved in the end of the policy iteration section. Don't confuse it with policy evaluation. Infact that's what the girl asks at 58:20 .
@xiaohongdeng825 5 років тому
Himanshu Ladia Actually in the example I was concerned with optimal policy is not guaranteed. See Sutton&Barto Figure 4.1
@qqcat3678 3 роки тому
@@xiaohongdeng825 the lecturer does mention somewhere in this lecture that it doesn't happen often that the optimal policy is found only by looking at the value function update of the initial policy, which is the random policy here.
@edmonddantes4705 Рік тому
@@himanshuladia9099 Your reasoning is completely wrong. Let me give you a trivial counterexample. Say we have two states s_1 and s_2. Define two actions at every state: either stay or move, i.e. the MDP is actually deterministic given the choice of the action. Let me define the immediate rewards: staying in s_2 gives reward 1, moving from any state gives reward 0, and staying in s_1 gives reward 2. I am going to choose a particular policy pi, which consists in always moving (in both states s_1 and s_2). Obviously, the value of that policy is 0. There is no way of going from the value of that policy to the optimal policy, which is moving at s_2 and staying at s_1.
@edmonddantes4705 Рік тому
@@qqcat3678 Yeah, he is wrong.
@arshammostaani3114 7 років тому
There is an ambiguity for me here. Can anyone help?
Ok, I agree that when no better policy can be found, the current policy is the optimal one for the set of values we already have. But how it is guaranteed that no better policy can be found while other paths of finding a better policy and value function have been selected?
@sairengpuiasailo6198 2 роки тому
@50:00 How did q_pi(s, pi(s)) = v_pi(s) ?
@sonnymupfuni6547 2 роки тому
q_pi(s, pi(s)) : follow pi policy for one step (from s), and then follow it for ever after.
v_pi(s) : follow pi policy from the state s (for all along)
These two values are the same.
@edmonddantes4705 Рік тому
Because pi is deterministic. One action is chosen with probability one, so the value of the state is the value of the action that is chosen.
@yunlongsong7618 8 років тому ⁺²
Gute gemacht, vielen Danke!
@sng5192 8 років тому ⁺⁴
Thanks for nice lecture. It would have been more good if there was more time to finish last part.
@파기송 9 місяців тому
Only flaw we can find on this video is the sound volume... Other than that it's masterpiece
@AkashSharma02 4 роки тому
Since policy iteration in each iteration results in realizable policies, would it mean that policy iteration, in general, will require fewer iterations to converge?
@edmonddantes4705 Рік тому
Why? Vanilla policy iteration is way more expensive than value iteration and by the Banach fixed point theorem, value iteration converges at rate gamma^n, whereas every evaluation in the evaluation part of policy iteration converges at rate gamma^n. Every backup in policy evaluation is 0(n²m), where n is the number of states and m is the number of actions. You need to do many evaluations of 0(n²m) and then act greedily, for which you need to calculate q* from v*, which is O(n²m) as well. However, every backup in value iteration is O(n²m) and you only need to act greedily once at the end, which is 0(n²m).
@spitfirerulz 3 роки тому
How great are the audience questions??
@princehan920453024 3 роки тому
In terms of Policy Evaluation, the initial uniform random policy is changing at every iteration, how come that is evaluation instead of optimisation (of policy)?
@princehan920453024 3 роки тому
I kind of understand. The uniform random policy actually does not change for all iterations since all choosable actions (can vary at each iteration) at each state share the same probability, which makes the policy still uniform (random/stochastic), although the actual probability value may change. So the probability distribution of action-state-map changing does not necessarily mean the policy changes, thus policy iteration indeed leads to the final optimal state values guided by this policy. Just in this simple grid example, the final results of policy and value iteration seemly have the same action-state mapping (optimal policy), but I believe in more complex cases they don't. So policy iteration indeed solves policy evaluation (prediction/planning), and value iteration solves policy optimisation (control).
@karlezrapilario 3 роки тому
If I used value iteration to solve for MDP, I only get the optimal v*(s). How do we get the optimal policy pi* from v*(s)?
@edmonddantes4705 Рік тому ⁺²
That is a very good question! You need to compute q*(s,a) for all a in A and all s in S by using the transition probabilities and the values of v* (look at the formulas in the previous lecture) and then act greedily with respect to q*. I was bothered by this as well, he skipped it and definitely should have been mentioned. You cannot improve your policy only with v*, you need q* to act greedily.
@konn81 3 роки тому
Hei guys, does anyone jknow where can we find the appendices with the additional proofs?
@prabhakarshenoy24 6 років тому
www.cs.ubc.ca/∼poole/demos/mdp/vi.html is not available..it says page not found :(
@twoplustwo5 7 років тому ⁺¹
car example - didn't get at all. How to read those charts?
@SayanGHD 5 років тому
he explains it later in the video
@lastsupper2963 4 роки тому ⁺¹
Fellow Indians pause on 43:24 for few seconds :)
@do_it8459 5 років тому
Any anyone, tell me how much time you take to complete this course..??
@serenitybiking587 9 років тому ⁺¹²
Can someone enable the automatic CC?
@abhi220 7 років тому
how?
@011azr 6 років тому
Just learn proper English then. It's really not that hard and I'm not even a native speaker.
@deconvfft 6 років тому ⁺³
Sometimes the audio is too low. CC would help in such cases.
@KaranSingh-ty5xr 6 років тому ⁺¹⁵
you dont have to be a dick about it
@@011azr
@SuperBiggestking 2 роки тому
The amount of times that I hear coughing in this video is alarming duing these COVID times HAHA
@billykotsos4642 5 років тому ⁺¹
188 slides
*has slight heart attack*
@woncoh1 5 років тому
27:50
1:28:27
@stefanherbek2025 6 років тому
Is the link at 1:23:52 still working?
@kalpantumdi898 6 років тому
Nope
@MinhTran-ew3on 5 років тому
is planning by dynamic programming is a model-based algorithm?
@Fireblazer41 5 років тому ⁺²
Yes as the entire environment dynamics is known and given by the MDP.
@FelixCrazzolara 3 роки тому ⁺⁷
Impossible to image a lecture with so many people coughing all the time in 2021...
@umountable 4 роки тому
But where are the cars going? The problem basically describes that on each day, 2 cars are lost ?!?
@edmonddantes4705 Рік тому
HAHAHAHAHA I love this comment. But basically yes, the assumptions are stupid. It also assumes that you only get 10 dollars independently of how many days the car you rented is gone 😂. It could be gone forever for 10 dollars :)))
@libphy 6 років тому ⁺¹
I wish there was less white noise
@chorinaelapin 6 років тому
Where can I find the algorithms/pseudocodes? Thanks.
@NNOTM 6 років тому ⁺²
They are in the book he mentioned, Reinforcement Learning - An Introduction by Richard Sutton and Andrew Barto. It's freely available online.
@donrosenthal5864 4 роки тому
@Nnotm and is THE standard textbook for just about any RL course you will take.
@scottoctave1353 5 років тому ⁺⁶
The guy sitting near the recording system is producing unnecessary noise.
@lindalinsefors6578 8 років тому ⁺¹
Grate lecture, but I have a question.
You show us that a optimal policy must satisfy Bellman’s Optimality Equation. And you show us this iterative method that can only terminate at a policy that satisfies Bellman’s Optimality Equation.
However:
policy is optimal => policy satisfies Bellman’s Optimality Equation (1)
is not the same as
policy satisfies Bellman’s Optimality Equation => policy is optimal (2)
Has some one proved (2), or do we just hope that it is true?
@lindalinsefors6578 8 років тому
Basically, has someone proved that Bellman's Optimality Equation has a unique solution?
@tomwei7077 8 років тому ⁺¹
I think this can be easily proofed by contradiction with Bellman's Optimality Equation.
@edmonddantes4705 Рік тому
@@lindalinsefors6578 Banach fixed point theorem gives you uniqueness when \gamma < 1. Just download the slides and read the extra material at the end. Also, he does not prove (1), he says it is true and carries out a hand-wavy argument, but it is slightly tricky to prove it rigorously. I did this in my notes and it is pretty interesting. Basically you prove the two inequalities.
@user-or7ji5hv8y 5 років тому
I wish the audio was better
@sawomirkapka89 2 роки тому
Did anybody manage to run the Value Iteration simulator from the link?
@shilashm5691 2 роки тому
First of all value iteration algo seems to be wrong.Bcoz in each iteration agent will do a single sweep(calculate value for single state)
@oblivious3862 3 роки тому
01:28:20
@jiansenxmu 7 років тому
mooi, dank u wel!
@WhishingRaven 4 роки тому
question volume is too low ㅠㅠ
@aartisingh5871 3 роки тому ⁺¹
If possible don't watch this with headphones on... GOD! Whosoever is coughing right next to the mic!!!

Наступне

Автоматичне відтворення

RL Course by David Silver - Lecture 4: Model-Free Prediction