RL Course by David Silver - Lecture 6: Value Function Approximation

Google DeepMind

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 12 тра 2015
#Reinforcement Learning Course by David Silver# Lecture 6: Value Function Approximation
#Slides and more info about the course: goo.gl/vUiyjq

КОМЕНТАРІ • 154

@TheAIEpiphany 3 роки тому ⁺¹³⁴
I'm grateful that David was successfully sampled in this iteration of the universe.
@SuperErkm 2 роки тому
Well said!
@etaifour2 2 роки тому
I laughed so hard 😀😀😀
@randalllionelkharkrang4047 Рік тому ⁺⁵
I refuse to like this comment , to make sure its stays at 69 likes, however tempted I am of doing so.
@florentinrieger5306 9 місяців тому
Ahahaha :D
@Chr0nalis 7 років тому ⁺³⁰⁶
"Life is one big training set"
D. Silver, 2015
@feebdaed 7 років тому ⁺⁵²
"Follow the gradient, Neo"
@luoryan3355 7 років тому ⁺⁵
Predict values of your future
@vikrantsaini4243 6 років тому ⁺³
"you are the running mean of your friend circle".Keep updated that towards goal.
@kotamori1907 5 років тому ⁺¹⁴
This motivates off-policy learning pretty well. We should learn from someone else's life.
@ProfessionalTycoons 5 років тому ⁺²
Well my recall is so small.
@proximo08 6 років тому ⁺⁹¹
this guy is definitely the best to explain RL
@crazyfrog2817 5 років тому ⁺⁹
he is not a random guy...
@andrewgrebenisan6141 4 роки тому ⁺⁶
@@crazyfrog2817 Yeah, it's interesting how random professors that know their stuff may not be good at teaching, but the best of the best always know their stuff and how to explain it
@ishansrt 4 роки тому ⁺⁶
@@crazyfrog2817 he is a legend, people don't realize this
@yasseraziz1287 3 роки тому ⁺⁵²
0:00 Motivations
0:35 Outline
0:35 Large-Scale Reinforcement Learning
3:55 Value Function Approximation
8:40 Types of Value Function Approximation
11:55 Which Function Approximator?
15:38 Incremental Methods
15:43 Gradient Descent
17:35 Value Function Approx. by Gradient Descent
21:35 Feature Vectors
23:23 Linear Value Function Approximation
28:43 Table Lookup Features
30:50 Incremental Prediction Algorithms
31:10 Monte-Carlo with Value Function Approximation
37:33 TD Learning with Value Function Approximation
41:56 TD(lambda) with Value Function Approximation
49:00 Control with Value Function Approximation
52:30 Action-Value Function Approximation
53:50 Linear Action-Value Function Approximation
55:20 Incremental Control Algorithms
56:20 Linear Sarsa with Coarse Coding in Mountain Car
1:04:30 Study of lambda: Should we Bootstrap?
1:06:10 Baird's Counterexample
1:06:30 Parameter Divergence in Baird's Counterexample
1:06:50 Convergence of Prediction Algorithms
1:08:00 Gradient Temporal-Difference Learning
1:09:00 Convergence of Control Algorithms
1:10:19 Batch Methods
1:12:30 Batch Reinforcement Learning
1:13:30 Least Square Prediction
1:15:25 Stochastic Gradient Descent with Experience Replay
1:17:25 Experience Replay in Deep Q-Networks (DQN)
1:24:46 DQN in Atari
1:26:00 How much does DQN help?
1:27:35 Linear Least Square Prediction (2)
1:32:29 Convergence of Linear Least Squares Prediction Algorithms
1:32:50 Least Squares Policy Iteration
1:34:15 Chain Walk Example
1:35:00 LSPI in Chain Walk: Action-Value Function
@MinhVu-fo6hd 7 років тому ⁺¹²⁷
It's gonna be pretty confusing the first time you hear about it, but give it a time, try to understand some of it and then come back to watch it again, you will see how much you can comprehend it.
@nirajabcd 3 роки тому ⁺²
ditto
@mathavraj9378 3 роки тому
I feel this
@Chdy120 3 роки тому
@@mathavraj9378 j
@chamangupta4624 2 роки тому
done that
@lex6709 Рік тому ⁺¹
yeah iteration would be good if not perfect oracle policy at first sight
@florentinrieger5306 9 місяців тому ⁺⁷
This video is yet another example that the complexity of a subject really depends on the one who is teaching it. Thank you David, for making RL so much more accessible and understandable! It is a real pleasure listening to those lectures of yours ❤
@LunahLiu 4 роки тому ⁺³⁶
2020 and still watching this to refresh the knowledge 🙋‍♀️
@KayzeeFPS 3 роки тому ⁺⁶
Actually true, I was watching some other courses but gave up half way to come back to this goldmine of information and articulation.
@NganVu 4 роки тому ⁺³³
1:38 Introduction
15:28 Incremental Methods
1:12:17 Batch Methods
@junzhu107 3 роки тому ⁺¹⁶
This is by far the best course on DRL I have ever watched. I kind of lost on the different notations in different papers although I can code many of them up. But this video summarized the root of the problems and solutions concisely which makes me be able to make sense what I have learned in the past. Thanks for the great work!
@shyom1d676 Рік тому ⁺²
David Silver is an intelligent and humble person. He explains things very well , I'm glad that I came across his lectures
@SadmanSakibEnan 5 років тому ⁺¹
Amazing lecture. Thanks for the upload.
@martagutierrez3858 6 років тому ⁺²⁰
holy cow, in the last 30 minutes this lecture goes completely off the rails. The amount of concepts introduced and the number of slides shown increases exponentially towards the end :-)
@ChumpyZhang 5 років тому ⁺⁷
Yeah, I assume that's also how TD(lambda) works. You start at slow and then learn very fast. LOL
@alvinphantomhive3794 4 роки тому ⁺²
@@ChumpyZhang yes it more effecient, but i guess it also carrying some bias alongs the way
@rotviepe Рік тому
If you can learn slow, you can learn fast
@robosergTV 6 років тому ⁺¹⁹
golden information for free. Thanks DeepMind! We need more up to date course though :/
@trantandat2699 4 роки тому ⁺¹
Very easy to understand ! outstanding lecture! outstanding teacher!
@abeaumont10 5 років тому
This is a great lecture!
@aidankennedy6973 3 роки тому ⁺²
It's amazing how intuitive these concepts become after watching these lectures. David Silver makes the math and theory behind RL that seems so hard to grasp impossible to forget.
@florentinrieger5306 9 місяців тому
It is yet another example that the complexity of a subject really depends on the one who is teaching it.
@willlogs 3 роки тому ⁺²
This is a loooot a of knowledge that probably took over 20 years to get this complicated and genius. The question is why would someone share it for free like this! It's a GIFT!
@ProfessionalTycoons 5 років тому ⁺¹
very cool idea experience replay with fixed Q learning, this seems to be related to double Q learning.
@existenence3305 4 роки тому ⁺¹⁰
This reminds me of the quote by Lex Fridman that goes, "All machine learning is supervised learning. What differs is how that supervision is done."
@karthik-ex4dm 5 років тому ⁺³
It cannot get any better than this 😘
@jg9193 5 років тому
that's unfortunate
@ThuongPham-wg4bc 7 місяців тому
I don't really care about the math in the lecture, I just enjoy the moment listening David talking about state of the art RL and suddenly catch his idea.
@rilyoire Рік тому
This helped a lot
@TheKovosh 5 років тому ⁺¹⁸
it would be great, if this video had subtitels
@kaspar8292 3 роки тому
1:24:00 Why doesn't using old parameters have the same effect as propagating backwards in TD-learning would? Why doesn't it reverse the temporal flow of information and therefore make the Q-values diverge?
@focusonlife3242 3 роки тому
Tnx a lot 👍
@KarfontaAlec 7 років тому ⁺²
Any ideas on how to adapt this to take multiple actions per time-step? I am working on a robot control problem, the simple crawling crate program described in the Berkeley RL lectures. The robot has two joints each with three actions; forward, hold, reverse. I can get it to work fairly well but when I add an extra limb it gets stuck in a local minimum of using just one arm and holding the other still. Ideally I would like the robot to control the torque on each motor but I do not know how I would set up the training examples for the NN. Right now the input is the angle and velocity of each joint and the output is the expected reward of each action. To get the actual target for a training example I use the Q-Value for the current step, the immediate reward plus the max expected reward from any action in the new state. So if there are four actions this Q-Value is used as the expected reward of the previous action. The other three action values in the sample are set to the current expected reward, so that only the value of the previous action is changed. I cannot think of how to adapt this to take multiple actions or output motor torques, any suggestions?
@sanjanavijayshankar5508 3 роки тому
You have to discretize your action space. Use something like mgrid. Then use the index as a "scalar" control input
@sanjanavijayshankar5508 3 роки тому
I would recommend DDPG for your application though
@timzhang6962 7 років тому ⁺²
43:10, do we need to explicitly store the eligibility trace of features for all states in a look-up table?
@notthispunk 7 років тому ⁺⁷
If you learn each Q(S,A) directly you need to store the trace for each state. When learning the parameter vector w you only need to store the trace for each feature.
@rohanbanerjee1924 2 роки тому
Does anyone have a reference to the "Gradient Q-Learning" approach that was mentioned around 1:11:27? I couldn't seem to find it online. In particular, I am curious how this approach actually computes the gradient of the TD target (especially how it handles the max term). Perhaps it is related to Gradient TD somehow?
@adityagupta120 5 років тому
Great Lectures.
Slides link isn't working.
@WahranRai 8 років тому ⁺⁵
16:39
- gradient is the direction of descent !
@ArulSelvam4 8 років тому ⁺¹⁶
+WahranRai No, Gradient is the direction of ascent and in most of gradient descent optimization we follow the direction of the negative gradient (for finding minima) math.stackexchange.com/questions/223252/why-is-gradient-the-direction-of-steepest-ascent
@kiuhnmmnhuik2627 6 років тому ⁺⁴
Note that @WahranRai said "- gradient is the direction of descent !". Note the "-".
@zijianxu8595 5 років тому
In DQN, in order to avoid chasing a non-stationary target, we fixed (R+y*maxQ(S', A', w)) for a while so that we could chase a stable target. My question is that could we do the same thing to fix R+y*maxQ(S', A') in Q-learning?
@edmonddantes4705 Рік тому
In DQN you have weights, so it is the fact that you are changing your weights all the time due to gradient descent what motivates fixing those weights in the objective for a bit. You stabilise the objective. However, in normal Q-learning, the objective is not changing much every step, so intuitively to me it would not make sense.
@judenchd 5 років тому
Why there is E_pi in the objective function?
@akshayrana9351 5 років тому ⁺²
I can never comprehend the backward views and eligibility traces, although the forward view still makes sense.
@muratcan__22 4 роки тому ⁺¹
Backward view is doing online updates. You do that for every state in each time step. If you do that for every state and only apply those updates at the end of the episode it will be equal to the updates you would do in your forward view. In forward view we were updating the values of the states when they are seen. We were collecting all step returns weighted by appropriate lambda and applying to the state that is seen. The return is the return from that state onwards in this case. In backward view, we immediatly update the value of the state, actually all the states in each steps. It is easy to see that they are equal when lambda is 0. Eligibility trace loses its recursive part and it is only 1 when the actual state is the state we are going to update. And all sum of those updates that would happen throughout the whole episode is equal to the forward 0 lambda return which is actually 1 step return because we only care about the 1 step return in that case ( all the other step returns are 0 because they have lambda in front of their weights).
@rohansinghvi3061 3 роки тому
For TD learning with value function approximation, using the linear value estimator as the TD target, won't that damage the algorithm..?
@edmonddantes4705 Рік тому
Depends on how well you choose those features right? Feature selection is an art. Also, the look-up table case corresponds with linear function approximation.
@MinhVu-fo6hd 7 років тому ⁺¹
at 43:28, he talked about the size of the eligibility traces. I dont' understand why its size is the same at the parameters' size but not the size of the state space. Does anyone know the answer for this?
@kiuhnmmnhuik2627 6 років тому ⁺³
Let's assume we're using a linear approximation, that is: v(s,w) = f1(s) w1 + f2(s) w2 + ... fn(s) wn, where fk(s) is the k-th feature of the state s.
In this case, the gradient of v in s is just the feature vector f(s). This will make what follows more intuitive. It's clear that if, say, f1(s) is much bigger than the other features, then we'll want to focus on w1 to increase/decrease v(s,w) as much as possible.
Instead of using eligibility traces to assign credit to the states, we use eligibility traces to assign credit to the features which "summarize" the states.
Let's say we have 4 states:
1) it's raining and it's cold
2) it's raining and it's hot
3) it's sunny and it's cold
4) it's sunny and it's hot
The features are:
r = it's raining
c = it's cold
which can take values 0 (false) or 1 (true).
If we go through states 2 and 3, will just remember the features (r,c) instead of the states themselves:
e
@notthispunk 7 років тому ⁺³
How would you calculate the Q feature vector based on the action? Would the action considered be part of the feature vector? Otherwise how could my features (e.g. distance to the wall) depend on the considered action without explicit coding of how the environment works?
@KarfontaAlec 7 років тому ⁺²
I am struggling with this same question. Maybe we can help each other think through this. It is confusing because in the gridworld or pacman examples you know the next state by the current state and action you take, but for my problem, robot control, you don't know how an action will affect your state.
One solution like you said was to add a feature for the action. I may be wrong on this, haven't yet tried, but I think that there should be one feature for each action we are considering, with corresponding weights. So when we are in a state and looking for an action to take we loop over each possible action where the Q(s,a) for a particular action will have the feature value for that action as 1, then each other action feature is 0 (ignoring them).
Another approach I am looking at is to have a different set of weights for each action. This may be less efficient but could be more expressive. Will need to do some tests once both approaches are working.
Finally I think the best approach, and the one I have the least understanding of, would be to have a neural network that produces an output for each action rather than a single value. This would seem to be the most efficient in terms of memory and CPU time. And could be a simpler implementation with a black box NN.
What are your thoughts on these?
@CarrKnight 7 років тому
I think perhaps that's why Powell defines the Q-value as a "Post-Decision State". For some implementations you do not need to add action as a feature. For example if you are playing chess you know immediately what the results are of taking your action (even though you can't predict the enemy movement so you don't have access to the transition function). So if your feature is "the enemy queen is in the game" YES/NO you can judge your possible actions without them appearing explicitly as features but that's because you can tell what your action is doing.
@neverfox 7 років тому
+Alec Karfonta Both of your last two suggestions are what I've seen in practice. Silver even mentioned the latter in a previous lecture. Adding actions to the feature vector would violate the standard distinction made between s and a, so while it might work, you'd be cutting against the grain of all of the literature.
@KarfontaAlec 7 років тому
I was finally able to get the value-function approximation working in practice on a crawling robot task. It works better and faster than the Q-Table implementation. I thought it would work better but take longer to train, it surpassed all my expectations. I have never before seen the robots move the way they are now.
First I tried the method I suggested above to have a feature in the state for each action, it would be a 1 if the action was taken in the previous time step. This almost seemed to work, they would chase the immediate reward but would not take the extra step to take actions with no immediate reward to reach a state with an immediate reward.
Then I tried the second method, where each action has it's own set of weights for the state features and this worked like a dream. Within an hour of training they had mastered the task, where the Q-Table would take overnight. So at a time step the state features are combined with each action's set of weights to find the max expected reward action. The next time step the weights for only the previous action are updated using the delta from expected and actual reward (immediate + future from the new state).
Now I am experimenting with more complex robots. With the old Q-Table implementation adding a limb would cause the size of the table to explode. With the value-function I have the computational capacity for pretty much any number of actions and state features. So I am trying to train the same crawling robot with just one extra identical limb. Though I seem to have a new problem. Ideally the robot should be able to control both limbs (four motors) at the same time. How do you guys think I could adapt the Q update to take four actions per time step?
Maybe instead of one update that considers all actions at once there are four separate updates for each motor. The weights would be a three-dimensional array like [motor][action][state_feature]. The motor dimension would be of size four, the action dimension thee (forward, stop, reverse). At each time-step four actions would be taken, the max action for each motor. The next time-step there would be four updates based on the action taken for each motor and the same previous state values. Though I'm worried this might exacerbate the credit assignment problem.
@Isaac668 6 років тому
The feature vector is built off of a set of observations. Take a self driving car for example, you would have sensors that detect your distance to the next object, your accelerometer, your pressure gauge, etc. and that will essentially be a generalization of your state. It doesn't matter if you're in your garage at home or in the garage at your office, if your observations are identical, your state would be the same (although your endgoal, and hence where you will find your reward, will be different).
@dantealexis7835 5 років тому
So is it that we don't store states explicitly unlike before where we had them in a look-up table? Or is it that we still have to store them, but now we can quickly estimate the values of similar states and add them to our table even without having to visit them?
I'm trying to understand how not storing them will work when we have to take an action using action-values. If we don't have them explicitly stored, then we evaluate all actions possible in the current state in real-time using function approximation right?
@ajaybala8967 5 років тому
So, the point of Value Function Approximation is that you don't have to store the Q-value lookup table for all states and actions. (it may be extremely large for many problems). The approximation of the q-value can be used to "extrapolate" values of unvisited states and their values (that is what is called generalization). Think of it like this: you have a bunch of x -values and y-values and you want to predict approximately the value of x given a new and unseen value for y. So, we try to fit a function for known x and y values. This is exactly what supervised learning in ML does, but here we get the "training set" by interacting with the environment. Hope this helped.
@dantealexis7835 5 років тому
@@ajaybala8967 Thanks for the reply. Truth be told I figured it out and am halfway through my skl project using that exact same thing lol. Only problem now is that it is so computationally heavy that it is the first time i actually have to consider code optimisation haha. But thanks again for the help!
@ajaybala8967 5 років тому
No problem!
@danieltjones01 4 роки тому
Does the "action out" approximation around 11 minutes give us a dictionary as output?
I.e. {Action 1: 7, Action 2: 6.5,...}
@edmonddantes4705 Рік тому
If you fixed the state, yes, you can view it like that. That's the computational way of interpreting it. The mathematical way would be a vector, each component containing the action-value of the fixed state given one of the possible actions.
@danieltjones01 4 роки тому
Bottom of slide at 30:20ish. To take the dot product of those vectors shouldn't the state vector be transposed?
Currently we have nx1 state vector and a nx1 weight vector. Giving is nx1,nx1; which won't multiply? We need 1xn,nx1 to give us a 1x1 output.
@edmonddantes4705 Рік тому
Not true. First of all, you are confusing scalar product with matrix multiplication. You are right in the sense that you obviously cannot multiply an nx1 matrix by an nx1 matrix, but that dot does not mean matrix multiplication. In mathematics, those vectors you are seeing are interpreted as vectors in the Euclidean space R^n (\mathbb{R}^n), and he is simply expressing two vectors of the Euclidean space as column vectors and writing a dot in-between to express scalar product. This is sometimes done in mathematics, physics, etc. Note that if I have two vectors x and y in R^n and I express vector x like a 1xn vector and y like an nx1 vector, the scalar product coincides with matrix multiplication. This is because in the Euclidean case, the dot product can be rewritten in terms of matrix multiplication of suitable matrices.
I will add that in general, you just need an inner product space in order to define a scalar product, and a scalar product is defined BETWEEN ELEMENTS OF THE SAME SPACE. Your space could be R^n, a space of functions like L^2, A space of sequences, etc.
@mohammadkhodadad1067 4 роки тому
43:40 how can we apply eligibility trace on a model which some of its weights are from something black box ??
@alvinphantomhive3794 4 роки тому
every state transition occur, you immediately decay the eligibility traces for the entire state you had visited , but at the same time you also increase the eligibility traces but only for the current state. In function approximation, each state represented as the features vector.
@QDownGames 6 років тому
I am trying to find more information about the "Coarse Coding" function approximation method online (at approx. 1:01:00). Anyone have some useful links? It appears to be quite elusive.
@alvinphantomhive3794 4 роки тому
its been two years since you asked, did you find any?
@silwate 2 роки тому ⁺¹
please add captions
@guptamber Рік тому
I am puzzled by some comments. I felt lectures in general towards the end are rushed and hand wavy. First 4-5 lectures are really good but after that it feels more fragmented. I could realize this more when I saw DQN slides, that is familiar topic to me and the way it was explained was very upside down and cryptic way.
@shaz7163 6 років тому ⁺¹
I don't understand how they train with experience replay in 1.17 . Is it like at each time step you take some state and target pairs that the model seen before and apply stochastic Gradient descent ? . Also how the improve policy at each time step . When using greedy policy which function approximator they use is it frozen one or the updating one ?
@jeromeblt1432 6 років тому
I think it's each timestep. 40:30
@jeromeblt1432 6 років тому
"Every step, what we do, we sample some mini-batch from our dataset ... optimise MSE between Q-network and Q-Learning targets" 1:18:34 . Also i suppose the 'i' in the loss function equation means "iteration". So calculate the loss each iterations (timestep)
@jeromeblt1432 6 років тому
1:22:15 "any iterations 'i' " , "optimize a loss function"
@renney77 6 років тому
Does anybody here know if the Q estimates or weights have to stay in a range like 0.0-1.0?? or they can just be any real value?
@alvinphantomhive3794 4 роки тому
i think it could be any value, since
the "weight update(delta w)" equation taken from the difference between the estimate value and the real value, and being multiplied by some epselon(step-size) towards the gradient over and over until the value converged to the global optimum
@shivinsaluja6085 6 років тому
We update the value of weights in Monte Carlo with Value function approximation(Incremental methods) at the end of multiple episodes right? And How does it solve the problem of space. Can someone please clarify?
@edmonddantes4705 Рік тому
The problem of space is called curse of dimensionality. A function approximator (linear features, neural net, wavelets, etc) uses way fewer trainable parameters than the number of total states (which could even be continuous). Note that by means of a neural network, you could compute a good estimate of the value of infinitely many states by only having a small number of trainable parameters. Why? Because a neural network is capable of interpolating and generalising if the architecture is chosen correctly. That's the beauty! Trainable parameters
@sampathkumar1010 6 років тому
At 47:15, why cant the second part \gamma^v(S',w) be left in terms of w, and and a new gradient be found with respect to the full function for optimization, instead of just taking the gradient part from V(S,w)
@sam41619 6 років тому
I guess the gradient will not behave properly because we haven't seen the *actual effect* of being in state S', In other words, we can't know (before time) that how the predicted value of next state v(S',w) will change as a function of state S' and w... Since we don't yet know if being in state S' will be useful or not OR what will be the direction of gradient of v(S',w).. or will it increase or decrease with respect to S'
@simonhaxby830 7 років тому
Does the experience set D consist of just one episode? Or multiple episodes/runs?
@NithinVasisth 7 років тому
Multiple episodes/runs.
(Also, He said in the lecture that we randomize them while sampling, so it shouldn't depend on which run it is)
@yatint9665 4 роки тому
There is probably an extension available for ur browser to increase the audio of the video....
@akshatgarg6635 3 роки тому
use linux
@chengjing9029 3 роки тому ⁺¹
Sometimes I think that I learn the truth from the gesture
@davideabati1903 6 років тому
I'm struggling with the objective function for DQNs. We're trying to learn w and we update them towards action value functions estimated with an older parameter vector w-. That feels wrong to me, what am I missing? I can see that we bootstrap towards the future (so we observe a reward and estimate the value at s', a'), but using w- from the past which is a worst estimate of the optimal parameter vector w* (hopefully) than the estimate I have now.
@sam41619 6 років тому
I felt the same when i heard this idea
@robosergTV 6 років тому ⁺¹
did you watch leactures of regular tabular q learning? We did same thing there. While it may feel wrong 1. you can come up with a better method and publish a paper 2. we still inject information of the real reward, thus making our predictions more accurate, that's why it all works.
@davideabati1903 5 років тому
@@robosergTV Thank you for your answer. Still, to this day, I feel like I'm missing something.
I studied and understood table-lookup Q-learning, and I want to highlight that is not the TD target that bothers me (on the contrary, I am amazed by how TD learns about the return without ever observing it).
What I felt odd about is the bootstrap towards *old* Q-targets, parametrized by w-.
After all, I'm trying to improve my estimate of w, so I want to improve over w-.
Anyway, I can grasp that, as you suggest, the observation of the unbiased reward R ensures an improvement.
Cheers
@amnagrawal 4 роки тому
@@davideabati1903 I am facing the same dilemma. Over here prof says that we fix the old target for 1000 (say) iterations. How do we ensure that the model trained during these iterations is improved when the error gradients themselves point to the worst estimate of Q*. In tabular Q-learning he told we take an alternate action as the target. Where my interpretation of it was from exploration perspective. Please tell me if I'm going in the wrong direction.
@edmonddantes4705 Рік тому ⁺²
This is one of the most interesting questions in this lecture's comments. My first observation is that in the look-up table case, the Q-learning updates are the same as for the DQN objective (with corresponding indicator function features, obviously) as long as WE DON'T FREEZE THE OLD PARAMETERS. Therefore the question of why freeze the old parameters at all is very natural. My second observation is that in the look-up table case, we are not moving the objective much at every step, in the sense that we only modify one pair of state-actions, so gradient descent won't go crazy by following a constantly moving objective. Note that although only one pair is moved, the change is clearly more violent in the table look-up case. Another reason why gradient descent will not go crazy in this case is the obvious one: it converges because it coincides with Q-learning and Q-learning converges under certain conditions.
Now, given the two previous observations, I will summarise the intuition that I have. The look-up table Q-learning updates follow the Bellman optimality equation and converge under certain conditions, so in a sense, it is a proper way of making updates, of going towards the truth. However, when substituting those greedy updates by a gradient descent step, who guarantees that convergence will be retained? What is the theoretical basis? After all, a gradient descent step is smooth, it is all but greedy and violent in general (contrarily to the real Q-learning objective). My claim is that in order for convergence to be retained, gradient descent must update somehow similarly to look-up table Q-learning. However, whereas look-up table Q-learning updates one pair violently in a greedy way, a gradient descent step updates all the weights very slightly, and hence, if you want to converge towards your real objective, you probably need several of those updates in the full weight space. One update is not enough for you to know where the hell those weights are smoothly trying to go towards. Basically you need to approach that violent greedy action a bit more, since a single step might not be representative of that greedy improvement. Also, since after every gradient descent step, all the weights are slightly changing, this is probably not to good for convergence of gradient descent. Gradient descent is normally used when objectives are stable, not when objectives are literally changing every step. So it is good to wait for some steps and see where that cumulative gradient update is actually really wanting to go, and once you are convinced, you can change the weights :P.
@Zeeblog 2 роки тому
I like the convention of referring to the information about the value of an action being given to us by an "oracle". It makes me think of the old woman from the Matrix saying "Hmmmm, I would've made pacman turn left there honey".
@willbrenton8482 2 роки тому
Anyone have some good resources for learning about sourcing feature vectors (23:30)
@SSS.8320 5 років тому
I can't get the intuition why eligibility trace becomes Et = .... + gradient of ^v(s,w) or gradient of ^q(s,a,w) w.r.t. to w. Can someone explain? Thank you very much in advance.
@edmonddantes4705 Рік тому
Either (1) think of it as a generalisation of the look-up table case, in which case it is clear that they coincide, since the gradients are delta functions (i.e. x(S_t) is all zeros except at the component corresponding with state S_t, where it is 1), or (2) compute it by following the same ideas as when showing that the backward view of TD(lambda) is equivalent to the forward view (this is in the extra material in the slides for lecture 4). I think the equivalence is only true in the linear case :). If you understand the proof in lecture 4, you should understand this.
@timdockhorn5834 5 років тому
Any idea what kind of features are used for the mountain car example? In my understanding, the user has to choose the features?
@exitudio 5 років тому
Feature Vector is the output of the network (eg. linear regression, neural network) You don't do it manually. But you may set a size of the network output and let the network optimizes the Feature Vector by itself.
@alvinphantomhive3794 4 роки тому ⁺¹
as he said, the features consist the current position and velocity of the car. eg: when the car on 32 position and accelarate 80% from the maximum speed, then the car ends up in x position with the state-value function = 8.
and that value function is use to improve the new policy and optimize future action.
@edmonddantes4705 Рік тому
The features are coarse coding in the first pictures (i.e. there are many circles and for each circle, a feature is defined as 1 for states on the circle and 0 otherwise), and radial basis functions (probably Gaussian densities with different means) for the other image where there is handwritten text. In both cases, the function approximators are linear.
@edmonddantes4705 Рік тому
@@alvinphantomhive3794 You are confusing features with states.
@sanjanavijayshankar5508 3 роки тому ⁺²
46:49 this explanation though.
@manolislykos 3 роки тому
Can someone explain the difference between gradient TD and the algorithms that was shown before in the lecture?
@edmonddantes4705 Рік тому
I am guessing gradient TD applies compatible function approximation in order to make updates in the real descent direction, but I would love to see a reference of what he means by gradient TD or gradient Q-learning. Again, I strongly suspect that they come from some improvements due to the policy gradient theorem by Sutton.
@akshayrana9351 4 роки тому
Another tutorial that explains Experience Replay: ua-cam.com/video/AohtWrJwvgg/v-deo.html
@feishawnshanks775 5 років тому
No subtitles？
@ImtithalSaeed 6 років тому
from where can I find the true value function to compare to find the error??
@sam41619 6 років тому ⁺¹
the true value comes as the trajectories are unrolled, make the *function mapping* (approximation) of last sample/episode as the target for next, its just like supervised learning, except now the data is correlated (is kind of a sequence)
@a90440654 4 роки тому
Hi, guys. Im wondering if Feature Vectors need exactly same number of columns as number of states.
@luisbanegassaybe6685 4 роки тому ⁺²
No definitely not. The number of states is arbitrary, and a feature vector is canonically a 1 column vector with many rows. For example, the feature vector for chess could have 1 row for every point on a chess board, and the entry to that row could be a number representing what kind chess piece is occupying that point. Opponent’s rook, my queen, empty space, my opponent’s knight, et cetera.
@edmonddantes4705 Рік тому
Of course not. Typically, it is ill-advised to do that, since that would imply that your number of trainable parameters is equal to the number of states, and the purpose of function approximation is precisely avoiding the curse of dimensionality of huge state spaces by using a much smaller number of trainable parameters. It's like you are back in the look-up table case. What a torture 😂
@riccardoandreetta9520 7 років тому ⁺¹³
I guess you can't really understand the details, unless you make some exercises about it. Still didn't find good videos about this.
@AbhishekSharma0318 6 років тому ⁺¹¹
Check this repository github.com/dennybritz/reinforcement-learning if you want to learn how to implement these ideas in python.
@luecu 5 років тому
Abhishek Sharma can not find it :(
@IgorAherne 5 років тому ⁺¹
To understand TemporalDifference, have a look at this video ua-cam.com/video/w33Lplx49_A/v-deo.html
in particular 56:40
@piyushjaininventor 5 років тому
Link is broken
@matsapo1 5 років тому
@@piyushjaininventor Remove the the ending parenthesis and the link will work.
@yota07 5 років тому ⁺²
Great course! the only negative is that each lecture is too long, so I tried to watch it at a higher speed. You can still watch it at speed x1.5 and it becomes around 1h each course.
@manishandu 5 років тому ⁺¹
Graduate level classes are usually 2 hours long.
@brandomiranda6703 5 років тому
what is the E_{pi}[ (v_pi(S;w) - v^hat(S,;w))^2], is it the true expecation of trajectories of the stoachsitc process given by the MDP? i.e. s1a1r1,s2a2r3,... ~MDP
@edmonddantes4705 Рік тому
v_pi does not depend on w, but it is the true expectation, yeah. Trajectories are sampled from the MDP. If you want to be more rigorous, the density is the discounted state distribution
ho^\pi defined in Preliminaries, page 2 of
proceedings.mlr.press/v32/silver14.pdf
(Deterministic policy gradient algorithms)
@karolinadrabent2088 Рік тому
wow
@eason2694 7 років тому ⁺¹
why does Monte Carlo evaluation converge to a local optimum
@williamtan3857 6 років тому
Because the target in MC learning is the real rewards (Gt).
@siddhartha3457 6 років тому ⁺³
This is solely due to function approximation. The value function may not be a linear combination of the features, and so, MC with linear function approximation cannot converge to the global optimum in this case. "An analysis of temporal-difference learning with function approximation" proves that MC TD methods converge to the least squares solution when this is the case.
@levin6975 4 роки тому
37:00
@emrahe468 6 років тому
The question at 1:22:00 was like gold.
@chane1950 4 роки тому
someone needs to run the Independent component analysis algorithm and filter out the irritating background noise.
@TheMikiwar 4 роки тому ⁺²
1:11:01 didn't know Boris Johnson was interested in RL
@toanoopie34 3 роки тому
Oh wow, it really does sound like him.
@ErwinDSouza 5 років тому
For Experience Replay, how long do we hold on to old data? Do we flush it away after every update of the network? If so, then we lose valuable information from the early explorations and the network might unlearn that data. On the other hand, it's not easy to reuse the old episodes, because those 'target' values were computed based on the output of old networks that may now be outdated.
@AdamCajf 3 роки тому
We don't flush it. It depends on what the size of the Experience Replay Memory is. It can be implemented as a cyclic buffer of a bounded size that holds the transitions observed recently: pytorch.org/tutorials/intermediate/reinforcement_q_learning.html
@yuxinzhang9403 2 роки тому
What if we wrap the input of neural network in a function and treat each such single function as a state? That way it would be an MDP again and the lookup table method still works in choosing the right function to call, a program can program itself by solving this MDP.
@johnhefele5432 6 років тому ⁺¹
I would have liked this video, but then it wouldn't be 420
@ishansrt 4 роки тому
19:17 I died! Best ever physical interpretation of a mathematical equation
@woncoh1 5 років тому
37:09

Наступне

Автоматичне відтворення

RL Course by David Silver - Lecture 7: Policy Gradient Methods