This is the most intuitive derivation of Policy Gradient Theorem on the internet. Thank you for being my teacher! Every RL intro class should begin with this video.
@@elliotwaite seriously, the best. Andrey Karpathy has got some tough competition :) Please do a follow up relevant to RLHF as used in LLMs, ideally focusing on the newer/more approachable algos like DPO. Pleeease :)
@@lmfaow00t thanks! I may start making videos again soon, but I'm not sure if I'll make videos similar to this one. However, I'll keep your suggestion in mind. Thanks for the support.
I tried to implement the algorithm by hand. I created a 3 layers network with one input(2) layer, one hide layer(32),and one output(2) layer. Before training, I collect 100 episodes ,then training the network, and repeat training 10000 times,finally,the probability of X0,X3,X5 converge to (0.999877,0.97339,0.99900). this is my first reinforcement learning program, thank you for your help,great tutorial!
I can't believe I get this for fee. Such an amazing visualization. I have a vague understanding about backpropagation and expected return. however, I never imagine the relationship can be this straightforward and clear.
Great Video! Using graphics and combining the logical explanation with pseudocode was really helpful to me. Most of the time you only see one or the other.
Thanks for the feedback. I've been wondering if adding all those different ways of explaining it was exessive, so I'm glad to hear you found it helpful.
One of the best! I'll have to watch the video a few times, but it's already helped me figure things out more intuitively. The example and animations are excellent. Great work.
Hey Elliot, Loved this explanation so much, man! Keep up the awesome work. I'm a beginner and I feel RL is often looked upon as a difficult paradigm due to the heavy math in there, but people like you are a blessing, for putting the ideas in such an outstanding fashion :)
This was a very helpful video for me, thank you Elliot! The toy problem with the robot was so helpful to visualize everything, especially showing the the state-action sequences with the probabilities
Wow, thanks YaGuang for the Super Thanks! It's much appreciated. I'm glad you found the explanation intuitive. I saw you are working on the Google DeepMind team up in Mountain View. I worked at the Google in Mountain View from 2015 - 2017 and have fond memories, lots of great people. I hope you are also enjoying your time there. I would imagine it's an exciting time to be working with that team.
Well, watched a few times, I can say that there's no concept left untouched from statistics, probability, math, NN, DL domains in this video. I'll list them as much as possible: reinforcement learning (of course), neural networks, backtracking, gradients/derivatives, chain rule, partial derivatives, probability trees, multiple random events, sampling, probability sampling, monte carlo, mean, standard deviation, pytorch and some more I can't count. Thanks to author again.
superb bro, this is something which is missing is most of the other videos...This really help building the intuition for beginners. Keep building such videaos , thatnks a ton
Truly logical, clear from theory to implementation, bravo. One question I still have is "is it really true that the mathematical derivations at the start of video tell so much?" Anyway, they didn't speak to me.
Thanks! For me, those mathematical derivation shown at the beggining of the video are hard to interpret, but after working through the logic I explained in the video, I can make sense of them. One trick that helps me is removing the logs in those equations by using the rule that "the gradient of the log of x" equals "1/x times the gradient of x". So wherever it says "the gradient of the log of P(...)" I just think of it as "1 / P(...) times the gradient of P(...)". Which can be read as "The gradient of the probability of that trajectory given the current neural network weights, divided by the probability of that trajectory given the current neural network weights." I'm not sure if that helps others, but it makes it easier to for me to understand. I think they just write out the math in the log form to make the equation more concise.
Excellent! This approach of developing the intuition works best for me . I often wonder how anyone ever came up with the end product of something this? A bunch of dense math as shown in the beginning? I suspect it starts with intuition like this and then is formalized to equations. I think that to improve something or invent new things, one has to go logically and intuitively first, highly formalized math second. When things are too abstract from the start, it’s too easy to get lost in the weeds. Thank you for these videos!
I agree. However, sometimes I figure out new things (or at least new to me) by working out the math first, and then after I see that the math leads to a simple result, I try to figure out an intuitive reason for why it does. So I think both modes, working intuitively and just working with the math symbols, can be useful for exploring new ideas. But I agree that just working with the math symbols feels a bit like exploring in the dark, and it's hard to know where to go next without turning on the lights of intuition.
This was a phenomenal video and should be required viewing for any class on machine learning. Any chance you are considering a video about policy gradients with continuous action spaces?
Thanks, Andrew! I don't have any plans for making new videos in the near future, but I may eventually, and that would be a good potential topic to cover. I'll add that to my list of future video ideas. Thanks for the suggestion.
This is one of the counterintuitive parts. I start explaining it at 13:49. This dR/dx is only calculating the first step in the backward pass, as in we are only calculating the partial derivates for the expected return with respect to the outputs from our neural network (as in the outputs from the softmax function at the end of our neural network). So in this step, we ignore the fact that the probability of going right (prob(right)) has to equal 1 - prob(left) because we don't make any assumptions about what the outputs for our neural network are, we don't assume they are probabilities that have to add up to one. It's only after this step when we continue backpropagating these gradients through the softmax function of our neural network that this restriction is taken into account, that prob(left) has to equal 1 - prob(right). This is because it is the softmax function that normalizes the probabilities to require them to add up to 1. So after the gradients get backpropagated through the softmax, they will be correct, even though they start off looking a bit strange since we start off not assuming anything about our neural network outputs, meaning we start off ignoring the fact that they have to sum to 1.
Very nice and intuitive video, thank you! I wish you visualized several iterations of the optimization procedure to show us the convergence of the algorithm to the desired policy. Maybe we could get a better understanding of exploration/exploitation from your amazing example.
Thanks for the comment and suggestion. I do like the bandits problems. I haven’t been making videos recently, but maybe I’ll get back into it eventually. I was first introduced to the bandits problems in David Silver’s RL course lecture here, which you might also like: ua-cam.com/video/sGuiWX07sKw/v-deo.htmlsi=4MtbQCR4kO4FL-DD
@@elliotwaite Thank you! I actually finished that series up - it's amazing :) But even this covers contextual bandits very briefly (video #9 in the series) And if I liked anything on RL after that lecture series, it's your video - kudos to you on it!
Great video! Can you make another one like this (with animations instead of maths) that demonstrates how LLMs work (including fine training). I don't understand math at all but am very interested in learning more about LLMs. I just cannot understand the math and statistics behind it.
This is just amazing. One of the best explanations out there for vanilla PG. I do have a question though. At 48:20, you code probs/probs.detach() to detach prob from backpropagation. However, wouldn't that just return 1 since we divind it with itself? Shouldn't it be just prob.detach()?
That part of the code does look odd, and you are right, dividing the value by itself will equal one. But if we replaced `probs / probs.detach()` with just `probs.detach()`, then that line would be: loss = -(probs.detach() * future_returns).sum() / num_episodes ...but when you detach a tensor, it becomes equivalent to a constant (and by that I mean the tensor will no longer backpropagate any gradients through it). And since `future_returns` came from the environment, it's also a value that we can't backpropagate through, and `num_episodes` is a value that didn't depend on our policy, so it's also a constant that we can't backpropagate through. So if we called `loss.backward()` using the above `loss`, it wouldn't do anything because all the values that were used to calculate that loss were treated as constants. It's like saying if `loss = -(3 * 5) / 7`, how will changing the parameters of our policy change the loss, and the answer is that it won't since all the values used to calculate the loss are constants. But when we keep in the attached `probs` in the equation, it looks more like this: `loss = -((policy_output / 3) * 5) / 7`, and now we can see how changing the output of our policy will change the loss, and we can calculate a gradient for the parameters of our model. So then the question is why do we also need the detached probs, why not just use: loss = -(probs * future_returns).sum() / num_episodes ...and the reason is that adding the division of the detached probs balances out the fact that we are expected to only sample those specific actions with a probability of `probs`. For example, let's imagine we only have two actions, A and B, and that our current policy samples A 99/100 times and B only 1/100 times. And that action A returns a future reward of 1.01 and action B returns a future reward of 1.02. So even though B is sampled much less often, it's actually the slightly better action to take, and our gradient should favor taking action B more often. So now let's imagine we sample 100 actions and as expected we sample action A 99 times and B only 1 time. Then if we used the loss directly above, our loss would look like this: loss = -(.99 * 1.01 + .99 * 1.01 ... (that repeated a total of 99 times) ... + .01 * 1.02) / 100 But the `.99` and the `.01` are actual the outputs of our model, which are the prob for A and the prob for B, so we can think of it more like this: loss = -(prob_A * 1.01 + prob_A * 1.01 ... (that repeated a total of 99 times) ... + prob_B * 1.02) / 100 So now if we ask how can we minimize this loss, the answer will be to increase prob_A, because that will make this sum more negative, but this isn't what we want. We actually want to increase B since it returns a better future return. So let's instead try it again but this time also with the `probs.detach()` values, it will look like this: loss = -((prob_A / .99) * 1.01 + (prob_A / .99) * 1.01 ... (that repeated a total of 99 times) ... + (prob_B / .01) * 1.02) / 100 And now if we ask how can we minimize that loss, the answer will be to increase prob_B (it has a slight edge now over increasing prob_A). We can see this more clearly if we simplify the sum: loss = -(99 * (prob_A / .99) * 1.01 + 1 * (prob_B / .01) * 1.02) / 100 And now if we substitute dividing by .99 by multiplying by 100 / 99, and we also substitute dividing by .01 by multiplying by 100 / 1, we get: loss = -(99 * (prob_A * 100 / 99) * 1.01 + 1 * (prob_B * 100 / 1) * 1.02) / 100 And the 99s cancel out, and the 1s cancel out, and the 100s cancel out, and we are left with: loss = -(prob_A * 1.01 + prob_B * 1.02) And now we can clearly see that increasing prob_B rather than prob_A will make the loss more negative. So we need to divide our probabilities by the detached versions of themselves to compensate for which probabilities will be sampled more often or less often. Dividing by `probs.detach()` is like saying, "we are only going to sample this action with a probability of `prob`, so we need to boost the weight of this sample in our loss by the inverse of that probability (if it's a low probability, boost it by a lot, and if it's a high probability, only boost it by a little)." So in `probs / probs.detach()`, the numerator is there so that we can propagate gradients, the denomintor is there to boost those calculated gradients by the inverse of those probabilites. I hope this helps.
@@elliotwaite Hey, thank you SOOO much for this detailed explanation. You are just too awesome and to properly explain like this is really commendable. So we can say that prob/prob.detach() is something like a normalizing term? Because the equation after prob/prob.detach() looks similar to a supervised learning technique where are finding the loss on the basis of how good the prediction was. Did I get that right?
@@sarvagyagupta1744 yeah, the 1/probs.detch() part can be thought of as a kind of normalizing term. However, I usually think of normalizing as making sure a sum adds up to 1, which I don't think that term does, so I think of it more as a re-weighting term, or a balancing term. But yeah, it's very similar to the supervised learning equation, but because we aren't given the dataset, but instead are generating the dataset by sampling actions using the probabilities output from our current policy, we have to add that 1/probs.detach() term to compensate for that difference.
@@elliotwaite yeah. I kinda felt that I was using the term very loosely. I was thinking of it as clipping instead of normalising. So is that a better comparison? Like how we do in PPO?
@@sarvagyagupta1744 I think of clipping as something that is conditionally applied. For example, in PPO-clip, the clipping only kicks in if the new policy starts to get too far away from the old policy. Whereas the 1/probs.detach() term is always applied. So for me, clipping means something different. But that might just be how I think of that word.
Thanks for the great video and detailed explanation! I'm curious about the reason of multiply e.g. 1/0.7 at 36:30, is it just the strategy we apply during the sampling such that later the sum and the calculation of gradient will equals to the derivation we got theoretically? Thanks!
Yep. You can also think of it as balancing out the fact that we are sampling actions. Because when we add a value to our total sum, it is kind of like a nudge to our policy. As in, if we get a positive reward after an action, we nudge our policy to take that action again, and if we get a negative reward, we nudge our policy to not take that action again, but we want our nudges to be balanced so that if we are getting the same positive reward for going left as we do for going right, we want the nudges to go left more to equal the nudges to go right more so that the policy doesn't actually change (because when there is no benefit to going left or right, we don't want the policy to change). So, for example, if our policy was left 0.1 and right 0.9, we'd sample going left 1 out of 10 times and right 9 out of 10 times. So when we add our going left sample to our total sum, we multiply the value by 1/0.1 (or we multiply that nudge by 10) and when we add our going right samples to the total sum, we multiply the value we add by 1/0.9 (or multiply those nudges by 10/9), so that the effect of both going left and right are considered equally (the going left nudge is multiplied by a bigger number to a account for the fact that it is sampled less often).
fantastic video. brilliantly explained. One doubt, the gradient formula derived in the video is equal to the multiplication of two parts. first part is action probabilities multiplied to arrive at a node and second part is the expected reward which we get by taking an action at that node. But in the program we are performing log only on the action probability at that node. It is not dependent on the parent probabilities in the path. Is my understanding correct ?
Yes, that is correct. When we are sampling episodes during training, we don't have to multiply the action probabilites for the current node by the action probabilities of the parent nodes. This is because those probabilities will already be factored in by the fact that we are sampling episodes. For example, if from the root node, our action probabilites are 10% left and 90% right, we will only be entering that left node 1 out of 10 times on average, and so we will only be adding the data for the next action we take from that left node 1 out of 10 times, meaning that the 10% probability from the parent node will already be factored in since the data for the current node will only show up 1 out of 10 times in all our data and we are averaging over all that data to calculate our gradient. Basically the probabilities of the parent nodes get factored in by the fact that we will only end up in that node proportional to product of all the probabilities of the parent nodes. I hope that makes sense. Let me know if you have any further questions.
Thanks for the video! But I'm confused why the formula used in the log probability code at the end or at 37:22 was different from the formula at 38:05 (the one with log). These formulas have different gradients for the same variables.
Are you talking about the difference between the one at 37:22 that starts with Ŷ = (1 / 4) * (log(x0) * [3 + 0 + 0] + ...), and the one at 38:05 that starts with Ŷ = (0.7) * log(x0) * [avg return after x0] + ... ? If so, the first one is what we got from sampling 4 random episodes from the environment, and the second version is what we would expect to get on average if we kept sampling more and more episodes, meaning, the value would converge to that equation. So those equations do have different gradients. The second equation gives us the gradient we actually want, and the first equation just gives us an estimate of that gradient, using only 4 random episodes. The first equation is an example of what we will actually get in practice when we sample episodes, and the second equation is the rationale for why using the first equation works, because on average, it will point us in the direction we are trying to go. Let me know if this didn't answer your question, or if you have any other questions.
Hey Elliot, for the observation of each state, we need to define Diamond Exists with 1 and what if we have 2 diamond so we need to define it as 2, each time the robot get the diamond it will reduce 1 of the observation and another thing how do we know the probability of going left or right, do we have to define it by ourself as well ? Anyway this video is a really great explanation, I like it.
Deciding how to represent your observation data seems to be a bit of an art at the moment. I just decided to use [1 = exists, 0 = doesn't exist] to keep the example simple, but I could have instead made the observation data be the distance to the closest diamond or something else. If there were 2 diamonds you could have the data be the number of diamonds that exist or you could add another input value that would only represent if that second diamond exists or not, or you could come up with any other way to represent the observation data. For me, just reading machine learning papers and seeing how others have represented input data for their problems has given me a sense of what types of representations seems to work well, but the best strategy at the moment seems to be a bit of guess and check combined with some logical thinking to try to decide what types of data representations would allow a neural network to distinguish between good and bad actions. To get the action probabilities for going left or right (the numbers in the blue boxes in the video), we run the observation data through our neural network and it will output these probabilities. So we don't decide what they are, we just use whatever the neural network outputs, and initially these probabilities will be somewhat random, and near 50/50 (and will be based on how we initialized the weights of our neural network, which usually is done is a somewhat random way). And then each time we update the network, that will change how it generates these probabilities. Thanks for the questions. I'm glad you liked the video.
33:40 Could someone explain this? x5 is the probability of taking the action to go right to receive the reward (as they're the variables for the action probabilities 27:52), so why do we do .2 * x5? As .2 and x5 are both the probabilities of going right, why are we multiplying them with each other? Why do we multiply the probability of going right by itself? It doesn't make sense. :/
Great question, this is a confusing part, and you are correct, we should not be multiplying by both .2 and x5. I probably should have made it more clearly at that part of the video that that example is what should not be done. The reason I started off showing the wrong way to do it is that at 33:40, where I am trying to figure out how to estimate that Y value, I wanted show how someone might go about figuring out how to estimate that value if they didn't already know the correct method, so I pretend I don't actually know how to estimate it and I guess at a solution (which at first is the wrong solution, but I figure out how to correct it later). My first attempt at a solution is to sample episodes, and each time I take an action, I add to my sum that action probability (as a variable) multiplied by the reward received after taking that action. The sum I actually estimate with this method is the Ŷ value at 33:40, which as you noticed is an incorrect estimate of Y. I have an extra multiple of n, and I am multiplying by the action probabilities twice, once as a constant value and once as a variable. So to correct for this, in my next attempt at estimating the sum, I divide the whole thing by n, and also, each time I take an action, I add to the sum that action probability (as a variable) divided by that action probability (as a constant value) multiplied by the reward received after taking that action. By dividing by that action probability as a constant, I cancel out the implicit multiplication by that action probability that is added just by the fact that I only add this new value to the sum after having already taken that action. This corrected estimate can be seen as the Ŷ value at 35:40, and this method is how to correctly estimate Y. I hope this helps. Let me know if there is anything you are still confused about.
Thanks for amazing lecture. I have question At 32:41 i cannot understanding green term I have thought that 0.7 is one of the parameter of x1 and we change this x1 for best policy but at the time of video yhat has n*0.7*x1*avg return after x0 could you elavorate about why we have to muliply 0.7?
At that point in the video (32:41), Ŷ is not the correct estimation of Y. I probably should have clarified that by renaming that Ŷ to be something like "An initial guess of Ŷ". I was trying to show how you could go about trying to figure out a way to estimate Y by coming up with an initial guess for Ŷ, and then modifying that initial guess as needed to make it be correct. For the initial guess, I intentional use the wrong algorithm of "sample episodes, and then for each action, add that action times the reward received after taking that action to a sum, and that sum will be our initial guess of Ŷ". After this initial guess, I see that the sum will be incorrect by having the extra "n" multiple and the extra "0.7" multiple. So to correct for this, I update the algorithm that I use by adding the step: "also, divide each item in the sum by n (the total number of episodes sampled) and by the constant value of the action probability (which would be 0.7 when sampling the x0 action). After these changes, Ŷ (our sampled estimate of Y) is correct, meaning that it will tend to converge towards Y as we sample more episodes. And we have to divide by 0.7 (the value of x0) instead of x0 the variable to get the correct estimate of Y. This is because if we divided by x0, it would cancel out the x0 variable instead of the 0.7 value, and then Ŷ would not look like the Y equation above it that we are trying to estimate. Does that help? Let me know if you have any follow-up questions.
Thanks, glad you liked it. I haven't been making videos lately, but I appreciate the suggestion of making more RL videos, I may make more in the future. Also, I currently don't have the images I used for this video in slide format.
@@elliotwaite I am watching the video for the second time, in 7:05 how you calculated the expected return. based on my calculation for the policy given the expected return for state 0 is (1*3)+1*0.1*-10=2.9 but you obtained -0.7
@@Torabi_ahmad18973, there are different ways to break up the sum of the expected return. I was summing over all the paths, and for each path, adding the probability of taking that path times the total reward received along that path. There are two possible paths. Path A has transition probabilities (1, 1, 0.1) giving a total probability of 0.1, and the total reward on that path is (3 - 10 = -7). And path B has transition probabilities (1, 1, 0.9, 1) and a total reward of (3 + 0 + 10 = 13), so to get the expected return we sum over both paths, which is ((0.1 * -7) + (0.9 * 13)), which equals (-0.7 + 11.7 = 11). If you want, you can also break it up by each reward and the probability of getting that reward. So there is a 1.0 chance of getting the reward of +3, a 0.1 chance of getting the -10, and a 0.9 chance of getting the +10. So ((1.0 * 3) + (0.1 * -10) + (0.9 * 10)) = 11. You get the same result, just in a different way. Let me know if this helped clarify or if there is anything that is still confusing.
I have a doubt. At 21:00, shouldn't the expected reward of each path be: prob1*reward_1 + prob2*reward_2 as compared to the one used as (prob1*prob*2) * (reward_1 + reward_2).
If the example is that prob2 comes after prob1, it should be prob1*reward1 + (prob1*prob2)*reward2. Said another way, only prob1 of the time will you ever receive reward1, and only (prob1*prob2) of the time will you ever receive reward2. This is because the probability of receiving each reward is the product of all the probabilities along the path to that reward. Does that make sense? Let me know if you still have any doubts.
@@elliotwaite yeah my bad, so assuming we get the rewards at each particular state as well and not all net rewards at the end of reaching the terminal state, estimated reward would be (prob1*reward_1) + (prob1*prob2*reward2). For the example that you are using, total rewards along the path are only received by the user after termination is achieved ( like Chess and unlike MDP process where at each state you get some reward), hence (prob1*prob2) * (net rewards along the way works). I am assuming if the first case was used the "expected rewards" would have been different for each path as well the policy gradients regarding it. Would love if you can elaborate on it, that would it work also with case 1, the above example.
@@AmanSharma-bj2yc, I think I'm confused. Can you let me know if we agree on the following things? In the example I am using, the rewards are received along the way, not just at the end. It can be modeled as an MDP with a start state and terminal states. However, the MDP does not allow any cycles, so it's like a one way graph. The expected reward of the policy is the expected reward from the start state. The expected reward from any of the other states could be different. Do we agree on these points or are we thinking about these things differently?
great video. The only thing I can't quite grasp after multiple watchings is ... what are the targets to the neural network prior to backpropagating? Can you break it down a little more for each time step. I've got the forward propagation. Just need the per-time-step targets that are required in order to calculate the partials. I'm always asking you a question at 45:00. Tell me more about the 'loss' what dimension is this? Is it d=2 because you have two actions? Doesn't the loss function require two arguments: the predicted output and the target? Thanks a lot.
Ah, I can see why this is confusing. In this algorithm we actually don't use target values. If we knew which actions we were supposed to take, we could use those acitons as target values, but don't have that information. We could also try to say that our targets are to receive infinite reward for each episode, and the farther away we are from that target, the greater our loss will be, but then all our losses would be infinity, so that does work so well either. So what we do instead is just come up with a loss value directly, without using target values. And this loss just needs to be a value that when we update the parameters of our model in the direction that minimizes that loss, it results in our policy receive more total reward on average. The reasoning behind why we came up with the loss that we did is explained in the video, but once we have that loss, we just call loss.backward() on it directly, without using any target values, and we update the parameters of our policy using that gradient. Let me know if you're still confused about anything and I can try to explain it further.
@@elliotwaite ok this is new! We come up with a loss value directly he says . . . Is this the same as the difference between output and target value? In order to backpropagate I'm used to calculating the errors at the output nodes. What role does the loss play in the backpropagation algorithm. Thank you for your reply, very kind of you to write.
@@hoddy2001 the loss we use here is not the difference between an output and a target since we don't have a target value. This loss is just something that produces the correct gradient when we try to minimize it. Unlike a loss that is the difference between the output and a target, the loss we use here can be negative, which is impossible with an output-and-target loss which represents an error amount, cause you can't have a negative error amount. But for the loss we use, a negative loss is fine, we just try to make it lower. So if it's negative, we just try to make it more negative. The main insight is that our loss isn't an error amount, it's just a value that generates the correct gradient when we back propagate from it. Let me know if anything is still confusing.
@@hoddy2001 ... The role the loss plays in the backpropagation algorithm is that loss is what is getting minimized. The gradient that we get from calling loss.backward() is the gradient that moves in the parameters in the direction that minimizes the loss value the fastest. For example, if our model only has 3 parameters, we can think of the parameter space of that model as a 3D space, and calculating the gradient of the loss figures out which direction in that 3D space from the current parameter values will decrease the loss the fastest. Then we update the parameters by taking a small step in the direction of that gradient. If instead we have a million parameters, it's the same thing but the parameter space is a million dimensional space (which is hard to visualize, but the concept is the same).
@@elliotwaite thank you for your patience my good man. I am in a context where I don't have access to loss.backward() and need to know computationally how to compute the partial derivatives using the loss value. I understand now loss is a scalar value representing what would otherwise be the value of the error function. It is a value, that when the partials are calculated with respect to it, it points us in the direction of the gradient. Can you refer me to the definition of loss.backward()? How does it compute partials from this scalar value?
Great speech man. But i just wished to see how to optimize policy probabilities by hand. If we update policy by the expected rewards we wouldn't be going for 13 reward soon.
I'm not sure what you mean. Can you clarify what you mean by "optimize policy probabilities by hand"? I thought the process I outlined was one that could be done by hand. Also, can you explain what you mean by "going for 13 reward soon"? The policy I described would converge towards the optimal policy of getting an average reward of 11, but it would never actually get all the way to the optimal policy since the probabilities would never be 1.0/0.0, they would always be something 0.999/0.001, but that is just how the algorithm of policy gradients works. If you are looking for a way to jump straight to the optimal policy of 11, then you would want to use a different algorithm. However, that is usually only possible in simple environments. Once the environment gets more complicated, it's computationally more efficient to improve by doing something like the policy gradients algorithm rather than trying to find the perfectly optimal policy.
@@gorkemvids4839 Okay, sounds good. Thanks for you question, although I'm not sure if I answered that well. But feel free to ask me any more questions if you have any more in the future.
Thank you for the amazing video, I have a question. In the implementation part you have defined the loss as the sum of all the partial derivatives of the return at the starting state. How does this relate to the neural network weight update. Concisely how do maximizing these partial derivatives maximize the return at the start state. please reply to the question if possible. I have now watched this video multiple times and still cannot understand this.......
The loss actually isn't the sum of the partial derivatives, it's the sum of the log probs times their associated partial returns divided by the number of episodes. The loss does not include the partial derivativea, the partial derivatives are calculated using this loss. The reason why minimizing this loss is equivalent to maximizing the expected return per episode is what I try to explain in the video, so I feel like I would just be repeating myself if I tried to explain it here in the comments as well. Was there a part of the video that was unclear that I could help clarify?
@@elliotwaite ohhhh so what we are doing is maximizing a quantity whose partial derivatives turn out to be the same as the actual quantity (the return) we want to maximize. Apologies for the misunderstanding.
@@rishipatel8010 Yeah, it's a nice trick we can use when working with neural networks. Since we only actually use the gradient to update the network, it only matters that the gradient is correct, so our loss value doesn't actually have to be the true value we are trying to minimize, it can be any value that has the same gradient as the true value we are trying to minimize.
@@NeuralGamerAI yes, that is an interesting point, that vector will be a vector of ones, but we don't just use the values of that vector (which would be somewhat useless since they are all just ones, as you mentioned), we instead use the values of the gradient of that vector with respect to the loss. Another way to think about it is that when we are back propagating the gradient values through that operation, whatever the gradient for x is at that point will get multiplied by whatever the value of 1/x was in the forward pass, and then that new gradient will continue being back propagated.
I wasn't expecting to see such a nice, graphic explanation. It's timely, thank you.
Thanks! Glad you liked it.
@@elliotwaite This is the best explanation, you have to create many AI explanation video, thank you🔥
@@ManusiaSetengahChiKuadrat thanks! I hope to make more videos at some point in the future.
This is the most intuitive derivation of Policy Gradient Theorem on the internet. Thank you for being my teacher! Every RL intro class should begin with this video.
Thank you. Much appreciated.
@@elliotwaite seriously, the best. Andrey Karpathy has got some tough competition :) Please do a follow up relevant to RLHF as used in LLMs, ideally focusing on the newer/more approachable algos like DPO. Pleeease :)
@@lmfaow00t thanks! I may start making videos again soon, but I'm not sure if I'll make videos similar to this one. However, I'll keep your suggestion in mind. Thanks for the support.
True dude, the intuitive derivation is always the best for most of the people
I tried to implement the algorithm by hand. I created a 3 layers network with one input(2) layer, one hide layer(32),and one output(2) layer. Before training, I collect 100 episodes ,then training the network, and repeat training 10000 times,finally,the probability of X0,X3,X5 converge to (0.999877,0.97339,0.99900). this is my first reinforcement learning program, thank you for your help,great tutorial!
Awesome! That's one of my favorite ways to better understand an algorithm, seeing if I can implement it myself. I'm glad to hear you got it working.
I can't believe I get this for fee. Such an amazing visualization. I have a vague understanding about backpropagation and expected return. however, I never imagine the relationship can be this straightforward and clear.
Thank you for the kind comment 😊. I'm glad to hear it helped you develop a more clear understanding of those concepts.
Great Video! Using graphics and combining the logical explanation with pseudocode was really helpful to me. Most of the time you only see one or the other.
Thanks for the feedback. I've been wondering if adding all those different ways of explaining it was exessive, so I'm glad to hear you found it helpful.
Damn this is like a gift to the learners of RL. You should do more videos on RL if you can, UA-cam only has a handful of good RL videos
Thanks, I'm glad you liked it. I may make more videos in the future, thanks for the encouragement.
This is the only video you need to learn the intuition and basics of Reinforcement Learning. Amazingly done! Thanks!
Thanks, Pablo!
One of the best! I'll have to watch the video a few times, but it's already helped me figure things out more intuitively. The example and animations are excellent. Great work.
Thanks!
This is a unique and awesome channel. Seeing these cool animations really helps to build an intuition. Thank you for uploading.
Watching such quality educational video for free is insane, thank you so much.
:) I'm glad you liked it.
This is pure gold! Thank you so much for time, hard work, and energy you put into this video. It's highly appreciated.
😀 I'm glad you liked it. Thanks for the kind comment.
A perfect video, I have never seen anything so good 👏👏 Thanks from Brazil 😁
Thanks, Vinicius!
Best video on the subject. Thank you, Elliot for creating the such a thorough content.
Thanks, Kanai!
Hey Elliot, Loved this explanation so much, man! Keep up the awesome work. I'm a beginner and I feel RL is often looked upon as a difficult paradigm due to the heavy math in there, but people like you are a blessing, for putting the ideas in such an outstanding fashion :)
Thanks!
Best RL learning video for beginners
Thanks!
Oh my god this is so clear, thank you! You make great visualizations!
Thanks!
This was a very helpful video for me, thank you Elliot! The toy problem with the robot was so helpful to visualize everything, especially showing the the state-action sequences with the probabilities
Thanks! I'm glad you found it helpful.
it's so entertaining and intriguing, thanks for your work, Elliot!
Thank you for a simplified great explanation. You took a lot of trouble to create this material and it shows in your presentation.
Thanks! I'm glad you found it helpful.
the best way for beginner to understand the policy method! thanks for your hard work.
Thanks! I'm glad you liked it.
Thank you! It is extremely helpful to see a detailed worked example. Your video is very much appreciated.
I'm glad to hear you found it helpful.
Really great video! I enjoyed the graphical representation of the gradient. Can't wait to see more in-depth policy gradient related videos.
Thanks, Mohsen!
Thank you for the super nice and intuitive explanation!
Wow, thanks YaGuang for the Super Thanks! It's much appreciated. I'm glad you found the explanation intuitive. I saw you are working on the Google DeepMind team up in Mountain View. I worked at the Google in Mountain View from 2015 - 2017 and have fond memories, lots of great people. I hope you are also enjoying your time there. I would imagine it's an exciting time to be working with that team.
Thank you Elliot! Glad to hear this that we were colleagues. It is indeed an exciting time and I wish the best of luck to your startup.
@@YaguangLi Thanks!
The best policy gradient explanation
Thanks!
One of the best Videos I have ever watched for RL AI! Great work and thanks a lot.
Thanks! Glad you liked it.
Well, watched a few times, I can say that there's no concept left untouched from statistics, probability, math, NN, DL domains in this video. I'll list them as much as possible:
reinforcement learning (of course), neural networks, backtracking, gradients/derivatives, chain rule, partial derivatives, probability trees, multiple random events, sampling, probability sampling, monte carlo, mean, standard deviation, pytorch and some more I can't count. Thanks to author again.
I'm glad you liked the video. That's one of the things I love about reinforcement learning, it involves so many different areas of interesting math.
Love it , never been this much clear as a visual person thank you, and we need alot alot more, Keep up, subscribe 100%
Thanks!
Keep up the good work man! This was amazing :)
Thanks!
superb bro, this is something which is missing is most of the other videos...This really help building the intuition for beginners. Keep building such videaos , thatnks a ton
Thanks, Neeraj! Glad you found it helpful.
Truly logical, clear from theory to implementation, bravo. One question I still have is "is it really true that the mathematical derivations at the start of video tell so much?" Anyway, they didn't speak to me.
Thanks! For me, those mathematical derivation shown at the beggining of the video are hard to interpret, but after working through the logic I explained in the video, I can make sense of them. One trick that helps me is removing the logs in those equations by using the rule that "the gradient of the log of x" equals "1/x times the gradient of x". So wherever it says "the gradient of the log of P(...)" I just think of it as "1 / P(...) times the gradient of P(...)". Which can be read as "The gradient of the probability of that trajectory given the current neural network weights, divided by the probability of that trajectory given the current neural network weights." I'm not sure if that helps others, but it makes it easier to for me to understand. I think they just write out the math in the log form to make the equation more concise.
This has to be the best explaination ever!
Thanks!
Excellent! This approach of developing the intuition works best for me . I often wonder how anyone ever came up with the end product of something this? A bunch of dense math as shown in the beginning? I suspect it starts with intuition like this and then is formalized to equations. I think that to improve something or invent new things, one has to go logically and intuitively first, highly formalized math second. When things are too abstract from the start, it’s too easy to get lost in the weeds. Thank you for these videos!
I agree. However, sometimes I figure out new things (or at least new to me) by working out the math first, and then after I see that the math leads to a simple result, I try to figure out an intuitive reason for why it does. So I think both modes, working intuitively and just working with the math symbols, can be useful for exploring new ideas. But I agree that just working with the math symbols feels a bit like exploring in the dark, and it's hard to know where to go next without turning on the lights of intuition.
This is simply the best video I have seen on this topic this year!
Thanks!
very nice video. i searched for a long time an explanation and finally found it
Thanks, kolo!
It's a shame I only can like this once. Subscribed. Looking forward to your future contents.
Thanks, Shunz!
This explanation is so insightful.
@@user-wr4yl7tx3w Thanks. I'm glad you liked it.
BEST ONE SOR FAR I HAVE SEEN. SO INTUITIVE
Thanks!
Thanks for this video
The best explanation anyone can ask for
Really amazing
Thanks!
Thanks a lot for making me understand the policy gradient theorem. Super intuitive
@@sidhiquepynkulam Thanks! I'm glad to hear you found it intuitive.
This video is a life saver. Thanks a lot for the great explanation
Glad to hear you liked. Thanks for the comment.
This is really awesome... the way you explained was just perfect
Thanks, Ruchin! Glad you found it helpful.
Amazing video! Thank you!!
Wow... this video was simply amazing! Thanks a lot... like others have mentioned, I wasn't expecting this! :)
Thanks! I'm glad you liked it.
This was a phenomenal video and should be required viewing for any class on machine learning. Any chance you are considering a video about policy gradients with continuous action spaces?
Thanks, Andrew! I don't have any plans for making new videos in the near future, but I may eventually, and that would be a good potential topic to cover. I'll add that to my list of future video ideas. Thanks for the suggestion.
Just awesome!! Thanks for all your efforts.
This is awesome! Thank you for this detailed explanation!
really like your explanation !!! Great video and thanks for your sharing!!!
Amazing video, clear and very detailed explanation 👍. Please, keep up doing such a beautiful educational job.
Thanks!
amazing video and good explaination
Thanks!
At 12:44 if want to calculate dR/dx, you changed 0.7 by x but you still need to replace 0.3 by 1-x, otherwise you can not derivate.
This is one of the counterintuitive parts. I start explaining it at 13:49. This dR/dx is only calculating the first step in the backward pass, as in we are only calculating the partial derivates for the expected return with respect to the outputs from our neural network (as in the outputs from the softmax function at the end of our neural network). So in this step, we ignore the fact that the probability of going right (prob(right)) has to equal 1 - prob(left) because we don't make any assumptions about what the outputs for our neural network are, we don't assume they are probabilities that have to add up to one. It's only after this step when we continue backpropagating these gradients through the softmax function of our neural network that this restriction is taken into account, that prob(left) has to equal 1 - prob(right). This is because it is the softmax function that normalizes the probabilities to require them to add up to 1. So after the gradients get backpropagated through the softmax, they will be correct, even though they start off looking a bit strange since we start off not assuming anything about our neural network outputs, meaning we start off ignoring the fact that they have to sum to 1.
Wonderful Explanation of Policy Gradient. Please do Actor Critic methods also!
@@inforoundup9826 Thanks. I haven’t been making videos lately, but I may again eventually. Thanks for the recommendation.
Thank you! Such a clear explanation. I finally understand this!
Very nice and intuitive video, thank you!
I wish you visualized several iterations of the optimization procedure to show us the convergence of the algorithm to the desired policy. Maybe we could get a better understanding of exploration/exploitation from your amazing example.
Thanks for the suggestion. That's some good feedback for future videos.
I would pay good money for this video
Thanks! I'm glad you found it valuable.
Thank you so much for this video! You are a great teacher :)
Thanks, Till! Glad you liked it.
great explanation, everything make sense now😃
@@kashifzaheer7804 thanks, that's great to hear.
Loved this - Please do a video on contextual bandits too!
Thanks for the comment and suggestion. I do like the bandits problems. I haven’t been making videos recently, but maybe I’ll get back into it eventually. I was first introduced to the bandits problems in David Silver’s RL course lecture here, which you might also like: ua-cam.com/video/sGuiWX07sKw/v-deo.htmlsi=4MtbQCR4kO4FL-DD
@@elliotwaite Thank you! I actually finished that series up - it's amazing :) But even this covers contextual bandits very briefly (video #9 in the series) And if I liked anything on RL after that lecture series, it's your video - kudos to you on it!
@@pk_1320 Thanks!
Fenomenal explained, many thanks
Thanks, Björn!
Yesss! Awesome video
Thanks!
Great explanation! so intuitive
Thanks!
Marvelous explanation!
Thank you for this elaborate explanation.
I'm glad you liked it.
Awesome explanation!
This is brilliant
Thanks!
This was incredible, thank you
Thanks, Matt!
THIS IS GOLD!
Thanks!
what an excellent lecture thanks
Thanks!
This video really helps! Thx Elliot!
Thank you, John Cena.
Great video! Can you make another one like this (with animations instead of maths) that demonstrates how LLMs work (including fine training). I don't understand math at all but am very interested in learning more about LLMs. I just cannot understand the math and statistics behind it.
Thanks! I haven't been making videos lately, so I probably won't make an LLM video soon. But I appreciate the suggestion.
Great work! Thank you!
🙂 I'm glad you liked it.
Thank you very much! I finally understand gradient!
Hi Elliot, do you have a sample code say using pg to play games on gym? I can't find anything can explain as clean as you do!!
Great video!
Thanks!
TY very much! It was really good!
@@eduardtsuranov712 thanks!
This is really helpful thank you.
😄 I'm glad you found it helpful.
This is just amazing. One of the best explanations out there for vanilla PG. I do have a question though. At 48:20, you code probs/probs.detach() to detach prob from backpropagation. However, wouldn't that just return 1 since we divind it with itself? Shouldn't it be just prob.detach()?
That part of the code does look odd, and you are right, dividing the value by itself will equal one. But if we replaced `probs / probs.detach()` with just `probs.detach()`, then that line would be:
loss = -(probs.detach() * future_returns).sum() / num_episodes
...but when you detach a tensor, it becomes equivalent to a constant (and by that I mean the tensor will no longer backpropagate any gradients through it). And since `future_returns` came from the environment, it's also a value that we can't backpropagate through, and `num_episodes` is a value that didn't depend on our policy, so it's also a constant that we can't backpropagate through. So if we called `loss.backward()` using the above `loss`, it wouldn't do anything because all the values that were used to calculate that loss were treated as constants.
It's like saying if `loss = -(3 * 5) / 7`, how will changing the parameters of our policy change the loss, and the answer is that it won't since all the values used to calculate the loss are constants. But when we keep in the attached `probs` in the equation, it looks more like this: `loss = -((policy_output / 3) * 5) / 7`, and now we can see how changing the output of our policy will change the loss, and we can calculate a gradient for the parameters of our model.
So then the question is why do we also need the detached probs, why not just use:
loss = -(probs * future_returns).sum() / num_episodes
...and the reason is that adding the division of the detached probs balances out the fact that we are expected to only sample those specific actions with a probability of `probs`. For example, let's imagine we only have two actions, A and B, and that our current policy samples A 99/100 times and B only 1/100 times. And that action A returns a future reward of 1.01 and action B returns a future reward of 1.02. So even though B is sampled much less often, it's actually the slightly better action to take, and our gradient should favor taking action B more often. So now let's imagine we sample 100 actions and as expected we sample action A 99 times and B only 1 time. Then if we used the loss directly above, our loss would look like this:
loss = -(.99 * 1.01 + .99 * 1.01 ... (that repeated a total of 99 times) ... + .01 * 1.02) / 100
But the `.99` and the `.01` are actual the outputs of our model, which are the prob for A and the prob for B, so we can think of it more like this:
loss = -(prob_A * 1.01 + prob_A * 1.01 ... (that repeated a total of 99 times) ... + prob_B * 1.02) / 100
So now if we ask how can we minimize this loss, the answer will be to increase prob_A, because that will make this sum more negative, but this isn't what we want. We actually want to increase B since it returns a better future return.
So let's instead try it again but this time also with the `probs.detach()` values, it will look like this:
loss = -((prob_A / .99) * 1.01 + (prob_A / .99) * 1.01 ... (that repeated a total of 99 times) ... + (prob_B / .01) * 1.02) / 100
And now if we ask how can we minimize that loss, the answer will be to increase prob_B (it has a slight edge now over increasing prob_A). We can see this more clearly if we simplify the sum:
loss = -(99 * (prob_A / .99) * 1.01 + 1 * (prob_B / .01) * 1.02) / 100
And now if we substitute dividing by .99 by multiplying by 100 / 99, and we also substitute dividing by .01 by multiplying by 100 / 1, we get:
loss = -(99 * (prob_A * 100 / 99) * 1.01 + 1 * (prob_B * 100 / 1) * 1.02) / 100
And the 99s cancel out, and the 1s cancel out, and the 100s cancel out, and we are left with:
loss = -(prob_A * 1.01 + prob_B * 1.02)
And now we can clearly see that increasing prob_B rather than prob_A will make the loss more negative.
So we need to divide our probabilities by the detached versions of themselves to compensate for which probabilities will be sampled more often or less often. Dividing by `probs.detach()` is like saying, "we are only going to sample this action with a probability of `prob`, so we need to boost the weight of this sample in our loss by the inverse of that probability (if it's a low probability, boost it by a lot, and if it's a high probability, only boost it by a little)."
So in `probs / probs.detach()`, the numerator is there so that we can propagate gradients, the denomintor is there to boost those calculated gradients by the inverse of those probabilites.
I hope this helps.
@@elliotwaite Hey, thank you SOOO much for this detailed explanation. You are just too awesome and to properly explain like this is really commendable. So we can say that prob/prob.detach() is something like a normalizing term? Because the equation after prob/prob.detach() looks similar to a supervised learning technique where are finding the loss on the basis of how good the prediction was. Did I get that right?
@@sarvagyagupta1744 yeah, the 1/probs.detch() part can be thought of as a kind of normalizing term. However, I usually think of normalizing as making sure a sum adds up to 1, which I don't think that term does, so I think of it more as a re-weighting term, or a balancing term. But yeah, it's very similar to the supervised learning equation, but because we aren't given the dataset, but instead are generating the dataset by sampling actions using the probabilities output from our current policy, we have to add that 1/probs.detach() term to compensate for that difference.
@@elliotwaite yeah. I kinda felt that I was using the term very loosely. I was thinking of it as clipping instead of normalising. So is that a better comparison? Like how we do in PPO?
@@sarvagyagupta1744 I think of clipping as something that is conditionally applied. For example, in PPO-clip, the clipping only kicks in if the new policy starts to get too far away from the old policy. Whereas the 1/probs.detach() term is always applied. So for me, clipping means something different. But that might just be how I think of that word.
Good work bro.
Thanks!
excellent tutorial !!! thanks a ton
Thanks. Glad you liked it.
Thanks for the great video and detailed explanation! I'm curious about the reason of multiply e.g. 1/0.7 at 36:30, is it just the strategy we apply during the sampling such that later the sum and the calculation of gradient will equals to the derivation we got theoretically? Thanks!
Yep. You can also think of it as balancing out the fact that we are sampling actions. Because when we add a value to our total sum, it is kind of like a nudge to our policy. As in, if we get a positive reward after an action, we nudge our policy to take that action again, and if we get a negative reward, we nudge our policy to not take that action again, but we want our nudges to be balanced so that if we are getting the same positive reward for going left as we do for going right, we want the nudges to go left more to equal the nudges to go right more so that the policy doesn't actually change (because when there is no benefit to going left or right, we don't want the policy to change). So, for example, if our policy was left 0.1 and right 0.9, we'd sample going left 1 out of 10 times and right 9 out of 10 times. So when we add our going left sample to our total sum, we multiply the value by 1/0.1 (or we multiply that nudge by 10) and when we add our going right samples to the total sum, we multiply the value we add by 1/0.9 (or multiply those nudges by 10/9), so that the effect of both going left and right are considered equally (the going left nudge is multiplied by a bigger number to a account for the fact that it is sampled less often).
it helps me a lot, thank you
Dude you're awesome!
😊
fantastic video. brilliantly explained. One doubt, the gradient formula derived in the video is equal to the multiplication of two parts. first part is action probabilities multiplied to arrive at a node and second part is the expected reward which we get by taking an action at that node. But in the program we are performing log only on the action probability at that node. It is not dependent on the parent probabilities in the path. Is my understanding correct ?
Yes, that is correct. When we are sampling episodes during training, we don't have to multiply the action probabilites for the current node by the action probabilities of the parent nodes. This is because those probabilities will already be factored in by the fact that we are sampling episodes. For example, if from the root node, our action probabilites are 10% left and 90% right, we will only be entering that left node 1 out of 10 times on average, and so we will only be adding the data for the next action we take from that left node 1 out of 10 times, meaning that the 10% probability from the parent node will already be factored in since the data for the current node will only show up 1 out of 10 times in all our data and we are averaging over all that data to calculate our gradient. Basically the probabilities of the parent nodes get factored in by the fact that we will only end up in that node proportional to product of all the probabilities of the parent nodes.
I hope that makes sense. Let me know if you have any further questions.
Nice one!!!
Thanks!
Thanks for the video! But I'm confused why the formula used in the log probability code at the end or at 37:22 was different from the formula at 38:05 (the one with log). These formulas have different gradients for the same variables.
Are you talking about the difference between the one at 37:22 that starts with Ŷ = (1 / 4) * (log(x0) * [3 + 0 + 0] + ...), and the one at 38:05 that starts with Ŷ = (0.7) * log(x0) * [avg return after x0] + ... ? If so, the first one is what we got from sampling 4 random episodes from the environment, and the second version is what we would expect to get on average if we kept sampling more and more episodes, meaning, the value would converge to that equation. So those equations do have different gradients. The second equation gives us the gradient we actually want, and the first equation just gives us an estimate of that gradient, using only 4 random episodes.
The first equation is an example of what we will actually get in practice when we sample episodes, and the second equation is the rationale for why using the first equation works, because on average, it will point us in the direction we are trying to go.
Let me know if this didn't answer your question, or if you have any other questions.
@@elliotwaite Thanks alot!
Hey Elliot, for the observation of each state, we need to define Diamond Exists with 1 and what if we have 2 diamond so we need to define it as 2, each time the robot get the diamond it will reduce 1 of the observation and another thing how do we know the probability of going left or right, do we have to define it by ourself as well ? Anyway this video is a really great explanation, I like it.
Deciding how to represent your observation data seems to be a bit of an art at the moment. I just decided to use [1 = exists, 0 = doesn't exist] to keep the example simple, but I could have instead made the observation data be the distance to the closest diamond or something else. If there were 2 diamonds you could have the data be the number of diamonds that exist or you could add another input value that would only represent if that second diamond exists or not, or you could come up with any other way to represent the observation data. For me, just reading machine learning papers and seeing how others have represented input data for their problems has given me a sense of what types of representations seems to work well, but the best strategy at the moment seems to be a bit of guess and check combined with some logical thinking to try to decide what types of data representations would allow a neural network to distinguish between good and bad actions.
To get the action probabilities for going left or right (the numbers in the blue boxes in the video), we run the observation data through our neural network and it will output these probabilities. So we don't decide what they are, we just use whatever the neural network outputs, and initially these probabilities will be somewhat random, and near 50/50 (and will be based on how we initialized the weights of our neural network, which usually is done is a somewhat random way). And then each time we update the network, that will change how it generates these probabilities.
Thanks for the questions. I'm glad you liked the video.
33:40 Could someone explain this? x5 is the probability of taking the action to go right to receive the reward (as they're the variables for the action probabilities 27:52), so why do we do .2 * x5? As .2 and x5 are both the probabilities of going right, why are we multiplying them with each other? Why do we multiply the probability of going right by itself? It doesn't make sense. :/
Great question, this is a confusing part, and you are correct, we should not be multiplying by both .2 and x5. I probably should have made it more clearly at that part of the video that that example is what should not be done. The reason I started off showing the wrong way to do it is that at 33:40, where I am trying to figure out how to estimate that Y value, I wanted show how someone might go about figuring out how to estimate that value if they didn't already know the correct method, so I pretend I don't actually know how to estimate it and I guess at a solution (which at first is the wrong solution, but I figure out how to correct it later). My first attempt at a solution is to sample episodes, and each time I take an action, I add to my sum that action probability (as a variable) multiplied by the reward received after taking that action. The sum I actually estimate with this method is the Ŷ value at 33:40, which as you noticed is an incorrect estimate of Y. I have an extra multiple of n, and I am multiplying by the action probabilities twice, once as a constant value and once as a variable. So to correct for this, in my next attempt at estimating the sum, I divide the whole thing by n, and also, each time I take an action, I add to the sum that action probability (as a variable) divided by that action probability (as a constant value) multiplied by the reward received after taking that action. By dividing by that action probability as a constant, I cancel out the implicit multiplication by that action probability that is added just by the fact that I only add this new value to the sum after having already taken that action. This corrected estimate can be seen as the Ŷ value at 35:40, and this method is how to correctly estimate Y.
I hope this helps. Let me know if there is anything you are still confused about.
Thank you for posting this question. I got stuck there as well
Great one
tnx
Thanks for amazing lecture. I have question At 32:41 i cannot understanding green term
I have thought that 0.7 is one of the parameter of x1 and we change this x1 for best policy but at the time of video yhat has n*0.7*x1*avg return after x0 could you elavorate about why we have to muliply 0.7?
At that point in the video (32:41), Ŷ is not the correct estimation of Y. I probably should have clarified that by renaming that Ŷ to be something like "An initial guess of Ŷ". I was trying to show how you could go about trying to figure out a way to estimate Y by coming up with an initial guess for Ŷ, and then modifying that initial guess as needed to make it be correct. For the initial guess, I intentional use the wrong algorithm of "sample episodes, and then for each action, add that action times the reward received after taking that action to a sum, and that sum will be our initial guess of Ŷ". After this initial guess, I see that the sum will be incorrect by having the extra "n" multiple and the extra "0.7" multiple. So to correct for this, I update the algorithm that I use by adding the step: "also, divide each item in the sum by n (the total number of episodes sampled) and by the constant value of the action probability (which would be 0.7 when sampling the x0 action). After these changes, Ŷ (our sampled estimate of Y) is correct, meaning that it will tend to converge towards Y as we sample more episodes. And we have to divide by 0.7 (the value of x0) instead of x0 the variable to get the correct estimate of Y. This is because if we divided by x0, it would cancel out the x0 variable instead of the 0.7 value, and then Ŷ would not look like the Y equation above it that we are trying to estimate.
Does that help? Let me know if you have any follow-up questions.
Wow. Very İntuitive explanation. İ hope you can create such videos for all RL approaches. Btw, may i be provided with the slides of this lecture?
Thanks, glad you liked it. I haven't been making videos lately, but I appreciate the suggestion of making more RL videos, I may make more in the future. Also, I currently don't have the images I used for this video in slide format.
@@elliotwaite I am watching the video for the second time, in 7:05 how you calculated the expected return. based on my calculation for the policy given the expected return for state 0 is (1*3)+1*0.1*-10=2.9 but you obtained -0.7
@@Torabi_ahmad18973, there are different ways to break up the sum of the expected return. I was summing over all the paths, and for each path, adding the probability of taking that path times the total reward received along that path. There are two possible paths. Path A has transition probabilities (1, 1, 0.1) giving a total probability of 0.1, and the total reward on that path is (3 - 10 = -7). And path B has transition probabilities (1, 1, 0.9, 1) and a total reward of (3 + 0 + 10 = 13), so to get the expected return we sum over both paths, which is ((0.1 * -7) + (0.9 * 13)), which equals (-0.7 + 11.7 = 11).
If you want, you can also break it up by each reward and the probability of getting that reward. So there is a 1.0 chance of getting the reward of +3, a 0.1 chance of getting the -10, and a 0.9 chance of getting the +10. So ((1.0 * 3) + (0.1 * -10) + (0.9 * 10)) = 11. You get the same result, just in a different way.
Let me know if this helped clarify or if there is anything that is still confusing.
yay! I found a gem :) thanks!
Thanks, Ufuk!
I have a doubt. At 21:00, shouldn't the expected reward of each path be: prob1*reward_1 + prob2*reward_2 as compared to the one used as (prob1*prob*2) * (reward_1 + reward_2).
If the example is that prob2 comes after prob1, it should be prob1*reward1 + (prob1*prob2)*reward2. Said another way, only prob1 of the time will you ever receive reward1, and only (prob1*prob2) of the time will you ever receive reward2. This is because the probability of receiving each reward is the product of all the probabilities along the path to that reward. Does that make sense? Let me know if you still have any doubts.
@@elliotwaite yeah my bad, so assuming we get the rewards at each particular state as well and not all net rewards at the end of reaching the terminal state, estimated reward would be (prob1*reward_1) + (prob1*prob2*reward2). For the example that you are using, total rewards along the path are only received by the user after termination is achieved ( like Chess and unlike MDP process where at each state you get some reward), hence (prob1*prob2) * (net rewards along the way works). I am assuming if the first case was used the "expected rewards" would have been different for each path as well the policy gradients regarding it. Would love if you can elaborate on it, that would it work also with case 1, the above example.
@@AmanSharma-bj2yc, I think I'm confused. Can you let me know if we agree on the following things? In the example I am using, the rewards are received along the way, not just at the end. It can be modeled as an MDP with a start state and terminal states. However, the MDP does not allow any cycles, so it's like a one way graph. The expected reward of the policy is the expected reward from the start state. The expected reward from any of the other states could be different. Do we agree on these points or are we thinking about these things differently?
Very awsome but what website or software that you use to prepare slides. Bcuz it looks clean
Thanks! I made them with Figma.
thank you so much! god bless you
This is amazing
Thanks!
Eureka! Thank You!
great video. The only thing I can't quite grasp after multiple watchings is ... what are the targets to the neural network prior to backpropagating? Can you break it down a little more for each time step. I've got the forward propagation. Just need the per-time-step targets that are required in order to calculate the partials. I'm always asking you a question at 45:00. Tell me more about the 'loss' what dimension is this? Is it d=2 because you have two actions? Doesn't the loss function require two arguments: the predicted output and the target? Thanks a lot.
Ah, I can see why this is confusing. In this algorithm we actually don't use target values.
If we knew which actions we were supposed to take, we could use those acitons as target values, but don't have that information.
We could also try to say that our targets are to receive infinite reward for each episode, and the farther away we are from that target, the greater our loss will be, but then all our losses would be infinity, so that does work so well either.
So what we do instead is just come up with a loss value directly, without using target values. And this loss just needs to be a value that when we update the parameters of our model in the direction that minimizes that loss, it results in our policy receive more total reward on average. The reasoning behind why we came up with the loss that we did is explained in the video, but once we have that loss, we just call loss.backward() on it directly, without using any target values, and we update the parameters of our policy using that gradient.
Let me know if you're still confused about anything and I can try to explain it further.
@@elliotwaite ok this is new! We come up with a loss value directly he says . . . Is this the same as the difference between output and target value? In order to backpropagate I'm used to calculating the errors at the output nodes. What role does the loss play in the backpropagation algorithm. Thank you for your reply, very kind of you to write.
@@hoddy2001 the loss we use here is not the difference between an output and a target since we don't have a target value. This loss is just something that produces the correct gradient when we try to minimize it. Unlike a loss that is the difference between the output and a target, the loss we use here can be negative, which is impossible with an output-and-target loss which represents an error amount, cause you can't have a negative error amount. But for the loss we use, a negative loss is fine, we just try to make it lower. So if it's negative, we just try to make it more negative.
The main insight is that our loss isn't an error amount, it's just a value that generates the correct gradient when we back propagate from it.
Let me know if anything is still confusing.
@@hoddy2001 ... The role the loss plays in the backpropagation algorithm is that loss is what is getting minimized. The gradient that we get from calling loss.backward() is the gradient that moves in the parameters in the direction that minimizes the loss value the fastest. For example, if our model only has 3 parameters, we can think of the parameter space of that model as a 3D space, and calculating the gradient of the loss figures out which direction in that 3D space from the current parameter values will decrease the loss the fastest. Then we update the parameters by taking a small step in the direction of that gradient. If instead we have a million parameters, it's the same thing but the parameter space is a million dimensional space (which is hard to visualize, but the concept is the same).
@@elliotwaite thank you for your patience my good man. I am in a context where I don't have access to loss.backward() and need to know computationally how to compute the partial derivatives using the loss value. I understand now loss is a scalar value representing what would otherwise be the value of the error function. It is a value, that when the partials are calculated with respect to it, it points us in the direction of the gradient. Can you refer me to the definition of loss.backward()? How does it compute partials from this scalar value?
Great speech man. But i just wished to see how to optimize policy probabilities by hand. If we update policy by the expected rewards we wouldn't be going for 13 reward soon.
I'm not sure what you mean. Can you clarify what you mean by "optimize policy probabilities by hand"? I thought the process I outlined was one that could be done by hand.
Also, can you explain what you mean by "going for 13 reward soon"? The policy I described would converge towards the optimal policy of getting an average reward of 11, but it would never actually get all the way to the optimal policy since the probabilities would never be 1.0/0.0, they would always be something 0.999/0.001, but that is just how the algorithm of policy gradients works.
If you are looking for a way to jump straight to the optimal policy of 11, then you would want to use a different algorithm. However, that is usually only possible in simple environments. Once the environment gets more complicated, it's computationally more efficient to improve by doing something like the policy gradients algorithm rather than trying to find the perfectly optimal policy.
Hey Thanks for another explanation. I'm still trying to wrap my head around this topic and i might miss few points. I'll keep studying.@@elliotwaite
@@gorkemvids4839 Okay, sounds good. Thanks for you question, although I'm not sure if I answered that well. But feel free to ask me any more questions if you have any more in the future.
Thank you for the amazing video, I have a question. In the implementation part you have defined the loss as the sum of all the partial derivatives of the return at the starting state. How does this relate to the neural network weight update. Concisely how do maximizing these partial derivatives maximize the return at the start state.
please reply to the question if possible. I have now watched this video multiple times and still cannot understand this.......
The loss actually isn't the sum of the partial derivatives, it's the sum of the log probs times their associated partial returns divided by the number of episodes. The loss does not include the partial derivativea, the partial derivatives are calculated using this loss. The reason why minimizing this loss is equivalent to maximizing the expected return per episode is what I try to explain in the video, so I feel like I would just be repeating myself if I tried to explain it here in the comments as well. Was there a part of the video that was unclear that I could help clarify?
@@elliotwaite ohhhh so what we are doing is maximizing a quantity whose partial derivatives turn out to be the same as the actual quantity (the return) we want to maximize.
Apologies for the misunderstanding.
@@rishipatel8010 Yeah, it's a nice trick we can use when working with neural networks. Since we only actually use the gradient to update the network, it only matters that the gradient is correct, so our loss value doesn't actually have to be the true value we are trying to minimize, it can be any value that has the same gradient as the true value we are trying to minimize.
But probs/probs.detach wouldn´t give a vector of ones?
@@NeuralGamerAI yes, that is an interesting point, that vector will be a vector of ones, but we don't just use the values of that vector (which would be somewhat useless since they are all just ones, as you mentioned), we instead use the values of the gradient of that vector with respect to the loss.
Another way to think about it is that when we are back propagating the gradient values through that operation, whatever the gradient for x is at that point will get multiplied by whatever the value of 1/x was in the forward pass, and then that new gradient will continue being back propagated.
@@elliotwaite Thank you very much, very usefull explaniation
Beautifully done.