Really well made video, both from the theoretical standpoint and also coding wise super clear to understand. One small error you made: In your theoretical section you mixed up different notations of the reward R_x: the most common used notation (also used by Sutton/Barto in the mentioned Book) is to use the index of the next state for the reward that occurs after taking an action: Notation 1: S0, A0, R1, S1, A1, R2... however, it might be noted as notation 2: S0, A0, R0, S1, A1, R1... in different literature. At 7:15 you used notation 1 (and also the sum notation is slightly wrong, it needs to run from t to T-1, not from 0 to T-1, but you fixed it in the discounted version of the formula) at 12:24 and 13:24 you used notation 2 for the delta equation (needs to be R_t+1 instead) I really loved the video, and leave this comment to help mitigate some of these confusions i had myself when studying these topics :)
your videos are good. I'm trying to implement a actor-crtic algorithm for modellimg a process. My process has input flow rate, concentration of species and the output is pH.I'm struggling to implement since i'm a beginner. Kindly make a video tutorial of how to implement actor-critic for a process modelling. It will be helpful for students like us to follow and learn.
Hey Phil. For some reason when I use this actor critic method (or REINFORCE) in a poker environment (texas holdem) it always learns to fold with 100% probability. If I use a dueling DQN approach, it works correctly and plays the stronger hands, and folds the weaker ones. It seems that i am running into a local optimum (since rewards are negative when you bet, and are only positive at the end of the episode if you win) where folding always has the maximum reward on the first timestep (0 instead of some negative number). I am using a gamma of 0.999. Would you have any idea whats going on here?
From the RL book by Sutton/Barto, the one-step actor-critic uses the semi-gradient method to update the critic network. Which means state_value_, _ = self.actor_critic(state_) should not be included inside the GradientTape. This is confirmed by the pseudocode given in Sutton/Barto where w is updated as w = w + alpha*grad(V(s, w)) (here V and w represent critic network and its parameters respectively). But if we include state_value_, _ = self.actor_critic(state_) inside the GradientTape, the update would have an additional grad(V(s', w)) term! ( here s' is the next state, ie state_ in code)
Page 274. Delta term is proportional to the difference in value function of successive states. Both gradients (actor and critic) have a delta term in them.
Would it be possible to label these precious lectures with a kind of sequential indexing (per topic) as you are enriching them, so one just heading to them have an idea where would be the best to start and follow along. Many thanks for sharing your exceptional skills.
Hey i have a question. Do you have like a source or literature, where the concept, that the value function and the policy both originate from the same network gets explained and why this is possible? Ty
Hahaha I feel like I’m in my ai/ml class. Every weeks lecture discussion starts with everyone saying thank you 😀 it’s awesome -I love this video so far, still watching, but it is amazingly clear . So, I totally agree - thank you!
Hey @Phil I have been following along, loving the content. Now I'm wondering, onn a scale of 0-1 what is the probability you will do a video on implementing CURL: Contrastive Unsupervised Representations for RL?
Very informative Can we adjust the actor-critic functionality to decide the output (resultant from softmax), and update the gradients accordingly? Since RL starts learning from scratch, I would like to use heuristics output as final softmax output to speed up the learning! Is that possible?
Why do we pass the softmax probabilities to tfp categorical distribution, can we not just select the highest probability action from the softmax output? I'm not really good understanding the math so having a hard time figuring it out.
Hi @Phil thank you for thiz amazing video. Just one question about your loss (critic-loss) is it possible to see that it explodes using the delta**2 ? Because the gradient after that gives me all nan values. Some advices ?
@@MachineLearningwithPhil I have just noticed that the nan values appear when there is one probability which goes to 0 in our probs tensor. We can just put a small quantity to prevent this ? And this is the reason for the nan in the gradient because we have a derivative of 0 ?
@@MachineLearningwithPhil another question, in a problem where the values predicted by the Actor is completely in another range of values respect to the Critic part (Actor --> (0,1) while Critic -->(-80,160)), it is really difficult find the optimal combination with just one network ?
Thank you for the tutorial. One question, in your application the agent learns after every step it takes in the environement. How about learning in a batch after each episode?
Generally not the way it's done with actor critic. It's a temporal difference method, so it learns each time step. Policy Gradient is based on Monte Carlo methods and do what you described.
I really appreciate your explanation. I tried to run it on FrozenLake, and NChain, it didn't work although I changed the input_dims from 8 to 1. Any hints or help how I can alter the code to work on FrozenLake?
Frozen lake isn't an appropriate environment for the algorithm. FL is for tabular methods, not approximate ones. In other words, neural nets won't really work.
Episode is typically when the environment is reset. (This never happens in the real world!) unless the real world itself is a simulator, like a game, for example Chess.
ImportError: This version of TensorFlow Probability requires TensorFlow version >= 2.9; Detected an installation of version 2.8.0. Please upgrade TensorFlow to proceed. I am getting this error can anybody help me to solve this? I also upgraded the Tensorflow but again got the same error. @Mahine Learning with Phil
This content is sponsored by my Udemy courses. Level up your skills by learning to turn papers into code. See the links in the description.
what mattered was the explanation of those little details that everyone ignores because they simply don't understand it like you do, so thanks a lot.
Thank you for your videos Phil. They are very informative and helps me to understand more and more about this content!
Glad to be of service Gabriel
Thank you, Dr Phil.
Really well made video, both from the theoretical standpoint and also coding wise super clear to understand.
One small error you made: In your theoretical section you mixed up different notations of the reward R_x:
the most common used notation (also used by Sutton/Barto in the mentioned Book) is to use the index of the next state for the reward that occurs after taking an action:
Notation 1: S0, A0, R1, S1, A1, R2...
however, it might be noted as notation 2: S0, A0, R0, S1, A1, R1... in different literature.
At 7:15 you used notation 1 (and also the sum notation is slightly wrong, it needs to run from t to T-1, not from 0 to T-1, but you fixed it in the discounted version of the formula)
at 12:24 and 13:24 you used notation 2 for the delta equation (needs to be R_t+1 instead)
I really loved the video, and leave this comment to help mitigate some of these confusions i had myself when studying these topics :)
Thanks for the clarification Benjamin
thanks dr. phil
i think it is a good idea in addition to show results in command line, show environment renders after model learns.
your videos are good. I'm trying to implement a actor-crtic algorithm for modellimg a process. My process has input flow rate, concentration of species and the output is pH.I'm struggling to implement since i'm a beginner. Kindly make a video tutorial of how to implement actor-critic for a process modelling. It will be helpful for students like us to follow and learn.
Your videos have defogged all these concepts for me. Thank you so much!!!
Thanks for the content Dr. Phil. :-)
Thanks for watching
Thanks! Saved and will watch later.
Very clear and nice explanation, thank you!
Hey Phil.
For some reason when I use this actor critic method (or REINFORCE) in a poker environment (texas holdem) it always learns to fold with 100% probability. If I use a dueling DQN approach, it works correctly and plays the stronger hands, and folds the weaker ones. It seems that i am running into a local optimum (since rewards are negative when you bet, and are only positive at the end of the episode if you win) where folding always has the maximum reward on the first timestep (0 instead of some negative number). I am using a gamma of 0.999.
Would you have any idea whats going on here?
You need a better exploration strategy. PG methods are on-policy. This means that there is a higher tendency to stuck in local minima.
From the RL book by Sutton/Barto, the one-step actor-critic uses the semi-gradient method to update the critic network. Which means
state_value_, _ = self.actor_critic(state_) should not be included inside the GradientTape.
This is confirmed by the pseudocode given in Sutton/Barto where w is updated as w = w + alpha*grad(V(s, w)) (here V and w represent critic network and its parameters respectively).
But if we include state_value_, _ = self.actor_critic(state_) inside the GradientTape, the update would have an additional grad(V(s', w)) term! ( here s' is the next state, ie state_ in code)
Page 274. Delta term is proportional to the difference in value function of successive states. Both gradients (actor and critic) have a delta term in them.
Would it be possible to label these precious lectures with a kind of sequential indexing (per topic) as you are enriching them, so one just heading to them have an idea where would be the best to start and follow along. Many thanks for sharing your exceptional skills.
Can't watch now, but leaving a comment to get this video going :D
Video time 6:04: For two flips, need to multiply by 2? E(2 flips) will still be 0 since 0 x 2 = 0.
You are so so great! Saving up to buy your courses, your videos have been so helpful :)
Thank you Maria.
Hey i have a question. Do you have like a source or literature, where the concept, that the value function and the policy both originate from the same network gets explained and why this is possible?
Ty
Your video helps!
Hi Phil
Thanks for the video. Can you please explain how the score is as the iterations progress even though we are sampling the actions randomly?
Comment for the algorithm! :)
Thanks Jousef!
Thank you !
Thanks for watching
Hahaha I feel like I’m in my ai/ml class. Every weeks lecture discussion starts with everyone saying thank you 😀 it’s awesome -I love this video so far, still watching, but it is amazingly clear . So, I totally agree - thank you!
how would you use this method in the context of reinforcement learning from human preferences ??
hi can we use this method for decision making too?
Hey @Phil I have been following along, loving the content. Now I'm wondering, onn a scale of 0-1 what is the probability you will do a video on implementing CURL: Contrastive Unsupervised Representations for RL?
Hi Phil.
I thought prob_ratio must equal to one if we replay the same action as the actor is updated after replay . am I right?
Very informative
Can we adjust the actor-critic functionality to decide the output (resultant from softmax), and update the gradients accordingly?
Since RL starts learning from scratch, I would like to use heuristics output as final softmax output to speed up the learning!
Is that possible?
Why do we pass the softmax probabilities to tfp categorical distribution, can we not just select the highest probability action from the softmax output? I'm not really good understanding the math so having a hard time figuring it out.
I am wondering the same thing. It looks like it also works if you just take the action with the highest probability.
I think it is to implement the exploration part for the Agent
Hi Phil, i am a beginner. Can you tell me if critic is needed after training completed, or not? That is, only is actor enough after training?
Thanks.
only actor!
@@papersandchill thank you so much.
Thank you sir
Which tensorflow version is good for this?
Thank svery much
Hi @Phil thank you for thiz amazing video. Just one question about your loss (critic-loss) is it possible to see that it explodes using the delta**2 ? Because the gradient after that gives me all nan values. Some advices ?
Strange. What environment? Make sure the ln term isn't exploding
@@MachineLearningwithPhil I have just noticed that the nan values appear when there is one probability which goes to 0 in our probs tensor. We can just put a small quantity to prevent this ? And this is the reason for the nan in the gradient because we have a derivative of 0 ?
Ln of 0 is undefined. You can just add some small value, yes.
@@MachineLearningwithPhil another question, in a problem where the values predicted by the Actor is completely in another range of values respect to the Critic part (Actor --> (0,1) while Critic -->(-80,160)), it is really difficult find the optimal combination with just one network ?
@@davideaureli6971 HI! I have the same problem. But I can't get it fixed. How did you do it? Thanks!
Thank you for the tutorial. One question, in your application the agent learns after every step it takes in the environement. How about learning in a batch after each episode?
Generally not the way it's done with actor critic. It's a temporal difference method, so it learns each time step. Policy Gradient is based on Monte Carlo methods and do what you described.
@@MachineLearningwithPhil Thank you!
AttributeError: module 'tensorflow' has no attribute 'contrib'
Can anybody help me to solve this error?
I really appreciate your explanation.
I tried to run it on FrozenLake, and NChain, it didn't work although I changed the input_dims from 8 to 1. Any hints or help how I can alter the code to work on FrozenLake?
Frozen lake isn't an appropriate environment for the algorithm. FL is for tabular methods, not approximate ones. In other words, neural nets won't really work.
Great! Keep it up
I don't see action_space being anywhere in the code, don't we need to when sampling the action?
Thank you! Just wondering where the learning rates alpha and beta are implemented?
21:35
Learning rates come into play when we compile the models with an optimizer. I didn't specify a learning rate so it uses the default values.
@@MachineLearningwithPhil I see. Thanks again
Thanks for the great tutorial. Does one game mean one episode ?
usually yes
Episode is typically when the environment is reset. (This never happens in the real world!) unless the real world itself is a simulator, like a game, for example Chess.
ImportError: This version of TensorFlow Probability requires TensorFlow version >= 2.9; Detected an installation of version 2.8.0. Please upgrade TensorFlow to proceed. I am getting this error can anybody help me to solve this? I also upgraded the Tensorflow but again got the same error. @Mahine Learning with Phil
pls make another tutorial on deep q learning with tensorflow 2
he already made it, check his channel
"Probability for getting head multiplied by the reward of getting head" - In my case is most likely zero
Every time people show one of those math formulas on UA-cam a panda baby dies in the world.
Call the WWF!
@@MachineLearningwithPhil I am looking to set up render() for a RL environment, do you have any videos related to this? or env.render()
What is the game ?
Edit : ok car pole ...
goodddddddddddddddddddddddddddddddddddddddd
hi