Everything You Need To Master Actor Critic Methods | Tensorflow 2 Tutorial

Поділитися
Вставка
  • Опубліковано 4 лис 2024

КОМЕНТАРІ • 78

  • @MachineLearningwithPhil
    @MachineLearningwithPhil  4 роки тому +6

    This content is sponsored by my Udemy courses. Level up your skills by learning to turn papers into code. See the links in the description.

  • @youssefmaghrebi6963
    @youssefmaghrebi6963 6 місяців тому +1

    what mattered was the explanation of those little details that everyone ignores because they simply don't understand it like you do, so thanks a lot.

  • @gabrielvalentim197
    @gabrielvalentim197 Рік тому +1

    Thank you for your videos Phil. They are very informative and helps me to understand more and more about this content!

  • @Falconoo7383
    @Falconoo7383 2 роки тому

    Thank you, Dr Phil.

  • @fastestwaydown
    @fastestwaydown 2 роки тому +4

    Really well made video, both from the theoretical standpoint and also coding wise super clear to understand.
    One small error you made: In your theoretical section you mixed up different notations of the reward R_x:
    the most common used notation (also used by Sutton/Barto in the mentioned Book) is to use the index of the next state for the reward that occurs after taking an action:
    Notation 1: S0, A0, R1, S1, A1, R2...
    however, it might be noted as notation 2: S0, A0, R0, S1, A1, R1... in different literature.
    At 7:15 you used notation 1 (and also the sum notation is slightly wrong, it needs to run from t to T-1, not from 0 to T-1, but you fixed it in the discounted version of the formula)
    at 12:24 and 13:24 you used notation 2 for the delta equation (needs to be R_t+1 instead)
    I really loved the video, and leave this comment to help mitigate some of these confusions i had myself when studying these topics :)

  • @hameddamirchi
    @hameddamirchi 3 роки тому +1

    thanks dr. phil
    i think it is a good idea in addition to show results in command line, show environment renders after model learns.

  • @anus4618
    @anus4618 4 місяці тому

    your videos are good. I'm trying to implement a actor-crtic algorithm for modellimg a process. My process has input flow rate, concentration of species and the output is pH.I'm struggling to implement since i'm a beginner. Kindly make a video tutorial of how to implement actor-critic for a process modelling. It will be helpful for students like us to follow and learn.

  • @georgesantiago4871
    @georgesantiago4871 3 роки тому

    Your videos have defogged all these concepts for me. Thank you so much!!!

  • @ronmedina429
    @ronmedina429 4 роки тому +2

    Thanks for the content Dr. Phil. :-)

  • @portiseremacunix
    @portiseremacunix 4 роки тому

    Thanks! Saved and will watch later.

  • @softerseltzer
    @softerseltzer 4 роки тому

    Very clear and nice explanation, thank you!

  • @Corpsecreate
    @Corpsecreate 4 роки тому +3

    Hey Phil.
    For some reason when I use this actor critic method (or REINFORCE) in a poker environment (texas holdem) it always learns to fold with 100% probability. If I use a dueling DQN approach, it works correctly and plays the stronger hands, and folds the weaker ones. It seems that i am running into a local optimum (since rewards are negative when you bet, and are only positive at the end of the episode if you win) where folding always has the maximum reward on the first timestep (0 instead of some negative number). I am using a gamma of 0.999.
    Would you have any idea whats going on here?

    • @papersandchill
      @papersandchill 2 роки тому

      You need a better exploration strategy. PG methods are on-policy. This means that there is a higher tendency to stuck in local minima.

  • @sriharihumbarwadi5981
    @sriharihumbarwadi5981 2 роки тому +2

    From the RL book by Sutton/Barto, the one-step actor-critic uses the semi-gradient method to update the critic network. Which means
    state_value_, _ = self.actor_critic(state_) should not be included inside the GradientTape.
    This is confirmed by the pseudocode given in Sutton/Barto where w is updated as w = w + alpha*grad(V(s, w)) (here V and w represent critic network and its parameters respectively).
    But if we include state_value_, _ = self.actor_critic(state_) inside the GradientTape, the update would have an additional grad(V(s', w)) term! ( here s' is the next state, ie state_ in code)

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  2 роки тому +2

      Page 274. Delta term is proportional to the difference in value function of successive states. Both gradients (actor and critic) have a delta term in them.

  • @hamidrezamirtaheri5414
    @hamidrezamirtaheri5414 4 роки тому

    Would it be possible to label these precious lectures with a kind of sequential indexing (per topic) as you are enriching them, so one just heading to them have an idea where would be the best to start and follow along. Many thanks for sharing your exceptional skills.

  • @SogaMaplestory
    @SogaMaplestory 4 роки тому

    Can't watch now, but leaving a comment to get this video going :D

  • @MrArv83
    @MrArv83 Рік тому

    Video time 6:04: For two flips, need to multiply by 2? E(2 flips) will still be 0 since 0 x 2 = 0.

  • @fernandadelatorre7724
    @fernandadelatorre7724 3 роки тому

    You are so so great! Saving up to buy your courses, your videos have been so helpful :)

  • @DjThrill3r
    @DjThrill3r 2 роки тому

    Hey i have a question. Do you have like a source or literature, where the concept, that the value function and the policy both originate from the same network gets explained and why this is possible?
    Ty

  • @林政達-c5k
    @林政達-c5k 3 роки тому

    Your video helps!

  • @rahulrahul7966
    @rahulrahul7966 2 роки тому

    Hi Phil
    Thanks for the video. Can you please explain how the score is as the iterations progress even though we are sampling the actions randomly?

  • @JousefM
    @JousefM 4 роки тому +1

    Comment for the algorithm! :)

  • @oussama6253
    @oussama6253 4 роки тому +1

    Thank you !

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  4 роки тому +1

      Thanks for watching

    • @robertotomas
      @robertotomas 4 роки тому +1

      Hahaha I feel like I’m in my ai/ml class. Every weeks lecture discussion starts with everyone saying thank you 😀 it’s awesome -I love this video so far, still watching, but it is amazingly clear . So, I totally agree - thank you!

  • @elhouarizohier3824
    @elhouarizohier3824 2 роки тому

    how would you use this method in the context of reinforcement learning from human preferences ??

  • @lichking1362
    @lichking1362 Рік тому

    hi can we use this method for decision making too?

  • @jahcane3711
    @jahcane3711 4 роки тому +2

    Hey @Phil I have been following along, loving the content. Now I'm wondering, onn a scale of 0-1 what is the probability you will do a video on implementing CURL: Contrastive Unsupervised Representations for RL?

  • @tarifcemay3823
    @tarifcemay3823 2 роки тому

    Hi Phil.
    I thought prob_ratio must equal to one if we replay the same action as the actor is updated after replay . am I right?

  • @ahmadalhilal9118
    @ahmadalhilal9118 3 роки тому

    Very informative
    Can we adjust the actor-critic functionality to decide the output (resultant from softmax), and update the gradients accordingly?
    Since RL starts learning from scratch, I would like to use heuristics output as final softmax output to speed up the learning!
    Is that possible?

  • @fawadnizamani761
    @fawadnizamani761 4 роки тому +1

    Why do we pass the softmax probabilities to tfp categorical distribution, can we not just select the highest probability action from the softmax output? I'm not really good understanding the math so having a hard time figuring it out.

    • @Jnaaify
      @Jnaaify 3 роки тому

      I am wondering the same thing. It looks like it also works if you just take the action with the highest probability.

    • @davideaureli6971
      @davideaureli6971 3 роки тому

      I think it is to implement the exploration part for the Agent

  • @selcukkara82
    @selcukkara82 2 роки тому

    Hi Phil, i am a beginner. Can you tell me if critic is needed after training completed, or not? That is, only is actor enough after training?
    Thanks.

  • @herbertk9266
    @herbertk9266 4 роки тому

    Thank you sir

  • @Falconoo7383
    @Falconoo7383 2 роки тому

    Which tensorflow version is good for this?

  • @alinouruzi5371
    @alinouruzi5371 3 роки тому

    Thank svery much

  • @davideaureli6971
    @davideaureli6971 3 роки тому

    Hi @Phil thank you for thiz amazing video. Just one question about your loss (critic-loss) is it possible to see that it explodes using the delta**2 ? Because the gradient after that gives me all nan values. Some advices ?

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  3 роки тому

      Strange. What environment? Make sure the ln term isn't exploding

    • @davideaureli6971
      @davideaureli6971 3 роки тому

      @@MachineLearningwithPhil I have just noticed that the nan values appear when there is one probability which goes to 0 in our probs tensor. We can just put a small quantity to prevent this ? And this is the reason for the nan in the gradient because we have a derivative of 0 ?

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  3 роки тому

      Ln of 0 is undefined. You can just add some small value, yes.

    • @davideaureli6971
      @davideaureli6971 3 роки тому

      @@MachineLearningwithPhil another question, in a problem where the values predicted by the Actor is completely in another range of values respect to the Critic part (Actor --> (0,1) while Critic -->(-80,160)), it is really difficult find the optimal combination with just one network ?

    • @Jnaaify
      @Jnaaify 3 роки тому

      @@davideaureli6971 HI! I have the same problem. But I can't get it fixed. How did you do it? Thanks!

  • @ellenamori1549
    @ellenamori1549 3 роки тому

    Thank you for the tutorial. One question, in your application the agent learns after every step it takes in the environement. How about learning in a batch after each episode?

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  3 роки тому

      Generally not the way it's done with actor critic. It's a temporal difference method, so it learns each time step. Policy Gradient is based on Monte Carlo methods and do what you described.

    • @ellenamori1549
      @ellenamori1549 3 роки тому +1

      @@MachineLearningwithPhil Thank you!

  • @Falconoo7383
    @Falconoo7383 2 роки тому

    AttributeError: module 'tensorflow' has no attribute 'contrib'
    Can anybody help me to solve this error?

  • @nawaraghi
    @nawaraghi 3 роки тому

    I really appreciate your explanation.
    I tried to run it on FrozenLake, and NChain, it didn't work although I changed the input_dims from 8 to 1. Any hints or help how I can alter the code to work on FrozenLake?

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  3 роки тому +1

      Frozen lake isn't an appropriate environment for the algorithm. FL is for tabular methods, not approximate ones. In other words, neural nets won't really work.

  • @fantashio
    @fantashio 4 роки тому

    Great! Keep it up

  • @SaurabDulal
    @SaurabDulal 3 роки тому

    I don't see action_space being anywhere in the code, don't we need to when sampling the action?

  • @KrimmStudios
    @KrimmStudios 3 роки тому

    Thank you! Just wondering where the learning rates alpha and beta are implemented?

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  3 роки тому +1

      21:35
      Learning rates come into play when we compile the models with an optimizer. I didn't specify a learning rate so it uses the default values.

    • @KrimmStudios
      @KrimmStudios 3 роки тому

      @@MachineLearningwithPhil I see. Thanks again

  • @LidoList
    @LidoList 3 роки тому

    Thanks for the great tutorial. Does one game mean one episode ?

    • @raffaeledelgaudio2724
      @raffaeledelgaudio2724 3 роки тому

      usually yes

    • @papersandchill
      @papersandchill 2 роки тому

      Episode is typically when the environment is reset. (This never happens in the real world!) unless the real world itself is a simulator, like a game, for example Chess.

  • @Falconoo7383
    @Falconoo7383 2 роки тому

    ImportError: This version of TensorFlow Probability requires TensorFlow version >= 2.9; Detected an installation of version 2.8.0. Please upgrade TensorFlow to proceed. I am getting this error can anybody help me to solve this? I also upgraded the Tensorflow but again got the same error. @Mahine Learning with Phil

  • @ashishsarkar3998
    @ashishsarkar3998 4 роки тому

    pls make another tutorial on deep q learning with tensorflow 2

    • @ShortVine
      @ShortVine 4 роки тому

      he already made it, check his channel

  • @filipesa1038
    @filipesa1038 2 роки тому

    "Probability for getting head multiplied by the reward of getting head" - In my case is most likely zero

  • @tunestar
    @tunestar 4 роки тому +1

    Every time people show one of those math formulas on UA-cam a panda baby dies in the world.

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  4 роки тому

      Call the WWF!

    • @alexandralicht1023
      @alexandralicht1023 4 роки тому

      @@MachineLearningwithPhil I am looking to set up render() for a RL environment, do you have any videos related to this? or env.render()

  • @birinhos
    @birinhos 3 роки тому

    What is the game ?
    Edit : ok car pole ...

  • @alinouruzi5371
    @alinouruzi5371 3 роки тому

    goodddddddddddddddddddddddddddddddddddddddd

  • @SpringerJen
    @SpringerJen 7 місяців тому

    hi