Policies and Value Functions - Good Actions for a Reinforcement Learning Agent

Поділитися
Вставка
  • Опубліковано 22 гру 2024

КОМЕНТАРІ • 94

  • @deeplizard
    @deeplizard  6 років тому +13

    Check out the corresponding blog and other resources for this video at:
    deeplizard.com/learn/video/eMxOGwbdqKY

  • @DanielWeikert
    @DanielWeikert 6 років тому +49

    I have said it before but I feel obliged to say it again with each new video. Your videos are awesome. I really like the explanations starting from scratch and then continuously building up. Really helpful and highly appreciated

    • @deeplizard
      @deeplizard  6 років тому +2

      We really appreciate that, Daniel! Always happy to see your comments :)

  • @iqbalagistany
    @iqbalagistany 2 роки тому +7

    "If you can't tell it simply, you don't understand it enough"
    It is the simplest explanation I have found on UA-cam.
    Thanks a lot

  • @sashamuller9743
    @sashamuller9743 4 роки тому +7

    i love the little snippets of real-life reinforcement learning at the end of the video, it keeps me inspired to continue!

  • @arjunkashyap8896
    @arjunkashyap8896 4 роки тому +4

    Your channel is a gem.. I have watched so many tutorials on ML but yours is the one which I fully understand.
    M gonna comment​ on every video I watch on your channel.
    Thanks a lot deeplizard, you're shaping future ML engineers.

  • @jy2592
    @jy2592 3 роки тому +1

    This channel deserves more viewers for sure.

  • @sahanakaweerarathna9398
    @sahanakaweerarathna9398 6 років тому +23

    hey deeplizard we need a series about RNN also. Plz choose it as your next series

  • @pepe6666
    @pepe6666 5 років тому +7

    i love how this video is so calm & soothing at the end then BAM WACKY RUNNING CRAZY STICK MAN

  • @happyduck70
    @happyduck70 3 роки тому +8

    I love the series, real clear explanation. Although it would be nice to have example in between. It now stays very abstract to me, but if it is visualized during the explanation it would land better in my opinion. But super thanks for making these series on such complex matter!

  • @hadiphonix3352
    @hadiphonix3352 4 роки тому

    by a huge distance you make the best tutorials in this field .thanks a lot

  • @joaopedrofelixamorim2534
    @joaopedrofelixamorim2534 3 роки тому +1

    This channel saved my life! Many thanks!

  • @adityanjsg99
    @adityanjsg99 Рік тому +1

    I learnt about RL from this series, what I couldn't from MIT and Stanford open course ware

  • @techpy5730
    @techpy5730 4 роки тому +2

    This series is golden! Thank you so much for creating it!

  • @slowonskor
    @slowonskor 4 роки тому +11

    Regarding: V(s) and Q(s,a)
    I want to point out what wasn't clear to me and what needs to be emphasized in my opinion to understand the whole stuff:
    Vπ(s) expresses the expected value of following policy π forever when the agent starts following it from state s.
    Qπ(s,a) expresses the expected value of first taking action a from state s and then following policy π forever.
    The main difference then, is the Q-value lets you play a hypothetical of potentially taking a different action in the first time step than what the policy might prescribe and then following the policy from the state the agent winds up in.

    • @yuhanyao2857
      @yuhanyao2857 4 роки тому

      Hi! RL newbie here. I love your explanation. The "following policy pi forever" part is especially enlightening. But I am not sure what it means. Rather, I don't think I ever quite understand the difference between action and policy in an intuitive way. I would guess policy pi, or the probabilities, changes after every time step, but does following policy pi forever mean the probability never changes? Or does following policy pi mean always choosing the same action? I would love to hear more of your insight :)

    • @sender1496
      @sender1496 Рік тому +1

      @@yuhanyao2857 From what I understand, the policy gives a complete probability distribution from each state. You can think of it as first plugging in the state of the system, then receiving a probability distribution of actions. So what happens is this: after your model has taken a step, you get a new state, but if you plug this state into the policy function, then you get a probability distribution of the different possible actions. Just sample an action from this distribution and let your model take that action. This gives a new state, which you then plug into the policy again. In this sense, the policy completely describes how your model should act going forward. It doesn't matter that the state changes.

    • @yuhanyao2857
      @yuhanyao2857 Рік тому +1

      @@sender1496 Wow can't believe it's been 2 years since I posted this question. Thank you for your explanation!

  • @rahulnawale8310
    @rahulnawale8310 Рік тому +1

    Very helpful videos. Very well explained in simple words. Thank you for creating such videos.

  • @hello3141
    @hello3141 6 років тому +3

    Really like your discussions: you get right to the heart of the matter. By the way, great video production and graphics! The cyclist is nifty!

  • @rashikakhandelwal702
    @rashikakhandelwal702 3 роки тому +1

    Thank you for these series. I am so grateful to you . Its simply awesome! :)

  • @jamesbotwina8744
    @jamesbotwina8744 6 років тому +2

    Very nice summary! I’m taking the Udacity DRL course and this video helped me understand the distinction between value function components

    • @deeplizard
      @deeplizard  6 років тому

      Glad to hear that, James! Thanks for letting me know!

  • @sumeetdeshpande4825
    @sumeetdeshpande4825 8 місяців тому +1

    Nice editing!! Lovely!!

  • @tingnews7273
    @tingnews7273 6 років тому +3

    I watched the video and read the post.
    What I learned:
    1、Policy and value functions are two things.
    2、Value functions have two form. One is tell agent how good is the state.The other is tell the agent how good is the action on some certain state.
    3、Policy is the probility for agent choose the certain action on some certain state.
    4、I think the two value functions just make me confuse at the first time. After I read the post get it clear. For short . One for the state. One for the action under the state.
    5、What is the Q-function: action-state pair value function.I finally get it.....Thank for the state value function.

    • @deeplizard
      @deeplizard  6 років тому

      I'm happy to hear that the blog post is acting as a supplemental learning tool to the video!

  • @light-qn2jb
    @light-qn2jb 11 місяців тому +1

    I was done trying to learn the topic till i saw this series the simplificationt the structure wow

  • @tinyentropy
    @tinyentropy 5 років тому +4

    Maybe I missed it. But I think you haven’t described what it means „to follow a policy“. In other words, how do you make use of the probability distribution over actions in any given state? You could sample from it to determine your next action. But doing so, how does it relate to the optimal value criterium, because you won’t be able to reach the global optimum then.

  • @ramasamyseenivasagan4174
    @ramasamyseenivasagan4174 4 роки тому +1

    Excellent work!!! Well defined concepts...

  • @xentox5016
    @xentox5016 3 роки тому

    At 4:00 What exactly is G_t? is it the value of the state? and what does the E_pi[ ] return?

  • @mariaioannatzortzi
    @mariaioannatzortzi 4 роки тому +1

    {
    "question": "With respect to what the value function is defined?",
    "choices": [
    "Value function is defined with all of the choices.",
    "Value function is defined with respect to the expected return.",
    "Value function is defined with respect to specific ways of acting.",
    "Value function is defined with respect to policy."
    ],
    "answer": "Value function is defined with all of the choices.",
    "creator": "marianna tzortzi",
    "creationDate": "2020-11-18T14:57:58.033Z"
    }

    • @deeplizard
      @deeplizard  4 роки тому +1

      Thanks, Marianna! Just added your question to deeplizard.com/learn/video/eMxOGwbdqKY :)

    • @JJJ-ee5dc
      @JJJ-ee5dc Рік тому

      @@deeplizard I am in a situation where I need to learn deep q learning in just 3 to 4 days. I have deep learning background but don't know anything in RL. And your videos gave me so much information and intuition.

  • @priyambasu5529
    @priyambasu5529 4 роки тому +2

    {
    "question": "What is the difference between a state value function and an action value function?",
    "choices": [
    "State value function tells us the correct state whereas action value function tells us the correct action to be taken in the state, both for a particular policy pie.",
    "State value function tell us the policy whereas action value function tells us the action and state.",
    "Both are the same.",
    "State value function tells us the state as well as action whereas action value function tell us only action for any state."
    ],
    "answer": "State value function tells us the correct state whereas action value function tells us the correct action to be taken in the state, both for a particular policy pie.",
    "creator": "whopriyam",
    "creationDate": "2020-04-09T11:32:01.613Z"
    }

    • @deeplizard
      @deeplizard  4 роки тому

      Thanks, priyam! I changed the wording just a bit, but I've now just added your question to deeplizard.com/learn/video/eMxOGwbdqKY :)

  • @rominashams7280
    @rominashams7280 2 місяці тому +1

    This videos are perfect

  • @PenAndSpecs007
    @PenAndSpecs007 2 роки тому +1

    Amazing content!

  • @Bjarkediedrage
    @Bjarkediedrage 4 роки тому +6

    I wish someone would explain all of this in a non-formal way, not using any notations, rudimentary math and visualizations. I had the same experience as when people tried to explain backrpropergation and the chain rule. I now undestand both concepts and how to implement them, but damn it was hard. I had to reverse engineer a simple neural network and really look at the code and step through it, in order to understand what it actually does and how it works. It's frustarting that it's faster to learn it that way, vs listening to people trying to explain it in an abstract way. Why do we have to introduce so many terms and abstractions you have to remember. In the end, what the algorithm does is all multiplication, addition, and data manipulation. My programmer brain is just wired differently. I'm sure that once I undestand all of this, I can throw away 90% of all of the terms and math introduced here, and undestand how it works intuitivly without remembering what MDP stands for... Same thing with NN's I'm ny inventing my own neurons, playing with different ways of making them recurrent, giving them memory and crazy features, and I know very little about math and its notations.

  • @kareemmohamed4064
    @kareemmohamed4064 3 роки тому +1

    Really Great, Thank You

  • @matthewchung74
    @matthewchung74 4 роки тому +2

    What is the term E at 5:30 in the video? I can't find an explanation.

    • @deeplizard
      @deeplizard  4 роки тому +3

      E used in this way means "expected value." In our specific case, the value we're referring to is the return, so we're looking at the expected return.

  • @kaierliang
    @kaierliang 4 роки тому +1

    you are a lifesaver

  • @lingchen8849
    @lingchen8849 3 роки тому

    Thanks for your video. I watched twice but still don't understand why there should be a big "E" for expected value @4:41. I am confused because I found other refereces do not have that.

  • @user-or7ji5hv8y
    @user-or7ji5hv8y 6 років тому +2

    Awesome video! Really clear!

  • @evolvingit
    @evolvingit 5 років тому +2

    Super!!Going awesome!!!

  • @fatemehoseini7614
    @fatemehoseini7614 8 місяців тому

    Thanks for great explanation, that was great , but it would be great to discuss more about probability distribution of policy , I just did not understand that concept. The rest was great 🙏

  • @carchang4843
    @carchang4843 Рік тому

    what does big E mean. 4:04
    since expected return is Gt

  • @asdfasdfuhf
    @asdfasdfuhf 4 роки тому +1

    3:55 I found it confusing/weird that *the expected return starting from s* is equal to *the discounted return starting from s*, since the two quantities obviously aren't always equal.
    Is this an error? Or perhaps just a way of saying that $v_\pi (s)$ is equal to either *expected return starting from s* or *discounted return starting from s*?

    • @deeplizard
      @deeplizard  4 роки тому +6

      Hey Sebastian - Maybe it would've been more clear if I said "expected discounted return" instead of just "expected return." From episode 3 onward, as long as we don't explicitly state otherwise, "return" means "discounted return." So, the value of state s under policy pi is equivalent to the expected discounted return from starting at state s and following pi. Hope the helps!

  • @dmitriys4279
    @dmitriys4279 5 років тому +4

    What is E in value function formulas ?

    • @deeplizard
      @deeplizard  5 років тому +3

      Expected value

    • @SLR_96
      @SLR_96 4 роки тому

      @@deeplizard So it's the expected value of the expected return right?

  • @hugaexpl0it
    @hugaexpl0it Рік тому +1

    Are you using those videos at the end to give us rewards for completing each video?
    If so, that's pretty meta and impressive.

  • @prathampandey9898
    @prathampandey9898 2 роки тому

    What is "E" in the formula and what does "|" symbol i.e. bar means in the formula?
    I would suggest to provide appropriate meaning of symbols below each formula.

  • @justinunion7586
    @justinunion7586 5 років тому +1

    "If an agent follows policy 'pi' at time t, then pi(a|s) is the probability that At = a if St = s. This means that, at time t, under policy 'pi', the probability of taking action a in state s is 'pi'(a|s)"

  • @GauravSharma-ui4yd
    @GauravSharma-ui4yd 4 роки тому +1

    Great job, as an awesome explanation. Can you please give the link of deepmind video snippet at the end

    • @deeplizard
      @deeplizard  4 роки тому

      ua-cam.com/video/t1A3NTttvBA/v-deo.html

  • @ProfessionalTycoons
    @ProfessionalTycoons 6 років тому +2

    thank you for this!

  • @Hyuna11112
    @Hyuna11112 2 роки тому

    Is there any reinforcement learning problems not solved with Markov decision processes ?

  • @hanserj169
    @hanserj169 5 років тому +1

    I am a very begginer! could someone please explain me why do we need a state-value function and an action-value function? It seems to me that the last one if enough since it can map the state, the action and the reward for this particular pair. Could I select one out of them?

    • @deeplizard
      @deeplizard  5 років тому

      The state-value functions doesn't account for a given action, while the action value does. Going forward, we stick mostly with the action-value function.

    • @hanserj169
      @hanserj169 5 років тому +1

      @@deeplizard thanks! Great content!

  • @mateusbalotin7247
    @mateusbalotin7247 3 роки тому

    Thank you!

  • @pepe6666
    @pepe6666 5 років тому +1

    i am somewhat confused though about rewards. say if we're doing something complicated like playing an atari game. how do you program in rewards? winning and losing a game is kinda all there is, right?

    • @paulgarcia2887
      @paulgarcia2887 5 років тому

      If you are doing something like Atari breakout then every time the agent hits a block it can get a +1 reward. It doesn't necessarily need to beat the level until it gets a reward.

    • @pepe6666
      @pepe6666 5 років тому +1

      @@paulgarcia2887 the answer i was looking for was that when the agent gets his +100 reward for winning he stores the state before the winning state & what to do & the potential reward for that state. which creates a new state with new data. and it propagates backwards that way. Cheers though. this was a difficult thing for me to learn.

    • @43_damodarbanaulikar71
      @43_damodarbanaulikar71 5 років тому +2

      U should checkout siraj raval he has a explanation as well as practical video where he is using StarCraft game

    • @pepe6666
      @pepe6666 5 років тому

      @@43_damodarbanaulikar71 cool man i will. cheers.

  • @wtfhej
    @wtfhej 5 років тому +1

    Thanks!

  • @tallwaters9708
    @tallwaters9708 5 років тому

    Thanks for the videos, what do you mean "following policy pi thereafter"? If I am in a state, and take an action with a given policy in that state, and then transition to a new state, am I still following the same policy in that state? Shouldn't the policy change in each state? This really confuses me, sorry if my question is not clear. Are you saying that when calculating the value functions, we always use the same policies in each state?
    Thanks again,

    • @tallwaters9708
      @tallwaters9708 5 років тому +1

      Wait I think I get it, so the policy never changes in a value calculation. The policy is essentially the Q-table, and sums up how an agent will act in an environment. This stays the same, the values are calculated...

  • @sametozabaci4633
    @sametozabaci4633 5 років тому

    Thanks for the explanatory content.
    It's unnecessary but you may want to change it. On the blog page, the word 'following' was written two times under the action-value function topic in the first paragraph in the second line.

    • @deeplizard
      @deeplizard  5 років тому +2

      Thanks, samet! Appreciate you spotting this. I will get this fix out in the next website update!

  • @ravishankar2180
    @ravishankar2180 6 років тому +1

    i thought policy was "collection of all the actions taken together in all the states in a complete lifetime to achieve some overall reward" ! and policies were "too many lifetimes with too many different overall reward". Optimal policy i thought to be "a policy in policies with highest reward".
    i didn't know that policy was only for "a single state with different action choices".

    • @deeplizard
      @deeplizard  6 років тому +3

      Hey ravi - Yes, a policy is a function that maps each state in the state space to the probabilities of taking each possible action.

    • @ravishankar2180
      @ravishankar2180 6 років тому +1

      thank you miss deep-lizard

  • @haneulkim4902
    @haneulkim4902 3 роки тому

    Amazing tutorial, I appreciate it very much. Would it be possible to lower the lizard appearing sound in the intro? it seems louder than your voice therefore hurts my ear........... :(

    • @deeplizard
      @deeplizard  3 роки тому

      Thanks, Haneul! The intro has been modified in later episodes.

  • @debayanganguly838
    @debayanganguly838 5 років тому

    The videos which I watched ate awesome... but most of the videos are not loading and buffering.... I have no idea why becase all other videos in UA-cam are working well

    • @deeplizard
      @deeplizard  5 років тому

      Thanks, Debayan. Aside from an issue with internet connection/speed, I'm not sure what else could cause the issue. The videos are all uploaded in standard 1080p HD quality. You could try lowering the quality to 720p to see if that helps load the videos. Also, you can use the corresponding blogs to the videos as well:
      deeplizard.com/learn/playlist/PLZbbT5o_s2xoWNVdDudn51XM8lOuZ_Njv

    • @debayanganguly838
      @debayanganguly838 5 років тому

      I know this channel has got very good quality content but my issue is only with videos of your channels all other channels and videos are working and loading well... are there any copy of this video anywhere else other than UA-cam

    • @debayanganguly838
      @debayanganguly838 5 років тому

      This issue also occurred on some of your videos in DeepLearnig Neural Networks.... not all only some videos are having problems.... it may be due to some policies of UA-cam no issue with your 1080 quality video

  • @ogedaykhan9909
    @ogedaykhan9909 5 років тому +1

    awesome

  • @louerleseigneur4532
    @louerleseigneur4532 4 роки тому

    merci

  • @abhishek_raghav
    @abhishek_raghav Рік тому

    great only missing examples ,, especially about
    return , policies and value

  • @raminbakhtiyari5429
    @raminbakhtiyari5429 3 роки тому

    just traffic.

  • @aminezitoun3725
    @aminezitoun3725 6 років тому +1

    this is scary dude... xD

    • @deeplizard
      @deeplizard  6 років тому

      Haha in what way?

    • @aminezitoun3725
      @aminezitoun3725 6 років тому

      @@deeplizard i like the series and all(still waiting for a new vid xD) but knowing what ai can do and the fact that i just saw a vid about how 4 ai killed 29 humans in japan is just scary tbh xD

  • @Sickkkkiddddd
    @Sickkkkiddddd Рік тому

    You gotta use examples, man. You gotta use examples. A bunch of notation without practical examples to tie them to makes folks tune out. That's the only way to intuitively understand the ideas. This is the fourth video and you've barely touched on any 'game'. Aren't games at the 'example' applications for RL? These concepts aren't difficult. They just need to be explained by teachers who are passionate about the student experience and want them to learn. All the dedication on the student's part is absolutely worthless if the teacher or material isn't coming across. I have downloaded and deleted a bunch of textbooks because they were all filled with abstract nonsense that left me frustrated with migraines. Please find a way to break this stuff down with game or real-world examples in a language a dummy can understand. I'd be very pissed if I paid for your course and got a bunch of abstract notations thrown at my face. PLEASE PROVIDE EXAMPLES. PLEASE!!!
    Save the crap notations for later and provide examples. You started with a raccoon agent in the first video. Where did he go to? Why isnt he utilised for subsequent videos? Sigh