What do Reinforcement Learning Algorithms Learn - Optimal Policies

Поділитися
Вставка
  • Опубліковано 22 гру 2024

КОМЕНТАРІ • 47

  • @deeplizard
    @deeplizard  6 років тому +10

    Check out the corresponding blog and other resources for this video at:
    deeplizard.com/learn/video/rP4oEpQbDm4

  • @tomashorych394
    @tomashorych394 3 роки тому +17

    I am glad that this series expose some theory instead of cheesy metaphors. Well done

  • @tusharnalawade8990
    @tusharnalawade8990 3 роки тому +9

    This is just the series i needed to watch. It is so good! Well done

  • @raminbakhtiyari5429
    @raminbakhtiyari5429 3 роки тому +4

    just a few people can understand the value of this series.
    thank you

  • @tingnews7273
    @tingnews7273 6 років тому +12

    I must admit at first to see so many stars make me confuse. And watched the video make me more confuse(maybe too many concepts) and then I decide to read the post.
    What I learned:
    1、Reinforcement learning learn the optimal policies.
    2、What is optimal policy
    3、What is optimal state-value function
    4、What is optimal action-value function
    5、What is Q-value : calculate by bellman optimal equition.
    My question:
    1、I think this will be more clear afterwards but it come front.I should ask or note it. When we brought the policies and value functions together. Polices is probilites. Value functions return some values.How can we put it together and a sequnce make it more harder(you choose this action make the state change the value will change too,I feel my brain is just melting down...)
    2、The optimal Vpi(s) demand all the s.Is that possible.Or just most s is good enough is good?Both q*(s,a) have the same question.
    3、Read the post find the describion like "In other words, q∗ gives the largest expected return achievable by any policy π for each possible state-action pair." This make me confuse . All we did is find the optimal policy why there is any policy....
    4、Bellman optimality equition.So many questions:
    4.1、Bellman equition calculate the Q-value?
    4.2、s',a' means the the next state and action?
    4.3、means the expect return is the return of t action on t state add the max value of next step? Kind like recurrent my brain is overloading....

  • @ghazalakhalid63
    @ghazalakhalid63 Рік тому +1

    Its an excellent video, making something so complex so easy to understand

  • @dmitriys4279
    @dmitriys4279 5 років тому +15

    What does E mean in formula? [E]xpectation?

    • @deeplizard
      @deeplizard  5 років тому +10

      Correct! Expected value.

  • @s25412
    @s25412 3 роки тому

    I think there's a typo at 3:30 where "E[R_(t+1)...]" should be "E[R_t...]." This is the reward you get for being in the current state (e.g. A robot wakes up in current state S and is given reward for simply waking up in that state (if any).

  • @user-or7ji5hv8y
    @user-or7ji5hv8y 6 років тому +4

    Can optimal policy be defined not only in terms of expected return but also variance of returns? That is, lower variance is preferred over higher variance. Not sure if that has been studied.

    • @deeplizard
      @deeplizard  6 років тому +3

      Hm... In the traditional sense of MDPs, I'm not sure. That's an interesting thought though.

    • @chrisfreiling5089
      @chrisfreiling5089 6 років тому +3

      Ahh! So you seem to be suggesting that rewards are not necessarily additive! Very interesting! Your second million dollars is not as valuable as your first!

  • @frommarkham424
    @frommarkham424 18 днів тому +1

    Stuffs getting real from here on 😁😁😁😁😁

  • @tallwaters9708
    @tallwaters9708 5 років тому +3

    If I understand this correctly, the Bellman equation only considers the current timestep and one more timestep ahead?

    • @deeplizard
      @deeplizard  5 років тому +5

      Looking at the equation directly, yes. However, indirectly, notice that to calculate q*(s,a), we need q*(s',a') where s' and a' are the _next_ state and action. Therefore, to calculate q*(s',a'), we'll need q*(s'',a''), where s'' and a'' are the _next next_ state and action, and so on.

    • @tallwaters9708
      @tallwaters9708 5 років тому

      @@deeplizard Thanks! Ok so it's a recursive function I guess? It would be hard to stumble upon just one reward state if the statespace was huge I think?

    • @vishalpoddar
      @vishalpoddar 4 роки тому

      @@deeplizard so is it at recursive function? or it learns to predict the q value without recursion?

    • @stydras3380
      @stydras3380 4 роки тому +1

      @@vishalpoddar yes its recursive. Exactly this allows us to update the q function over and over: Calculate the right hand side for our current approximation of the value function and then update q according to that. If you iterate that process it will converge to the optimal function

    • @Tony-Man
      @Tony-Man 2 роки тому

      @@deeplizard Can you elaborate more on the rationale behind always looking one step forward instead of other number of steps? I understand that the computational demand goes up by a lot, even if you you look at 2 steps forward. What happens to the learning performance with other parameters being equal?

  • @sahibsingh1563
    @sahibsingh1563 5 років тому +1

    Your videos are short and
    very informative
    ABSOLUTE GEM :)

  • @Hero4ever97
    @Hero4ever97 2 роки тому

    Could someone help me to understand the difference between optimal policy and optimal q-function. The former should be the optimal mapping which, given a state, tells me which is the best action in order to get the maximum return, the latter should be that function that given any state and action, return me the best return???? I am very confused.

  • @priyankrajsharma
    @priyankrajsharma 5 років тому +1

    great job .. very easy to understand

  • @lion795
    @lion795 2 роки тому +1

    perfecto seniorita

  • @MrCmon113
    @MrCmon113 5 років тому +2

    Why does the blog speak of "the" optimal policy? According to the definition there might be many or none.
    According to the definition it is not even clear whether there exists a policy that is strictly better than any other policies.

    • @stydras3380
      @stydras3380 4 роки тому

      I think there is a theorem that for Markov decision programs the optimal value function exists and is comparable to all other value functions (and of course bigger, thats why we are doing it). This directly implies that two such optimal value function (lets call them v and v*) have to be equal: For all states s we have v(s)≤v*(s) since v* is optimal, but also v*(s)≤v(s) since v is optimal, so v(s)=v*(s) for all states s and thus v=v*

  • @asdfasdfuhf
    @asdfasdfuhf 4 роки тому +1

    Excellent content once again

  • @Rotem0
    @Rotem0 2 роки тому

    I think there is a mistake when the narrator says that s' is the best possible next state. If I understand correctly, the expectation is over the distribution of the possible next steps s' (since they depend on the randomness of the environment when it is provided with s and a), and the "max" expression is over the best possible a'.

  • @a_sobah
    @a_sobah 5 років тому +1

    Love this videos keep it up

  • @sbixpolo
    @sbixpolo 2 роки тому

    Vpi(s) >= Vpi'(s) for all s in S
    i expected this instead : sum(Vpi(s)) >= sum(Vpi'(s) ..
    Thanks for your videos :)

  • @chrisfreiling5089
    @chrisfreiling5089 6 років тому +1

    Ok, not trying to be difficult, just trying to understand. So correct me if I'm wrong. But you paraphrase the Bellman equation as saying that the expected return is the reward.... plus the "maximum expected discounted return that can be achieved from any possible next state-action pair (s',a')". I don't believe this. Shouldn't it be the "expected value over the next state s' of the maximum value over the next action a'"? The point being that "expected" and "maximum" are in the wrong order and the probability distribution is over the next state.

    • @deeplizard
      @deeplizard  6 років тому +5

      I don’t think you’re being difficult. Your comments show that you’re actually putting your own thoughts into this stuff :)
      I could’ve been more precise in my paraphrasing. Let me clarify. First, from an earlier video/blog on MDPs where we touch on transition probabilities (more in the blog than the video), we talk about how the action a that is selected from a given state s is from a set of actions A(s) that can be taken from s. A(s) is a subset of all possible actions that can be taken in the environment and has a probability distribution over s.
      deeplizard.com/learn/video/my207WNoeyA
      Next, from the max term used in the Bellman equation, we can see from the subscript that the maximization is occurring over all the next possible actions a’ that can be taken from s’. In other words, given that we end up transitioning to state s’, which action a’ from the set of actions that can be taken from s’ is going to yield the max return?
      Your phrasing of "plus the expected value over the next state s' of the maximum value over the next action a'" works to describe this. I'm trying to think of a way to say this a bit more intuitively and update the blog with it. Maybe "plus the maximum discounted return that can be achieved from the next state s’ over all possible actions a’ in A(s’)."

  • @ProfessionalTycoons
    @ProfessionalTycoons 6 років тому +1

    amazing

  • @louerleseigneur4532
    @louerleseigneur4532 4 роки тому +1

    merci merci

  • @MrGenbu
    @MrGenbu 5 років тому

    my head starts to burn "i mean it" when ever you start talking about functions and those terms 3:28

    • @deeplizard
      @deeplizard  5 років тому +1

      Make sure that you've studied the episodes in the series that come before this one, as they build up to the math that we're using in this episode. Also, to get a firmer grasp on the math, you can go at a slower pace by spending time on the written blog format of each episode here: deeplizard.com/learn/playlist/PLZbbT5o_s2xoWNVdDudn51XM8lOuZ_Njv

  • @1414tyty
    @1414tyty 3 роки тому +1

    {
    "question": "How can we determine an action-value function is optimal?",
    "choices": [
    "For any state action pair, our function produces the expected reward for taking that action plus the maximum discounted return thereafter.",
    "For any state action pair, our function yields the maximum future rewards.",
    "For any state action pair, our function produces the reward for taking that action.",
    "For any state action pair, our function yields the discounted rewards of following the optimal policy."
    ],
    "answer": "For any state action pair, our function produces the expected reward for taking that action plus the maximum discounted return thereafter.",
    "creator": "Tyler",
    "creationDate": "2021-03-03T23:16:50.005Z"
    }

    • @deeplizard
      @deeplizard  3 роки тому +1

      Thanks, Tyler! Just added your question to deeplizard.com/learn/video/rP4oEpQbDm4 :)

  • @frommarkham424
    @frommarkham424 18 днів тому +1

    Damn these videos are delicious 😋 thank you

  • @chrisfreiling5089
    @chrisfreiling5089 6 років тому +2

    Suggestion: Seems to me that there is some burden to show an optimal policy exists and has the properties that you claim. A "proof" would be simple enough and would demonstrate the purpose for assuming that the set of states and the set of actions are both finite. It would also use some recursive formula like the Bellman equation, so it would be a preview of that equation as well.

    • @deeplizard
      @deeplizard  6 років тому +2

      Thanks for the suggestion. I’ll think on this further and consider writing up a proof and adding it to the blog.

  • @tinyentropy
    @tinyentropy 2 роки тому

    I am not convinced ;) The argument mentioned is that "since the agent follows the optimal policy, the next state s' satisfies the condition that the best next action (wrt. the expected reward) can be taken".
    However, we are conditioning on the action a here, which effectively means we are not following the optimal policy. The optimal policy samples a given s, meaning that not all possible a will follow the optimality path.
    In general, I don't understand how we get away from greedy behavior here. Thinking of local instead of global optimization maximums.

  • @hdluktv3593
    @hdluktv3593 3 роки тому

    We progress the following: WHAT THE FUCK DOES E MEAN?!

  • @fahdsaad2012
    @fahdsaad2012 3 роки тому +2

    this is very confusing...not clear at all

  • @Throwingness
    @Throwingness 3 роки тому

    Intros are too long. 24 seconds is too long...................