CS885 Module 2: Maximum Entropy Reinforcement Learning

Поділитися
Вставка
  • Опубліковано 11 гру 2024

КОМЕНТАРІ • 15

  • @Miaumiau3333
    @Miaumiau3333 Рік тому

    This is SO GOOD, very clear and straight to the point

  • @viraatchandra8498
    @viraatchandra8498 3 роки тому

    best video out there for SQL math, particularly some of the derivations which aren't properly explained in the paper

  • @datascience_with_yetty
    @datascience_with_yetty 4 роки тому +1

    You’re a great teacher. Please do more videos. Thanks

  • @InquilineKea
    @InquilineKea 2 роки тому

    I AM THE LIVING EMBODIMENT OF THIS

  • @mrbeancanman
    @mrbeancanman 3 роки тому +1

    at 36:36, (SAC) why do we need to use a network to approximate softmax(Q_w \ l) for the policy (can we not just use it directly?)

  • @phanindraparashar8930
    @phanindraparashar8930 3 роки тому

    Wow!!!! Amazing. It really helped me a lot. Can you do a video on Option Critic and Hierarchical Reinforcement Learning

  • @miroslavkosanic2917
    @miroslavkosanic2917 4 роки тому

    Hi, I wanted to ask, at 17:31, if the sum in the denominator of all actions in any state is equal to 1, and that is by the definition of the stochastic policy, wouldn't that mean that we are actually just dividing the numerator by 1 and the softmax, in theory, isn't needed and is put there for the practical/implementation reasons?

    • @miroslavkosanic2917
      @miroslavkosanic2917 4 роки тому

      And also, shouldn't we have used that same condition, the sum of all policies different actions in some state equals to 1, in the objective function?

    • @astaragmohapatra9
      @astaragmohapatra9 3 роки тому +1

      @@miroslavkosanic2917 The softmax value of all action combined is equal to 1, not the denominator. Not entirely sure of this

    • @yueying9083
      @yueying9083 2 роки тому +1

      The sum of Pi should be 1. So it is a constrained optimization, and the derivation should use Lagrangian.

  • @thomashirtz
    @thomashirtz 3 роки тому

    This course is amazing! Very nice job
    (I was struggling with Soft policies, now I think I understand much more :) )
    I have some quick questions though:
    - at 16:15 you say that the objective is concave, but why is that ?
    - at 25:30 We try to prove that Q grows for any pair (a,s). I can understand that the sum over the action of Q(a,s) will grow, but I am confused about the any pair, if at one state, the Q value was overestimated, and we than adjust it because we learn, the Q value of this action will decrease, and the Q value of the other action will increase, no ? I am just confused how the Q values can monotonically increase, even for bad actions
    - at 31:00 I am very confused how repeatedly applying the derivation can give Q pi+1
    - at 31:39 where the epsilon comes from ? (I didn't see it anywhere before)
    Thanks again to have posted this class :)

    • @astaragmohapatra9
      @astaragmohapatra9 3 роки тому +1

      1) Because you can see the differentiation is of the form dJ/dpi = y -(mx+1), so the integral form must be x^2, hence concave

  • @wunkewldewd
    @wunkewldewd 3 роки тому +1

    Thanks for the video! but I think slide 3 is slightly wrong, or at least phrased confusingly: in some cases the optimal policy is not deterministic. For example, in rock paper scissors, the optimal policy is stochastic.

    • @DhruvaKartik
      @DhruvaKartik 2 роки тому

      For single-agent problems where we trying to maximize the expectation of a reward, there will always exist a deterministic optimal policy. Rock-paper-scissors is a two-player zero-sum game, so it does not fall under the standard reinforcement learning setup.

  • @AKNiloy
    @AKNiloy 2 роки тому

    sir is this an undergraduate course or graduate course?