Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 4 - Model Free Control

Поділитися
Вставка
  • Опубліковано 2 жов 2024
  • For more information about Stanford’s Artificial Intelligence professional and graduate programs, visit: stanford.io/ai
    Professor Emma Brunskill, Stanford University
    onlinehub.stanf...
    Professor Emma Brunskill
    Assistant Professor, Computer Science
    Stanford AI for Human Impact Lab
    Stanford Artificial Intelligence Lab
    Statistical Machine Learning Group
    To follow along with the course schedule and syllabus, visit: web.stanford.ed...

КОМЕНТАРІ • 9

  • @odycaptain
    @odycaptain 2 роки тому +8

    Thank you for the course

  • @RUBOROBOT
    @RUBOROBOT Рік тому +6

    In the monotonic e-greedy Policy Improvement theorem, why do we add the (1-e)/(1-e) factor instead of just using the (1-e) that is already there? I see this step unnecessary and confusing as the original (1-e) is canceled with the unnecessary new (1-e) denominator, and it's therefore never used.

    • @zonghaoli4529
      @zonghaoli4529 Рік тому +2

      This proof around 33:03, to a certain extent, is very cumbersome since all these transformations did not take you anywhere to be honest. Essentially, what is really important is that for V^{pi_{i+1}}, there is a taking max{Q} over action, which is certain always larger or equal to Q over action, which is what V^{pi_{i}} gives you in the previous iteration. On average, it is the greedy action that ensures the monotonic improvement.

    • @michaelbondarenko4650
      @michaelbondarenko4650 11 місяців тому +1

      Interestingly, they didn't fix the proof in the 2023 class either

    • @ZinzinsIA
      @ZinzinsIA 10 місяців тому

      the (1 - eps) / (1 - eps) is just to show that we did not changed the value of the sum by multiplying by this quantity and then we can write 1 - eps another way i.e with the sum of pi(a|s) - eps. Then, sum pi(a|s) - eps = (1 - eps + eps/card(A) + [(card(A) -1)/card(A) * eps)] - eps. The expression between parentheses corresponds to the sum of probas by definition of the epsilon soft policy. Be careful that they put the epsilon inside the sum over a but it is not correct, we do not sum epsilon over all possible actions. Then with the simplification and the fact that max_a(Q_pi(s, a)) >= Q_pi(s, a) you get the result and what is interesting with this "cumbersome" writing is that the simplification it gives you corresponds to V_pi(s) and so you have shown that there's a policy improvement. You can also check the book Sutton & Barto Aabout RL the proof is phrased a little bit differently but it's the same idea (p.101 of the 2018 edition)

  • @zonghaoli4529
    @zonghaoli4529 Рік тому

    26:23 I think the reason why people got a bit confused and obatain different answers was because they forgot the essence of MC for policy evaluation. For MC policy evaluation, it will start only when a full episode is completed. In this case, therefore, G_{i,t} for all existed state-action pair (s3,a1), (s2,a2), and (s1,a1), are all 1 as gamma is zero. Then just follow the pesudo code for MC, you will get the right answer. If you are doing TD, where policy evaluation starts immediately without waiting for the completion of the entire episode, I think the first student's answer was correct.

    • @takihasan8310
      @takihasan8310 Місяць тому

      No man, see the instance carefully, the gamma is 1.

  • @BeckCaesar-r8l
    @BeckCaesar-r8l 20 днів тому

    Anderson Charles Anderson Melissa Smith Melissa