CS 285: Lecture 16, Part 1: Offline Reinforcement Learning 2

Поділитися
Вставка
  • Опубліковано 29 січ 2025

КОМЕНТАРІ • 4

  • @pjhae1445
    @pjhae1445 Рік тому

    31:39 why we need additional policy extraction process? can we just argmax Q(s.a) which is came from IQL iteration?

    • @jpiabrantes
      @jpiabrantes Рік тому +1

      It's slow & hard to do argmax Q(s, a) when the state-action space is big and continuous. For tabular Q(s, a) that would work well.

    • @binyuwang6563
      @binyuwang6563 3 місяці тому

      Because we need to handle continuous action output

  • @erzhu419
    @erzhu419 Рік тому

    18:25 I suggest we can put the graph in page 5 here to explain the intuition of why π*=π_β * A_π, because π* can only appear in the Intersection part of large π_β and large A_π part, which is just blue line * orange line.