Розмір відео: 1280 X 720853 X 480640 X 360
Показувати елементи керування програвачем
Автоматичне відтворення
Автоповтор
31:39 why we need additional policy extraction process? can we just argmax Q(s.a) which is came from IQL iteration?
It's slow & hard to do argmax Q(s, a) when the state-action space is big and continuous. For tabular Q(s, a) that would work well.
Because we need to handle continuous action output
18:25 I suggest we can put the graph in page 5 here to explain the intuition of why π*=π_β * A_π, because π* can only appear in the Intersection part of large π_β and large A_π part, which is just blue line * orange line.
31:39 why we need additional policy extraction process? can we just argmax Q(s.a) which is came from IQL iteration?
It's slow & hard to do argmax Q(s, a) when the state-action space is big and continuous. For tabular Q(s, a) that would work well.
Because we need to handle continuous action output
18:25 I suggest we can put the graph in page 5 here to explain the intuition of why π*=π_β * A_π, because π* can only appear in the Intersection part of large π_β and large A_π part, which is just blue line * orange line.