DeepMind x UCL RL Lecture Series - Model-free Control [6/13]

Поділитися
Вставка
  • Опубліковано 20 тра 2024
  • Research Scientist Hado van Hasselt covers prediction algorithms for policy improvement, leading to algorithms that can learn good behaviour policies from sampled experience.
    Slides: dpmd.ai/modelfreecontrol
    Full video lecture series: dpmd.ai/DeepMindxUCL21
  • Наука та технологія

КОМЕНТАРІ • 13

  • @i0dollas
    @i0dollas 2 роки тому +13

    I really appreciate Hados teaching style demonstrated in the past 2 videos. There's some things I struggled to understand about the core concepts of various reinforcement learning aspects/math formulations, some of these confused concepts were introduced in the first lectures.
    I realize without working through a lot of examples myself or really thinking about all the implications I'd never have a strong intuition of the learning material and it would be an impediment to learning further material.
    But Hado's consistent review and focus on explaining different important nuances not self evidently clear by just looking at various formulations, really strongly enhanced the growth of a robust useful clear intuition over these concepts, the type of intuition necessary to play around with these concepts in a novel way.

  • @Shubham_Chaudhary
    @Shubham_Chaudhary 2 роки тому +7

    What an amazing lecture from the inventor of Double Q-learning!

    • @unknown3.3.34
      @unknown3.3.34 2 роки тому

      Can I take this course bro?
      Planning to start this. I would like to learn RL but I'm Afride that I might take up the wrong course.
      Can you suggeste some good courses on RL.

  • @lucabommarito5922
    @lucabommarito5922 4 місяці тому +1

    59:50
    Classical Q learning: 99% of gamblers quit right before they win big

  • @marcin.sobocinski
    @marcin.sobocinski Рік тому +2

    I wish there was a practical part of those lectures with some coding in Python 😀

  • @chentao4686
    @chentao4686 2 роки тому +2

    Thanks for such wonderful lecture! 6/13

  • @jmybll
    @jmybll Рік тому

    Does anyone know if there are different versions of the double dqn algorithm. From my point of view there could be a couple of different ways to implement the idea of breaking the overestimation. For example, for updating q you could choose any of the following update rules:
    r+\gamma*q'(s`,\argmax_a' q(s',a')) as in the video
    r+\gamma*q(s`,\argmax_a' q'(s',a')) choosing the action according to q', but evaluating it by q

  • @graham8316
    @graham8316 2 роки тому

    cancelation at 1:25:00 seems strange to me. If our policies {mu} and {pi} are choosing different actions (because they're different policies), then wouldn't we get to different A_{t}'s and then not be able to cancel them?
    Edit: if anyone else is confused by this we're literally finding the probability of {tau} under {mu} and {pi}, so A_{t} is just A_{t} and we're finding the probability of it occurring, so we can do all the cancelation.

  • @mofra
    @mofra 2 роки тому

    can someone plz explain again the req on alpha in 39:38?
    and why 1/t is ok and hold the req

  • @raunakkbanerjee9016
    @raunakkbanerjee9016 2 роки тому

    This one is serious stuff..

  • @mofra
    @mofra 2 роки тому +1

    WHAT DOES THE I NOTATION MEAN? I() like in 16:16

    • @viktorkm9
      @viktorkm9 2 роки тому +2

      I(TRUE) = 1
      I(FALSE) = 0