DeepMind x UCL RL Lecture Series - Introduction to Reinforcement Learning [1/13]

Поділитися
Вставка
  • Опубліковано 1 тра 2024
  • Research Scientist Hado van Hasselt introduces the reinforcement learning course and explains how reinforcement learning relates to AI.
    Slides: dpmd.ai/introslides
    Full video lecture series: dpmd.ai/DeepMindxUCL21
  • Наука та технологія

КОМЕНТАРІ • 110

  • @hasuchObe
    @hasuchObe 2 роки тому +508

    A full lesson on reinforcement learning from a Deep Mind researcher. For Free! What a time to be alive.

    • @mawkuri5496
      @mawkuri5496 2 роки тому +41

      lol.. 2 minute papers

    • @321conquer
      @321conquer 2 роки тому +1

      You might be dead it is your AI clone is typing this... Hope this helps ®

    • @alejandrorodriguezdomingue174
      @alejandrorodriguezdomingue174 2 роки тому +3

      yes, apart from sharing knowledge (do not get me wrong) they also target a market place and teach people so to use their future products

    • @masternobody1896
      @masternobody1896 2 роки тому +2

      yes

    • @jakelionlight3936
      @jakelionlight3936 2 роки тому

      @@mawkuri5496 lol

  • @prantikdeb3937
    @prantikdeb3937 2 роки тому +36

    Thank you for releasing those awesome tutorials 🙏

  • @JamesNorthrup
    @JamesNorthrup Рік тому +26

    TOC in-anger
    0:00 class details
    5:00 intro-ish
    6:50 turing
    8:20 define goals of class
    9:00 what is RL?
    12:14 interaction loop
    17:20 reward
    25:49 atari game example
    28:18 formalization
    29:40 reward
    30:10 the return
    34:00 policy, actions denoted with Q
    35:00 goto lectures 3,4,5,6
    43:00 markov is maybe not the most important property
    44:00 partial observability
    46:10 the update function
    53:48 Policy -> mapping of agent state to action.
    54:20 stochastic policy
    56:00 discount factor magnify local reward proximity (or not)
    59:00 pi does not mean 3.14. means probability distribution. Bellman Equation so named here
    1:02:00 optional Model
    1:04:00 model projects next state+reward, or possibly any state and any reward, because reasons
    1:07:00 Agent Categories
    1:10:00 common terminology

  • @chevalier5691
    @chevalier5691 2 роки тому +14

    Thanks for the amazing lecture! Honestly I prefer this online format rather than an actual lecture, not only because the audio and presentation are more clear, the concepts are explained more thoroughly without any interruption from students

    • @luisleal4169
      @luisleal4169 9 місяців тому

      And also you can go back to sections you missed or didn't fully understand.

  • @loelie01
    @loelie01 2 роки тому +6

    Great course, thank you for sharing Hado! Particularly enjoyed the clear explanation of Markov Decision Processes and how they relate to Reinforcement Learning.

  • @matthewfeeley6226
    @matthewfeeley6226 2 роки тому

    Thankyou very much for this lesson and for you to take the time to deliver the content.

  • @abanoubyounan9331
    @abanoubyounan9331 7 місяців тому +1

    Thank you, DeepMind, for sharing these resources publicly.

  • @Adri209001
    @Adri209001 2 роки тому +9

    Thank so much for this, We love you from Africa

  • @TexasBUSHMAN
    @TexasBUSHMAN 2 роки тому +3

    Great video! Thank you! 💪🏾

  • @kejianshi299
    @kejianshi299 2 роки тому +3

    Thanks so much for this lesson!

  • @cennywenner516
    @cennywenner516 7 місяців тому +1

    For those who may view this lecture, note that I think it is a bit non-standard in reinforcement learning to denote the state S_t as the "agent's state". It usually refers to the environment's state. This is important for other literature. The closest thing for the agent's state is perhaps "the belief state" b_t. Both are relevant depending on what is being done, and some of the formalization might not work when the two are mixed. Notably, most of the environments that are dealt with are Markovian in the (possibly hidden) environment state but not in the observations or even what the agent may derive about the state, which also means most of the time it is insufficient to condition on only "S_t=s" the way it is defined here, rather than the full history H_t.
    Considering how a lot of the RL formalism is with regard to the state that is often not fully observable to the agent, maybe this approach is useful.

  • @Fordance100
    @Fordance100 2 роки тому +2

    Amazing introduction on RL.

  • @alisheheryar1770
    @alisheheryar1770 2 роки тому +2

    The type of learning in which your AI agent learns/tunes itself by interacting with its environment, is called Reinforcement Learning. More generalization power than a neural network and more able to cater to those unforeseen situations that were not considered when designing such a system.

  • @billykotsos4642
    @billykotsos4642 2 роки тому +8

    The man, the myth the legend.
    OMG ! I’m in !

  • @anhtientran3158
    @anhtientran3158 2 роки тому

    Thank you for your informative lecture

  • @kiet-onlook
    @kiet-onlook 2 роки тому +10

    Does anyone know how this course compares to the 2015 or 2018 courses offered by Deepmind and UCL? I’m looking to start with one but not sure which one to take.

  • @adwaitnaik4003
    @adwaitnaik4003 2 роки тому +1

    Thanks for this course.

  • @robertocordovacastillo3035
    @robertocordovacastillo3035 2 роки тому +1

    That is awesome! thank you from Ecuador

  • @ianhailey
    @ianhailey 2 роки тому +6

    Are the code and simulation environments for these examples available somewhere?

  • @randalllionelkharkrang4047
    @randalllionelkharkrang4047 Рік тому +8

    Please , can you link the assignments for this course? For non UCL students

  • @lqpchen
    @lqpchen 2 роки тому +7

    Thank you! Is there any assignments pdf files?

  • @goutamgarai6632
    @goutamgarai6632 2 роки тому +3

    thanks DeepMind

  • @chadmcintire4128
    @chadmcintire4128 2 роки тому +4

    Why the downvote on free education? Thanks, I am comparing this to to cs 285 for Berkeley, so far it has been good, different focus.

  • @0Tsutsumi0
    @0Tsutsumi0 7 місяців тому

    "Any goal can be formalized as the outcome of maximizing a cumulative reward." A broader question would be "Can all possible goals be transformed into a Math formula?", it starts getting trickier whenever you deal with subjective human concepts such as love.

  • @sumanthnandamuri2168
    @sumanthnandamuri2168 2 роки тому +8

    @DeepMind Can you share assignments?

  • @DilekCelik
    @DilekCelik 9 місяців тому +1

    Some diamond lectures from top researchers are public. Amazing.. Get benefit guys. You will not get this much quality lectures from the universities.

  • @charliestinson8088
    @charliestinson8088 6 місяців тому +1

    At 59:06, does the Bellman Equation only apply to MDPs? If it depends on earlier states I don't see how we can write it in terms of only v_pi(S_{t+1})

  • @abdul6974
    @abdul6974 2 роки тому

    is there any practical Course in Python in RL to apply the theory of the RL?

  • @extendedclips
    @extendedclips 2 роки тому +1

    ✨👏🏽

  • @robinkhlng8728
    @robinkhlng8728 2 роки тому

    Could you further explain what v(S_t+1) formally is?
    Because v(s) is defined with lowercase s as input. From what you said I would say it is SUM_s' [ p(s'|S_t=s ) * v(s') ], so the expected value over all possible states s' for S_t+1.

    • @a4anandr
      @a4anandr 2 роки тому

      That seems right to me. Probably, it is conditioned on the policy \pi as well.

  • @matiassandacz9145
    @matiassandacz9145 Рік тому

    Does anyone know where can I find assignments for this course? Thank you in advance!

  • @AtrejuTauschinsky
    @AtrejuTauschinsky 2 роки тому

    I'm a bit confused by models... In particular, value functions map states to rewards, but so do (some) models -- what's the difference? You seem to have the same equation (S -> R) for both on the slide visible at 1:16:30

    • @hadovanhasselt7357
      @hadovanhasselt7357 2 роки тому +6

      Some models indeed use explicit reward models, that try to learn the expected *immediate reward* following a state or action. Typically, a separate transition model is also learnt, that predicts the next *state*.
      So a reward model maps a state to a number, but the semantics of that number is not the same as the semantics of what we call a *value*. Values, in reinforcement learning, are defined as the expected sum of future rewards, rather than just the immediate subsequent reward.
      So while a reward model and a value function have the same functional form (they both map a state to a number), the meaning of that number is different.
      Hope that helps!

  • @umarsaboor6881
    @umarsaboor6881 8 місяців тому +1

    amazing

  • @theminesweeper1
    @theminesweeper1 Рік тому

    Is the reward hypothesis generally regarded as true among computer scientists and other smart people?

  • @rudigerhoffman3541
    @rudigerhoffman3541 2 роки тому

    After around 30:00 it was said that we can't hope to always optimize the return itself therefore we need to optimize the expected return. Why? Is this because we don't know the return yet and can only calculate an expected return based on some inference made on the basis of known returns? Or is it only because of the need of discounted returns in possibly infinite markov decision processes? If so, why wouldn't it work in finite MDPs?

    • @MikeLambert1
      @MikeLambert1 Рік тому

      My attempt at an answer, on the same journey as you: If you have built/trained/learned a model, it is merely an approximation of the actual environment behavior (based on how we've seen the world evolve thus far). If there's any unknowns (ie, you don't know what other players will do, you don't know what you will see when you look behind the door, etc) then you need to optimize E[R], based on our model's best understanding of what our action will do. Optimizing E[R] will still push us to open the doors because we believe there _might_ be gold behind them. But if we open a particular door without any gold, it doesn't help R (in fact, I believe it lowers R, because any gold we find is now "one additional step" out in the future), even though it maximized E[R].

  • @patrickliu7179
    @patrickliu7179 Рік тому

    16:22
    For a task that fails due to trying to maximize a cumulative reward, would casino games that have turns of independent probability such as roulette break the model? This is tentative to the reward accumulation period expanding beyond one turn, resulting in a misapplication of the model. While it is more of a human error than machine error, its a common human misconception of the game and thus liable to be programmed that way.
    Another example may be games with black swan events, so the reward accumulation period is too short to have witnessed a black swan event.

  • @boriskabak
    @boriskabak Рік тому

    where we can see a coding intrioduction how to code reinforcment learning models

  • @anshitbansal8294
    @anshitbansal8294 2 роки тому

    "If we are observing the full environment then we do not need to worry about keeping the history of previous actions". Why would this be the case, because then from what the agent will learn?

  • @mabbasiazad
    @mabbasiazad 2 роки тому

    Can we have access to the assignments?

  • @bhoomeendra
    @bhoomeendra 4 місяці тому

    37:28 what is ment by prediction is it different from the actions?

  • @swazza9999
    @swazza9999 2 роки тому +4

    Thanks Hado, this has been very well explained. I've been through similar lectures/ intro papers before but here I learned more of the finer points / subtleties of the RL formalism - things that a teacher might take for granted and not mention explicitly.
    Question: 1:03:23 anyone know why the second expression is an expectation value and the first is a probability distribution? Typo or a clue to something much more meaningful?

    • @TheArrowShooter
      @TheArrowShooter 2 роки тому

      Given that for a pair (s, a) there is one "true" reward signal in the to be learnt model, the expected value should suffice. I.e. if you would model this with a distribution, this would in the limit be a dirac delta function at value r. The alternative where there are two (or more) possible reward values for a state-action pair, a probability distribution that you sample from could make more sense.
      You can ask yourself if it even make sense to have multiple possible rewards for an (s, a)-pair. I think it could be useful to model your reward function like a distribution when your observed state is only a subset of the environment for example. E.g. assume you can't sense whether it is raining or not, and this will respectively determine the reward of your (s, a) pairs being either 5 or 10. Modelling the reward as an expected value (would be 7.5 given that it rains 50 percent of the time) would ignore some subtleties of your model here I suppose.
      I'm no RL specialist so don't take my word for it!

    • @swazza9999
      @swazza9999 2 роки тому

      ​@@TheArrowShooter hmm is it really right that there is one "true" reward signal for a given pair (s, a)? If a robot makes a step in a direction it may or may not slip on a rock so despite the action and state being determined as a prior, the consequences can vary.
      I was thinking about this more and I realised the first expression is asking about a state, which is a member of a set of states, so it makes sense to ask for the probability that the next state is s'. But in the second expression we are dealing with a scalar variable, so it makes more sense to ask for an expectation value. But don't take my word for it :)

    • @TheArrowShooter
      @TheArrowShooter 2 роки тому +1

      @@swazza9999 I agree that there are multiple possible reward signals for a given state action pair. I tend to work with deterministic environments (no slipping, ending up in different states, ..), hence our misunderstanding :)!
      My main point was that you could model it as a probability distribution as well. The resulting learnt model would be more faithful to the underlying "true" model as it could return rewards by sampling (i.e. 5 or 10 in my example).

    • @willrazen
      @willrazen 2 роки тому +1

      It's a design choice, you could choose whatever formulation that is suitable for your problem. For example, if you have a small and finite set of possible states, you can build/learn a table with all state transition probabilities, i.e. the transition matrix. As mentioned in the same slide, you could also use a generative model, instead of working with probabilities directly.
      In Sutton&Barto 2018 they say:
      "In the first edition we used special notations, P_{ss'}^a and R_{ss'}^a, for the transition
      probabilities and expected rewards. One weakness of that notation is that it still did not
      fully characterize the dynamics of the rewards, giving only their expectations, which is
      sufficient for dynamic programming but not for reinforcement learning. Another weakness is the excess of subscripts and superscripts. In this edition we use the explicit notation of
      p(s',r | s,a) for the joint probability for the next state and reward given the current state
      and action."

    • @Cinephile..
      @Cinephile.. 2 роки тому

      Hi I want to learn Data science , machine learning. And AI
      I am unable to get the right approach and study material there are numerous courses as well but still struggling to find the right one

  • @gokublack4832
    @gokublack4832 2 роки тому +2

    At 49:40 how about just storing the number of steps the agent has taken? Would that make it Markov?

    • @thedingodile5699
      @thedingodile5699 2 роки тому

      No, you would still be able to stand in the two places he highlighted with the same amount of steps taken, so you can't tell the difference. While if you knew the entire history of the states you visited you would be able to tell the difference.

    • @gokublack4832
      @gokublack4832 2 роки тому

      @@thedingodile5699 Maze games like this usually have an initial state (i.e, position in the grid) where the game starts, so I'm not sure why if you stored the number of steps taken you wouldn't be able to tell the difference. You'd just look at the steps taken and notice that although the two observations are the same, they are very far away from each other and they're likely different. I'd agree if the game could start anywhere on the grid, but that's usually not the case.

    • @thedingodile5699
      @thedingodile5699 2 роки тому

      @@gokublack4832 even if you start the same place you can most likely reach the two squares at the same time-step (unless there is something like you can only reach this state in an even amount of steps or something like that)

    • @gokublack4832
      @gokublack4832 2 роки тому

      ​@@thedingodile5699 True, yeah I guess it's theoretically possible to construct a maze that starts at the same place, but then comes to a fork in the road later where the mazes are identical on both sides except only one contains the reward. In that case, counting steps wouldn't help you distinguish between two observations on either side... 🤔tricky problem

    • @hadovanhasselt7357
      @hadovanhasselt7357 2 роки тому +4

      @@gokublack4832 It's a great question. In some cases adding something simple as counting steps could make the state Markovian, in other cases it wouldn't. But even if this does help disentangle things (and make the resulting inputs Markov), adding such information to the state would also result in there being more states, which could make it harder to learn accurate predictions or good behaviour. In general, this is a tricky problem: we want the state to be informative, but also for it to be easy to generalise from past situations to new ones. If each situation is represented completely separately, the latter can be hard.
      In later lectures we go more into these kind of questions, including how to use deep learning and neural networks to learn good representations, that hopefully can result in a good mixture between expressiveness on the one hand, and ease of generalisation on the other.

  • @tamimyousefi
    @tamimyousefi 2 роки тому

    15:45
    Goal: Prosper in all societies.
    Env.: A world comprised of two societies, killers and pacifists.
    These two groups despise the actions of the other. You will find reward from one and penalty from the other for any given action.

    • @gkirgizov_ai
      @gkirgizov_ai 2 роки тому

      just kill all the pacifists and the goal becomes trivial

    • @MikeLambert1
      @MikeLambert1 Рік тому

      I think you're still maximizing a reward in your scenario, but it's just the reward is not static, and is instead a function of your state (ie, which society you are physically in).

  • @bobaktadjalli6516
    @bobaktadjalli6516 2 роки тому

    Hi, at 59:50 I couldn't understand the meaning of argument "a" under "max". It would be appreciated if anyone could explain this to me.

    • @SmartMihir
      @SmartMihir 2 роки тому

      I think
      Regular value function would get value of a state when we pick action by following pi.
      Optimal value function however would pick action such that value is maximum (for all time steps further)

  • @JuanMoreno-tj9xh
    @JuanMoreno-tj9xh 2 роки тому +5

    "Any goal can be formalized as the outcome of maximizing a cumulative reward." What about the goal being to know if a program will halt?

    • @judewells1
      @judewells1 2 роки тому +1

      My goal is to find a counter example that disproves the reward hypothesis.

    • @hadovanhasselt7357
      @hadovanhasselt7357 2 роки тому +13

      Great question, Juan! I would say you can still represent this goal with a reward. E.g., give +1 reward when you know the program will halt.
      So in this case the problem perhaps isn't so much to formulate the goal. Rather, the problem is that we cannot find a policy that optimises it. This is, obviously, a very important question, but it's a different one.
      One could argue that the halting problem gives an example that some problems can have well-formalised goals, but still do not allow us to find feasible solutions (in at least some cases, or in finite time). This itself doesn't invalidate the reward hypothesis. In fact, this example remains pertinent if you try to formalise this goal in any other way, right?
      Of course, there is an interesting question which kind of goals we can or cannot hope to achieve in practice, with concrete algorithms. We go into that a bit in subsequent lectures, for instance talking about when optimal policies can guaranteed to be found, and discussing concrete algorithms that can find these, and discussing the required conditions for these algorithms to succeed.

    • @nocomments_s
      @nocomments_s 2 роки тому +1

      @@hadovanhasselt7357 thank you very much for such an elaborate answer!

    • @JuanMoreno-tj9xh
      @JuanMoreno-tj9xh 2 роки тому +1

      @@hadovanhasselt7357 True. I didn't think about it that way. I just thought that if you couldn't find a framework, a reward to give to your agent, such that you could solve your problem by finding the right policy then you could say that the reward hypothesis was false. Since there is no way to get around it.
      But you are right. It's a different question. But then it's still a hypothesis. Thanks for your time. :)

    • @JuanMoreno-tj9xh
      @JuanMoreno-tj9xh 2 роки тому +2

      @@judewells1 Nice one!

  • @may8049
    @may8049 2 роки тому

    when will we be able to download alpha go and play with him.

  • @AyushSingh-vj6he
    @AyushSingh-vj6he 2 роки тому

    Thanks, I am marking 49:21

  • @rakshithv5073
    @rakshithv5073 2 роки тому

    Why do we need to maximize expectation of return ?
    What will happen if I maximize return alone without expectation ?

    • @ckhalifa_
      @ckhalifa_ 2 роки тому

      expectation of return (reward actually) includes the relevant discount factor for each future reward.

  • @Saurabhsingh-cl7px
    @Saurabhsingh-cl7px 2 роки тому

    So I have to watch videos of previous years on RL by deep minds to understand this ?

    • @los4776
      @los4776 2 роки тому +1

      No it would not be a requirement

  • @comradestinger
    @comradestinger 2 роки тому +1

    ow right in the inbox

  • @yulinchao7837
    @yulinchao7837 Рік тому

    15:45 Let's say my goal is to live forever and I can take 1 pill per day and that gaurantees my survival the next day. If I don't take it, I die. How do I formalize the goal by the cumulative rewards? My goal would be getting infinite rewards. However, the outcomes of me taking the pill or not at some day in the future are both infinite. In other words, I can't distinguish if I can live forever from maximizing the cumulative reward. Does this count as a success to breaking the hypothesis?

  • @philippededeken4881
    @philippededeken4881 Рік тому

    Lovely

  • @chanpreetsingh007
    @chanpreetsingh007 Рік тому

    Could you please share assignments?

  • @mohammadhaadiakhter2869
    @mohammadhaadiakhter2869 Місяць тому

    At 1:05:49, how did we approximate the policy?

  • @malcolm7436
    @malcolm7436 10 місяців тому

    If your goal is to win the lottery, you incur a weekly debt for each attempt, and the chance is the same with no guarantee of achieving the goal. If the reward is your profit over time, then the cumulative reward could even be negative and decreasing with each attempt.

  • @WhishingRaven
    @WhishingRaven 2 роки тому +1

    이게 또 나오네

  • @cuylerbrehaut9813
    @cuylerbrehaut9813 5 місяців тому +1

    Suppose the reward hypothesis is true. Then the goal “keep this goal’s reward function beneath its maximum” has a corresponding reward function (rendering the goal itself meaningful) whose maximization is equivalent to the achievement of the goal. If the reward function were maximized, the goal would be achieved, but then the function must not be maximized. This is a contradiction. Therefore the reward function cannot be maximized. Therefore the goal is always achieved, and therefore the reward function is always maximized. This is a contradiction. Therefore the reward hypothesis is false.

    • @cuylerbrehaut9813
      @cuylerbrehaut9813 5 місяців тому +1

      This assumes that the goal described exists. But any counter-example would require such an assumption. To decide if this argument disproves the reward hypothesis, we would need some formal way of figuring out which goals exist and which don’t.

  • @KayzeeFPS
    @KayzeeFPS 2 роки тому +4

    I miss David silver

  • @garrymaemahinay3046
    @garrymaemahinay3046 2 роки тому

    i have solution but i need a team

  • @mattsmith6509
    @mattsmith6509 2 роки тому +2

    Can it tell us y people bought toilet paper in the pan dam

  • @AineOfficial
    @AineOfficial 2 роки тому +1

    Day 1 asking him when AlphaZero Back to Chess Again.

  • @madhurivuddaraju3123
    @madhurivuddaraju3123 2 роки тому +1

    Pro tip: Always switch off vaccum cleaner when recording lectures.

    • @spectator5144
      @spectator5144 2 роки тому

      he is most probably not using an Apple M1 computer

  • @robensonlarokulu4963
    @robensonlarokulu4963 Рік тому

    DULL presentation! Go with Balaraman Ravindran and see the difference.

  • @jonathansum9084
    @jonathansum9084 2 роки тому

    Many great people have said RL is replaced by DL.
    If so, I think we should focus more on newer topics like perceive.IO. I think they are much important and practical rather than those histories.
    I hope you do not mind what I said.

    • @felipemaldonado8028
      @felipemaldonado8028 2 роки тому

      Do you mind to provide evidence about those "many great people"?

  • @MsJoChannel
    @MsJoChannel 2 роки тому

    slides like it was 1995 :)