Reinforcement Learning Upside Down: Don't Predict Rewards -- Just Map Them to Actions

Поділитися
Вставка
  • Опубліковано 28 лис 2024

КОМЕНТАРІ • 31

  • @whatsinthepapers6112
    @whatsinthepapers6112 5 років тому +19

    Not going to lie - was fooled up until magnetic chess board! Can't put anything past Schmidhuber

  • @herp_derpingson
    @herp_derpingson 5 років тому +28

    Academics now have to use meme knowledge and tactics, to get their papers noticed. What a time to be alive.

  • @michael-nef
    @michael-nef 5 років тому +15

    starting strong, upside down characters in an academic paper. high teir memer

    • @michael-nef
      @michael-nef 5 років тому +4

      @Dmitry Akimov Lighten up a bit, these people just want recognition for their work and using catchy titles and more light-hearted introductions draws attention. It's not really their fault when it's what they're incentivized to do, something something reward-action.

    • @herp_derpingson
      @herp_derpingson 5 років тому

      @dmitry I dont think its going to happen. There are so many research papers, if you want to get noticed, you need to stand out.

    • @michael-nef
      @michael-nef 4 роки тому +3

      @Dmitry Akimov ok boomer

  • @ronen300
    @ronen300 3 роки тому +3

    One of the funniest 3 minutes in the field ! I was seriously laughing out loud 😂

  • @CosmiaNebula
    @CosmiaNebula 4 роки тому +6

    skip to 4:08 if you don't want memes

  • @foobar1231
    @foobar1231 5 років тому +2

    Sorry, if something wrong, I'm not a specialist in RL.
    It is a kind of dynamic programming: agent remembers its previous experience (command) and acts according to observation and experience. Experience is from the episodes (positive and negative, they are like palps). The longer an episodes (more steps), the bigger the horizon. So, calculate the mean reward from episodes and demand a little bit more (on one standard deviation more). What does it mean (to demand more)? As I understood, remain and develop only successful episodes further and cut negative episodes (palps).

    • @quickdudley
      @quickdudley 4 роки тому +1

      Let's call the agent f, the observations s, the reward r, the demand d, and actions a. At each step of experience generation a = f(s,d). Then later once the reward is known f is updated such that f(s,r) is pulled towards a.

  • @CyberneticOrganism01
    @CyberneticOrganism01 2 роки тому +1

    interesting new perspective on how to do RL ☺️

  • @justinlloyd3
    @justinlloyd3 Рік тому

    during the first few minutes I am like "hmm I don't think that's gonna work" LOL

  • @softerseltzer
    @softerseltzer 3 роки тому

    Thank you for the video!
    One thing I don't understand though is why does the first paper says that you must use RNN's for non-deterministic environments, yet in the experiments paper, they just stack a few frames for the VizDoom example without any RNN's.

  • @scottmiller2591
    @scottmiller2591 5 років тому +7

    My cursor, hovering, hovering over the downvote icon - "This guy totally neither read nor understood the paper..." Finally, he says "Just kidding!" and actually reviews the paper.

  • @richardwebb797
    @richardwebb797 4 роки тому +1

    If you have 2 actions A and B, and you explore / train an input of desired reward 0 to produce action A, how does that help you do the right thing with an input desired reward 1 (select action B)?

    • @YannicKilcher
      @YannicKilcher  4 роки тому

      I guess ideally you would learn both, or at least recongize that you now want a different reward, so you should probably do a different action

    • @richardwebb797
      @richardwebb797 4 роки тому

      @@YannicKilcher possible to explain in more concrete terms? The idea is to sample actions better than randomly, but seems hand-wavy to say optimizing a probability distribution given one input will make the output distrib for another input good. Then again I guess that's the exactly what a neural net tries to do

  • @robosergTV
    @robosergTV 4 роки тому +1

    what a great video, thanks!

  • @NanachiinAbyss
    @NanachiinAbyss 5 років тому +1

    Can't you do the same by simply adding some logic to the function where the actions are chosen?
    If you have a Network that outputs expected values you can just choose actions that have the expected value match with what you want.

    • @YannicKilcher
      @YannicKilcher  5 років тому

      The value function has a hard coded horizon (until the end of the episode), where as UDRL can deal with any horizon.

  • @snippletrap
    @snippletrap 4 роки тому +3

    Negative 5 billion billion trillion is a pretty bad reward.

  • @DeepGamingAI
    @DeepGamingAI 5 років тому +4

    Pronounced "Lara"?

  • @jonathanballoch
    @jonathanballoch 3 роки тому

    This is just a generalization of goal-conditioned imitation learning, no?

    • @patf9770
      @patf9770 3 роки тому

      Or maybe that's just a special case of ⅂ꓤ ;)

  • @ambujmittal6824
    @ambujmittal6824 5 років тому

    Hi, can you do a video on Capsule networks also? Thank you :)
    Btw, I love your videos.

    • @DanieleMarchei
      @DanieleMarchei 5 років тому +2

      he already did it ^^
      ua-cam.com/video/nXGHJTtFYRU/v-deo.html