@Dmitry Akimov Lighten up a bit, these people just want recognition for their work and using catchy titles and more light-hearted introductions draws attention. It's not really their fault when it's what they're incentivized to do, something something reward-action.
Sorry, if something wrong, I'm not a specialist in RL. It is a kind of dynamic programming: agent remembers its previous experience (command) and acts according to observation and experience. Experience is from the episodes (positive and negative, they are like palps). The longer an episodes (more steps), the bigger the horizon. So, calculate the mean reward from episodes and demand a little bit more (on one standard deviation more). What does it mean (to demand more)? As I understood, remain and develop only successful episodes further and cut negative episodes (palps).
Let's call the agent f, the observations s, the reward r, the demand d, and actions a. At each step of experience generation a = f(s,d). Then later once the reward is known f is updated such that f(s,r) is pulled towards a.
Thank you for the video! One thing I don't understand though is why does the first paper says that you must use RNN's for non-deterministic environments, yet in the experiments paper, they just stack a few frames for the VizDoom example without any RNN's.
My cursor, hovering, hovering over the downvote icon - "This guy totally neither read nor understood the paper..." Finally, he says "Just kidding!" and actually reviews the paper.
If you have 2 actions A and B, and you explore / train an input of desired reward 0 to produce action A, how does that help you do the right thing with an input desired reward 1 (select action B)?
@@YannicKilcher possible to explain in more concrete terms? The idea is to sample actions better than randomly, but seems hand-wavy to say optimizing a probability distribution given one input will make the output distrib for another input good. Then again I guess that's the exactly what a neural net tries to do
Can't you do the same by simply adding some logic to the function where the actions are chosen? If you have a Network that outputs expected values you can just choose actions that have the expected value match with what you want.
Not going to lie - was fooled up until magnetic chess board! Can't put anything past Schmidhuber
Academics now have to use meme knowledge and tactics, to get their papers noticed. What a time to be alive.
starting strong, upside down characters in an academic paper. high teir memer
@Dmitry Akimov Lighten up a bit, these people just want recognition for their work and using catchy titles and more light-hearted introductions draws attention. It's not really their fault when it's what they're incentivized to do, something something reward-action.
@dmitry I dont think its going to happen. There are so many research papers, if you want to get noticed, you need to stand out.
@Dmitry Akimov ok boomer
One of the funniest 3 minutes in the field ! I was seriously laughing out loud 😂
skip to 4:08 if you don't want memes
Sorry, if something wrong, I'm not a specialist in RL.
It is a kind of dynamic programming: agent remembers its previous experience (command) and acts according to observation and experience. Experience is from the episodes (positive and negative, they are like palps). The longer an episodes (more steps), the bigger the horizon. So, calculate the mean reward from episodes and demand a little bit more (on one standard deviation more). What does it mean (to demand more)? As I understood, remain and develop only successful episodes further and cut negative episodes (palps).
Let's call the agent f, the observations s, the reward r, the demand d, and actions a. At each step of experience generation a = f(s,d). Then later once the reward is known f is updated such that f(s,r) is pulled towards a.
interesting new perspective on how to do RL ☺️
during the first few minutes I am like "hmm I don't think that's gonna work" LOL
Thank you for the video!
One thing I don't understand though is why does the first paper says that you must use RNN's for non-deterministic environments, yet in the experiments paper, they just stack a few frames for the VizDoom example without any RNN's.
My cursor, hovering, hovering over the downvote icon - "This guy totally neither read nor understood the paper..." Finally, he says "Just kidding!" and actually reviews the paper.
Gotcha 😉
If you have 2 actions A and B, and you explore / train an input of desired reward 0 to produce action A, how does that help you do the right thing with an input desired reward 1 (select action B)?
I guess ideally you would learn both, or at least recongize that you now want a different reward, so you should probably do a different action
@@YannicKilcher possible to explain in more concrete terms? The idea is to sample actions better than randomly, but seems hand-wavy to say optimizing a probability distribution given one input will make the output distrib for another input good. Then again I guess that's the exactly what a neural net tries to do
what a great video, thanks!
Can't you do the same by simply adding some logic to the function where the actions are chosen?
If you have a Network that outputs expected values you can just choose actions that have the expected value match with what you want.
The value function has a hard coded horizon (until the end of the episode), where as UDRL can deal with any horizon.
Negative 5 billion billion trillion is a pretty bad reward.
Pronounced "Lara"?
This is just a generalization of goal-conditioned imitation learning, no?
Or maybe that's just a special case of ⅂ꓤ ;)
Hi, can you do a video on Capsule networks also? Thank you :)
Btw, I love your videos.
he already did it ^^
ua-cam.com/video/nXGHJTtFYRU/v-deo.html