Policy Gradient Methods | Reinforcement Learning Part 6

Mutual Information

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 14 чер 2024
The machine learning consultancy: truetheta.io
Want to work together? See here: truetheta.io/about/#want-to-w...
Policy Gradient Methods are among the most effective techniques in Reinforcement Learning. In this video, we'll motivate their design, observe their behavior and understand their background theory.
SOCIAL MEDIA
LinkedIn : / dj-rich-90b91753
Twitter : / duanejrich
Github: github.com/Duane321
Enjoy learning this way? Want me to make more videos? Consider supporting me on Patreon: / mutualinformation
SOURCES FOR THE FULL SERIES
[1] R. Sutton and A. Barto. Reinforcement learning: An Introduction (2nd Ed). MIT Press, 2018.
[2] H. Hasselt, et al. RL Lecture Series, Deepmind and UCL, 2021, • DeepMind x UCL RL Lect...
[3] J. Achiam. Spinning Up in Deep Reinforcement Learning, OpenAI, 2018
ADDITIONAL SOURCES FOR THIS VIDEO
[4] J. Achiam, Spinning Up in Deep Reinforcement Learning: Intro to Policy Optimization, OpenAI, 2018, spinningup.openai.com/en/late...
[5] D. Silver, Lecture 7: Policy Gradient Methods, Deepmind, 2015, • RL Course by David Sil...
TIMESTAMPS
0:00 Introduction
0:50 Basic Idea of Policy Gradient Methods
2:30 A Familiar Shape
4:23 Motivating the Update Rule
10:51 Fixing the Update Rule
12:55 Example: Windy Highway
16:47 A Problem with Naive PGMs
19:43 Reinforce with Baseline
21:42 The Policy Gradient Theorem
25:20 General Comments
28:02 Thanking The Sources
LINKS
Windy Highway: github.com/Duane321/mutual_in...
NOTES
[1] When motivating the update rule with an animation protopoints and theta bars, I don't specify alpha. That's because the lengths of the gradient arrows can only be interpretted on a relative basis. Their absolute numeric values can't be deduced from the animation because there was some unmentioned scaling done to make the animation look natural. Mentioning alpha would have make this calculation possible to attempt, so I avoided it.

КОМЕНТАРІ • 73

@rajatkumar.j Місяць тому ⁺¹
Finally, after watching this 3 times I got the intuition of this method. Thank you for uploading a great series!
@maximilianpowers9785 Рік тому ⁺²⁵
My RL exam is in 2 weeks, you’re a life saver. I’m studying at UCL and the lectures lack a bit of that much needed visual intuition!
@Mutual_Information Рік тому ⁺⁸
Exactly what I'm going for. When I read the update rule, it didn't click until I ultimately landed on a visual like this. Happy it clicked for you too
@shadowdragon2484 Місяць тому ⁺²
Genuinely such an amazing series its changed the way I look at optimization problems as a whole moving forward
@Mutual_Information Місяць тому
Thank you for appreciating this one too. It's much less viewed
@stemfolk Рік тому ⁺¹³
This is one of the best channels on the site by a long way. Very grateful that you never compromise on the quality of the videos. Excellent work!
@derickd6150 Рік тому ⁺⁵
Wow I'm so lucky. You only upload every few months and I just watched your previous videos yesterday. Love the series!! Thank you so much!
@Mutual_Information Рік тому
I intend to upload more frequently in fact. Just need to shorten the videos a bit. This one was a monster.
@rolandbertin-johannet5270 Рік тому ⁺⁹
So grateful for this channel, any topic you cover is understood 10x faster than through other media
@Mutual_Information Рік тому ⁺²
That's what I'm going for - I'm here for the quick learners ;)
@chiwaiwan2484 6 місяців тому
and 10x quicker than my fucking lectures
@dasyud 3 місяці тому ⁺¹
I was struggling to get into reading RL material because of the lack of intuition and this was exactly what I needed. Thanks a ton! I can now easily build upon the fundamentals you've taught me.
I'm gonna binge watch every video on your channel since they are all on topics I find very interesting and want to learn about. I hope you put out more videos! Cheers! 🎉
@Mutual_Information 3 місяці тому
Glad they're working for you. And yea I'm cooking a big one as we speak.
@joshithmurthy6209 Рік тому ⁺²
This video literally came a day before my test , thanks for uploading before test , even if you did after my test I would have seen it. They are so good.
@Mutual_Information Рік тому
Excellent! Love to hear it
@siddharthbisht1287 Рік тому ⁺²
You are going to heaven sir, with the kind of work you are producing. The interesting thing one can watch these with popcorn, with a notebook or listen to it while working on another task and it just works. Your explanations are clear, simple and straightforward which reflects your understanding. Also, thanks for sharing the sources, it genuinely helps a lot. Keep up with a great work.
@wenkanglee9596 6 місяців тому ⁺¹
I just want to express my gratitude to you. As a total newbie to AI and ML, IMO, this series might be the well-explained videos in RL. Please keep up the good work. :)
@Mutual_Information 6 місяців тому ⁺¹
Thank you, especially when it's said on this video - part 6 of the RL series. This series took me a real long time and it's only appreciate by a small, studious bunch - so it's great to hear from them. Thanks again!
@johnshim6727 9 місяців тому ⁺¹
This was a great video as a beginner in RL to grasp concepts. Appreciate your effort and time for making this!!
@DoGyKG 11 місяців тому ⁺¹
Damn thank you for making this video.
The visualization of algorithm is beyond any other explanations
@asdf56790 7 місяців тому ⁺¹
A huuuge thank you for making this series!
It was extremly well-explained and I would've taken me many many more hours learning this from a book/on other resources.
It is a very dense course, so I had to spend quite some time rewatching, but it's absolutely worth it and I like the great coverage.
Amazing!
@Mutual_Information 7 місяців тому ⁺²
And you are in exactly the circumstance I was aiming for. After I read the book, I thought.. damn that just takes way too long to learn. If I could lower the cost of learning, people would appreciate it, just like I would have. So I'm glad it worked for you!
@mberoakoko24 Рік тому ⁺²
Aye , you are back. I'll come back to watch this with a notebook
@marcegger7411 Рік тому ⁺²
My favorite youtube channel is back!!
@Mutual_Information Рік тому ⁺¹
Thank you Marc :)
@awaisahmad5908 3 місяці тому
Thank You So Much. I wish we had teachers like You in our universities.
@quachthetruong Рік тому
Your channel makes me love statistical probability and its applications more. You fully explain math without turning it into a tough academic lecture!
Thank you so much!
@TallMonkey 4 місяці тому ⁺¹
Thank you man. You're the best. Really helped me study for my upcoming exam. Most importantly helping me understand the intuition behind it all
@Mutual_Information 4 місяці тому
My goal exactly!
@bonettimauricio 7 місяців тому ⁺¹
Really excited to go into this RL journey with you. I was reading the book and watching these lessons, now it is the end (hopefully just the beginning), thank you so much for this!
@user-fh7hj7du2f 5 місяців тому
Thank you for explaining so well this complicated topic.
@user-bj8wg8vq8h 10 місяців тому ⁺¹
Very well done video, thank you
@dhinas9444 Місяць тому ⁺¹
Thank you man for maxing out our mutual information!
@Mutual_Information Місяць тому
YES! Someone finally said it! lol that's honestly exactly what I had in my mind when naming this silly channel
@mCoding Рік тому ⁺¹
Great series! I have a question about the weighting of proto points. Do you do a simple distance weighted average over all proto points, and if so is there a variation that only weights K nearest neighbors? I ask because if the landscape is very non-monotonic, weights from far points might tug in the wrong direction even if locally the nearby proto points give good information, e.g. imagine a gradient landscape that is like a maze.
@Mutual_Information Рік тому ⁺¹
Thank you!
To answer your Q, in this very simple case, I'm doing a distance weighted average over all protopoints, but there is a hyperparameter I didn't mention in the video, which scales distances prior to the conversion into normalized weights.
The effect is, if the scale is very small, then the model effectively becomes identical to a K=1 nearest neighbors, and so any state only use the single closest protopoint to determine its action probability. In this case, that problematic tugging you mention doesn't exist. If the scale is very large, then the model sees all protopoints as almost equally far away.. and for any state, the action probabilities will essentially be just a simple average of all proto-action-probabilities. So, somewhere between these extremes, we balance that tugging-problem with the benefit of averaging data from nearby states.
In larger, high dimensional problems. The way I've done things doesn't work well. We can't tile the space without blowing up parameters. So there are then a variety of approaches.. like selecting protopoints that well partition up the regions where the data is observed.. or reducing dimensionality of the original space into a more manageable latent space. Or.. abandon protopoints entirely and just use deep nets! In fact, I don't believe I've seen any large RL model that is nearest-neighbors based..
@akritiupreti6974 6 місяців тому
Work of art!
@timothytyree5211 Рік тому ⁺¹
Encore! Encore! Thou art the mac daddy of RL! I will stay tuned!
Couldst thou pretty please consider developing a follow up video on more sophisticated PPO methods?
@Mutual_Information Рік тому ⁺¹
PPO would be the next topic. I can't say that's the next thing on the menu, but if this series gathers some attention, I can come back with an encore in due time. Thanks for the love!
@MathVisualProofs Рік тому ⁺¹
👍So nicely done.
@Mutual_Information Рік тому ⁺¹
Thank you MVP - that means something big come from you
@kimchi_taco 10 місяців тому ⁺³
salute!
@Mutual_Information 10 місяців тому
Thanks for watching these more intense videos. This one took a long ass time!
@maximechopin2600 6 місяців тому
I wanted to ask how you make your animations, they are very clear and concise , thanks for the great content
@dhlee8594 Рік тому ⁺¹
What is your background current job? Your videos are of really high-quality and cover advanced topics
@Mutual_Information Рік тому ⁺³
I used to be in quantitative finance. Now I'm a data scientist at Lyft.
And thank you - the advanced topics are where the action is!
@DRich222 Рік тому ⁺¹
Hah- Nice candid clip at the end.
@zenchiassassin283 Рік тому ⁺¹
Hi, thanks a lot for your videos ! Do you plan to make some videos on other reinforcement learning policy gradient methods ?
@Mutual_Information Рік тому
RL videos aren't next in the queue. I'll be exploring some new categories. But eventually, I'd like to touch on PPO more directly. Just because if it's usefulness. But that probably won't happen this year.
@5_inchc594 Рік тому ⁺¹
Thanks for the clear explaination. Could you please make a video on the PPO algorithm?
@Mutual_Information Рік тому
I don't currently have plans for it.. but it would be my next follow up on the RL series.
@5_inchc594 Рік тому
@@Mutual_Information Thanks! that would be great
@akhilezai Рік тому ⁺¹
Yaayyyy you're back!
@Mutual_Information Рік тому ⁺¹
Indeed. This one was 30 minutes long, hence it took forever to create. Next videos will be shorter/uploaded more frequently.
@add-mt5xc 7 місяців тому ⁺¹
How does one see that the objective (average reward) used in the policy gradient theorem is independent of the initial state? I think this is true as on the right-hand side, you are summing over states s. Is it the Markov assumption that lets you write the average reward in that manner such that it is independent of the initial state?
@Mutual_Information 7 місяців тому
It's not that it's independent of the stating state. It's that it's true for any starting state. You'll still be able to improve in expected return even if the starting state is randomly set at the start of each episode.
@actualBIAS Рік тому ⁺¹
Would love to add this to my playlists. Is there any chance to do it? Why is it even disabled?
Great vids btw
@Mutual_Information Рік тому
Wait.. what's disabled?? I don't think I've disabled any such thing on my end.
@GusTheWolfgang 11 місяців тому
Why didnt yoou upload before my dissertation last year D:
@moisesbessalle 2 місяці тому
I think at @6:20 you meant "since we are creating 3 values out of 2 constraints" right?
@wilhem7206 2 місяці тому
The one constraint is theta1 + theta2 + theta3 = 0, so if you know two of the thetas the 3rd one is determined
@moisesbessalle 2 місяці тому
@@wilhem7206 but there is another constraint which is that each p>=0
@wrjog23 Рік тому ⁺¹
too much informatiooooon!!!
@zerotwo7319 23 дні тому
Man, I hate that this has nothing to do with neurons, or anything biologically inspired. great explanation to see what is really going on. but this has nothing to do with intelligence.
@TimL_ Рік тому ⁺²
Thank you.
@gravkint8376 5 місяців тому ⁺¹
Damn this video is helpful. So far I was only able to get a vague understanding of the topic with lots of time and effort. But this gives a whole new level of intuition. Thank you so much!
@Mutual_Information 5 місяців тому
Exactly what I'm going for ;)

Наступне

Автоматичне відтворення

An introduction to Policy Gradient methods - Deep Reinforcement Learning