Proximal Policy Optimization | ChatGPT uses this

CodeEmporium

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 20 тра 2024
Let's talk about a Reinforcement Learning Algorithm that ChatGPT uses to learn: Proximal Policy Optimization (PPO)
ABOUT ME
⭕ Subscribe: ua-cam.com/users/CodeEmporiu...
📚 Medium Blog: / dataemporium
💻 Github: github.com/ajhalthor
👔 LinkedIn: / ajay-halthor-477974bb
PLAYLISTS FROM MY CHANNEL
⭕ Reinforcement Learning: • Reinforcement Learning...
Natural Language Processing: • Natural Language Proce...
⭕ Transformers from Scratch: • Natural Language Proce...
⭕ ChatGPT Playlist: • ChatGPT
⭕ Convolutional Neural Networks: • Convolution Neural Net...
⭕ The Math You Should Know : • The Math You Should Know
⭕ Probability Theory for Machine Learning: • Probability Theory for...
⭕ Coding Machine Learning: • Code Machine Learning
MATH COURSES (7 day free trial)
📕 Mathematics for Machine Learning: imp.i384100.net/MathML
📕 Calculus: imp.i384100.net/Calculus
📕 Statistics for Data Science: imp.i384100.net/AdvancedStati...
📕 Bayesian Statistics: imp.i384100.net/BayesianStati...
📕 Linear Algebra: imp.i384100.net/LinearAlgebra
📕 Probability: imp.i384100.net/Probability
OTHER RELATED COURSES (7 day free trial)
📕 ⭐ Deep Learning Specialization: imp.i384100.net/Deep-Learning
📕 Python for Everybody: imp.i384100.net/python
📕 MLOps Course: imp.i384100.net/MLOps
📕 Natural Language Processing (NLP): imp.i384100.net/NLP
📕 Machine Learning in Production: imp.i384100.net/MLProduction
📕 Data Science Specialization: imp.i384100.net/DataScience
📕 Tensorflow: imp.i384100.net/Tensorflow

КОМЕНТАРІ • 31

@CodeEmporium 5 місяців тому ⁺⁴
Thanks for watching! If you think I deserve it, please consider hitting that like button as it will help spread this channel. More break downs to come!
@user-mx9eu5bb7i 5 місяців тому ⁺²
I like the clarity that your video provides. Thanks for this primer. A couple things, though, that were a bit unclear and perhaps you could elaborate on here in the comments.
- It wasn't obvious to me how/why you would submit all of the states at once (to either network) and update with an average loss as opposed to training on each state independently. I get that we have an episode of related/dependent states here -- maybe that's why we use the average instead of the directly associated discounted future reward?
- Secondly, in your initial data sampling stage you collected outputs from the policy network. During the training phase of the network it looks like you're sampling again but your values are different. How is this possible unless you're network has changed somehow? Maybe you're using drop-out or something like that?
Forgive the questions -- I'm just learning about this methodology for the first time.
@user-ir1pm2pd1k 2 місяці тому ⁺²
Hi! Great video! Could you answer my question about training policy? This happening on 10:00. Why obtained probability of actions are different from probs, taken on gathering data? I think that we havent changed policy network before this action. So, if we havent changed network yet, on 10:08 we would have received ratio == 1 on every step(
@srivatsa1193 5 місяців тому
I ve really enjoyed this series so far. Great work ! The world needs more pasionate teachers like youeself. Cheers!
@CodeEmporium 5 місяців тому
Thanks so much for the kind words I really appreciate it :)
@ashishbhong5901 4 місяці тому
Good presentation and break down of concepts. Liked your video.
@ZhechengLi-wk8gy 5 місяців тому
Like your channel very much, looking forward to the coding part of RL.😀
@swagatochakraborty2583 Місяць тому
Great presentation. One question : why the policy network is a separate network than the value network? Seems like the probability of the actions should be based on estimating the expected reward values I think in my Coursera course on Reinforcement learning - I saw they were using the same network and simply copying over the weights from one to another. So they were essentially the time shifted version of the same network and trained just once.
@obieda_ananbeh 5 місяців тому
Thank you!
@vastabyss6496 5 місяців тому ⁺⁶
What's the purpose of having a separate policy network and value network? Wouldn't the value network already give you the best move in a given state, since we can simply select the action the value network predicts will have the highest future reward?
@yeeehees2973 Місяць тому
More to do with balancing exploration/exploitation, as simply picking the maximum Q-value from the value network yields suboptimal results due to limited exploration. Alternatively, using on a policy network would yield too noisy updates, resulting in unstable training.
@sudiptasarkar4438 Місяць тому
@@yeeehees2973I feel that this video is misleading at 02:06. Previously I thought value function objective is to estimate the max reward value of current state, but this guy is saying otherwise
@yeeehees2973 Місяць тому
@@sudiptasarkar4438 the Q-values inherently try to maximize the future rewards, so a Q value of being in a certain state can be interpreted as maximums future reward given this state.
@patrickmann4122 15 днів тому
It helps with something called “baselining” which is a variance reduction technique to improve policy gradients
@inderjeetsingh2367 5 місяців тому
Thanks for sharing 🙏
@CodeEmporium 5 місяців тому
My pleasure! Thank you for watching
@markusdegen6036 Місяць тому
Why is the value network not having state and action as input and q-value as output. Is it just to get the same type of output as the policy network or is there a different reason? Great video 🙂
@ericgonzales5057 3 місяці тому ⁺¹
WHERE DID YOU LEARN THIS?!??! PLEASE ANSWER
@victoruzondu6625 Місяць тому
What are vf updates and how do we get the value for our clipped ratio.
You didn't seem to explain them
I could only tell the last quiz is a B because the other options complement the policy nextwork not the value network
@0xabaki 3 місяці тому
haha finally no one has done quiz time yet!
I propose the following answers:
0) seeing the opportunity cost of an action is low
1) A
2) B
3) D
@footube3 4 місяці тому
Could you please explain what up, down, left and right signify. In which data structure are we going up, down, left or right?
@CodeEmporium 4 місяці тому
Up down left and right are individual actions that an agent can possibly take. You could store these data types in an “enum” and sample a random action from this
@OPASNIY_KIRPI4 4 місяці тому
Please explain how you can apply back propagation over the network simply by using a single loss number? As far as I understand, an input vector and a target vector are needed to train a neural network. I will be very grateful for an explanation.
@CodeEmporium 4 місяці тому
The single loss is “back propagated” through the network to compute the gradient of the loss with respect to each parameter of the network. This gradient is later used by an optimizer algorithm (like gradient descent) to update the neural network parameter, effectively “learning”. I have a video coming out on this tomorrow explaining back propagation in my new playlist “Deep Learning 101”. So do keep an eye out for this
@OPASNIY_KIRPI4 4 місяці тому ⁺¹
Thanks for the answer! I'm waiting for a video on this topic.
@BboyDschafar 5 місяців тому
FEEDBACK.
Either from experts/ teachers, or from the enviroment.
@paull923 5 місяців тому
Great video! Especially, the quizzes are a good idea. B B B I‘d say
@CodeEmporium 5 місяців тому ⁺¹
Thanks so much! It’s fun making them too. I thought it would be a good way to engage. And yep the 3 Bs sound right to me too 😊
@zakariaabderrahmanesadelao3048 5 місяців тому ⁺¹
The answer is B.
@CodeEmporium 5 місяців тому
Ding ding ding for the Quiz 1!
@id104442304 4 місяці тому
bbb

Наступне

Автоматичне відтворення

Reinforcement Learning through Human Feedback - EXPLAINED! | RLHF