Proximal Policy Optimization | ChatGPT uses this

Reinforcement Learning with Human Feedback - How to train and fine-tune Transformer Models

RLHF+CHATGPT: What you must know

АНТИГЕЛИК. МАКСИМАЛКА УАЗИКА

😱РОСІЯНИ СТРАШЕННО НАЛЯКАНІ РАКЕТАМИ Tomahawk зі США #shorts #росіяни #tomahawk

Сможешь отличить бургер без мяса от обычного?

Reinforcement Learning through Human Feedback - EXPLAINED! | RLHF

CodeEmporium

Переглядів 10 750

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 10 гру 2023
We talk about reinforcement learning through human feedback. ChatGPT among other applications makes use of this.
ABOUT ME
⭕ Subscribe: ua-cam.com/users/CodeEmporiu...
📚 Medium Blog: / dataemporium
💻 Github: github.com/ajhalthor
👔 LinkedIn: / ajay-halthor-477974bb
PLAYLISTS FROM MY CHANNEL
⭕ Reinforcement Learning: • Reinforcement Learning...
Natural Language Processing: • Natural Language Proce...
⭕ Transformers from Scratch: • Natural Language Proce...
⭕ ChatGPT Playlist: • ChatGPT
⭕ Convolutional Neural Networks: • Convolution Neural Net...
⭕ The Math You Should Know : • The Math You Should Know
⭕ Probability Theory for Machine Learning: • Probability Theory for...
⭕ Coding Machine Learning: • Code Machine Learning
MATH COURSES (7 day free trial)
📕 Mathematics for Machine Learning: imp.i384100.net/MathML
📕 Calculus: imp.i384100.net/Calculus
📕 Statistics for Data Science: imp.i384100.net/AdvancedStati...
📕 Bayesian Statistics: imp.i384100.net/BayesianStati...
📕 Linear Algebra: imp.i384100.net/LinearAlgebra
📕 Probability: imp.i384100.net/Probability
OTHER RELATED COURSES (7 day free trial)
📕 ⭐ Deep Learning Specialization: imp.i384100.net/Deep-Learning
📕 Python for Everybody: imp.i384100.net/python
📕 MLOps Course: imp.i384100.net/MLOps
📕 Natural Language Processing (NLP): imp.i384100.net/NLP
📕 Machine Learning in Production: imp.i384100.net/MLProduction
📕 Data Science Specialization: imp.i384100.net/DataScience
📕 Tensorflow: imp.i384100.net/Tensorflow

КОМЕНТАРІ • 11

@RameshKumar-ng3nf 6 днів тому ⁺¹
Brilliant Bro 👌. Excellent explanation. I never understand RLHF reading so many books and notes. Your examples are GREAT & simple to understand 👌
I am new to your channel and subscribed.
@manigoyal4872 5 місяців тому
what about the generation of rewards, will there be another model to check the relativity of the answer and the precision of the answer, cause we have a lot of data
@theartofwar1750 2 місяці тому ⁺¹
At 6:58, you have an error: PPO is not used to build the reward model.
@neetpride5919 5 місяців тому ⁺⁴
Great video! I have a few questions:
1) Why do we need to manually train the reward model with human feedback if the point is to evaluate responses of another pretrained model? Can't we just cut out the reward model altogether, rate the responses directly using human feedback to generate a loss value for each response, then backpropagate on that? Does it require less human input to train the reward model than to train the GPT model directly?
2) When backpropagating the loss, do you need to do recurrent backpropagation for a number of steps that is the same as the length of the token output?
3) Does the loss value apply equally to every token that is output? Seems like this would overly punish some words e.g. if the question starts with "why" it's likely the response is going to start with "because" regardless of what comes after. Does RLHF only work with sentence embeddings rather than word embeddings?
@0xabaki 3 місяці тому
1) I think the point is to minimize the human feed back volume so humans just give enough responses to train a model for all future feedback. this way humans are not going to always have to give feedback, but instead will lay the basis, and probably come back to re-evaluate what the reward model is doing so it is still acting human
(2) and (3) seem more specific to the architecture of chatGPT and neither PPO nor RLHF. I would look into the other GPT specific videos he made
@manigoyal4872 5 місяців тому
Acts as a randomizing factor depending on whom you are getting feedback from
@sangeethashowrya0318 Місяць тому
Sir ,please make a video on function approximation in RL
@ayeshariaz3382 26 днів тому
where to det your slides?
@0xabaki 3 місяці тому
haha quiz time again:
0) when the person knows me well
1)D
2)B if proper human feedback
3)C
@manigoyal4872 5 місяців тому ⁺¹
Aren't we users are the humans in feedback loop for openai
@akzytr 5 місяців тому ⁺²
Yeah, however openai has the final say on what feedback goes through

Наступне

Автоматичне відтворення

Proximal Policy Optimization | ChatGPT uses this

Proximal Policy Optimization | ChatGPT uses this

Reinforcement Learning with Human Feedback - How to train and fine-tune Transformer Models

Reinforcement Learning with Human Feedback - How to train and fine-tune Transformer Models

RLHF+CHATGPT: What you must know

RLHF+CHATGPT: What you must know

АНТИГЕЛИК. МАКСИМАЛКА УАЗИКА

АНТИГЕЛИК. МАКСИМАЛКА УАЗИКА

😱РОСІЯНИ СТРАШЕННО НАЛЯКАНІ РАКЕТАМИ Tomahawk зі США #shorts #росіяни #tomahawk

😱РОСІЯНИ СТРАШЕННО НАЛЯКАНІ РАКЕТАМИ Tomahawk зі США #shorts #росіяни #tomahawk

Сможешь отличить бургер без мяса от обычного?

Сможешь отличить бургер без мяса от обычного?

[柴犬ASMR]曼玉Manyu&小白Bai 毛发护理Spa asmr

[柴犬ASMR]曼玉Manyu&小白Bai 毛发护理Spa asmr

🦙 LLAMA-2 : EASIET WAY To FINE-TUNE ON YOUR DATA Using Reinforcement Learning with Human Feedback 🙌

🦙 LLAMA-2 : EASIET WAY To FINE-TUNE ON YOUR DATA Using Reinforcement Learning with Human Feedback 🙌

RLHF: How to Learn from Human Feedback with Reinforcement Learning

RLHF: How to Learn from Human Feedback with Reinforcement Learning

What is Retrieval-Augmented Generation (RAG)?

What is Retrieval-Augmented Generation (RAG)?

Reinforcement Learning from Human Feedback Explained (and RLAIF)

Reinforcement Learning from Human Feedback Explained (and RLAIF)

But what is a GPT? Visual intro to transformers | Chapter 5, Deep Learning

But what is a GPT? Visual intro to transformers | Chapter 5, Deep Learning

Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.

Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.

Chat GPT Rewards Model Explained!

Chat GPT Rewards Model Explained!

Monte Carlo in Reinforcement Learning

Monte Carlo in Reinforcement Learning

[柴犬ASMR]曼玉Manyu&小白Bai 毛发护理Spa asmr

[柴犬ASMR]曼玉Manyu&小白Bai 毛发护理Spa asmr

Богдан "ТАВР" Кротевич: Коли почнуть саджати...?/ обмін полонених на вілли соловйова / еволюція АЗОВ

Богдан "ТАВР" Кротевич: Коли почнуть саджати...?/ обмін полонених на вілли соловйова / еволюція АЗОВ

ИВЛЕВ РАЗНËС ВЫСОЦКУЮ. ЗВËЗДЫ. ПАРОДИЯ. #иванабрамов #шоузвезды #ивлев #юлиявысоцкая

ИВЛЕВ РАЗНËС ВЫСОЦКУЮ. ЗВËЗДЫ. ПАРОДИЯ. #иванабрамов #шоузвезды #ивлев #юлиявысоцкая

Американский стелс по советскому учебнику 📚

Американский стелс по советскому учебнику 📚

Дурнєв дивиться сторіс ZОМБІ #48

Дурнєв дивиться сторіс ZОМБІ #48

«Проти семи окупантів, вів бій з автомату»: «Монгол» про оборону позиції на Запорізькому напрямку

«Проти семи окупантів, вів бій з автомату»: «Монгол» про оборону позиції на Запорізькому напрямку

Решение задачи про лжеца и честного охранника

Решение задачи про лжеца и честного охранника

КАК СПРЯТАТЬ КОНФЕТЫ

КАК СПРЯТАТЬ КОНФЕТЫ