You are somewhat right and wrong aswell. But mostly we need to train reward model in order for this to give feedback in human label data. So in that case its rlhf. So happy that you point out the something interesting.❤️
Sure but i have already created a video, where we have finetuned tinystarcoder which is 164M parameter model. you can check it here: 1. ua-cam.com/video/G3RZoxPIpXw/v-deo.html 2. ua-cam.com/video/R2paulc3P2M/v-deo.html
Your tutorial is very helpful for us. please make a video for chatbots with reinforcement learning. (RLHF)
Thanks for the comment. Sure i will, as soon as possible
Great video
Glad you enjoyed it
I think this RLAIF instead of RLHF because the feedback is generated using BERT model instead of a human which forms a reward model
You are somewhat right and wrong aswell.
But mostly we need to train reward model in order for this to give feedback in human label data. So in that case its rlhf.
So happy that you point out the something interesting.❤️
Please make video covering all 3 steps from scratch, with less parameter LLM, Pleaseeee
Sure but i have already created a video, where we have finetuned tinystarcoder which is 164M parameter model.
you can check it here:
1. ua-cam.com/video/G3RZoxPIpXw/v-deo.html
2. ua-cam.com/video/R2paulc3P2M/v-deo.html
@@WhisperingAI I am getting errors while implementing that notebook.
Sorry for that let me revisit the notebook and make necessary changes. Will update the notebook in couple of hours.
@@WhisperingAI yes please 🥺
@@WhisperingAI please also clarify the path of models and tokenizer for SFT, REWARD MODEL AND POLICY MODEL