We have a study where we investigate classification performance of text for different levels of reasoning. We do find a positive effect of 0-shot-cot in these cases but this is more apparent when 0-shot-cot is accompanied by n-shot examples. Level k reasoning is more mechanistic due to its game theoretical nature than say sentiment analysis. This may be why cot is observed to improve in our case
Clear my confusion, I am a newbie in this field: Goal is to maximize reward. And for self correction we are providing bonus, Using this approach, aren't we encouraging the model to do more mistakes in first attempt and then do self correction in 2nd atrempt to get max reward?
Great video ! Could you share the link for the chatgpt chat extract that you demonstrated? The share link on the top right as seen in the video? It's a little difficult to follow towards the end
Have they actually achieved any positive results? The most successful RL models usually limit how much a policy should change to avoid instabilities. One can say anything, if no results are there to prove it.
This is awesome. Imagine a musician that understands music and can really do with this tool
What kind of stuff are you imagining? Musician here
I’ve been thoroughly impressed with o1 and the quality/insights of its responses.
We have a study where we investigate classification performance of text for different levels of reasoning. We do find a positive effect of 0-shot-cot in these cases but this is more apparent when 0-shot-cot is accompanied by n-shot examples. Level k reasoning is more mechanistic due to its game theoretical nature than say sentiment analysis. This may be why cot is observed to improve in our case
Clear my confusion, I am a newbie in this field:
Goal is to maximize reward. And for self correction we are providing bonus,
Using this approach, aren't we encouraging the model to do more mistakes in first attempt and then do self correction in 2nd atrempt to get max reward?
Great video ! Could you share the link for the chatgpt chat extract that you demonstrated? The share link on the top right as seen in the video? It's a little difficult to follow towards the end
Strawberry is so exhaustive that people will be afraid to ask it questions lol. :)
Have they actually achieved any positive results? The most successful RL models usually limit how much a policy should change to avoid instabilities. One can say anything, if no results are there to prove it.
I love the part where you say "I am sorry, but I am going to ask you something personal, strawberry".
Hope your account is still active 😅
I feel watched .....