LOL that's the funniest thing. The actual "strawberry" model can perfectly guess how many r's are in "strawberry", but if you make it just a tiny bit more complicated, it fails as bad as before. @Chollet would laugh at this so much xD
i have picked them up from various exams. The earning problem i made it. It was when o1 was released and when i tested it personally it shattered my questions so came up with that. Thanks for noticing.
It's the o1 but we also see that you might also get away with o1-mini. 1.o1 (good overall) 2.o1-mini (Good when you have very specific issue ) 3.Deepseek r1( could be cheaper than the too but api release will tell) 4.QwenQWQ ( The cheapest , Deepseek r1's api will tell if it retains that. Brings reasoning abilities to actual usable prices.) I hope it was helpful. :)
pretty much meaningless. via the webinterface, you never know what model version you get and esp. OpenAI is known for making A-B tests. So you have to use the API. And a temperature above 0 makes no sense for these kind of tests.
I get you bro but the point is. API pricing is of the charts. (o1-preview) And people will be using in most of the cases the chatgpt version. Yes.There could be internal system prompt change . Hidden AB tests. (yeah that is a downside but happens rarely.) Known AB tests are there and visible so we know when they come. All in all i get your point. I have thought about this and other things like factuality of models ( you can watch my "Can you trust LLMs" video). i have some plans to take these into account but. If i am being honest i am little busy on something related to family but i will try to get it implemented ASAP.
Good informative video. A suggestion: a chart at the end with pass or fail for the models.
good suggestion thanks for that.
For the Chinese models try swapping the word Unicorn with Qilin or Kirin.
They somewhat resemble a Unicorn - Horned Horse.
hmm. SHould try this.
Thanks for the info bro,
welcome bro :)
Great content!
Thank you very much.
great video!
Sick video man
thanks man : )
LOL that's the funniest thing. The actual "strawberry" model can perfectly guess how many r's are in "strawberry", but if you make it just a tiny bit more complicated, it fails as bad as before. @Chollet would laugh at this so much xD
😂
gemini 1121 got all the questions right expect for the earnings problem and the unicorn svg
comming up with it's video :). Actually planned that but this reasoning mode dropped.
Did you write the prompts yourself or did you get them from someplace?
i have picked them up from various exams. The earning problem i made it. It was when o1 was released and when i tested it personally it shattered my questions so came up with that.
Thanks for noticing.
So which is best?
It's the o1 but we also see that you might also get away with o1-mini.
1.o1 (good overall)
2.o1-mini (Good when you have very specific issue )
3.Deepseek r1( could be cheaper than the too but api release will tell)
4.QwenQWQ ( The cheapest , Deepseek r1's api will tell if it retains that. Brings reasoning abilities to actual usable prices.)
I hope it was helpful.
:)
When openAI makes a breakthrough other companies soon followed.
But openai is not actually open! All models are depends on the google research for the transformars even chatgpt
They have no moat!
windsulf taking over
I build first android application with it yesterday. tears🥹
pretty much meaningless. via the webinterface, you never know what model version you get and esp. OpenAI is known for making A-B tests. So you have to use the API.
And a temperature above 0 makes no sense for these kind of tests.
I get you bro but the point is.
API pricing is of the charts. (o1-preview)
And people will be using in most of the cases the chatgpt version.
Yes.There could be internal system prompt change . Hidden AB tests. (yeah that is a downside but happens rarely.)
Known AB tests are there and visible so we know when they come.
All in all i get your point. I have thought about this and other things like factuality of models ( you can watch my "Can you trust LLMs" video).
i have some plans to take these into account but.
If i am being honest i am little busy on something related to family but i will try to get it implemented ASAP.
@@YJxAI Makes sense! 😉 You could just use the example python implementations for your tests. Just an idea.