Once o1 replies with an answer and you either know it is wrong or suspect that it is, the next step should be asking it how it could "test" its answer. Prompt something like this, "I think your answer iw wrong, are there tests you could do to confirm that your answer is correct?"
That is an awesome suggestion 👏 Never thought about it this way, nobody told me that either...its a great way to fully utilise a reasoning capable model like that...Thanks mate! 😀 👍
O1 should be able to 'see' the traces in order to interpret them better. I am sure that if he had the possibility to see the scheme, it would have solved the problem correctly. Even if it took 3 prompts to get the correct result, it is still impressive.
I recommend you post complex problem which should cost more than 1 day for human to solve. That's your best value at this moment. Anything less than 30 minutes for human being, there are so many youtubers can do it. I have many complex real world problems, but I don't think it's world knowledge can handle it.
I have a few books that have problems that are of that kind: multi part and conceptually challenging. Problem is, I need to do them first myself 😅 might try and get around to doing some of the more doable ones and testing those against o1-preview and o1-mini
Also, I didn’t do theory in graduate school or for my current job, so my PhD physics problem-solving abilities have diminished in the 5+ years since I took graduate coursework, so it’ll be awhile before I’ll be comfortable solving problems of those kind at a quick rate.
Hey Kyle, I have not taken physics in university but I did take quite a few proof theoretic math courses. I was wondering if you could give chatgpt more mathematical logic problems instead of straight calculations (which we know can be simplified using wolfram to a certain extent). Some real analysis proofs, or measure theory could be a very good test of reasoning and thinking
Loving your channel, this is very educational - fantastic in depth data points for model prompting, behavior, performance across models. Keep up the good work! I bet these o1 models would do a good job thinking through Fermi Estimates...
The problem with using LLMs for problems like this is that they're not deterministic so there is always a non zero possibility they will return a wrong answer and when they do return a wrong answer it really looks correct. You always need to verify so you either need to know the answer already or have extensive in domain knowledge which really limits their usefulness.
Similarly to my first comment, we tend to judge this AI by comparing to pre-determined answers and rather unfairly think its no good when it gets the wrong answer (or at the very least, means we cannot whole heartily trust it). But when you think about say an area of novel research where there is not necessarily an exact way to validate our answers, its entirely possible that a PhD candidate or researched could make just the same mistakes as the Ai is making here. Not quite sure what I'm getting at with this comment but there's something in it.
@@perorenchino2036 Yes, but I feel I'm trying to say a bit more than that in that how we Judge an AI needs to be contextualised against our own limits.. i.e. Will we say we have arrived at AGI if an AI makes a mistake? What about a simple mistake (but one that an average person could still make, for example a syllogism or riddle). We are holding an AI to a super human standard. Which is great, but I don't think people realise on mass that already machines are now solving problems, far beyond the lay persons own capabilities. Others are criticising AIs because 'it has the problem already in the training'. This seems to be a fallacy though because you can easily give it a novel problem, and it will now think through a series of steps to try and solve. To be honest, I don't care if it gets the answer right at this stage, because there's no guarantee that the human would either and for the most past the AI appears to attempt at least reasonable, if not entirely correct, steps. And yet somehow we still judge this as ... a fail? Would we do the same if it were an undergrad student. or even an average person just giving a hard problem a go?
@@nickrobinson7096 Yes, there are logical fallacies and conceptual issues that can be associated with the idea that for an AI to be considered as AGI (Artificial General Intelligence), it must have a "perfectionist" mentality, making very few errors or mistakes: 1. Perfectionist Fallacy: This is the expectation that a solution or entity (in this case, AGI) must be flawless or perfect to be acceptable. It sets an unrealistically high standard that might not be necessary for AGI to be effective or considered as general intelligence. 2. False Dilemma/False Dichotomy (Either/Or Fallacy): This fallacy might manifest if the argument implies that AI must either be perfect or it cannot be considered AGI, ignoring the possibility that AGI could be imperfect yet still meet the criteria for general intelligence. 3. Straw Man Fallacy: This could occur if someone misrepresents the requirements for AGI by suggesting it must be perfect, thereby creating an easier position to attack or dismiss. In reality, AGI is generally defined by its ability to understand, learn, and apply intelligence across a wide range of tasks, not by its perfection. 4. Equivocation: This could happen if "perfection" is used ambiguously, conflating different meanings of intelligence or capability. For instance, human intelligence is not perfect and is marked by errors and learning from them, yet it is still considered general intelligence. 5. Nirvana Fallacy: This involves comparing an actual situation to an unrealistic, idealized version. Expecting AGI to function without errors might be an idealized notion, ignoring practical limitations and iterative improvements in AI development. In discussions about AGI, it's crucial to focus on its ability to perform a wide range of cognitive tasks effectively (which is why it's called "general intelligence" and not "narrow intelligence") rather than holding it to a standard of perfection. That being said, there are AI benchmarks or tests that measure "general intelligence" such as the GAIA benchmark, which in Dr. Alan Thompson's youtube channel ( ua-cam.com/video/JpQA7nB_P6o/v-deo.html ), he tested the o1 model on a level 3 GAIA benchmark problem and o1 managed to solve it. Also, on his website ( lifearchitect.ai/agi ), Dr. Alan Thompson estimated that we are already at 81% AGI as of September 2024.
Thanks for the interesting videos. I suggest making a video to test these models to calculate this integral: $\int_{0}^{1}\frac{\tanh^{-1}(x\sqrt{2-x^2})}{x} dx$. I calculated the exact value to be $\frac{3\pi^2}{16}. Even Mathematica cannot give the exact answer. Good Luck!
@@KyleKabasares_PhD Actually, I guided the o1 model to solve it, and it was able to do so after receiving some hints from me. It would be interesting to see if you could ask it to solve the problem as well, to determine if it has really learned from my guidance.
Thanks to LLM'snow I can do graduate level Physis problem, in a few months or yars it will caulate that in 10 seconds or less, lol, that is the future,
@@DanielSeacrest I expect Kyle will use complex problems to test o1. Other youtubers don't have so many complex problems (have to be maths , physics, or economics, etc ) to test.
these model doesn't know the physical world like we do because they observe physical world through words of internet and their synthetic data, i believe since these models have tendency wants to learn so if they able to do... eh what i it's term (learning from experience).
Would be interested to see, regardless of it getting it right or wrong, what would happen if after it gives the answer if you say something like 'do you agree with your answer?' you could also try it within a single question like 'after providing your answer please check again'. I have done this once and it can go into a bit of a loop for a few mins, realising its made mistakes and trying again and again. In my simple test it did this for a long time but eventually got a decent answer.
Once o1 replies with an answer and you either know it is wrong or suspect that it is, the next step should be asking it how it could "test" its answer. Prompt something like this, "I think your answer iw wrong, are there tests you could do to confirm that your answer is correct?"
Thanks for the suggestion!
That is an awesome suggestion 👏 Never thought about it this way, nobody told me that either...its a great way to fully utilise a reasoning capable model like that...Thanks mate! 😀 👍
O1 should be able to 'see' the traces in order to interpret them better. I am sure that if he had the possibility to see the scheme, it would have solved the problem correctly.
Even if it took 3 prompts to get the correct result, it is still impressive.
I recommend you post complex problem which should cost more than 1 day for human to solve. That's your best value at this moment. Anything less than 30 minutes for human being, there are so many youtubers can do it. I have many complex real world problems, but I don't think it's world knowledge can handle it.
I have a few books that have problems that are of that kind: multi part and conceptually challenging. Problem is, I need to do them first myself 😅 might try and get around to doing some of the more doable ones and testing those against o1-preview and o1-mini
Also, I didn’t do theory in graduate school or for my current job, so my PhD physics problem-solving abilities have diminished in the 5+ years since I took graduate coursework, so it’ll be awhile before I’ll be comfortable solving problems of those kind at a quick rate.
@@KyleKabasares_PhD 😂😂😂
Hey Kyle, I have not taken physics in university but I did take quite a few proof theoretic math courses. I was wondering if you could give chatgpt more mathematical logic problems instead of straight calculations (which we know can be simplified using wolfram to a certain extent). Some real analysis proofs, or measure theory could be a very good test of reasoning and thinking
I wonder if could re create Walter whites blue crystal meth formula of 99%
Loving your channel, this is very educational - fantastic in depth data points for model prompting, behavior, performance across models. Keep up the good work!
I bet these o1 models would do a good job thinking through Fermi Estimates...
@@electronjoe Thank you so much for watching!
That is quite impressive, however, you still need to provide it with the appropriate direction.
The problem with using LLMs for problems like this is that they're not deterministic so there is always a non zero possibility they will return a wrong answer and when they do return a wrong answer it really looks correct. You always need to verify so you either need to know the answer already or have extensive in domain knowledge which really limits their usefulness.
Similarly to my first comment, we tend to judge this AI by comparing to pre-determined answers and rather unfairly think its no good when it gets the wrong answer (or at the very least, means we cannot whole heartily trust it). But when you think about say an area of novel research where there is not necessarily an exact way to validate our answers, its entirely possible that a PhD candidate or researched could make just the same mistakes as the Ai is making here. Not quite sure what I'm getting at with this comment but there's something in it.
You are pointing out the equivalanve between Ai AND HUMAN MIND IN PROBLEM SOLVING,
@@perorenchino2036 Yes, but I feel I'm trying to say a bit more than that in that how we Judge an AI needs to be contextualised against our own limits.. i.e. Will we say we have arrived at AGI if an AI makes a mistake? What about a simple mistake (but one that an average person could still make, for example a syllogism or riddle). We are holding an AI to a super human standard. Which is great, but I don't think people realise on mass that already machines are now solving problems, far beyond the lay persons own capabilities. Others are criticising AIs because 'it has the problem already in the training'. This seems to be a fallacy though because you can easily give it a novel problem, and it will now think through a series of steps to try and solve. To be honest, I don't care if it gets the answer right at this stage, because there's no guarantee that the human would either and for the most past the AI appears to attempt at least reasonable, if not entirely correct, steps. And yet somehow we still judge this as ... a fail? Would we do the same if it were an undergrad student. or even an average person just giving a hard problem a go?
@@nickrobinson7096 Yes, there are logical fallacies and conceptual issues that can be associated with the idea that for an AI to be considered as AGI (Artificial General Intelligence), it must have a "perfectionist" mentality, making very few errors or mistakes:
1. Perfectionist Fallacy: This is the expectation that a solution or entity (in this case, AGI) must be flawless or perfect to be acceptable. It sets an unrealistically high standard that might not be necessary for AGI to be effective or considered as general intelligence.
2. False Dilemma/False Dichotomy (Either/Or Fallacy): This fallacy might manifest if the argument implies that AI must either be perfect or it cannot be considered AGI, ignoring the possibility that AGI could be imperfect yet still meet the criteria for general intelligence.
3. Straw Man Fallacy: This could occur if someone misrepresents the requirements for AGI by suggesting it must be perfect, thereby creating an easier position to attack or dismiss. In reality, AGI is generally defined by its ability to understand, learn, and apply intelligence across a wide range of tasks, not by its perfection.
4. Equivocation: This could happen if "perfection" is used ambiguously, conflating different meanings of intelligence or capability. For instance, human intelligence is not perfect and is marked by errors and learning from them, yet it is still considered general intelligence.
5. Nirvana Fallacy: This involves comparing an actual situation to an unrealistic, idealized version. Expecting AGI to function without errors might be an idealized notion, ignoring practical limitations and iterative improvements in AI development.
In discussions about AGI, it's crucial to focus on its ability to perform a wide range of cognitive tasks effectively (which is why it's called "general intelligence" and not "narrow intelligence") rather than holding it to a standard of perfection. That being said, there are AI benchmarks or tests that measure "general intelligence" such as the GAIA benchmark, which in Dr. Alan Thompson's youtube channel ( ua-cam.com/video/JpQA7nB_P6o/v-deo.html ), he tested the o1 model on a level 3 GAIA benchmark problem and o1 managed to solve it. Also, on his website ( lifearchitect.ai/agi ), Dr. Alan Thompson estimated that we are already at 81% AGI as of September 2024.
love watching this videos ❤
I’m glad!
Thanks for the interesting videos. I suggest making a video to test these models to calculate this integral: $\int_{0}^{1}\frac{\tanh^{-1}(x\sqrt{2-x^2})}{x} dx$. I calculated the exact value to be $\frac{3\pi^2}{16}. Even Mathematica cannot give the exact answer. Good Luck!
Thank you for the suggestion!
@@KyleKabasares_PhD Actually, I guided the o1 model to solve it, and it was able to do so after receiving some hints from me. It would be interesting to see if you could ask it to solve the problem as well, to determine if it has really learned from my guidance.
" no please " lmao
Why so much hate comments on a niche channel ?
As someone who studies machine learning, idk why people who dont even know what a genetic network is are attacking him talking about ai
very nice video! But how would you know if the model is bullshitting you on a problem for which you don't have a solution beforehand?
you are helping openai to train the model . And I do not think openai o1 will have much help in the future :
Please solve Riemann hypothesis
Thanks to LLM'snow I can do graduate level Physis problem, in a few months or yars it will caulate that in 10 seconds or less, lol, that is the future,
In your first video, o1 solved an 1.5 week homework for 2 minutes. So today's challenge is not that promising...
Although this is not o1-preview like in other problems, this is o1-mini and GPT-4o.
@@DanielSeacrest I expect Kyle will use complex problems to test o1. Other youtubers don't have so many complex problems (have to be maths , physics, or economics, etc ) to test.
these model doesn't know the physical world like we do because they observe physical world through words of internet and their synthetic data, i believe since these models have tendency wants to learn so if they able to do... eh what i it's term (learning from experience).
Great clip.
Would be interested to see, regardless of it getting it right or wrong, what would happen if after it gives the answer if you say something like 'do you agree with your answer?' you could also try it within a single question like 'after providing your answer please check again'. I have done this once and it can go into a bit of a loop for a few mins, realising its made mistakes and trying again and again. In my simple test it did this for a long time but eventually got a decent answer.
Or maybe like... try and alternative way and see if you get the same answer.
Interesting , so its good but not good enoph
dude ur reactions are priceless lmao xD
@@KasunWijesekara LOL I’m glad you think so
Is it just me, or has 4o been significantly nerfed after o1 was released? Sure, 4o wasn’t too good at math, but it wasn’t that bad back then.
OpenAI consumes water of about 3 bottles to generate 100 words...
The water they "consume" is still there afterwards.
@@marwin4348 How ?
Boring as hell. Unsubscribe
@@evangelion045 bye ;)
Rude dud