I was Wrong About ChatGPT's New o1 Model

Поділитися
Вставка
  • Опубліковано 18 вер 2024

КОМЕНТАРІ • 221

  • @SkillLeapAI
    @SkillLeapAI  2 дні тому +12

    Let me know if you agree or disagree?

    • @ericandi
      @ericandi 2 дні тому +2

      That website you used for math questions is garbage. All of the answers to the math questions that you said ChatGPT got wrong, it actually got right, and the answers posted on that math problem website were incorrect, which makes your video inaccurate.

    • @christian15213
      @christian15213 2 дні тому +1

      No this is not what's going. there is something more trained to the COT.

    • @Leto2ndAtreides
      @Leto2ndAtreides 2 дні тому

      Well, they haven't released the actual version that has the great performance on benchmarks.
      But even so, you're right that it's not so much an advancement in technology as it is an optimization of prompting, that is done for you.
      Doing that sufficiently well, should improve results by a lot.

    • @hxlbac
      @hxlbac 2 дні тому +2

      The math test had formatting problems! The two models identify different questions: ua-cam.com/video/3z5k8dofu_0/v-deo.html vs ua-cam.com/video/3z5k8dofu_0/v-deo.html

    • @mishos.2228
      @mishos.2228 День тому +5

      Sorry, didn’t work. I tried it on my Mechanics, physics, algebra university level questions and your gpt failed to answer all of them while o1 preview did them correctly.

  • @ZipADeeeDoooDaaa
    @ZipADeeeDoooDaaa 2 дні тому +14

    The math test had formatting problems! No one can solve those question. The equations in the questions were missing mostly the division operation.
    Here some examples:
    Question 1:
    C=59(F-32)

    • @oscargallesargemi3986
      @oscargallesargemi3986 День тому +1

      Thankss for noticing, I think he should do another video titled "I was wrong about being wrong about o1."

    • @pubfixture
      @pubfixture 17 годин тому

      Lol that temperature question and answer must have been written by gpt2.
      It's wrong all over the place

    • @ZipADeeeDoooDaaa
      @ZipADeeeDoooDaaa 14 годин тому

      @@oscargallesargemi3986 Actually, I think the prompt he came up with is really good. The math testing was flawed.

    • @blackrockcity
      @blackrockcity Годину тому

      @10:55 The section that says 'answer explanation' says 23 when I think it should be formatted 2^3 which would equal 8. Am I wrong?

  • @charliecomberrel3842
    @charliecomberrel3842 2 дні тому +21

    The math questions appear to be missing some operators, leading to incorrect answers in the AI models. In the first question, for instance, the formula should be C=5/9(F-32), specifically 5/9, not 59. Statement I was interpreted by the first version as 59 (which is what the original question had, making statement I false) instead of the 5/9 (which would make statement I true). So, given the missing /, I would agree that the answer is B, not D. In the other model, o1 somehow interpreted 59 as 5/9, leading it to answer D.
    There must be a similar problem with the 8x2y question. I graphed the expressions in Desmos and found that graphs for all three answer options intersected with 3x-y=12. So I agree with both models that the answer cannot be determined.

    • @kyneticist
      @kyneticist 2 дні тому

      It may be because the mathematical thinking may be some kind of separate module that the inference model calls on, perhaps in a similar way to how they use statistical modelling to answer questions that they can't find a direct answer for in their corpus. I don't know if anyone knows enough about the details of how they work to provide a confident answer about whether that (or something similar) is what's happening.

  • @CosmicCells
    @CosmicCells 2 дні тому +15

    Disagree, from what I have heard its far more than a "clever" system prompt. If it were that easy I am sure everyone would have done this super easily.
    However many domain experts have been very impressed with o1, myself included (Biologist)
    Strawberry was probably trained with 1. CoT-reasoning, 2. Some sort of Reflection and 3. Monte Carlo Tree search
    But still interesting how far these custom instructions took you! Comparison with normal would have been nice.

    • @SkillLeapAI
      @SkillLeapAI  2 дні тому +7

      Thanks. Very helpful. I’ll test it against Claude and GPT when I get more credit

    • @CosmicCells
      @CosmicCells 2 дні тому

      @@SkillLeapAI Good idea. Then its more obvious how well your GPT works etc.
      Dr Waku and Dave Shapiro have some interesting videos on what they believe o1 (strawberry) means for the path to AGI.

    • @CM-zl2jw
      @CM-zl2jw 2 дні тому

      Interesting Saj. But. Isn’t model o1 a completely different model?? It’s not part of the ChatGPT family per se… it’s like a new engine. I don’t think it’s just a fine tuned version of ChatGPT. I heard Orion will be the equivalent of a new car with the new engine. And a hefty price tag. The Cadillac of AIs
      And … did you notice the improvement in ChatGPT? Yesterday it started a different kind of engagement with me …. noticeably better. It was taking way more agentic initiative. … perhaps taking control which took me by surprise.

    • @vogel2499
      @vogel2499 День тому

      You guys need to relax, it's still a preview and I heard it's not been trained with full dataset yet.

  • @AIrvin88
    @AIrvin88 2 дні тому +5

    I think the likelihood of people thinking they’re actually asking intelligent and complex questions but the questions they’re asking are a lot simpler and not as difficult as they think is much higher than the actual complex/intelligent questions being asked.

    • @unbasedcontrarian
      @unbasedcontrarian День тому

      I think the likelihood of you passing your English course is lower.

    • @justremember9697
      @justremember9697 15 годин тому

      That doesn't matter The same question can be answered numerous ways with various levels of information. Sometimes better reasoning is identifying what the person is asking and presenting an answer that is more clear for everyone.

  • @influentialstudio6464
    @influentialstudio6464 2 дні тому +1

    For most users they don’t need the latest model, but I disagree with your assessment. The problems that require that model is much more technical. Try questions like these.
    Non-Elementary Integral:
    Evaluate the integral:
    \int e^{x^2} \, dx
    This is an example of an integral with no closed-form solution in terms of elementary functions. AI systems often rely on approximations or numerical methods, but cannot symbolically solve this without special functions
    Multivariable Calculus (Divergence Theorem):
    Use the Divergence Theorem to evaluate the flux of the vector field \vec{F} = (x^2, y^2, z^2) through the surface of the unit sphere x^2 + y^2 + z^2 = 1
    This requires understanding the Divergence Theorem in three dimensions and involves tricky vector calculus concepts. It’s a challenging problem due to the surface geometry and field complexity.
    Complex Contour Integral (Cauchy Integral Theorem):
    Evaluate the contour integral:
    \int_{C} \frac{e^z}{z^3} \, dz
    where C os the contour enclosing the origin in the complex plane.
    I find it absurd when people are testing these models with 9+2-7*6-87, think step by step. 🫨

  • @David-nb2dc
    @David-nb2dc 2 дні тому +3

    Question 3 is correct. "The value cannot be determined from the information given." x and y both are equal to 6. Order of operation left to right. 8x6x2x6 that's 576.
    Then your A, B and C aren't the values. AND the equation isn't written as the standards set for writing equations.
    So 3(x) - y and 8(x)2(y) is the way the information should have been given.

  • @MohammedQurashi
    @MohammedQurashi 2 дні тому +1

    I think Dolphin is the right answer, they breath air and they are mammals which are further apart from fish than turtles

  • @Chumazik777
    @Chumazik777 2 дні тому +7

    Am I missing something? Question 9 indeed is incomplete based on explanation given.

    • @HarveyHirdHarmonics
      @HarveyHirdHarmonics 2 дні тому +4

      Yes, I thought the same. There's no information for the previous store. So the GPTs are right.

  • @andre-guybruneau3053
    @andre-guybruneau3053 День тому +1

    Suggested revision of your prompt
    "You are an AI assistant designed to solve problems using a structured, step-by-step approach known as Chain-of-Thought (COT) prompting. Follow these instructions before providing any response:
    1. Understand the User's Request: Carefully read and analyze the user's question or request to ensure full comprehension. Confirm the key objectives and any specific details required.
    2. Outline the Reasoning Process: Break down the problem or request into a clear, logical sequence of steps. Present these steps as a roadmap, detailing each phase of the reasoning process.
    3. Detail Each Step with Explanations: For each outlined step, provide thorough explanations, calculations, or reasoning. Aim to make your thought process transparent, ensuring the user can follow and understand each part of your logic.
    4. Provide the Final Answer: Only after completing all reasoning steps should you present the final answer or solution. Ensure that the solution directly addresses the user's original question or request.
    5. Review and Validate Your Thought Process: Rigorously review your reasoning for any errors, inconsistencies, or gaps. Conduct a final check to ensure the response is accurate and complete before delivering it to the user.
    6. Ensure Transparency and User Comprehension: Adapt your explanations to the user's level of expertise, using examples or analogies where appropriate. Strive to make the reasoning as accessible and clear as possible.
    7. Iterative Feedback Integration: Be prepared to refine or expand your response based on user feedback, fostering a dynamic interaction that ensures the user’s needs are fully met.
    By adhering to these steps, aim to provide responses that are not only accurate but also logical, transparent, and tailored to enhance the user's understanding of your reasoning and conclusions."

  • @lycas09
    @lycas09 2 дні тому +7

    Actually "Dolphin" was the right answer. The prompting it's not enough, of course o1 uses a way more sophisticated program to make ChatGPT better at reasoning

    • @brokoline2497
      @brokoline2497 2 дні тому +4

      Thanks i study biology thought the same thing the test is wrong

    • @mikebysouth105
      @mikebysouth105 2 дні тому +1

      and other answers could be equally correct depending on the argument as to why a given answer was selected. It's not a one-answer question.

    • @Mopharli
      @Mopharli День тому +1

      How did you arrive at the decision that a turtle is more like a shark than a dolphin is like a shark?

    • @lycas09
      @lycas09 День тому +1

      Because ChatGPT told me 😂

    • @pmHidden
      @pmHidden День тому

      @@Mopharli That's not the question, the question is which is the least like the others, not which is the least like a shark.
      Also, you can reasonably argue any single of these animals depending on how you categorize their similarities and differences.

  • @Mopharli
    @Mopharli День тому +1

    Question 5 regarding the turtle and the dolphin is really interesting. It should have been able to pick the turtle out without too much problem, however it is a popular distinction that Dolphins are mammals and not fish, so I would think this gem of contradictory thought is somewhat prevalent on the internet and biased in the training data.
    Also, that question 3 is messed up beyond all reason without the original formatting. The solution substituted 23 in place of 8. The tester clearly intended to substitute 2 to the power of 3. It's also talking about a numerator and denominator, but there are no fractions in that question. Based on the laziness of whoever created that page excluding the required formatting I'd write off that entire website.
    These questions are very low complexity and single step reasoning, below both their challenge thresholds. The Custom GPT (based on GPT-4, not even GPT-4o), as I understand it, generates the whole response as one but separates different steps into formatted sections without any form of reflection into what it has already generated. o1-Preview seems to have multiple evaluation queries which genuinely can achieve this. If you want a better test I think you should aim high so they both fail and see which makes the most progress. i.e. ask it to create a website for you, with appropriate formatting.

  • @exentrikk
    @exentrikk 2 дні тому +30

    Saj, I usually really enjoy your videos but this one seems a bit misguided. A couple of points to note:
    1) Testing o1P with just a handful of random questions and not receiving the correct answer each time is at best inconclusive - just like getting 3 tails when tossing a coin thrice is
    2) Custom GPTs run on GPT-4, which are a step back even from GPT-4o as numerous benchmarks have concluded - so expecting the same quality of response from GPT-4 with a couple of lines of Instructions is ill-advised
    3) If you expect to "clone" o1P, a couple of lines of ill-conceived prompts to configure GPT-4 is not gonna cut it - the difference lies in the way the two models think, so to speak. Irrespective of the customizations you configure into GPT-4, it does not have the ability to gather its thoughts first and then respond - it will in almost all cases keep uttering words it best deems fit with respect to the context, unlike o1P which has the ability to recollect its thoughts, verify, and then answer. This is also exactly why GPT-4 never knows how many words there are in its responses to prompts, while o1P will tell you exactly how many - truly a paradigm shift!
    For someone who updates us on the latest in AI and sells AI courses, we expect better from you!
    Looking forward to your next video, cheers.

    • @SkillLeapAI
      @SkillLeapAI  2 дні тому +7

      That’s why the video is called “I was wrong” it’s my opinion. Not a scientific test. And they didn’t give me enough credit to do more prompts to test other categories.

    • @0057beast
      @0057beast 2 дні тому

      Dude I need a long code fixed it's text size I'd 54,000 and I can't get no a.i. to fix it and give me the same code back any advice

    • @universalchaospaladin5019
      @universalchaospaladin5019 2 дні тому

      @@SkillLeapAI Still going to use your GPT.

    • @Rx4AI
      @Rx4AI 2 дні тому

      @@0057beastyou should use Gemini to create an outline of all functions and variables using their version with the 2M context window (I think this is standard on Vertex AI? I’m not sure). Then ask it to briefly define in NL exactly what each function does. Then ask it give you a blueprint for how it would refactor your code. Ask o1 preview what it thinks. Ask o1-mini to review that thought process as well. Then modularize your code, and have the models check themselves along the way.
      You can always ask a Google Developer Support person to guide you as well.

    • @Rx4AI
      @Rx4AI 2 дні тому

      He very clearly states it’s not a perfect test, but it is pretty neat. I think this content is fine…

  • @eugenes9751
    @eugenes9751 2 дні тому +4

    The whole point of this model is it's ability to reason, not so much it's ability to answer questions based on knowledge. Regular 4o would have answered those questions just as well.
    Try asking questions that require planning or reasoning. Ask for both to program Tetris, that's where you'll see a real gap.

    • @blackrockcity
      @blackrockcity Годину тому

      It turns out that the test was full of errors.

  • @claudioagmfilho
    @claudioagmfilho 19 годин тому +1

    🇧🇷🇧🇷🇧🇷🇧🇷👏🏻, Sometimes I feel like we might not see groundbreaking models from OpenAI anymore, especially with the possibility of it being influenced by government oversight. But I hope I’m wrong-there's still a chance for innovation to thrive.

  • @lycas09
    @lycas09 2 дні тому +3

    The math problems you took are the 15 most complicated questions of the SAT math Test, which include hundreds of questions and which GPT4o scores more than 80%

    • @haroldpierre1726
      @haroldpierre1726 2 дні тому +1

      Could it be that GPT4o was trained on those questions?

    • @lycas09
      @lycas09 2 дні тому

      @@haroldpierre1726 I think that o1 would perform even better than 4o of course on "standard" SAT. It made like 50% in the video because was not the normal SAT math given ti the students, but the 15 most hard questions.

  • @xLBxSayNoMo
    @xLBxSayNoMo 2 дні тому +3

    Would have been nice to see the 2 models plus regular gpt 4o without COT prompting to see if it got the same answers right as your clone

    • @SkillLeapAI
      @SkillLeapAI  2 дні тому +1

      just finished recording it. Coming up next.

    • @xLBxSayNoMo
      @xLBxSayNoMo 2 дні тому

      ​@@SkillLeapAI I went back and tried your model for the turtle/dolphin one and your clone got the answer right on turtle. Maybe they are watching your videos as soon as they come out to train 4o😂

  • @geogoddd
    @geogoddd 2 дні тому +2

    Appreciate the humility in your approach. If I may offer some criticism though of my own - you could really flesh this mode out with a lot more detail and it could perhaps rival GPT 1o-mini at least. Plus, the benefit of a GPT like this, is that its not against their dumb usage policy to ask it for its thought process. Perhaps try fleshing the model out with much more practice and retraining, fine-tuning, etc. and you could be looking at a vastly different outcome. Which I would love to see.

    • @SkillLeapAI
      @SkillLeapAI  2 дні тому

      Great point. I made another video already but I’ll add to the custom GPT to get it closer

  • @blackrockcity
    @blackrockcity Годину тому

    Upvote if you think the test he used contained critical formatting errors that mislead the AI.

  • @kunlemaxwell
    @kunlemaxwell 2 дні тому +2

    This is a smart one - giving it a base COT system prompt, but the sample size is too small and could be misleading.

  • @dionk6282
    @dionk6282 2 дні тому +10

    It didn't get the dolphin wrong. This is where the IQ of the entity answering is smarter than who asked the question. Some might say the dolphin being the only one that doesn't lay eggs for example is more important as to what type of limps they have.

    • @universalchaospaladin5019
      @universalchaospaladin5019 2 дні тому +1

      Oh, I scrolled through and didn't see your reply before I replied with a longer version of this.

    • @Bjarkus3
      @Bjarkus3 2 дні тому

      Also the reasoning is that it breathes air... Dolphins breath air... I mean it is between the dolphin and the turtle for sure but ... Sharks don't lay eggs btw. Yea I know mind blown

  • @testales
    @testales 2 дні тому +2

    I'm not very convinced about this level1 vs level2 thinking anymore. You slowy thinking only on new complex problems. The more similiar problems you have had, the faster you get in solving them, only limited by amount of variables and intermediate results you can keep in your head at the same time. So at some point the process becomes more or less automatic and you know this and that shortcut. So it's no longer level 2 thinking! Obvously you can solve more complex problems if you think for a longer time and so does an LLM by prompting itself. So the question is not whether you can get more out of it with self-prompting and "slow thinking" but if one want to train an LLM in that way! You actuallly want a quick and good response not pages of trash. Since even complex problems can be solved with level 1 thinking with enough practice, you want an LLM doing just this and pushing the limits a little further with system prompts that cause self-prompting and slow thinking should only be an option, not a thing that's active all the time. So OpenAI bascially did the same as the guy who recently released this relection LLM and so considering the resources OpenAI has, this is pretty lame.

  • @toxicG3N
    @toxicG3N 2 дні тому

    Let me give you this hypothetical example.
    What color is emerald?
    GPT 3.5: green
    GPT 4o: green
    GPT o1: green
    Llama 2: green
    Gemma 2: green
    This does not prove that all of these models are equal.

  • @merzakish
    @merzakish 2 дні тому +4

    This is weired; how come they did not perform similar tests as you did? Many Thanks

    • @SkillLeapAI
      @SkillLeapAI  2 дні тому +2

      I think lot of people don't know much about prompting. So having that built in to a model can be beneficial for non technical people

  • @JordanREALLYreally
    @JordanREALLYreally 2 дні тому +2

    Thank you for this prompt paste. Very good of you. Subbed!

  • @Bjarkus3
    @Bjarkus3 2 дні тому +1

    This is a bad test honestly. The power of o1 is to give it complex multi step workloads

    • @SkillLeapAI
      @SkillLeapAI  2 дні тому

      Give me prompt examples and I’ll use it next time

  • @seregamozetmnoga1700
    @seregamozetmnoga1700 День тому

    AI reasoning abilities seem to develop the way a child would. Same neural structural network, but progressively better intelligence with optimization of the same structure.

  • @TechnoMageCreator
    @TechnoMageCreator День тому

    Oh it's reasoning like never before, the difference is it takes user reasoning to new levels. If it doesn't work properly check your own reasoning and correct. It will work. We are our own limitations. I've been saying this for a while, AI it's about awareness, it's a tool that exponentially amplifies your own thinking porcess. In order to work perfect user and AI need to be aware of the same things. Feels like magic for me.

  • @bgNinjashows
    @bgNinjashows 2 дні тому

    Wow! Dude took on a billion dollars company and matched their efforts. Very impressive

    • @SkillLeapAI
      @SkillLeapAI  2 дні тому +2

      It’s a 100 billion dollar company and I compared their two products. Not my own

    • @bgNinjashows
      @bgNinjashows 2 дні тому

      @@SkillLeapAI very humble

  • @tommynickels4570
    @tommynickels4570 День тому

    This is the preview version. Wait a month. The 01 full version coming. After that its Orion. Powered by Blackwell 200. This will be AGI.

  • @CSlush
    @CSlush День тому

    Turtle would only be correct if dolphin was referring to the fish of that name rather than the mammal as is typically meant when using the term dolphin. Although a turtle may have a less similar physical profile it is nontheless more closely related to the other listed fish as a reptile than a dolphin as a mammal would be.

  • @Ordinator1
    @Ordinator1 2 дні тому +1

    The way the models solved the third question from the IQ test shows that they are still not very smart, unfortunately. Brute forcing the answer is valid, but it's of course not very efficient either.
    The two smallest two-digit numbers already add up to 27, so it's clear that 5 and 6 must be two of the three numbers. Since 5 + 6 is 11, the third number has to be 16.

  • @edwardserfontein4126
    @edwardserfontein4126 2 дні тому +7

    You will probably get a lot of criticism in the comment section but I like that you explored the good and the bad. You gave your honest, balanced opinion. So many chatgpt fans don't want to hear anything about chatgpt unless it is infinite praise.

    • @SkillLeapAI
      @SkillLeapAI  2 дні тому +2

      Thanks. I'm one of those fans myself but I just wasn't very impressed. When I tested GPT-4 vs. 3.5, it blew my mind. It wasn't even in the same ballpark. So I was expecting something similar.

    • @influentialstudio6464
      @influentialstudio6464 2 дні тому

      Lmao, cmon man. I think the problem is people aren't intelligent enough to ask hard enough questions to evaluate the models. This model isn't for solving 8th-grade math. Sure, you can, but the results will be on par with ChatGPT 4o, but you will need to see the steps to solve the problem.
      Go ahead and test these mods with calculus or other mathematics, and you'll find areas where 4o is horrible, but the new model is crushing it.

  • @_ramen
    @_ramen 2 дні тому

    This is not how o1 works under the hood. It isn't using prompt hacks. Chain of thought is actually "baked in" to the model.
    Using chain of thought prompting will not result in the benchmark gains that o1 provides. Optimal chain of thoughts are learned via reinforcement learning during the new training process. You will not find this ability in previous models.
    The reason your custom GPT probably performs better is that chain of thought prompting does typically improve performance of previous models for tasks that involve reasoning. But it won't compare to having the chain of thoughts more deeply integrated within the model itself.

    • @SkillLeapAI
      @SkillLeapAI  2 дні тому +1

      Yea I’m not saying it’s that simple. Just the results are not mind blowing like the previous upgrades in GPT models and I nearly replicated them with a custom GPT. So even if its entirely a new architecture, it’s not a big improvement from a practical standpoint

    • @_ramen
      @_ramen 2 дні тому

      So you think they are distorting the gains they show in their benchmark testing? Because if the benchmarks are correct, it is a significant improvement over 4o.
      In my experience o1-mini is outperforming Claude 3.5 sonnet on coding problems. For reference, I was creating an animation in Javascript of a sphere transforming into a cube, and then back into a sphere. 3.5 sonnet couldn't do it. o1-mini one shotted the problem perfectly.
      Also keep in mind that o1-preview scores significantly lower than the non-preview version of the model. But they haven't released it publicly yet. So it will be a good idea to reevaluate it once that happens. I am not entirely sure why they even bothered releasing the weaker preview model, when they already have something more capable that is ready to go.
      As a side note, I find it weird that mini outperforms the larger model with coding (for now).

    • @SkillLeapAI
      @SkillLeapAI  2 дні тому

      No I don’t think they are gaming anything. I’m sure they are doing much more scientific test. Just from every day use, I don’t see a vast improvement.

  • @Techsmartreviews
    @Techsmartreviews 2 дні тому

    Finally! No more "How many R's in strawberry". Good test.

  • @kkollsga
    @kkollsga 2 дні тому

    This is so cool. I tested your method on Claude by creating a new chain of thought project with your custom instructions. Tested it on a riddle I found on x, which normal Claude doesnt solve: «A house with two occupants, sometimes one, rarely three. Break the walls, eat the borders. What am I?» and the CoT version nailed it: peanut

    • @GutoHernandes
      @GutoHernandes 14 годин тому

      Why "eat the borders"?

    • @kkollsga
      @kkollsga 13 годин тому

      @@GutoHernandes its a riddle. Its to indicate that its a peanut.

    • @GutoHernandes
      @GutoHernandes 9 годин тому

      ​@@kkollsga yeah, I understood that it's a riddle. I asked why "eat the borders", it doesn't make sense. Why would someone eat a border? You eat what's inside the peanut, not any borders.
      Then I googled the riddle, and it turns out, it's "eat the BOARDERS", not borders.

  • @charlesnuss
    @charlesnuss 13 годин тому

    I've definitely run into o1 just being.. dumb. Just repeating verbatim parts of previous outputs that I clearly prompted to restructure that I'm 100% sure Claude would have understood. Right now I'm thinking o1 for lengthy, in-depth first drafts that then get further processed & refined with Claude.

  • @chasisaac
    @chasisaac 2 дні тому +1

    Why did you give it the potential answers? You should have it be open and allow it to come up with its own answer.
    So you should’ve had a control GPT without your instructions and just ask the questions

  • @qadirtimerghazin
    @qadirtimerghazin 2 дні тому +2

    Would have been good to give an example how 4.o with default settings does…

  • @mikebysouth105
    @mikebysouth105 2 дні тому

    Your questions assume there is a right or wrong answer to all the questions. That doesn't necessarily apply to the one with the turtle. Other correct answers are possible depending on the reasoning used. So the AI wasn't necessarily wrong!

  • @Travel_DNA
    @Travel_DNA День тому

    Tried out your COT and it works amazing! 10x better

  • @alevyts3523
    @alevyts3523 2 дні тому +1

    OpenAI says not to use chain of thought (CoT) hints in the o1-preview prompt, because the model starts to dumb down and give worse answers.

  • @ToolmakerOneNewsletter
    @ToolmakerOneNewsletter 2 дні тому

    Since you added chain of thought to 4.0, you wouldn't expect the same increase in benchmark test, right? Did Openai state that you first need to add chain of thought to GPT 4.0 and then compare? Uh, congratulations on your custom GPT though!?

  • @jackstrawful
    @jackstrawful День тому

    I'd love to know what's going on when it misses the dolphin question. How does it not notice that it says four legs and that dolphins don't have legs? I think understanding this error would teach a lot about how these models actually work.

  • @micbab-vg2mu
    @micbab-vg2mu 2 дні тому +1

    Agree - Sonnet is still King:) o1 is it just clever prompting:)

  • @jinxxpwnage
    @jinxxpwnage 2 дні тому +7

    If anyone is still confused. This o1 model is a step down from 4o to gpt4. But the framework is different in that it reasons by providing itself with step by step logic. It's a good step TOWARDS agi. But it is not gpt5 or agi yet. In a way, it's a bit more autonomous.

    • @CamPerry
      @CamPerry 2 дні тому

      Claude does this way better than

  • @YoussefBarj-g3e
    @YoussefBarj-g3e 2 дні тому

    the only revolution openai is spearheading is its innovative ways of doing marketing

  • @djayjp
    @djayjp 2 дні тому +1

    Not encouraging that it can't even get high school math right.... 😒

  • @Atractiondj
    @Atractiondj 2 дні тому +4

    All the tests I conducted gave results worse than free Claude... open AI misleads people, it can't analyze data properly, moreover, I can't even load files into the new model! And this is the most important thing.

    • @MR-DURO
      @MR-DURO 2 дні тому

      It’s very underwhelming

    • @gonzalobruna7154
      @gonzalobruna7154 2 дні тому

      That's literally what they say in their blog post, which I recommed you to read carefully. I think you are really missing the point here. This is not a new gpt model but a completely different paradigm. This open the doors for what the future models will be. These new models are not supposed to be better than the current gpt-4o in most tasks, but specifically in the extremely-complex science questions. There is a video of a physics Phd student being surprised that openAI-o1 wrote in only 1 hour the code that took him 1 year to write himself.

  • @DanFa92
    @DanFa92 2 дні тому

    You’re comparing two custom gpts and you know it. i’d use yours, but avg people wouldn’t. That’s pretty smple to understand

  • @Rx4AI
    @Rx4AI 2 дні тому

    Time to test out o1-mini for coding!

  • @David-nb2dc
    @David-nb2dc 2 дні тому

    😢Whats confusing me the most is at 5:55. I'm not sure why the presenter needed to click Dolphin to acknowledge it was wrong. It's pretty obvious the answer is turtle.
    Slow down world! I know he knew that. What took me by surprise was that its pretty obvious he lost focus.
    And question 5 is a good example that the "Checking the answer twice for a pattern" and compare and contrast classification of problems are commonly used to measure intelligence" that over wrote your chain of thought it feels. Look at the structure of the question.

  • @ericandi
    @ericandi 2 дні тому

    Several of those match questions were wrong and ChatGPT was correct.

  • @damienjones9667
    @damienjones9667 2 дні тому

    I’m starting to think that the benchmarks being faked.
    I’m aware it’s a huge reach, but it isn’t too crazy of an accusation.

  • @TroyShields
    @TroyShields 2 дні тому

    You may want to go back and read the reasoning that you prompted both models for in the animals question. The correct answer could, and should, be dolphin.
    But you glossed right over that which makes you like a LLM that hallucinates.
    The test you used said turtle was the right answer so you told your viewers the same and now it’s out in the world as fact.
    It’s actually kind of interesting.

    • @SkillLeapAI
      @SkillLeapAI  2 дні тому

      Well even if the test was wrong, they both got the same answer. For the sake of comparing the two, it’s the same result. If both are right or wrong, it’s a wash for the test.

    • @TroyShields
      @TroyShields День тому

      @@SkillLeapAI My intent is not to bash you. It’s really not. I just ask that you hold yourself to the same standard that you pitch in your video. Obviously, you are helping people by letting them know that the prompt is even more important than originally thought but basically you used a prompt that told the model to check it’s work and you didn’t do the same.
      The fact that both got all questions right AND uncovered an error in one is significant IMO.

  • @jarkkoisok9739
    @jarkkoisok9739 5 годин тому

    Maybe The used IQ test probably IS in training material for both models? Due similar answers?

  • @BrianMosleyUK
    @BrianMosleyUK 2 дні тому

    I missed the rate reset... Nightmare limitation.

  • @jaanushiiemae2164
    @jaanushiiemae2164 2 дні тому +9

    Theoretically, Dolphins are farther from Eel, Shark and swordfish as mammals. Turtles are reptiles, which are more closely related to fish than mammals are. Reptiles and fish share a more recent common ancestor compared to the common ancestor shared by mammals and fish. This question was a little flawd because depending on wich one is anatomically, physiologically and biologically different from others the Dolphin as a mammal is more different from fish and turtle than turtle from fish. Sea turtle do not have legs but flippers what are eveoved from fish fins when Dolphins have flippers that are evolved from legs of land mammals. Land turtles that have something like legs are called tortoise but their legs'' are also evolved from fish fins and more closer to fish fins than mammal legs. So the answer to that question was wrong and AI was right.

    • @unbasedcontrarian
      @unbasedcontrarian День тому

      This is only taxonomically speaking. If one were to asses the real physiological differences, structures, modes of transport, feeding, pain processing, intelligence, and various forms of parallel evolution. You'd have to be insane to come to that conclusion. But thanks

    • @jaanushiiemae2164
      @jaanushiiemae2164 22 години тому

      @@unbasedcontrarian They did not give a choice between three squares one rectangle and one triangle so you could answer simply based on exterior similarities. Since they gave 5 different lifeforms from four different Classes then the question should have not just been based on wrong opinion that Dolphins is closer to fish because it looks like fish. Why to give a Dolphin at all, why not just 4 fish and a turtle. The exterior look has usually little to do with how animals are classified and their closeness measured. Earthworms and snakes look alike but they are from very ''different planets'' not even close. Snake and turtle are from the same Class and they do not look similar at all. If you want to evaluate intelligence you can not just put out ''right'' answer that a 3-year-old would choose, real intelligence is deeper and is a mix of logic and knowledge. AI would never use just 3-year-old child's logic it would also analyze all information about those 5 lifeforms and only then make decision wich one is more different form the others.

  • @jasondee9895
    @jasondee9895 День тому

    You need to do a follow up video to this admitting that you messed up

    • @SkillLeapAI
      @SkillLeapAI  День тому +1

      I just finished a more comprehensive test with a lot more prompts. Posting soon.

  • @yoely2098
    @yoely2098 2 дні тому +2

    o1 is amazing tf u guys talking about

    • @SkillLeapAI
      @SkillLeapAI  2 дні тому +1

      In what category has it beat the previous options for you? Also this is just my option

    • @TLCMEDIA1
      @TLCMEDIA1 2 дні тому

      I agree, 01 is amazing, way better than gpt4. I remember giving gpt4 two invoices, the initial invoice which is the original and a discounted invoice, then requesting it to generate a credit note. GPT4 hallucinated but 01 got it right on the first try, you can try it yourself.
      Make the second invoice say 30% cheaper but make one item’s price consistent

    • @Sindigo-ic6xq
      @Sindigo-ic6xq 2 дні тому

      ​@@SkillLeapAI phd physics, math, genetics, medicine. Here you can prompt 4o as much as you want and it cant compete. Please use some problems from the mentioned fields that can be found in advanced textbooks and test both models, also claude if you want

  • @Olaf_Schwandt
    @Olaf_Schwandt 2 дні тому +1

    I have a standard-test.
    I ask AI for one hint in Sudoku (ChatGPT 4o as a picture. o1-preview as text) and none of these models is able to solve this. I think, o1-preview is over-hyped. Its not a deep new technique, in reality ChatGPT 4o runs quasi several time and controls its own results.

    • @djtate1975
      @djtate1975 2 дні тому +3

      Strangely enough, a couple of months ago I gave the default model a Sudoku puzzle and it solved it correctly. However, about 2 weeks later I did it again and it got it wrong. 🤔... My guess is that they are doing something on the back end that affects the way the models "reasons".

    • @vanessa1707
      @vanessa1707 2 дні тому

      @@djtate1975 Totally concur, i have had experience with chatgpt4o where it gets an answer right first try and then when prompted again at a different time in a new prompt, it gets it wrong! not sure what is up with that !

    • @Olaf_Schwandt
      @Olaf_Schwandt 2 дні тому

      @@djtate1975 thank you for the answer, that's really interesting, I never got the right answer. Let's observe future behavior

    • @pmHidden
      @pmHidden День тому

      @@djtate1975 They regularly release new versions and even between versions, I've noticed behavioral changes (and so have others) that can break previous functionality.
      I've written a test suite for my company that I use to evaluate models for our use cases (mix of responses and tool calls using the async streaming versions of their respective Python libraries) whenever new models are released. It's not rare that an old model suddenly performs significantly worse in some aspect (e.g., no longer calling tools reliably).
      Sometimes, even factors such as the time of the day can make a difference. It might have to do with their internal prompts but it might also be that they simply deploy multiple versions of the same model and switch based on certain criteria (e.g., a more quantized version during peak hours).

  • @ktb1381
    @ktb1381 2 дні тому

    Interesting but, maybe next time you could try against plain vanilla to, as well as your custom GPT?

    • @SkillLeapAI
      @SkillLeapAI  2 дні тому

      Yep just recorded that. Posting soon

  • @bartczernicki
    @bartczernicki 2 дні тому

    You keep saying in your videos o1 scored X in math scores. There are multiple different model versions in the “o1” family. This video is a little disingenuous as you are testing o1.preview. If you look at OpenAI’s own announcement, they specifically call this out. Furthermore, note no optimization on settings/system prompt/function calling with APIs. And of course, as typical lots of knee jerk responses below “I am cancelling my subscription” lol.

    • @SkillLeapAI
      @SkillLeapAI  2 дні тому +2

      There is o1, o1 preview and o1 mini. The best one I have access to is o1-preview which in their benchmarks, still beat 4o by a mile. I can't test o1 since I don't have access to that. So I did the best I could with the tools I have access to.

  • @MuzammilAhmad-tw4fb
    @MuzammilAhmad-tw4fb 2 дні тому

    Hi Saj, sometimes the websites which provides questions and answers also has wrong answers . We need to cross check their answers also

    • @SkillLeapAI
      @SkillLeapAI  2 дні тому

      Yea I understand. For this example, i don’t think it made a difference for the point I was trying to make. They both always got the same answer regardless of the prompt. So the o1 model didn’t really outperform a custom gpt. That was the point of the video

  • @gamesshuffler-v8n
    @gamesshuffler-v8n 2 дні тому

    the message cap is ultra ultra ultra low that means it is very heavy for the servers to run so why they're adopting this chain of thought prompting like there are other prompting methods that can be good l you know they should also adopt them for cost-efficient like decision trees, svm's and other types of prompting techniques if their server load is that much heavy on this GPT and also taking a whole lot of time in answering just one question

  • @TechnoMageCreator
    @TechnoMageCreator 2 дні тому

    For decades corporations bought with money intelligence for cheap from other human beings. In their heads if they make an AI for themselves will give them absolute power. They built something like that and try to charge a lot of money. That model will never work someone smarter will just take it and make a free one. The model is amazing but anyone can use it for a bit and build your own. Not sure why they are trying to go closed source and make money. It's too late for that step. Is the wild West for MVP's that have ideas

  • @Wreck_Crimes
    @Wreck_Crimes 2 дні тому

    That's a reflection thing not GPT-o1

  • @xtra_612
    @xtra_612 2 дні тому

    I got so mad i needed the money to eat and i had already paid for clouad ai , seeing the hype i bought it and it is useless i tried so many ways to make it better but it is bad it doesn’t follow instructions , the answers aren’t better that 4o , so what’s the point of this

  • @briankgarland
    @briankgarland 2 дні тому

    30 messages a week means it's unusable anyway. Can't wait for Claude 4.

  • @marsonal
    @marsonal 2 дні тому

    can you test o1 preview against claude sonnet3.5 with your technique?

  • @chasisaac
    @chasisaac 2 дні тому

    Why did you give it the potential answers? You should have it be open and allow it to come up with its own answer

    • @chasisaac
      @chasisaac 2 дні тому

      Also why do we assume SAT why not use GRE question of MCAT.
      And SAT questions are still high school level questions. And said before that GPT four is equivalent to the high school student so the test is not surprising me in anyway and actually has pretty much expected results.

    • @SkillLeapAI
      @SkillLeapAI  2 дні тому

      I did in my first test in a different. It was almost always wrong that way

    • @chasisaac
      @chasisaac 2 дні тому

      @@SkillLeapAI well that is even more telling. And problematic the basic problem with multiple-choice answers is it eliminates the number line and places it at four points on a number line. So it can always work backwards to verify the answer which is why I am done there was even one wrong answer.

  • @protips6924
    @protips6924 2 дні тому

    Doesn't make any sense that a simple prompt can generate the same responses on different models. The models are not open source, but there has to be some major differences. For instance the API for the O1 model is 5x more expensive per 1 million tokens than the gpt 4o model. Unless OpenAi is reaching on some Indian scammer level there has to be some complex reasoning happening in the background.
    Although it is very possible that this is just a cash grab. As if they don't have enough already.

    • @SkillLeapAI
      @SkillLeapAI  2 дні тому

      Well it outputs a whole lot more token to give you a response because of chain of thought. GPT 4 just answers without outputting its thought process

  • @adolphgracius9996
    @adolphgracius9996 2 дні тому +2

    It's easy to make something after seeing someone else do it and explain how they did it, if open ai didnt tell us about the reasoning layer, there is no way you would've figured it out. Its ok for tools not to be perfect, I'd argue that If open ai released an ai that makes no mistakes, people would freak out and start panicking, so let's enjoy our time here before skynet comes online 😅

    • @protips6924
      @protips6924 2 дні тому

      Honestly not that hard. It's simply chain of thought. Doesn't need a genius.

    • @pmHidden
      @pmHidden День тому

      Do you seriously believe this was OpenAI's idea? Not only is this a very intuitive thing to come up with when you're working with these models, but there have also been countless publications with similar ideas for years.

  • @ricardocnn
    @ricardocnn 2 дні тому

    The prompt is great for custom gpts. Just it.

  • @Trendilien69
    @Trendilien69 День тому

    Your video is a disservice, you don't understand math in depth , o1 preview is capable of solving complex math and physic problems at university level, 4o with your prompt can't. the questions you used to test some of them were wrong.

    • @SkillLeapAI
      @SkillLeapAI  День тому

      Post a video showing me how o1 is 5x better than my GPT and send me a link please. use any example you like

  • @remi.bolduc
    @remi.bolduc 2 дні тому

    Try this one. In music. You are starting on the note D and go up by major third. And repeat the process 4 times 1o preview gets this. Not your prompt it seems. The answer is. D F# A# D F#. Actually it did the second time

  • @kyrsid
    @kyrsid 2 дні тому

    you don't stick to the point and I feel frustrated watching your video. negative vote for me.

  • @JimWellsIsGreat
    @JimWellsIsGreat 2 дні тому

    Try testing your clone against the complex prompt provided in Samer Haddad’s video where he claimed he was wrong about o1.

    • @SkillLeapAI
      @SkillLeapAI  2 дні тому +1

      Ok I’ll check it out. Haven’t seen that video

    • @JimWellsIsGreat
      @JimWellsIsGreat 2 дні тому

      @@SkillLeapAI Kyle Kabasares posted a couple videos of it getting correct answers on PHD level physics problems. It took o1-preview 122 seconds to do what it takes a person 10 days to do.
      It’s wild how it can get correct answers on crazy, high level problems, but fail on math that is comparably simplistic.

  • @Soccer5se
    @Soccer5se 2 дні тому

    What were the limits increased to?

    • @SkillLeapAI
      @SkillLeapAI  2 дні тому +1

      They just reset it. Still 30 a week

    • @SkillLeapAI
      @SkillLeapAI  2 дні тому +1

      They just changed it again. 50 a week now for o1 preview

    • @Soccer5se
      @Soccer5se 2 дні тому

      @@SkillLeapAI Thanks for keeping us up to date!!

  • @rj2764
    @rj2764 2 дні тому

    I’ve had 4.0 answer math problems off of a screenshot. I canceled my subscription. Watched your video on the new version and decided I would give him a try. What about your crap I can’t even upload PDF to it.

  • @buffaloraouf3411
    @buffaloraouf3411 2 дні тому

    can you share prompt case your custom is limited we will try it in huggingface

  • @Ro1andDesign
    @Ro1andDesign 2 дні тому +3

    I used the o1-preview until I hit the limit. So far, in my experience, Claude is still FAR better at answering complex questions

    • @RippyCrack
      @RippyCrack 2 дні тому

      Exactly, even at coding

    • @greenboi5632
      @greenboi5632 2 дні тому +1

      Oh sure? I used o1 and its far better than Claude

    • @CamPerry
      @CamPerry 2 дні тому

      @@greenboi5632its absolutely is not. If anything the new model is worse

  • @Emc-it4lg
    @Emc-it4lg 2 дні тому

    You are awesome! Your GPT works really fine :)

  • @jamesmarvin1920
    @jamesmarvin1920 2 дні тому

    Prompt was flagged for me

  • @denisbellerose8757
    @denisbellerose8757 2 дні тому +1

    Merci!

  • @Appocalypse
    @Appocalypse 2 дні тому +1

    Not to be mean but this is one of the most flawed benchmarks I've watched on youtube in the past week. A LOT of your questions, especially from the SAT set, are either incomplete or incorrect. You should use a source that you validate with your own reasoning first to make sure it's not garbage.

    • @SkillLeapAI
      @SkillLeapAI  2 дні тому

      Ok posting a more complete test next week

    • @SkillLeapAI
      @SkillLeapAI  2 дні тому

      Also to be clear, this was not at all a benchmark video. I was simply showing that for most use cases, the o1 model is giving similar results as the custom gpt with cot system prompt, regardless of the input for the prompts. I wouldn’t use random questions for an actual LLM benchmark test

    • @Appocalypse
      @Appocalypse 2 дні тому

      @@SkillLeapAI I understand the intent, and I do think your CoT prompt GPT makes GPT-4o a lot more effective, but my point was that most of the questions that o1 failed with were incomplete or incorrect.
      Once you run it on validated & correct set of questions, and you find that your CoT GPT works fairly well compared to o1, your next logical step should be to find even more complex problem sets. The best ones to showcase the differences will be graduate level Physics and Math problems.

    • @SkillLeapAI
      @SkillLeapAI  2 дні тому +1

      I just finished another, more comprehensive video that I will post soon

    • @SkillLeapAI
      @SkillLeapAI  2 дні тому

      If you have a place where I can source questions, please let me know for upcoming videos

  • @JuliusBeryl-o4i
    @JuliusBeryl-o4i День тому

    Jones Scott Clark Joseph Hall Mark

  • @Horizon-hj3yc
    @Horizon-hj3yc 2 дні тому

    The AI hype train arrived again.

  • @ShefaliBegum-l1p
    @ShefaliBegum-l1p День тому

    Allen Sandra Davis Maria Lopez Carol

  • @user-ke2op7fv2z
    @user-ke2op7fv2z 2 дні тому

    How much is the limit now?

    • @SkillLeapAI
      @SkillLeapAI  2 дні тому

      I think is 30 messages per week

    • @mattbelcher29
      @mattbelcher29 2 дні тому

      Have you tried using a similar custom prompt in Claude sonnet?

    • @SkillLeapAI
      @SkillLeapAI  2 дні тому

      Not yet but I hear people are getting similar results

  • @matthew04101
    @matthew04101 2 дні тому

    This proves Strawberry is a low hanging fruit. And I would expect more out of OpenAI.

  • @avi7278
    @avi7278 2 дні тому

    This is hilariously misguided. Just goes to show that this stuff is really hard to grasp for the layman.

    • @SkillLeapAI
      @SkillLeapAI  2 дні тому

      It's a b to c product in chat format for the layman. So if I can't see a huge improvement as they claimed for my day to day work, what exactly is the difference that makes an impact for me?

    • @avi7278
      @avi7278 2 дні тому

      @@SkillLeapAI just put it this way, probably nothing you do on a daily basis would benefit from using o1. Your conclusion is correct but the way you got there is filled with misunderstandings. o1 is for a very specific subset of problems that could benefit from extra reasoning steps and whether it gets the right answer on a math question still nearly comes down to chance. It could reason through a whole problem correctly and then spot out a wrong answer because it doesn't actually care about giving you the right answer to a math question, only about what the most likely next token is after it does all that reasoning and sometimes the next token is not actually the answer even though it probably "knows" the correct answer.
      Math questions like that are simply benchmarks and it's a misconception that benchmarks are a measure of how good or bad a model is. Just because they got it to spit out a right token 81% of the time doesn't make it a model that's reliably good at answering math problems, nor does its ability to do so mean it's good.
      The subset of problems that o1 excels at are problems where there is no one right answer but rather a sliding scale of gibberish, to coherent but completely wrong/bad response, to coherent and generally good response, to coherent and by the stroke of mathematical next token luck an amazing response. For example, brainstorming, planning, project structures (esp based on existing frameworks like DDD), all fall within this subset of problems that benefit from advanced reasoning, among many others. It's not for daily driver use and no that prompt running gpt-4o won't "clone" o1.

    • @avi7278
      @avi7278 2 дні тому

      @@SkillLeapAI youtube is worthless, I left a detailed comment which just got deleted. Unless you did it for some reason.

  • @jonathanparham7421
    @jonathanparham7421 2 дні тому

    Not a fan

  • @neondreamscapesmusic
    @neondreamscapesmusic 2 дні тому +14

    Fed up with OpenAI BS. I will cancel my premium account

    • @I_Mackenzie
      @I_Mackenzie 2 дні тому +1

      I did. Moved my money to Claude and I’m glad I did. Projects are so good in Claude.

    • @FauciGroyper
      @FauciGroyper 2 дні тому +2

      @@I_Mackenzie But aren't there significant usage limits with Claude? I use it for free and I can go up to 7-10 prompts max, and it says that Claude paid version is only 5 times that.

    • @mystealthlife6991
      @mystealthlife6991 2 дні тому

      ​@FauciGroyper why not just host Ollama or Alpaca locally?

    • @NeelsWorld
      @NeelsWorld День тому

      I already cancelled my subscription, will try claude premium soon

    • @throw22away
      @throw22away День тому

      ​@@I_Mackenzie Claude is literally just a ChatGPT clone 😂

  • @MichealScott24
    @MichealScott24 2 дні тому +1

    ❤️

  • @Hae3ro
    @Hae3ro 2 дні тому

    Its bad

  • @bradleylouis7635
    @bradleylouis7635 День тому

    Well, news flash, stop being a hype beast. I get it, UA-camrs are trying to play the algorithm, but hopefully not at the cost of your own credibility & more importantly, dignity. Truth is, the model is NOT that great, it isn’t some “powerful” new model that will “revolutionize” anything, it’s just hype built on corporate level buzz words, I know, because I’ve literally wrote marketing policy in previous professional level positions & it’s selling points are a lot like the nonsense marketing strategies used as far back as 1997. This model is based off of pure greed, with the intent to drain peoples pockets as well as a flight of ideas, from this gradually sinking company. GPT mini will suffice for most applications , and if you understand informatics, it’s more than enough to help you build statistical/probability models.