Is DeepSeek R1 really better than OpenAI o1? Let's find out

Поділитися
Вставка
  • Опубліковано 29 січ 2025

КОМЕНТАРІ • 53

  • @TheFeatureCrew
    @TheFeatureCrew  День тому +2

    Since there seems to be some confusion in the comments:
    - these are the tests we run on every model (check our previous videos)
    - what is presented is, as always, each model’s first attempt at the problem, no cherrypicking

  • @Mario_Pintaric
    @Mario_Pintaric 2 дні тому +24

    Your Tower of London problem is incorrectly solved based on the image of the rods initially shown. Given the rods are of different length, the ChatGPT solution is incorrect. I was able to create a prompt for DeepSeek that solved this problem correctly with a much smaller version of the model (70B Distilled 8bit). It would be helpful if you provided the details of the problem and the prompts you used to solve them so your intentions were clearer while allowing others to try/verify your approach.

    • @shonnspencer1162
      @shonnspencer1162 День тому +4

      You got that too. I think they are too biased to OpenAI . They need to stop looking at it from the perspective of a model they have been working on for years. This was not scientific on any level. This is just a biased inference toward open AI. I thought that they would do better. I like to always be positive, but this was just them trying to prove that OpenAI had a better model and if you work that way you will come to your own conclusion work in a neutral way so that you can see what actually happens

    • @JiasenLiu
      @JiasenLiu 18 годин тому +1

      09:11 yeah, the rule that the bottom disk can't be moved first isn't even mentioned in the prompt.

    • @Mario_Pintaric
      @Mario_Pintaric 4 години тому

      @@JiasenLiu , Agreed. Without a capacity limit for pegs and additional rules, this test is meaningless. The only way an AI would get this right is if it made a bunch of assumptions, which in itself would be invalid. One of the issues with current AI implementations is the extent to which oversampling is used to pre-determine certain results. For industrial applications, that could be fatal. We're a long, long way from truly practical AI. But I'm not going to say that out loud. My "correct" prompt solves the problem they present using the DeepSeek 32B (8bit) local model. The full blown ChatGPT flubs on the same prompt!!! I'll be releasing my findings on LinkedIn tomorrow.

  • @perelmanych
    @perelmanych 2 дні тому +27

    No surprise that 5 million model that you can run locally was outperformed by the model from organization that burns trillions of dollars. What is surprising that they are not so far off.

    • @MARKXHWANG
      @MARKXHWANG День тому

      Test that to the stupid traders in biggest hedgefunds

  • @Nuk1945
    @Nuk1945 День тому +6

    Try this prompt, "Solve the following problem. The rule is the following, there are three pegs, colored disks are placed on the pegs, one on top of other, only the top disk can be moved, one disk can be moved at each step. Initial configuration is: Peg A: empty; Peg B: green disk at bottom, red disk at top; Peg C: blue disk. Target configuration: Peg A: green disk at bottom, blue disk at top; Peg B: Red; Peg C: empty." DeepSeek r1 solved it correctly in 4 steps. I also tried the 32B -r1 model on my PC, it solved it in 6 steps .

  • @tantzer6113
    @tantzer6113 День тому +5

    We have benchmarks for a reason: so we don’t cherry pick successes and failures.

  • @carkawalakhatulistiwa
    @carkawalakhatulistiwa День тому +5

    The most important comparison is price.

  • @sabyasachighosh6252
    @sabyasachighosh6252 2 дні тому +7

    Wouldn't Tower of Hanoi type problems already be in the train set of o1?

    • @TheFeatureCrew
      @TheFeatureCrew  2 дні тому +1

      Exactly - the type of problem but not the instance. With such a simple test there is a chance that the instance for this ep was in training data. More complex instances take more time so we only push it when needed for a conclusion. See our o1 pro review for an instance that certainly is not in training data. This is similar to giving a model a specific math or physics question that is not in training data, even though plenty of technically similar problems are.
      - Jakob

  • @jnevercast
    @jnevercast День тому +2

    Thanks for doing this comparison.
    I want to add, the OrbitControls issue has been a bane for Claude 3.5 Sonnet too!

  • @IvarDaigon
    @IvarDaigon День тому +2

    You cant make the assumption that Deepseek R1 and O1 mini are in the same order of magnitute in paramater count by looking at how fast they produce tokens via their repsective APIs
    We have no idea what kind of hardware they are running on, how much vram they are using or how many concurrent users there are etc.
    Even if R1 is not as reason-able as O1, the benefit of R1 should be obvious, you can run it at home 24/7 and all you need to pay for is the electricity. This is going to lead to some very serious grass roots innovations.

    • @TheFeatureCrew
      @TheFeatureCrew  День тому +1

      You are correct, I cut out the part where I mentioned this as I was kinda babbling. We are also basing this assumption on the prevalent rumors that o1-mini could be as small as 12b params. Perhaps I should have worded it as r1 is certainly within an oom of o1 mini. Two oom seems impossible.
      Agree that r1 is the next in a long line of OS/OW models bringing somewhat-near-frontier capabilities to edge devices. That said, what should also be obvious is there will be an arms race for frontier inference time compute AND distillation trends will continue bringing legitimately frontier capabilities to edge hardware.
      - Jakob

  • @Laptevwalrus
    @Laptevwalrus 2 дні тому +6

    Agree, R1 is not as powerful as O1. However I noticed its much more stable at mutlitple promts (conversation). For example it solves my puzzle problem with some hints, while o1 struggles being stubborn. Also analysing thought process of R1 it feels like it learned to be critical of itself which is really intersesting in my view

    • @AdamB1_23
      @AdamB1_23 2 дні тому +2

      From what I’ve seen, the reason this happens is because OpenAI does not pass the previous reasoning tokens through to the next response etc, so it never knows how it worked out previous points causing it to refuse to backtrack. R1 however, despite using a lot more tokens, passes its reasoning tokens through so it can understand what it did before. Basically how humans think.

    • @KashifKhan-iw2ns
      @KashifKhan-iw2ns 2 дні тому +1

      ⁠@@AdamB1_23exactly, I never cared about context until I used DeepSeek and its previous answer analyzing is just so good. Although it is not as powerful as o1 but still it is very comparable..

    • @Raylz-h9c
      @Raylz-h9c 2 дні тому

      everything i do with r1 since days results in "sorry maintenance" or "too much traffic"
      Ty for info xD

  • @akim5030
    @akim5030 2 дні тому +3

    Does deepseek have any limitations like number of prompts per day?

    • @揪芭比母捏牛
      @揪芭比母捏牛 2 дні тому +2

      currently no limitation but some sensitive topics

    • @Maisonier
      @Maisonier 2 дні тому

      @@揪芭比母捏牛 Like chatgpt ...

    • @globurim
      @globurim 12 годин тому

      So far no, but might be if it becomes too popular. You can also try local so you have no limits

  • @saltybaguette7683
    @saltybaguette7683 2 дні тому +3

    I'm curious, why do you say O1 and R1 sizes are the same order of magnitude? O1 Runs on a server you don't know anything about (I think). The size might be within an order of magnitude, but there could also be an order of magnitude more/less of resources available to you on the OpenAI server?

    • @TheFeatureCrew
      @TheFeatureCrew  2 дні тому

      We were comparing o1 mini to r1 in param count, not o1

  • @nahlene1973
    @nahlene1973 2 дні тому +3

    Try philosophical questions, i found Deepseek‘s thinking & literature level goes above almost any writer i read, to a point that i kinda think it could write Nobel literature level stuffs, there were a few time i felt that i was talking to JP Satre😂

  • @3koozy
    @3koozy 2 дні тому +4

    If it was thatbeasy to tell what model is better with few examples, all the benchmarks will be obselete

  • @tonycosentino
    @tonycosentino День тому

    Thank you for demonstrating exactly what I have experienced in my own daily test as I use o1 and o1 pro but put the same questions into R1 and most of the time is missing decent reasoning.

  • @damien2198
    @damien2198 2 дні тому +6

    I find DS far more useful/powerful than O1 (and soon O3) as I can upload tons of docs to it (pdf/docs/xls)

  • @frank6687
    @frank6687 2 дні тому +3

    Interesting, I just put the question 3 into ChatGPT O1 and got the same answer as R1.
    Here’s a concise, step-by-step solution for reaching the goal configuration:
    Initial Setup:
    Peg A: Empty
    Peg B: Green (bottom), Red (top)
    Peg C: Blue
    Goal Configuration:
    Peg A: Green (bottom), Blue (top)
    Peg B: Red
    Peg C: Empty
    Steps to Solve
    Move Red from Peg B to Peg C.
    A: Empty
    B: Green
    C: Blue (bottom), Red (top)
    Move Green from Peg B to Peg A.
    A: Green
    B: Empty
    C: Blue (bottom), Red (top)
    Move Blue from Peg C to Peg A (on top of Green).
    A: Green (bottom), Blue (top)
    B: Empty
    C: Red
    Move Red from Peg C to Peg B.
    A: Green (bottom), Blue (top)
    B: Red
    C: Empty
    Final Configuration
    Peg A: Green (bottom), Blue (top)
    Peg B: Red
    Peg C: Empty
    This completes the puzzle in 4 moves.
    but after giving O1 and R1 a hint that the first answer was wrong, both of them corrected it and gave me the right answer.
    r1 (4 steps):
    Move Red (B → C)
    Move Green (B → A)
    Move Red (C → B)
    Move Blue (C → A)
    o1(5 steps):
    Move Red (B → A)
    Move Green (B → C)
    Move Red (A → B)
    Move Green (C → A)
    Move Blue (C → A)

  • @TechyMage
    @TechyMage День тому +2

    can someone repeat the test and also use other tests bcz i smell a lot of bias here, based on my experience, deepseek is performing better on close to olympiad lvl maths and phy prob including phy prob that needs good amount of reasoning to solve. Either the tests r cherry picked or they r somehow fucking with result or maybe a coincidence

  • @DaveHoskinsCG
    @DaveHoskinsCG 19 годин тому

    The idea is that you teach where it went wrong and right. It will reason on the previous result, and think about a better answer. A reasoning model based on incentives.

  • @MARKXHWANG
    @MARKXHWANG День тому +1

    Great content. Out of sample test is the real test. In sample is just gaming benchmark

  • @MrRandomnumbergenerator
    @MrRandomnumbergenerator День тому

    amazing test, would u make more guys?

  • @ominoussage
    @ominoussage День тому +1

    Even if it were true that o1 is still better, that won't stop companies and enthusiasts to use R1 for themselves simply because they can host it/it's cheaper to use while not being too far off of o1's performance. With just 1 or 2 more prompts away, you could achieve a similar level of response as o1. I'd rather spend a little bit more time to get the response I want than spending a lot more money.

  • @sagumekishin5748
    @sagumekishin5748 2 дні тому +3

    You can try prompting the same questions in Chinese to see if R1 is better at reasoning in Chinese than English

  • @DanFrederiksen
    @DanFrederiksen 2 дні тому

    quite impressive they can code some 3D from scratch however I find that any level of real world greater interconnectivity of code with constraints from various directions immediately make them fail. So instead of a neat 3D ball from scratch maybe try a relatively straight forward subtask inside some existing code of even moderate complexity.

  • @MultiBraner
    @MultiBraner 2 дні тому

    Subscribed ... amazing stuff

  • @i4aneye618
    @i4aneye618 2 дні тому

    Great work!

  • @Metarig
    @Metarig 18 годин тому

    For coding, you have to keep going back and forth, letting the AI fix issues. With only 50 prompts per week on o1, it's ridiculous-you can't get anything done unless you pay $200 a month. At that point, you might as well pay someone to do it for you instead of stressing over AI limitations. I’m not sure about the technical side, but these guys seem biased as hell.

  • @lizadonrex
    @lizadonrex 12 годин тому

    You need to ask deepseek who is the leader of China and what happens in Tiananmen Square in 1989.

  • @shonnspencer1162
    @shonnspencer1162 День тому

    Guys try to be neutral as best as you can. When you work with something for years, you tend to get a comfort level that will leave out a better way to work with something. This is again. The reason why technology slows down because you work with something and don’t see a different perspective.If I work with Windows on my life and then I work with a Mac, I tend to hate a Mac even though my Mac gives me less issues than a Windows computer, just for thought.

  • @briankgarland
    @briankgarland 18 годин тому

    Deepseek will always be the Temu version. Because IP theft can't really innovate.

  • @oonaonoff4878
    @oonaonoff4878 День тому

    the opposite of life is not death it’s the machine

  • @greymatter-TRTH
    @greymatter-TRTH 2 дні тому +3

    Get out of here with this garbage. You want to tell me you were not paid to do this video? 😂😂😂

    • @torch_boy
      @torch_boy День тому

      Huh, you think ChatGPT pays these guys to find prompts that are worse for R1?

  • @thewarplayer2398
    @thewarplayer2398 День тому

    Holy bias, Batman