Since there seems to be some confusion in the comments: - these are the tests we run on every model (check our previous videos) - what is presented is, as always, each model’s first attempt at the problem, no cherrypicking
Your Tower of London problem is incorrectly solved based on the image of the rods initially shown. Given the rods are of different length, the ChatGPT solution is incorrect. I was able to create a prompt for DeepSeek that solved this problem correctly with a much smaller version of the model (70B Distilled 8bit). It would be helpful if you provided the details of the problem and the prompts you used to solve them so your intentions were clearer while allowing others to try/verify your approach.
You got that too. I think they are too biased to OpenAI . They need to stop looking at it from the perspective of a model they have been working on for years. This was not scientific on any level. This is just a biased inference toward open AI. I thought that they would do better. I like to always be positive, but this was just them trying to prove that OpenAI had a better model and if you work that way you will come to your own conclusion work in a neutral way so that you can see what actually happens
@@JiasenLiu , Agreed. Without a capacity limit for pegs and additional rules, this test is meaningless. The only way an AI would get this right is if it made a bunch of assumptions, which in itself would be invalid. One of the issues with current AI implementations is the extent to which oversampling is used to pre-determine certain results. For industrial applications, that could be fatal. We're a long, long way from truly practical AI. But I'm not going to say that out loud. My "correct" prompt solves the problem they present using the DeepSeek 32B (8bit) local model. The full blown ChatGPT flubs on the same prompt!!! I'll be releasing my findings on LinkedIn tomorrow.
No surprise that 5 million model that you can run locally was outperformed by the model from organization that burns trillions of dollars. What is surprising that they are not so far off.
Try this prompt, "Solve the following problem. The rule is the following, there are three pegs, colored disks are placed on the pegs, one on top of other, only the top disk can be moved, one disk can be moved at each step. Initial configuration is: Peg A: empty; Peg B: green disk at bottom, red disk at top; Peg C: blue disk. Target configuration: Peg A: green disk at bottom, blue disk at top; Peg B: Red; Peg C: empty." DeepSeek r1 solved it correctly in 4 steps. I also tried the 32B -r1 model on my PC, it solved it in 6 steps .
Exactly - the type of problem but not the instance. With such a simple test there is a chance that the instance for this ep was in training data. More complex instances take more time so we only push it when needed for a conclusion. See our o1 pro review for an instance that certainly is not in training data. This is similar to giving a model a specific math or physics question that is not in training data, even though plenty of technically similar problems are. - Jakob
You cant make the assumption that Deepseek R1 and O1 mini are in the same order of magnitute in paramater count by looking at how fast they produce tokens via their repsective APIs We have no idea what kind of hardware they are running on, how much vram they are using or how many concurrent users there are etc. Even if R1 is not as reason-able as O1, the benefit of R1 should be obvious, you can run it at home 24/7 and all you need to pay for is the electricity. This is going to lead to some very serious grass roots innovations.
You are correct, I cut out the part where I mentioned this as I was kinda babbling. We are also basing this assumption on the prevalent rumors that o1-mini could be as small as 12b params. Perhaps I should have worded it as r1 is certainly within an oom of o1 mini. Two oom seems impossible. Agree that r1 is the next in a long line of OS/OW models bringing somewhat-near-frontier capabilities to edge devices. That said, what should also be obvious is there will be an arms race for frontier inference time compute AND distillation trends will continue bringing legitimately frontier capabilities to edge hardware. - Jakob
Agree, R1 is not as powerful as O1. However I noticed its much more stable at mutlitple promts (conversation). For example it solves my puzzle problem with some hints, while o1 struggles being stubborn. Also analysing thought process of R1 it feels like it learned to be critical of itself which is really intersesting in my view
From what I’ve seen, the reason this happens is because OpenAI does not pass the previous reasoning tokens through to the next response etc, so it never knows how it worked out previous points causing it to refuse to backtrack. R1 however, despite using a lot more tokens, passes its reasoning tokens through so it can understand what it did before. Basically how humans think.
@@AdamB1_23exactly, I never cared about context until I used DeepSeek and its previous answer analyzing is just so good. Although it is not as powerful as o1 but still it is very comparable..
I'm curious, why do you say O1 and R1 sizes are the same order of magnitude? O1 Runs on a server you don't know anything about (I think). The size might be within an order of magnitude, but there could also be an order of magnitude more/less of resources available to you on the OpenAI server?
Try philosophical questions, i found Deepseek‘s thinking & literature level goes above almost any writer i read, to a point that i kinda think it could write Nobel literature level stuffs, there were a few time i felt that i was talking to JP Satre😂
Thank you for demonstrating exactly what I have experienced in my own daily test as I use o1 and o1 pro but put the same questions into R1 and most of the time is missing decent reasoning.
Interesting, I just put the question 3 into ChatGPT O1 and got the same answer as R1. Here’s a concise, step-by-step solution for reaching the goal configuration: Initial Setup: Peg A: Empty Peg B: Green (bottom), Red (top) Peg C: Blue Goal Configuration: Peg A: Green (bottom), Blue (top) Peg B: Red Peg C: Empty Steps to Solve Move Red from Peg B to Peg C. A: Empty B: Green C: Blue (bottom), Red (top) Move Green from Peg B to Peg A. A: Green B: Empty C: Blue (bottom), Red (top) Move Blue from Peg C to Peg A (on top of Green). A: Green (bottom), Blue (top) B: Empty C: Red Move Red from Peg C to Peg B. A: Green (bottom), Blue (top) B: Red C: Empty Final Configuration Peg A: Green (bottom), Blue (top) Peg B: Red Peg C: Empty This completes the puzzle in 4 moves. but after giving O1 and R1 a hint that the first answer was wrong, both of them corrected it and gave me the right answer. r1 (4 steps): Move Red (B → C) Move Green (B → A) Move Red (C → B) Move Blue (C → A) o1(5 steps): Move Red (B → A) Move Green (B → C) Move Red (A → B) Move Green (C → A) Move Blue (C → A)
can someone repeat the test and also use other tests bcz i smell a lot of bias here, based on my experience, deepseek is performing better on close to olympiad lvl maths and phy prob including phy prob that needs good amount of reasoning to solve. Either the tests r cherry picked or they r somehow fucking with result or maybe a coincidence
The idea is that you teach where it went wrong and right. It will reason on the previous result, and think about a better answer. A reasoning model based on incentives.
Even if it were true that o1 is still better, that won't stop companies and enthusiasts to use R1 for themselves simply because they can host it/it's cheaper to use while not being too far off of o1's performance. With just 1 or 2 more prompts away, you could achieve a similar level of response as o1. I'd rather spend a little bit more time to get the response I want than spending a lot more money.
quite impressive they can code some 3D from scratch however I find that any level of real world greater interconnectivity of code with constraints from various directions immediately make them fail. So instead of a neat 3D ball from scratch maybe try a relatively straight forward subtask inside some existing code of even moderate complexity.
For coding, you have to keep going back and forth, letting the AI fix issues. With only 50 prompts per week on o1, it's ridiculous-you can't get anything done unless you pay $200 a month. At that point, you might as well pay someone to do it for you instead of stressing over AI limitations. I’m not sure about the technical side, but these guys seem biased as hell.
Guys try to be neutral as best as you can. When you work with something for years, you tend to get a comfort level that will leave out a better way to work with something. This is again. The reason why technology slows down because you work with something and don’t see a different perspective.If I work with Windows on my life and then I work with a Mac, I tend to hate a Mac even though my Mac gives me less issues than a Windows computer, just for thought.
Since there seems to be some confusion in the comments:
- these are the tests we run on every model (check our previous videos)
- what is presented is, as always, each model’s first attempt at the problem, no cherrypicking
Your Tower of London problem is incorrectly solved based on the image of the rods initially shown. Given the rods are of different length, the ChatGPT solution is incorrect. I was able to create a prompt for DeepSeek that solved this problem correctly with a much smaller version of the model (70B Distilled 8bit). It would be helpful if you provided the details of the problem and the prompts you used to solve them so your intentions were clearer while allowing others to try/verify your approach.
You got that too. I think they are too biased to OpenAI . They need to stop looking at it from the perspective of a model they have been working on for years. This was not scientific on any level. This is just a biased inference toward open AI. I thought that they would do better. I like to always be positive, but this was just them trying to prove that OpenAI had a better model and if you work that way you will come to your own conclusion work in a neutral way so that you can see what actually happens
09:11 yeah, the rule that the bottom disk can't be moved first isn't even mentioned in the prompt.
@@JiasenLiu , Agreed. Without a capacity limit for pegs and additional rules, this test is meaningless. The only way an AI would get this right is if it made a bunch of assumptions, which in itself would be invalid. One of the issues with current AI implementations is the extent to which oversampling is used to pre-determine certain results. For industrial applications, that could be fatal. We're a long, long way from truly practical AI. But I'm not going to say that out loud. My "correct" prompt solves the problem they present using the DeepSeek 32B (8bit) local model. The full blown ChatGPT flubs on the same prompt!!! I'll be releasing my findings on LinkedIn tomorrow.
No surprise that 5 million model that you can run locally was outperformed by the model from organization that burns trillions of dollars. What is surprising that they are not so far off.
Test that to the stupid traders in biggest hedgefunds
Try this prompt, "Solve the following problem. The rule is the following, there are three pegs, colored disks are placed on the pegs, one on top of other, only the top disk can be moved, one disk can be moved at each step. Initial configuration is: Peg A: empty; Peg B: green disk at bottom, red disk at top; Peg C: blue disk. Target configuration: Peg A: green disk at bottom, blue disk at top; Peg B: Red; Peg C: empty." DeepSeek r1 solved it correctly in 4 steps. I also tried the 32B -r1 model on my PC, it solved it in 6 steps .
We have benchmarks for a reason: so we don’t cherry pick successes and failures.
The most important comparison is price.
Wouldn't Tower of Hanoi type problems already be in the train set of o1?
Exactly - the type of problem but not the instance. With such a simple test there is a chance that the instance for this ep was in training data. More complex instances take more time so we only push it when needed for a conclusion. See our o1 pro review for an instance that certainly is not in training data. This is similar to giving a model a specific math or physics question that is not in training data, even though plenty of technically similar problems are.
- Jakob
Thanks for doing this comparison.
I want to add, the OrbitControls issue has been a bane for Claude 3.5 Sonnet too!
You cant make the assumption that Deepseek R1 and O1 mini are in the same order of magnitute in paramater count by looking at how fast they produce tokens via their repsective APIs
We have no idea what kind of hardware they are running on, how much vram they are using or how many concurrent users there are etc.
Even if R1 is not as reason-able as O1, the benefit of R1 should be obvious, you can run it at home 24/7 and all you need to pay for is the electricity. This is going to lead to some very serious grass roots innovations.
You are correct, I cut out the part where I mentioned this as I was kinda babbling. We are also basing this assumption on the prevalent rumors that o1-mini could be as small as 12b params. Perhaps I should have worded it as r1 is certainly within an oom of o1 mini. Two oom seems impossible.
Agree that r1 is the next in a long line of OS/OW models bringing somewhat-near-frontier capabilities to edge devices. That said, what should also be obvious is there will be an arms race for frontier inference time compute AND distillation trends will continue bringing legitimately frontier capabilities to edge hardware.
- Jakob
Agree, R1 is not as powerful as O1. However I noticed its much more stable at mutlitple promts (conversation). For example it solves my puzzle problem with some hints, while o1 struggles being stubborn. Also analysing thought process of R1 it feels like it learned to be critical of itself which is really intersesting in my view
From what I’ve seen, the reason this happens is because OpenAI does not pass the previous reasoning tokens through to the next response etc, so it never knows how it worked out previous points causing it to refuse to backtrack. R1 however, despite using a lot more tokens, passes its reasoning tokens through so it can understand what it did before. Basically how humans think.
@@AdamB1_23exactly, I never cared about context until I used DeepSeek and its previous answer analyzing is just so good. Although it is not as powerful as o1 but still it is very comparable..
everything i do with r1 since days results in "sorry maintenance" or "too much traffic"
Ty for info xD
Does deepseek have any limitations like number of prompts per day?
currently no limitation but some sensitive topics
@@揪芭比母捏牛 Like chatgpt ...
So far no, but might be if it becomes too popular. You can also try local so you have no limits
I'm curious, why do you say O1 and R1 sizes are the same order of magnitude? O1 Runs on a server you don't know anything about (I think). The size might be within an order of magnitude, but there could also be an order of magnitude more/less of resources available to you on the OpenAI server?
We were comparing o1 mini to r1 in param count, not o1
Try philosophical questions, i found Deepseek‘s thinking & literature level goes above almost any writer i read, to a point that i kinda think it could write Nobel literature level stuffs, there were a few time i felt that i was talking to JP Satre😂
If it was thatbeasy to tell what model is better with few examples, all the benchmarks will be obselete
Thank you for demonstrating exactly what I have experienced in my own daily test as I use o1 and o1 pro but put the same questions into R1 and most of the time is missing decent reasoning.
💯 thanks for stopping by, Tony!
- jakob
@@TheFeatureCrewThe most important comparison is price.
I find DS far more useful/powerful than O1 (and soon O3) as I can upload tons of docs to it (pdf/docs/xls)
You can do that with ChatGpt projects.
Interesting, I just put the question 3 into ChatGPT O1 and got the same answer as R1.
Here’s a concise, step-by-step solution for reaching the goal configuration:
Initial Setup:
Peg A: Empty
Peg B: Green (bottom), Red (top)
Peg C: Blue
Goal Configuration:
Peg A: Green (bottom), Blue (top)
Peg B: Red
Peg C: Empty
Steps to Solve
Move Red from Peg B to Peg C.
A: Empty
B: Green
C: Blue (bottom), Red (top)
Move Green from Peg B to Peg A.
A: Green
B: Empty
C: Blue (bottom), Red (top)
Move Blue from Peg C to Peg A (on top of Green).
A: Green (bottom), Blue (top)
B: Empty
C: Red
Move Red from Peg C to Peg B.
A: Green (bottom), Blue (top)
B: Red
C: Empty
Final Configuration
Peg A: Green (bottom), Blue (top)
Peg B: Red
Peg C: Empty
This completes the puzzle in 4 moves.
but after giving O1 and R1 a hint that the first answer was wrong, both of them corrected it and gave me the right answer.
r1 (4 steps):
Move Red (B → C)
Move Green (B → A)
Move Red (C → B)
Move Blue (C → A)
o1(5 steps):
Move Red (B → A)
Move Green (B → C)
Move Red (A → B)
Move Green (C → A)
Move Blue (C → A)
can someone repeat the test and also use other tests bcz i smell a lot of bias here, based on my experience, deepseek is performing better on close to olympiad lvl maths and phy prob including phy prob that needs good amount of reasoning to solve. Either the tests r cherry picked or they r somehow fucking with result or maybe a coincidence
The idea is that you teach where it went wrong and right. It will reason on the previous result, and think about a better answer. A reasoning model based on incentives.
Great content. Out of sample test is the real test. In sample is just gaming benchmark
Thanks, Mark!
- jakob
amazing test, would u make more guys?
Yessir 🫡
Even if it were true that o1 is still better, that won't stop companies and enthusiasts to use R1 for themselves simply because they can host it/it's cheaper to use while not being too far off of o1's performance. With just 1 or 2 more prompts away, you could achieve a similar level of response as o1. I'd rather spend a little bit more time to get the response I want than spending a lot more money.
You can try prompting the same questions in Chinese to see if R1 is better at reasoning in Chinese than English
quite impressive they can code some 3D from scratch however I find that any level of real world greater interconnectivity of code with constraints from various directions immediately make them fail. So instead of a neat 3D ball from scratch maybe try a relatively straight forward subtask inside some existing code of even moderate complexity.
Subscribed ... amazing stuff
Great work!
For coding, you have to keep going back and forth, letting the AI fix issues. With only 50 prompts per week on o1, it's ridiculous-you can't get anything done unless you pay $200 a month. At that point, you might as well pay someone to do it for you instead of stressing over AI limitations. I’m not sure about the technical side, but these guys seem biased as hell.
You need to ask deepseek who is the leader of China and what happens in Tiananmen Square in 1989.
Guys try to be neutral as best as you can. When you work with something for years, you tend to get a comfort level that will leave out a better way to work with something. This is again. The reason why technology slows down because you work with something and don’t see a different perspective.If I work with Windows on my life and then I work with a Mac, I tend to hate a Mac even though my Mac gives me less issues than a Windows computer, just for thought.
Deepseek will always be the Temu version. Because IP theft can't really innovate.
the opposite of life is not death it’s the machine
Get out of here with this garbage. You want to tell me you were not paid to do this video? 😂😂😂
Huh, you think ChatGPT pays these guys to find prompts that are worse for R1?
Holy bias, Batman