It is very disappointing that OpenAI doesn’t even say in a paper exactly what they are doing in o1. I am sure it is a variety of techniques, some of which are being deployed in these open models. They no longer give any credence to their argument that this is because of safety. In fact, all the efforts they have put into preventing “jailbreaking” is done to avoid seeing the raw tokens (that you pay for, btw) because it would give an idea of what they are actually doing. I’m sure there are some interesting ideas there, but this idea of siloing science for competitive reasons is so far from where (they at least said) they came from, it is pretty repugnant.
The argument that secrecy ensures AI safety is flawed. While preventing malicious use is crucial, relying solely on internal processes to identify and fix vulnerabilities is insufficient. Independent audits and robust "red teaming" offer a more effective and accountable approach to safety, allowing for external scrutiny without compromising core algorithms. The current model of AI development creates a significant power imbalance. Users pay for access to powerful tools without understanding how they work. This lack of transparency undermines trust and accountability. In situations where AI provides inaccurate or harmful information, tracing the source of the error becomes nearly impossible, hindering rectification and accountability. The secrecy surrounding AI development contradicts the fundamental principles of scientific progress. The "siloing" of knowledge not only limits competition but also slows down the overall advancement of the field. Furthermore, this opacity makes it difficult to identify and mitigate biases within AI models, potentially exacerbating existing societal inequalities. The legal framework surrounding AI intellectual property is still evolving. Finding the right balance between protecting trade secrets and promoting transparency is crucial. Compliance with data privacy regulations like GDPR is essential, and a more nuanced legal approach may be needed to address the unique challenges of AI development while fostering collaboration. A tiered approach to transparency offers a practical solution. Sharing high-level information about model architectures and training methods, while protecting specific algorithms, allows for broader participation in the development process. This fosters collaboration, accelerates innovation, and enhances the overall robustness and safety of AI systems.
Yeah its extremely deceptive. They are effectively training their AI to lie to us. One thing I have noticed in using QwQ and reading through the thought process is that the thoughts themselves are very important, and being able to stop and edit the raw thoughts is extremely powerful.
Great video, the broken loop reminded me of the “There are 4 lights” Picard meme, which then made me realize the episode that is from is called “Chain of Command” 😂
Interesting that the QwQ and R1 models use similar expressions in their thought processes, like "Wait a minute, letters can be tricky, especially if there are repeating letters". I wonder why? See 11:57 and 18:35
My guess: the approach that works is to generate multiple attempted solutions (conjectures) and then evaluate them (refutations) and pick one. Now, the first step, the generation of multiple candidates, can take place by doubting whatever you have done so far. Self-doubt is key in critical thinking.
@tantzer6113 agreed, but even so the specific expressions used are surprisingly similar. Maybe they trained on a common dataset? Or maybe this is the kind of thing you get if you use GPT-4 to generate synthetic data.
Thanks for the video. Very timely. One thing I find very interesting about the exposed chains of thought is that they enable us to see where the reasoning might have gone wrong. Over on AI Explained, Philip has developed a set of common-sense reasoning problems that humans do much better on than any current AI models. When I tried a few of his publicly available prompts with DeepSeek, the model did not get the “right” answer, but I could see that it had decent reasons for coming up with a “wrong”answer. The exposed reasoning thus helped to reveal ambiguities and other flaws in the reasoning problems themselves. I imagine that careful examination of those chains of thought, both by humans and by AI, will also be a very useful way to improve the reasoning ability of these models.
Sam, I can’t help myself, but it seems to work better if I dynamically prompt exactly, what I need. The first iteration determines the “reply mode/format” and the second iteration brings the reply. The agentic “flow” is way cheaper as well.
Question for you. when you're doing your strawberry test are you using strings or string literals. a "strrawberry" might be interpreted as something that the AI should spell check and use a dictionary for where a 'strrawberrrrry' might be something that it takes as a literal and attempt the task differently?
Has anyone else suspected the whole 'strawberry' thing comes from the shape of the monte carlo tree graph? It would be painfully on the nose if this were true... Maybe red for filtering by logical-nots/falsey values that consists of the initial nodes, then green for the truthy final leafs. A lot of our own reasoning is first stating what x definitely is-not, then we pick from likely candidates that remain - if we really don't know something.
Are we sure they haven’t? Lately, when I’m using Claude, it will sometimes pause for a while and give a “Thinking…“ message before responding five or 10 seconds later with the answer. It might be going through a multistep chain of thought, though none of it is disclosed. I wonder if that’s why the latest version of Claude scores close to o1 on some of the reasoning benchmarks.
@@TomGally I think it's plausible that they've baked in chain of thought reasoning, what they probably haven't released yet is the monte carlo search that consumes lots of tokens. If they did that without increasaing the price of their API they would probably lose a lot of money. That's how to reliably infer they didn't secretly deploy such a model.
@@TomGally like 5 months ago they've added hidden tag to the output. Though it's not actually used for chain of thought, but just for model to evaluate whether it's necessary to add an artifact (code block, rich text, HTML/SVG with preview, diagram, etc) to the response.
11:54 How many r's in "strrawberry" (typos intentional): "thought for 9 seconds"... This is progress in AI :) I so wish for tokenization to go away, then at least his would be a nonissue.
Yes, it is impressive from a technical perspective because it demonstrates advancements in step-by-step reasoning capabilities in language models. Moreover, those who question its significance may lack understanding of how LLMs function and could benefit from further education on the topic ;-)
My question is : I have downloaded the open source models, but without their secret prompt, the accuracy is no comparison to openAI models, even not as good as a regular Llama 3.1. 70b instruct model. Can any one tell me where is the secret prompt? Using deepseek or qwq website is no open source at all. No one knows what model is running in the backend
“The secret” prompt is the training… you basically train it with prompt example to already correct answers. By giving it examples… I think GPT 3.5 or 4 (I don’t remember) took like 100k prompt examples at inicial training lol
Cool comparison! However, there's no way I'm paying for overseas reasoning models when homegrown Open AI's superb o1 exists. Which I do pay for! And it's impressive even in its pre-release versions (mini & preview).
Great content, as always! A bit off-topic, but I wanted to ask: My OKX wallet holds some USDT, and I have the seed phrase. (alarm fetch churn bridge exercise tape speak race clerk couch crater letter). Could you explain how to move them to Binance?
The speed of open source development is promising, but the traditional accuracy benchmarks hide the importance of speed which is more critical for these inference-bound models. Nice video highlighting the current ecosystem.
Some LLM-based systems could write and run basic Python programs to perform the logic. But by now the answer to the "count the r's in strawberry" question is likely in the training data.
@@jasonfilby9648 So, LLMs should be aware of their own limitations and write programs to count things instead of 'guessing.'? I guess it's similar to spoken words for people-we can't really count letters unless the word is written, and we can point to each letter and count.
I just called the mental asylum to check on QwQ. They said it is still looking straight ahead with that distant gaze, rocking back and forth, and muttering, “I should finalize my answer, I should stop, I should go with my first result, I should end, …”
I thought that as well. In a way it's ambiguous to ask that because typically LLMs correct mistakes. If you ask "true or false my meighbor means someone who is next to me" it would usually take meighbor to be neighbor that contained a typo and theh answer. Perhaps where the model could have improved is to make it clear that it corrected a typo before answering and not just say "3". Could have said "There are 3 answers in strawberry" and then it would be clear it corrected it before answering. Of course a better answer might be to give an answer for both cases and state those two cases clearly. you don't want a model just saying "4" if it could have been a typo. Perhaps only if it was constrained to a one word answer.
@@jasonrhtx It corrects without looking for a request for autocorrect. I don't think that that would be a reasonable way for an LLM to behave in general. When you mistype a word unless you request autocorrect it should just start complaining about the spelling errors in your work and not answer? What that's good for are these trick questions that have little bearing on the LLM's practical use.
Do We really Need Advanced, modèle just to Count, the number of r in strawberry ? Like i just prompted this :Breakdown the Word anticonstitutionnellement in token like form for each letter. After that circle each letter -n- and for each circle count one. The number of circle is the number of n in the word And it worked !!! What an amazing answer 🤡 People need to stop comparing advanced models with dumb prompting
Devs were very clear that they shared QwQ with us so we see the progress but its just the basic mechanic of whats to come.
It is very disappointing that OpenAI doesn’t even say in a paper exactly what they are doing in o1. I am sure it is a variety of techniques, some of which are being deployed in these open models. They no longer give any credence to their argument that this is because of safety. In fact, all the efforts they have put into preventing “jailbreaking” is done to avoid seeing the raw tokens (that you pay for, btw) because it would give an idea of what they are actually doing. I’m sure there are some interesting ideas there, but this idea of siloing science for competitive reasons is so far from where (they at least said) they came from, it is pretty repugnant.
The argument that secrecy ensures AI safety is flawed. While preventing malicious use is crucial, relying solely on internal processes to identify and fix vulnerabilities is insufficient. Independent audits and robust "red teaming" offer a more effective and accountable approach to safety, allowing for external scrutiny without compromising core algorithms.
The current model of AI development creates a significant power imbalance. Users pay for access to powerful tools without understanding how they work. This lack of transparency undermines trust and accountability. In situations where AI provides inaccurate or harmful information, tracing the source of the error becomes nearly impossible, hindering rectification and accountability.
The secrecy surrounding AI development contradicts the fundamental principles of scientific progress. The "siloing" of knowledge not only limits competition but also slows down the overall advancement of the field. Furthermore, this opacity makes it difficult to identify and mitigate biases within AI models, potentially exacerbating existing societal inequalities.
The legal framework surrounding AI intellectual property is still evolving. Finding the right balance between protecting trade secrets and promoting transparency is crucial. Compliance with data privacy regulations like GDPR is essential, and a more nuanced legal approach may be needed to address the unique challenges of AI development while fostering collaboration.
A tiered approach to transparency offers a practical solution. Sharing high-level information about model architectures and training methods, while protecting specific algorithms, allows for broader participation in the development process. This fosters collaboration, accelerates innovation, and enhances the overall robustness and safety of AI systems.
Yeah its extremely deceptive. They are effectively training their AI to lie to us. One thing I have noticed in using QwQ and reading through the thought process is that the thoughts themselves are very important, and being able to stop and edit the raw thoughts is extremely powerful.
Open AI is disgusting, Sam Altman is a toad
Great video, the broken loop reminded me of the “There are 4 lights” Picard meme, which then made me realize the episode that is from is called “Chain of Command” 😂
Interesting that the QwQ and R1 models use similar expressions in their thought processes, like "Wait a minute, letters can be tricky, especially if there are repeating letters". I wonder why?
See 11:57 and 18:35
My guess: the approach that works is to generate multiple attempted solutions (conjectures) and then evaluate them (refutations) and pick one. Now, the first step, the generation of multiple candidates, can take place by doubting whatever you have done so far. Self-doubt is key in critical thinking.
@tantzer6113 agreed, but even so the specific expressions used are surprisingly similar. Maybe they trained on a common dataset? Or maybe this is the kind of thing you get if you use GPT-4 to generate synthetic data.
Chinese don't cock block each other I guess
Thanks for the video. Very timely. One thing I find very interesting about the exposed chains of thought is that they enable us to see where the reasoning might have gone wrong. Over on AI Explained, Philip has developed a set of common-sense reasoning problems that humans do much better on than any current AI models. When I tried a few of his publicly available prompts with DeepSeek, the model did not get the “right” answer, but I could see that it had decent reasons for coming up with a “wrong”answer. The exposed reasoning thus helped to reveal ambiguities and other flaws in the reasoning problems themselves.
I imagine that careful examination of those chains of thought, both by humans and by AI, will also be a very useful way to improve the reasoning ability of these models.
back in december ?
😂
Dude, were in the future now
The time is now, old man
the space is here @@MarxOrx
What local front end are you using with Qwen coder at 16:25 ?
Sam, I can’t help myself, but it seems to work better if I dynamically prompt exactly, what I need. The first iteration determines the “reply mode/format” and the second iteration brings the reply. The agentic “flow” is way cheaper as well.
Please give an example
@@mkstowegnv I already gave an example. See the workflow the first iteration and the second iteration.
Question for you. when you're doing your strawberry test are you using strings or string literals. a "strrawberry" might be interpreted as something that the AI should spell check and use a dictionary for where a 'strrawberrrrry' might be something that it takes as a literal and attempt the task differently?
Has anyone else suspected the whole 'strawberry' thing comes from the shape of the monte carlo tree graph? It would be painfully on the nose if this were true... Maybe red for filtering by logical-nots/falsey values that consists of the initial nodes, then green for the truthy final leafs. A lot of our own reasoning is first stating what x definitely is-not, then we pick from likely candidates that remain - if we really don't know something.
We just want to know if qwen still looping
Reminds Asimovs story of SPD-13
What are your thoughts why anthropic has not released a similar chain of thought model or architecture?
Are we sure they haven’t? Lately, when I’m using Claude, it will sometimes pause for a while and give a “Thinking…“ message before responding five or 10 seconds later with the answer. It might be going through a multistep chain of thought, though none of it is disclosed. I wonder if that’s why the latest version of Claude scores close to o1 on some of the reasoning benchmarks.
@@TomGally I hope for Anthropic's sake that they haven't deployed a thinking model secretely. Because that would mean it's not good.
@@TomGally I think it's plausible that they've baked in chain of thought reasoning, what they probably haven't released yet is the monte carlo search that consumes lots of tokens. If they did that without increasaing the price of their API they would probably lose a lot of money. That's how to reliably infer they didn't secretly deploy such a model.
@@TomGally like 5 months ago they've added hidden tag to the output. Though it's not actually used for chain of thought, but just for model to evaluate whether it's necessary to add an artifact (code block, rich text, HTML/SVG with preview, diagram, etc) to the response.
11:54 How many r's in "strrawberry" (typos intentional): "thought for 9 seconds"... This is progress in AI :) I so wish for tokenization to go away, then at least his would be a nonissue.
Need an agent to assign inference time per query.
do we now get impressed when the model count R's in stawberrrry?
Yes, it is impressive from a technical perspective because it demonstrates advancements in step-by-step reasoning capabilities in language models. Moreover, those who question its significance may lack understanding of how LLMs function and could benefit from further education on the topic ;-)
My question is : I have downloaded the open source models, but without their secret prompt, the accuracy is no comparison to openAI models, even not as good as a regular Llama 3.1. 70b instruct model. Can any one tell me where is the secret prompt? Using deepseek or qwq website is no open source at all. No one knows what model is running in the backend
“The secret” prompt is the training… you basically train it with prompt example to already correct answers. By giving it examples…
I think GPT 3.5 or 4 (I don’t remember) took like 100k prompt examples at inicial training lol
@@geelws8880you manually have to do the rlhf?
Cool comparison! However, there's no way I'm paying for overseas reasoning models when homegrown Open AI's superb o1 exists. Which I do pay for! And it's impressive even in its pre-release versions (mini & preview).
Joke right?
poor qwq. you broke the little guy?
Great content, as always! A bit off-topic, but I wanted to ask: My OKX wallet holds some USDT, and I have the seed phrase. (alarm fetch churn bridge exercise tape speak race clerk couch crater letter). Could you explain how to move them to Binance?
The speed of open source development is promising, but the traditional accuracy benchmarks hide the importance of speed which is more critical for these inference-bound models.
Nice video highlighting the current ecosystem.
Next big thing is test time training, wonder will those open source models will be out.
What is the point of “strawberry”-like question if we know that LLM doesn’t recognize letters? How model suppose to count those letters?
Some LLM-based systems could write and run basic Python programs to perform the logic. But by now the answer to the "count the r's in strawberry" question is likely in the training data.
Well, the chain of thought did solve it despite the challenges of tokenisation
@@jasonfilby9648 So, LLMs should be aware of their own limitations and write programs to count things instead of 'guessing.'? I guess it's similar to spoken words for people-we can't really count letters unless the word is written, and we can point to each letter and count.
@@cookiesInChocolate It's possible, not sure what's going on behind the scenes though.
The progress on these models is insane, people have no clue what's coming.
Am I the only one that thinks these models closely followed open ai releases because they are built using intel that has been lifted from openAI ?
Look ma, no moat!
I just called the mental asylum to check on QwQ. They said it is still looking straight ahead with that distant gaze, rocking back and forth, and muttering, “I should finalize my answer, I should stop, I should go with my first result, I should end, …”
there is not a word "strrawberry" so the correct word you were reasonably referring to has 3 'r's
I thought that as well. In a way it's ambiguous to ask that because typically LLMs correct mistakes. If you ask "true or false my meighbor means someone who is next to me" it would usually take meighbor to be neighbor that contained a typo and theh answer.
Perhaps where the model could have improved is to make it clear that it corrected a typo before answering and not just say "3". Could have said "There are 3 answers in strawberry" and then it would be clear it corrected it before answering.
Of course a better answer might be to give an answer for both cases and state those two cases clearly.
you don't want a model just saying "4" if it could have been a typo. Perhaps only if it was constrained to a one word answer.
He specified the word (literal string) in quotes, ‘strrawberry’, and did not request autocorrect.
@@jasonrhtx It corrects without looking for a request for autocorrect. I don't think that that would be a reasonable way for an LLM to behave in general. When you mistype a word unless you request autocorrect it should just start complaining about the spelling errors in your work and not answer?
What that's good for are these trick questions that have little bearing on the LLM's practical use.
Excellent! Thanks
'Chinese AI models like QwenQ are keeping pace with American AI enterprise models'
[16:04-17:46]
Has anyone tries test time training on these new Large Reasoning models to see how much that even improvese them further?
TTT is just LoRA, nothing special
@@menglilingshano? TTT is used during runtime, LoRa during training.
You decide to break up with your AI waifu.
AI: QwQ what's this???
qwen is sure to develop a depression with that amount of overthinking
Great job 👏
The models will start to learn that strrawberry has 4 r's now.
I felt bad for the model that was stuck in the loop! 🥲I've been stuck in such loops myself and it is not fun.
Athene V2 looks good
Dude, O1 models were not released in December…… 😂 It’s much more recent
o1 (non-preview) is going to be released this month, so he's not wrong. Accidentally not wrong, but not wrong nonetheless.
Do We really Need Advanced, modèle just to Count, the number of r in strawberry ? Like i just prompted this :Breakdown the Word anticonstitutionnellement in token like form for each letter. After that circle each letter -n- and for each circle count one. The number of circle is the number of n in the word
And it worked !!! What an amazing answer 🤡
People need to stop comparing advanced models with dumb prompting
llms cannot solve problems it has never been trained on. it gives a best guess.
Interesting they all Chinese company.
If innovation had a face, it’d look like AI 💫
Back in December? What?
neti neti