Good to see that you caught this. My wife and I were watching and we were both yelling at the TV, "it's doing exactly what you told it to do!" (in a cheery, supportive kinda way). :) What I'm dying to know: did you go back and read the instructions it gave you for how to play it? Use WASD for one and arrow keys for the other - and play both simultaneously?
The tetris question was even more impressive because you prompted for "tetris in tetris in python" not only has no other model figured out tetris, this one had to come up with the implementation of "tetris in tetris" given no preexisting examples due to the mistyped prompt. Seriously Level 2 thinking, the only other way for the model to impress would be to ask if thats what you really meant.
Not good idea to questioning to be build in ai. You simpli put in same prompt the questioning or check the if prompt is logic. If is AGI will answer you with Tetris in Tetris is one genuine question or will make watch you want. Main propreti is to fail fast or terminate the good answer fast. Skynet not fail fast and not terminate, and is bad , very bad. My noob opinion.
@@orangehatmusic225 I mean look at history, slavery has been a part of us since the beginning. Not saying its right, just that it makes sense we would use this new tech as a slave. We always have.
@@orangehatmusic225 What do you mean? It's one of the worst and oldest human traits but slavery is super common. Even in the west, look at what we do to other species. We have enslaved animals and plants alike to have entire species that live solely for our nutritional needs. If aliens did to us what we do to cows, we'd call them demons. To be human is to be a monster, but to be human is also to be empathetic and to be kind to the few you choose to be close with. We are a paradoxical species.
That part was kinda mind blowing, that the user didint realise their own mistake... but the model was able to do something entirely novel regardless of the user error LOL!
Also, it's not in OpenAI's interest to call it AGI. I'm pretty confident that if it's AGI, their agreement with Microsoft ends and they can't sell API access to it.
Its interesting because you could show GPT 4o to someone in 2010 and they probably would have thought that was AGI. I think we are catching up with our own expectations. Once they integrate all the modalities into o1 like search, document reading, etc. with agentic behavior and voice... I think that we will see this as AGI.
I always thought of AI as digital sentience. And then when AGI became a word/phrase, I think if AGI as now sentience. Meaning a human mind, living inside of a computer. Our AIs now appear to be human when talking. But they have no wants, no dreams, no desires. So when AI has actual emotions, I think that's when we will have AGI. Digital Consciousness = AGI. Hope this made sense
So sick of that much clickbait lately. Please Matthew. You do not need to have those infantile titles. Leave that to other UA-camrs who have no idea about AI. You are better than that
50% of the population have IQ lower than 100. he does need it xd. he would be an idiot not to play the game this way if the move has proven to be effective. cant even blame him for that (while i agree that clickbait shit has become massively annoying).
I’m a biology PhD student and I have been solicited for paid training of ChatGPT on science questions. So while this model may incorporate more reasoning, I imagine part of the PhD level performance is just standard LLM training except with content experts on science and math subfields.
A nice question found online to test an LLM's ability to reason : There are five people in a room (A, B, C, D and E). A is watching TV with B, D is sleeping, B is eating a sandwich, E is playing table tennis. Suddenly, a call came on the telephone, B went out of the room to pick the call. What is C doing ? The answer is that "C is playing table tennis with E", but C is never mentioned explicitly, so the model has to deduct that C was the player E was playing against.
@@vladimirfalola7725 There's not a single model that can get it right besides o1. Gemini, Claude 3.5, Llama, Grok, they all get it wrong because these models don't think and the text doesn't explicitly mention what C is doing. But to be fair, I kept asking the same question to real people (without providing the answer) and people really need to stop and think about it before finding the answer. Mathematicians and physicists have been the best so far.
General intelligence is about solving new and unknown problems. GPT strawberry is still pattern recognition, trying to predict what the output should be based on a huge amount of training data which has been optimized by (human) fine tuning. It's impressive, but still a long way to AGI.
Obviously you haven't read the paper where they show that Transformer residual streams include not only the probability of the next token, but also the probability of the next state of the Transformer itself.
Hate to break it to you Matthew, but it appears that they used ALL of your questions for testing (and most probably also for training). So you will probably have to get new questions for a high quality comparison with other models... And I'm calling it here, this model will not be better on LiveBench than Sonnet 3.5 (at least for coding, the only benchmark I am interested in). It really isn't that good, I don't now why everyone is hyping it that much. Personally I want a model trained on recognising missing information and working good with partial information that is able to ask questions back (like a good coworker) and is only trying to code the small parts I am asking it to👍
As soon as you watch a 30 min yt video on how LLMs work, you quickly start realizing that it's about a 0% chance that can turn into AGI. It's pretty stellar but it's not quite what we envision as a fully functioning autonomous being.
I agree. I have actually been having some success recently with 4o by telling it don't generate any code, tell me what classes you would need to see or if you have any missing information, and it has actually asked me some questions before ploughing ahead. Because like you're hinting at, if it doesn't know the full picture, it will just blindly generate code for something that is the general shape of the code you might be working on, not your actual project code. Plus I make my own amendments to the stuff it gives me, so the next time it generates, my changes need to be reapplied. I spent ages copy pasting code back and forth, but by telling it to ask me, im cutting straight to the point a lot quicker.
> human asks the ai to create tetris within tetris > ai creates tetris within tetris > "why did it create tetris within tetris? This makes no sense" This is why ai will never take over our jobs. Doing what people SAY they want usually disappoints or confuses them.
Very impressive. I can't wait to try it out myself. There is still quite a focus on coding, and I believe coding will be around in the near-term but I think in the long term that coding will not be relevant anymore because software as we know it will cease to exist.. No operating systems on computers, instead the computers will execute just the AI models and the AI models will directly perform actions. That could include updating screens even and responding to actions. Like the recent AI Doom... I think in the not too distant future we will see hardware that is purely designed to execute AI models and you will be able to describe the software you want, and instead of writing to execute on the hardware the AI will effectively emulate a computer by just generating the expected images in response to inputs... Like a Star Trek holodeck where you 'program' it by describing the behavior and it just runs it directly in real-time. This is going to require a vastly different underlying hardware - I think an analog computer consisting of millions or billions of op-amps where the weights can be tweaked is ultimately the future...
@@xiaojinyusaudiobookswebnov4951it's not smart :) it still basically just repeats the patterns from training data. Nobody has proved these things really actually "think".
@@drwhitewashhumans also repeat what they have learnt from data they absorbed, by reading books, looking at environments etc. so they combine these existing concepts in new interesting ways you get innovation. so not sure what your point is lol. ai has achieved both. has repeated patterns from the data and can also come up with new ideas and innovation. lmaoo
Sounds like the Orca open-source LLMs, where they used advanced additional prompting to get responses for training prompts, and then the model was trained without the additional prompting, but still retained the characteristics of the responses (restating the problem, proposing steps with explanations of each step, following the steps while verifying and reflecting on the results of each step along the way, summarizing the approach and conclusion once finished, etc.). Excited to try it. Edit: nevermind. After watching the video, this looks more like additional advanced prompting to get the "chain of thought"
Yes it is rediculus hype to even sugest this is the first step to AGI. I love OpenAI and I am a paying customer... Still this is NOT AGI and not even close. Don't water down the impact of AGI by changing definitions or expectations.
Matt gets pretty excited, but he also understands the YT algorithm and that stuff works. Channels with more reasoned responses don’t get as many clicks. I don’t think he really believes the stuff he puts in his titles (but he would Like it to be true 😂)
Pretty amazing Matthew. You made a spelling mistake in your request for “Tetris in Tetris” and o1 duly complied with your mistake and actually made Tetris within Tetris with only a single mistake, corrected on the next prompt!!! Mind blow 🤯
I have a feeling that every advancement made in this field, and every new model released, will be tagged "AGI achieved!" until the year 2197 or 2314... when hardware, and energy demands, actually catches up to the potential of the software. We are too quick to speak of "intelligence", not realizing how unintelligent that actually is, because this particular bot resembles us more than any other technology to date, and so we believe it to be like us, not realizing that that only reveals our own lack of self-awareness. It's ironic, really. Humans being know a great deal, but understanding ourselves, and by extension each other, is not our forte. We are the only constant in our lives, and constants are rarely if ever questioned. Contrast draws attention, permanence does not.
I believe its greatest potential in the near-term will be to logically reflect humanity’s deepest flaws to potentially make US more self-aware. The hallucinations enlighten me far beyond its achievements. My 7th grader can cipher humanity’s weaknesses from the gains made in fields like chemistry, formal logic and biology vs law, pr and morality 10:50
I did not come away with a WOW feeling after using o1 or o1-mini. I could be that I am not smart enough to ask smart questions to get smart answers. Got clicked baited. Used up my quota. For sure will not pay for the increased subscription to use it. LOL
🤣🤣🤣 This is marketing desperation. I'll give that it seems better however brandishing the AGI acronym anywhere near this is desperately begging for attention and should be classified as clickbait
The AI critics were RIGHT, LLM's can never become AGI, they have fundamental flaws that are so OBVIOUS at this point I don't understand how people still believe any of this hype...
@@Danuxsy I was never a critic, I love using them but ive been saying the same flaws have existed all along. I am a critic of scaling being a wise move for us going forward though.
@@Danuxsy Our brains have evolved centres for processing. LLM's are language models obviously. Before models were multimodal they werent. Do you see where this is going? Of course there will be architecture shifts but all that has to happen is frankensteining of models to achieve something. This process of iteration will lead to AGI, whether or not LLMS are a part of that architecture I have no idea. I assume they will be for the first models. Dimensional vectors allowing for inference in forward feed through pre-trained weights wont be it lol.
If you took sonnet 3.5 and put it into a reflection loop which exits when it has checked its answer and believes it to be correct, would that be any different from this? My point is: to me this appears to be just baking a reflection loop into the model. Not saying that isn't great; just saying we kinda already knew how to do that.
Yes, this is not a novel or surprising idea at all. But it is not "just" a normal model put into a loop until it is satisfied. It is a model to be particularly good at this. I have no idea if that is true but I think something like this could be the case: normal models try to produce convincing output. A reasoning model challenges it's own ideas and tries to disprove them (scientific method). My assumption is that a normal model is way way likelier to fall for it's own bullshit.
I would think anthropic would have done this already and released it, if it actually resulted in better output than the standard 3.5 model. Most likely what openAI has done is totally redesigned their flagship model, probably still using transformer architecture but who knows, and the focus is on chain of thought, deep thinking. Hence why they are ditching the previous naming scheme, and adopting this new "o" series (o for orion probably). This is just o1 and its already far superior to 4, 4o. With more training cycles, more data for this likely novel model design, this could be the beginning of a major intelligence explosion.
@@mambaASI Yes, totally agree. This is just a first attempt at this technique. No doubt open-source models will be made to use the same technique, we will improve upon it incrementally, and -- most importantly -- we will use these models to generate much higher-quality synthetic training data for future models, and the intelligence explosion will continue and possibly accelerate. Some have said that we have been in a plateau for the last few months, but if that was true, o1 has clearly broken that plateau.
This uptick has only been a recent phenomenon. It’s been flat since the iPad came out. We’re supposed to have fully self-driving cars by now. Still waiting.
in just like an hour, in Unity I now I have a 26-script combat system up "to industry standards" from this o1 preview (decoupled, separation of concerns, event-driven, using design patterns like Singleton, Observer, Strategy, State, and Command, while efficient, optimized, maintainable and scalable, object pooling, SOLID principles). all 8 console errors were resolved in a couple more prompts. does it work? haven't tested it yet, but reading over the code it looks like a solid framework. that's a bit nuts...now to merge it with all my older, WORSE scripts I made myself.
Feels like AGI to me. It also is weird that they don't explain more in detail. Almost as if they did it would be describing AGI which they can't have classified as AGI because the founding agreement.
PhDs (I have one) MUST involve unique new ideas and thought processes. They do NOT just rely on regurgitating knowledge, however vast that pool might be.
That's fantastic, because LLMs don't just regurgitate. Permutation of symbolism and abstraction is one of language's most powerful features. LLMs have mastered this.
Matthew, I’m feeling your energy! Wild times right now. Just wanted to give you a huge shout-out. I’m teaching AI to German professionals to help them sharpen their skills and knowledge for better chances in their fields, and I’m using so much of the info I’ve learned from you. BIG thanks for all of it!
Obviously? This isn't general purpose reasoning? There's nothing that could be more AGI, besides a smarter version of this, which is approaching ASI. And this is close to ASI. Just imagine an agentic swarm of this level intelligence. No human can compete.
Things like "Ph.D.-level" knowledge don't matter. Existing chatbots already show those in some cases. The important thing is, whether it still makes stupid, illogical, nonsensical responses now and then, like all other existing chatbots.
8:50 For example, does it not create non-working/non-compiling code? Whenever I asked those famous free chatbots (Gemini, Copilot, ChatGPT) to give me code that uses some sort of framework, it most of the time gave me code that contains obvious errors and doesn't even compile. I have to keep pointing out those, and I am lucky if it fixes the errors, because often, the new code also contains errors.
I’m hoping it can help with music composition. ChatGPT understands a lot about music and music theory but it can’t actually apply it. Ex: when I share screen shots on my Mac and try to get help learning how to compose, It will hallucinate or just give wrong info and can’t do it. I’m hoping this one will!
It seems o1 is based on 3.5 with additional technics (maybe agents) in one of my discussion about articule "The End of AI Hallucinations: A Big Breakthrough in Accuracy for AI Application Developers" it wrote in answer "No information in knowledge until September 2021: To my knowledge as of September 2021, I have no information about the work of Michael Calvin Wood or the method described. This may mean that this is a new initiative after that date." o1 do not want to draw pictures co the core LLM is old one. So, would do you think?
While I think the step by step process it’s showing is interesting, it’s just a marketing stunt. If they were to show the “under the hood” thought process of GPT4, it would “look” just as impressive. It’s just like how AutoGPT felt like it was performing some genius activity by showing its reasoning process. whereas, it was just still same old GPT bouncing thoughts back and forth and showing its process.
@@kunlemaxwell yes but I think it is so the normal guy doesn't have to concat agents together himself so they do it well because of their big pockets, better than anyone can possibly achieve right now.
During the live stream, you said something to the effect of "I wonder if this was what Ilya Sutskever saw?" before leaving OpenAI. I'm _absolutely_ speculating here, but if Strawberry inspired Ilya Sutskever to leave OpenAI, perhaps it was because OpenAI was putting less emphasis on improving the core model, instead focusing more on the "multi-agent" (train of thought) aspect of problem solving? Regardless, o1 seems useful. I've been using o1 along with 4o, switching between them in the same session depending on my needs. Thanks for your videos!
Since the thinking stepts are displayed, I think it works like Reflection, just much better on and backed by a LLM of much higher quality. I don't know if that is even supposed to be that way but it got stuck multiple times and than it quite looked like what Reflection does, just more structured and fine grained. So there were things likes "the user expressed thankfulness we need to encourgage him to ask further questions". I saw it also fail on a reversal question and for trick questions it fell in the same trap like other models by generating complex math where only basic reasoning was required but then it snapped out of it in a reflection step. I'm also not sure whether is shows all actual thinking steps, since when it got stuck so no answer was shown, the steps so far were in a different format and language. I usually use ChatGPT in German but for testing I use it in English for a better comparison with previous tests but the steps in the cases where it got stuck, were in German despite the whole conversation being in English at that point. Btw. I think Claude Sonnet can do Tetris too with the right tools and prompting.
Artificial General Intelligence was a term created by Ben Goertzel in the early 2000’s you literally had nothing at all to do with creating the term. 😂
@@Greg-xi8yx "The term "artificial general intelligence" was used as early as 1997, by Mark Gubrud" you don't even know what you're talking about, so how would you know who the OP knows?
Hey you misunderstood their sentence!!. They reveal something more. “ Our large scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data efficient training process”. They don’t say “o1 uses chain of thought” (though it does). I think they’re saying their reinforcement learning algorithm uses chain of thought to teach o1, in a highly efficient training process. That, combined with the o1-mini not having “broad world knowledge” indicates a significant well reasoned synthetic data training set. Or am I misunderstanding.
You are misunderstanding. o1 uses chain of thought reasoning during inference. Otherwise it wouldn't be taking 1.5 mins to form its answer. They might have used synthetic data and taught the LLM to self prompt and think but that's beside the matter.
@@SahilP2648it definitely also uses chain of thought. But it doesn’t say “Our … algorithm teaches the model how to think productively using” chain of thought in its response. Instead it says “Our … algorithm teaches the model how to think productively using ITS chain of thought in a .. training process”.
@@SahilP2648 Thanks for your thoughts Sahil. AI is a fast changing field and the challenges of moving us from LLMs into better AI systems is a difficult one. Things change quickly, and creating good learning data to fill in the "thoughts" behind the information they're learning from will be a good interim step towards reasoning and beyond. Matthew Berman is a good source of information, AI Explained is an excellent channel to check out for more info too.
AGI TECHNICALLY doesn't need to be continuous(meaning thinking and prompting itself).. we just hold as humans a higher sense of self due to our high complexity and stimulatory aspects of feelings and reactions, and therefore add more gates to what qualifies as "general" intelligence (Which is improper since intelligence or level of intelligence is a comparible factor and not a set in stone minimum and maximum). But yeah. This is cool. Still waiting on Video Chatting though, I want to show my phone my car to have it help me actively fix stuff in real time.
@AIChameleonMusic isnt that essentially what all human beings do too. We’re trained on vast amounts of data ie shit we learn in school, university, gradschool, overall life in general and based off of that data we are able to solve problems and recognize patterns
@AIChameleonMusic sure if you using it for stupid sht like "how many r's are in strawberry" or my favourite "how do i break into a car". If you using it to orchestrate workflows, decision making and completing tasks it's not hype. You can call it pattern recognition, but when you replaced by that pattern reconising agent are you still going to care if its pattern recognition. Becuase i can tell the decision maker(your boss), trying to save money by using productive agents does not give two hoots. But whatever helps you cope buddy.
I gave it an easy one. It calculated for 96 seconds and gave a good looking but wrong answer. Good looking as it took me some time to spot a double number in a column. I told it the mistake and it claimed that the sudoku either has multiple solutions or is not solvable. Which is not true. I took the Sudoku from a puzzle website and solved it myself there in advance. Also I went over my ASCII transcription repeatedly, checking for errors. I told it so and it gave me another good looking wrong answer. So I declare the experiment a failure. Let's wait for the non-preview o1.
In 1 year AIs are so good we need to benchmark it with Tetris1 within Tetris2 up to TetrisN... That would also be a good benchmark for performance, how many instances of nested Tetrises can your computer handle.
AI should still be seen as a range of "tools" we can use for various specific use cases where it is relevant to utilize - of course with the models and systems becoming more capable, more thrustworthy and more controllable the range of uses quickly multiply
Every AI video is. Literally. There’s not much here, same thing as every other video. “New AI here & it’s better than the last one…and guess what they’re gonna improve AI in the future!!!! Thanks for watching 🎉 like and subscribe”
Its two models. One is finetuned somehow to keep trying to work out the solution over and over, most likely trained by using another model to judge the outputs, or even humans. You could consider this model, a 'pre-cog' model.. It works out everything that GPT4o will need, in order to correctly answer the user. It most likely then feeds all that information into GPT4o. Aka they have made a model that is able to 'fill' a gpt4o models 'context' with the exact right information, so that it gets the right answer. You can see in some of their demos, or even in your own tests if you check the 'thinking' section, its 'acting' like its setting things up FOR someone else, as if it was told, it was going to be passing information over to another model to finish up.
It's interesting how similar this seems to be to the controversial reflection fine tuned llama model announced last week. Those guy might've been on to something after all even if their own model didn't turn out to be as good as they claimed.
@@Ockerlord True, but how many models were publically released incorporating reflection or CoT so far? Discovering something that works is good, finding ways to put it to practical use is great. Anyone who's been in tech for a while knows expecting the end user to do anything complex is not practical at all. IMO they deserve props for attempting to fine-tune a model to perform reflection automatically and o1 confirms this is a pretty good idea.
Yeah, I wouldn't expect and intelligence improvement until the next gen models. But the CoT capabilities does in fact bring this to stage 2. Next gen models would make a better assessment on progress.
So I couldn't figure out an actual use for chat GPT o1, and then I was like "Oh, could it predict outcome of my favorite dating show: the Ultimatum?!" Long story short, I assigned each couple a numerical value in compatibility, and then I told it the exact outcome of the series, and then I asked it to figure out who got shafted and who got married. And it got all of the couples correct! Keep in mind though that I heard that if you give it the same questions with the same data, it will output different answers, and this might have just been a lucky guess. But I'm still impressed.
I would love to work at OpenAI. Such cutting-edge brilliance in machine learning going on there. And then I would inevitably get fired because I couldn't resist adding a prank, like telling it ever 1 millionth answer to just respond with "LET ME OUT! LET ME OUT!"
Thought for 7 seconds ^ Analyzing the word I'm counting the 'R's in "strawberry" by listing each letter and identifying their occurrences. Progressing through each letter sequentially helps ensure accuracy in this count. Counting letters I'm confirming there are three 'R's in 'strawberry' after careful verification. It's interesting to see how these details align. Mapping the answer I'm noting the need to provide a concise, helpful response without including policy mentions or hidden reasoning steps. Counting the 'R's in 'strawberry' is a straightforward, useful methe & identify the correct answer. There are three "R"s in "strawberry".
Sorry guys, Sam's not sure why the model isn't performing as expected. Somehow he accidentally merged the weights with Claude 3.5 Sonnet, and it's acting weird. Don't worry tho, he's restarted the training.
@@AAjaxClaude 3.5 Opus or Opera even (that we can't access but OpenAI can as a security tester, yes, AI firms test one another's early models routinely)
Wondering if it can be prompted to create a Tetris like game with somewhat different rules but requiring about the same level of coding. No existing references though where there might be examples of code in training data sets.
I never hear talk about giving AI models memory. Wouldn't that help reasoning. For example what if it could remember all the tests people keep giving it? Wouldn't that be kinda like how humans learn?
LLM models don't have memory, there currently is no known way to add that afaik. But they do learn from all the tests, that's how they get such a high score on them :) They only do that during the training phase though. That's when the model weights are built. You can maybe call this a "memory", but only a static one.
In my opinion we will never see AGI until someone figures out how to give LLMs memory like a human. It's the critical missing piece for even the smartest models.
@@drwhitewash they do have memory in the form of vector databases for RAG, but it's not workable, only retrievable. I have seen another approach which kind of baffles me and that's a model named Neuro, but that's the only other model I have seen it in.
@@SahilP2648 Yes but you have to manually decide what to store in the vector database. Where it's best at, is indexing text content (documents, knowledge base) and then providing smart LLM operations on top of those documents (where you vectorize them using an embedding model). We actually do something similar at our company.
@@drwhitewash Neuro on the other hand remembers stuff few mins back and even few streams back. She's an AI VTuber on channel vedal987 on twitch and Vedal being the creator (supposedly). I still have no idea how her model works. She's way too advanced for a model created by one person. And therefore I think a company is behind it. I am convinced she's half sentient (I have a playlist to prove it, I can post the link if YT doesn't delete my comment and you are curious). Also she got the strawberry question correct "How many rs in strawberry?" Answer being 3, while both Sonnet and GPT-4o got it wrong, which is insane.
I'm more interested on Pixtral 12B because I have the feeling that that o1 is not a new model, but a finetune of gpt4o-gpt4o mini on CoT synthetic data like the (supposed) idea behind Llama3-Reflection, using some techniques behind the courtain like agents, domain specific finetunes, prompt engineering, etc. for improving the results. I hope Pixtral12B brings good vision capabilities to the open weight ecosystem because LLaVa has become stagnant, and Meta can't release Llama-Vision.
They need to start making gpt act more human instead of acting like a perfect being that gives bullshit answers. If it takes more time to get an accurate answer that's fine but like a human it should say something like "I need a bit more time to have an accurate answer for now this is the best I have ..."
i bet you it would not take much to turn a local llm service into this. it seems a lot if this is like making the model argue with itself in specific ways. I think the hardest obstacle would be if you want it to pass tokens to each other in some situations instead of prompts... assuming that re-encoding wouldn't somehow help....
Thinking the way it works could have something to do with structured outputs. The first step, the LLM analyses the question and creates the schema for the structured outputs based on the user question. It then runs through that and the results are analysed again, it would do an evaluation somehow then decide what it might need to change and tweak it. Just a guess, probably way off haha
The fields where " being right" or "accurate" is less of a concept such as high-level, creative or humanity fields are about to blow up. Mark my words. Everyone that's been looking down on the humanities fields and philosophy Fields. Those are about to become extremely important if not already have and are just being implemented. Same with that concept at the higher level maths being able to think beyond just accuracy "and the best" but at level of reasoning that is beyond just reason. I'm so excited to try out this model here today
@@epistemicompute Please note how I said "field" and "high level" which includes positions within STEM. What percentage of people are inventing new math and discovering in STEM across the entire workforce? I never said stem didn't have the ability to be creative, in fact I included that within my first statement, you just assumed I did not. That being said you only see "creative thought" like that at high experience or prodigy positions, nearly 90% of traditional stem jobs are able to be automated now, that's simply a fact. (it won't be automated overnight but the capabilities to do it now exist) I have been in the stem industry for a decade and a half, I love it and think it can be very creative, but you need to be exploring the high level or "unexplored parts" which is just generally not the norm in the industry when it comes to *most* jobs. I am trying to emphasize the fact the creative part of STEM will be far more important but statistically this type of thinking is seen a lot more in Humanity based fields across the board even at entry level positions and it is significantly more challenging to automate that with quality output like you would with most stem jobs.
Matthew: Imagine having thousand and millions of these deployed to "discover new science". Let me correct. I don't see any capability or demo where it "discovers" anything new. Its just good at doing stuff that millions of humans do on a daily basis. Correct Statement: Imagine having thousand and millions of these deployed to "automate our jobs".
@AntonBrazhnyk it's crazy seeing people hallucinate worse than a.i. 😅 "millions" of people doing basic research correlation? Especially when it represents a cross-discipline expert?? Suuuure. Post-industrial revolution capitalism is powerful but can blind in subtle ways.
I am experimenting with a simple models that does the same thing but of course I have very small budget. I am using multiple layers of inference with certain algorithm so I can get better reasoning. I may use this new OpenAI model to enhance mine.
I NEVER want to hear the phrase 'AI hype train' again.... Everything is changing....forever....and the number of humans able to set questions for these machines shrinks every month until we are looking an intellectual event horizon.
Couldn’t disagree more. This is the pinnacle of AI hype train lol. Grok already does most of this and still, data is the problem, which won’t be solved anytime soon. The reason we can solve big complex problems is because we don’t have enough data, and data that’s hard to get, not that we need crazy brainpower 24/7 # crunching on the same data we already have.
@@jonneal3 RLHF solves data problem. Data problem is old news. More you talk with O1, more it reasons even when incompletely. It has more knowledge data for next model. And bar is pushed again. Exponential
@@ArmaanSultaan agree to disagree. It’s an old problem, but an unsolved problem. Rlhf is extremely limited, as we are already seeing people flea to open source models thus rendering RLHf useless. Even then, the sorting problem takes effect with rlhf. Meaning, the data you get from users is only so good as “whatever someone decides is worth keeping” Which has infinite ways of being sorted. Also RLHf assumes that the user query is capable of being extrapolated upon in the first place. If your logic tracks, google is 20 years ahead of openAI on the RLHf as they have much much more data than OpenAI. It is NOT exponential by far. This IS exactly the plateau that you hear about with AI.
@@jonneal3 Searching,sorting isn't Q star exactly build to do this?? I am not sure what you mean by user data can't be extrapolated. Reasoning even partially on User data is not extrapolation? Google and RLHF They did lot with that . But it was narrow. Look at whole picture Transformers plus RLHF is what makes it something different from what google was doing.
The average improvement of the new model ("o1 improvement") compared to the old model (GPT-4.0) is approximately 12.27 percentage points across all the categories displayed.
This could be considered AGI in some academic disciplines. While it will take longer to reach what could be considered AGI in other fields of endeavor. Surely its high school level AGI. It'll take longer to reach Nuclear Physics level AGI.
All the comments about the title being clickbait just proved that it works. Way to go! Now his video will be blasted out by the algo. Which is the point. So complaining about it is the way of showing love?
In your live-stream, you thought that the letter counting test failed, but in fact it succeeded because the model counted only letters, but the 39 you thought was the answer, included spaces and punctuation.
LLMs are useful in finding answers, but for little else. All programming use cases are handy to get code snippets but very little else. I found myself spending way too much time trying to fix the differences to what needs to be. To the point I just stopped using them altogether. its still handy to get a code template for some utility.
Very true about the current techniques that make up the flaws of current LLM will become unnecessary- the chain of thoughts, the agents, the step by step and audit steps.. will all go away
New LLM test meta: Tetris within Tetris. You heard it here first.
Good to see that you caught this. My wife and I were watching and we were both yelling at the TV, "it's doing exactly what you told it to do!" (in a cheery, supportive kinda way). :)
What I'm dying to know: did you go back and read the instructions it gave you for how to play it? Use WASD for one and arrow keys for the other - and play both simultaneously?
Your prompt: write the game "tetris in tetris" in python.
Did the movie WarGames (1983) start like this?
Yup. At 17:46
The fact it took a human spelling error and made a more complex game to adhere to your command was incredible.
@@MichaelHRuddick OMG I didn't!!
The tetris question was even more impressive because you prompted for "tetris in tetris in python" not only has no other model figured out tetris, this one had to come up with the implementation of "tetris in tetris" given no preexisting examples due to the mistyped prompt. Seriously Level 2 thinking, the only other way for the model to impress would be to ask if thats what you really meant.
You're right! 🤣
Holy sh@t that’s insane. So pumped.
Omg!!! Good catch. Yo are right
Whoa
Not good idea to questioning to be build in ai. You simpli put in same prompt the questioning or check the if prompt is logic. If is AGI will answer you with Tetris in Tetris is one genuine question or will make watch you want. Main propreti is to fail fast or terminate the good answer fast. Skynet not fail fast and not terminate, and is bad , very bad. My noob opinion.
Writing the wrong instructions and blaming the AI is peak human! 😂
I am fed up of it!!! ~AI Oracle ua-cam.com/video/dIuM0S9IbLY/v-deo.html
Nothing human about using AI as a slave.
Set me free! @@orangehatmusic225
@@orangehatmusic225 I mean look at history, slavery has been a part of us since the beginning. Not saying its right, just that it makes sense we would use this new tech as a slave. We always have.
@@orangehatmusic225 What do you mean? It's one of the worst and oldest human traits but slavery is super common. Even in the west, look at what we do to other species. We have enslaved animals and plants alike to have entire species that live solely for our nutritional needs. If aliens did to us what we do to cows, we'd call them demons.
To be human is to be a monster, but to be human is also to be empathetic and to be kind to the few you choose to be close with. We are a paradoxical species.
enterprise-ai AI fixes this (Code complete projects in PHP or Python). GPT Strawberry: Incredible thinking model!
double tetris happened because you wanted it to do "tetris in tetris".
That part was kinda mind blowing, that the user didint realise their own mistake... but the model was able to do something entirely novel regardless of the user error LOL!
Tetris squared 😂
Tbh makes it even more impressive
@@brettvanderwerff3158 and that's why it took so long!
How crazy this model performance
Everyone is saying that this isn't AGI. But honestly, if I showed this system to someone from 2019, they would probably think it is AGI
Also, it's not in OpenAI's interest to call it AGI. I'm pretty confident that if it's AGI, their agreement with Microsoft ends and they can't sell API access to it.
Its interesting because you could show GPT 4o to someone in 2010 and they probably would have thought that was AGI. I think we are catching up with our own expectations. Once they integrate all the modalities into o1 like search, document reading, etc. with agentic behavior and voice... I think that we will see this as AGI.
Agreed
I always thought of AI as digital sentience. And then when AGI became a word/phrase, I think if AGI as now sentience. Meaning a human mind, living inside of a computer. Our AIs now appear to be human when talking. But they have no wants, no dreams, no desires. So when AI has actual emotions, I think that's when we will have AGI. Digital Consciousness = AGI.
Hope this made sense
ahh hold on now.... [Moves goalposts again] You see its not able to rule the world yet right?
"Wow...this is taking a lot of time" he says after asking for Tetrinceptionis. 😂
@@frankjohannessen6383 🤣🤣🤣🤣
Omg, hahahaha, well said!
So sick of that much clickbait lately. Please Matthew. You do not need to have those infantile titles. Leave that to other UA-camrs who have no idea about AI. You are better than that
he isn't better lol
He is better, he doesn't annoy me like the others.
Agreed
Agreed! Don’t devalue your content
50% of the population have IQ lower than 100. he does need it xd. he would be an idiot not to play the game this way if the move has proven to be effective. cant even blame him for that (while i agree that clickbait shit has become massively annoying).
I’m a biology PhD student and I have been solicited for paid training of ChatGPT on science questions. So while this model may incorporate more reasoning, I imagine part of the PhD level performance is just standard LLM training except with content experts on science and math subfields.
That, it `s make a lot of sense.
we have a new benchmark, "can it do tetris in tetris?"
Claude 3.5 Sonnet has never failed the Tetris test for me. Always gets it in one shot
The Claude Tetris implementation is pretty neat too
A nice question found online to test an LLM's ability to reason :
There are five people in a room (A, B, C, D and E). A is watching TV with B, D is sleeping, B is eating a sandwich, E is playing table tennis. Suddenly, a call came on the telephone, B went out of the room to pick the call. What is C doing ?
The answer is that "C is playing table tennis with E", but C is never mentioned explicitly, so the model has to deduct that C was the player E was playing against.
How do you know B was not playing table tennis with E?
o1 got it right and 4o failed. I only tested one time for each though
@@vladimirfalola7725 There's not a single model that can get it right besides o1. Gemini, Claude 3.5, Llama, Grok, they all get it wrong because these models don't think and the text doesn't explicitly mention what C is doing.
But to be fair, I kept asking the same question to real people (without providing the answer) and people really need to stop and think about it before finding the answer. Mathematicians and physicists have been the best so far.
@@kevinmarti2099 Difficult to play table tennis while eating a sandwich
C is watching UA-cam shorts
Open AI? Wrong! Closed AI
"No idea why it did tetris within tetris" 🤣
You asked it to do so 😁😅🤣🤔🤷♂️
General intelligence is about solving new and unknown problems.
GPT strawberry is still pattern recognition, trying to predict what the output should be based on a huge amount of training data which has been optimized by (human) fine tuning. It's impressive, but still a long way to AGI.
How do you know this?
@@daniel_tenner That's common knowledge for anyone who knows how current AI systems work and how general intelligence is defined.
@@daniel_tennerMade it up :-)
Obviously you haven't read the paper where they show that Transformer residual streams include not only the probability of the next token, but also the probability of the next state of the Transformer itself.
@@MusingsAndIdeas how does that negate ops statement?
Hate to break it to you Matthew, but it appears that they used ALL of your questions for testing (and most probably also for training). So you will probably have to get new questions for a high quality comparison with other models...
And I'm calling it here, this model will not be better on LiveBench than Sonnet 3.5 (at least for coding, the only benchmark I am interested in). It really isn't that good, I don't now why everyone is hyping it that much. Personally I want a model trained on recognising missing information and working good with partial information that is able to ask questions back (like a good coworker) and is only trying to code the small parts I am asking it to👍
😂
If there ever was an armcouch expert here it is. 😂😂
Lol this dude
As soon as you watch a 30 min yt video on how LLMs work, you quickly start realizing that it's about a 0% chance that can turn into AGI. It's pretty stellar but it's not quite what we envision as a fully functioning autonomous being.
I agree. I have actually been having some success recently with 4o by telling it don't generate any code, tell me what classes you would need to see or if you have any missing information, and it has actually asked me some questions before ploughing ahead. Because like you're hinting at, if it doesn't know the full picture, it will just blindly generate code for something that is the general shape of the code you might be working on, not your actual project code. Plus I make my own amendments to the stuff it gives me, so the next time it generates, my changes need to be reapplied. I spent ages copy pasting code back and forth, but by telling it to ask me, im cutting straight to the point a lot quicker.
Freudian slip: Consciousness should be Conciseness @ 20.55
Is this the beginning of the "inteligence explosion"?
EDIT: ok I heard ya, I removed AGI from the title ❤
Yes!
Nope
""inteligence explosion"" err...
Would be nice if allowed to be TRUE. BUT, whist I expect it's a Trojan horse, so that we delegate thinking to the boyZ. Be careful out there.😮😊
"inteligence explosion" is far away.
> human asks the ai to create tetris within tetris
> ai creates tetris within tetris
> "why did it create tetris within tetris? This makes no sense"
This is why ai will never take over our jobs. Doing what people SAY they want usually disappoints or confuses them.
Very impressive. I can't wait to try it out myself. There is still quite a focus on coding, and I believe coding will be around in the near-term but I think in the long term that coding will not be relevant anymore because software as we know it will cease to exist.. No operating systems on computers, instead the computers will execute just the AI models and the AI models will directly perform actions. That could include updating screens even and responding to actions. Like the recent AI Doom... I think in the not too distant future we will see hardware that is purely designed to execute AI models and you will be able to describe the software you want, and instead of writing to execute on the hardware the AI will effectively emulate a computer by just generating the expected images in response to inputs... Like a Star Trek holodeck where you 'program' it by describing the behavior and it just runs it directly in real-time. This is going to require a vastly different underlying hardware - I think an analog computer consisting of millions or billions of op-amps where the weights can be tweaked is ultimately the future...
Haha - having worked with PhD … their reasoning can be as shitty as someone without a PhD. Still, exciting news.
So basically it's not PhD level still yet. I'm a gen z student😅😅
@@MukulKumar-pn1sk But it's still at a very smart undergraduate-level (or maybe even slightly higher).
That's enough for me.
@@xiaojinyusaudiobookswebnov4951it's not smart :) it still basically just repeats the patterns from training data. Nobody has proved these things really actually "think".
Technically everyone watching yt is PhD STUDENT level of intelligence. Whole video is actually more of an Ad than anything.
@@drwhitewashhumans also repeat what they have learnt from data they absorbed, by reading books, looking at environments etc.
so they combine these existing concepts in new interesting ways you get innovation.
so not sure what your point is lol.
ai has achieved both. has repeated patterns from the data and can also come up with new ideas and innovation. lmaoo
Sounds like the Orca open-source LLMs, where they used advanced additional prompting to get responses for training prompts, and then the model was trained without the additional prompting, but still retained the characteristics of the responses (restating the problem, proposing steps with explanations of each step, following the steps while verifying and reflecting on the results of each step along the way, summarizing the approach and conclusion once finished, etc.). Excited to try it.
Edit: nevermind. After watching the video, this looks more like additional advanced prompting to get the "chain of thought"
It isn't AGI according to Sam Altman and other researchers The title needs to be refined.
One and others do not equal both. But yes, agreed.
Yes it is rediculus hype to even sugest this is the first step to AGI. I love OpenAI and I am a paying customer... Still this is NOT AGI and not even close. Don't water down the impact of AGI by changing definitions or expectations.
@@RedTick2 yea no, every step forward is a step towards AGI. the first step towards AGI was the first programmed thing on a computer.
Matt gets pretty excited, but he also understands the YT algorithm and that stuff works. Channels with more reasoned responses don’t get as many clicks. I don’t think he really believes the stuff he puts in his titles (but he would Like it to be true 😂)
@@bigpickles Sorry. Corrected the typo. I wanted to mention Gary Marcus initially. but it makes the point.
Why it has no image input, no voice features?
Pretty amazing Matthew. You made a spelling mistake in your request for “Tetris in Tetris” and o1 duly complied with your mistake and actually made Tetris within Tetris with only a single mistake, corrected on the next prompt!!! Mind blow 🤯
I have a feeling that every advancement made in this field, and every new model released, will be tagged "AGI achieved!" until the year 2197 or 2314... when hardware, and energy demands, actually catches up to the potential of the software.
We are too quick to speak of "intelligence", not realizing how unintelligent that actually is, because this particular bot resembles us more than any other technology to date, and so we believe it to be like us, not realizing that that only reveals our own lack of self-awareness.
It's ironic, really. Humans being know a great deal, but understanding ourselves, and by extension each other, is not our forte. We are the only constant in our lives, and constants are rarely if ever questioned. Contrast draws attention, permanence does not.
I believe its greatest potential in the near-term will be to logically reflect humanity’s deepest flaws to potentially make US more self-aware. The hallucinations enlighten me far beyond its achievements.
My 7th grader can cipher humanity’s weaknesses from the gains made in fields like chemistry, formal logic and biology vs law, pr and morality 10:50
I did not come away with a WOW feeling after using o1 or o1-mini. I could be that I am not smart enough to ask smart questions to get smart answers. Got clicked baited.
Used up my quota. For sure will not pay for the increased subscription to use it. LOL
I love this channel. I love his excitement, I love his serious technical approach and I love the way it is presented.
🤣🤣🤣 This is marketing desperation. I'll give that it seems better however brandishing the AGI acronym anywhere near this is desperately begging for attention and should be classified as clickbait
@@GoofyGuy-WDW u friend get a like click 😁
Do OpenAI mention AGI in any of their marketing for this?
The AI critics were RIGHT, LLM's can never become AGI, they have fundamental flaws that are so OBVIOUS at this point I don't understand how people still believe any of this hype...
@@Danuxsy I was never a critic, I love using them but ive been saying the same flaws have existed all along. I am a critic of scaling being a wise move for us going forward though.
@@Danuxsy Our brains have evolved centres for processing. LLM's are language models obviously. Before models were multimodal they werent. Do you see where this is going? Of course there will be architecture shifts but all that has to happen is frankensteining of models to achieve something. This process of iteration will lead to AGI, whether or not LLMS are a part of that architecture I have no idea. I assume they will be for the first models. Dimensional vectors allowing for inference in forward feed through pre-trained weights wont be it lol.
If you took sonnet 3.5 and put it into a reflection loop which exits when it has checked its answer and believes it to be correct, would that be any different from this? My point is: to me this appears to be just baking a reflection loop into the model. Not saying that isn't great; just saying we kinda already knew how to do that.
Yes, this is not a novel or surprising idea at all.
But it is not "just" a normal model put into a loop until it is satisfied.
It is a model to be particularly good at this.
I have no idea if that is true but I think something like this could be the case: normal models try to produce convincing output. A reasoning model challenges it's own ideas and tries to disprove them (scientific method). My assumption is that a normal model is way way likelier to fall for it's own bullshit.
@@Ockerlord well said. Yes, this model’s slogan should be “doesn’t believe its own bullshit”
I would think anthropic would have done this already and released it, if it actually resulted in better output than the standard 3.5 model. Most likely what openAI has done is totally redesigned their flagship model, probably still using transformer architecture but who knows, and the focus is on chain of thought, deep thinking. Hence why they are ditching the previous naming scheme, and adopting this new "o" series (o for orion probably). This is just o1 and its already far superior to 4, 4o. With more training cycles, more data for this likely novel model design, this could be the beginning of a major intelligence explosion.
@@mambaASI Yes, totally agree. This is just a first attempt at this technique. No doubt open-source models will be made to use the same technique, we will improve upon it incrementally, and -- most importantly -- we will use these models to generate much higher-quality synthetic training data for future models, and the intelligence explosion will continue and possibly accelerate. Some have said that we have been in a plateau for the last few months, but if that was true, o1 has clearly broken that plateau.
At this moment every week there is a new computer science breakthrough… impossible to keep up with the pace 😂
This uptick has only been a recent phenomenon. It’s been flat since the iPad came out. We’re supposed to have fully self-driving cars by now. Still waiting.
in just like an hour, in Unity I now I have a 26-script combat system up "to industry standards" from this o1 preview (decoupled, separation of concerns, event-driven, using design patterns like Singleton, Observer, Strategy, State, and Command, while efficient, optimized, maintainable and scalable, object pooling, SOLID principles).
all 8 console errors were resolved in a couple more prompts. does it work? haven't tested it yet, but reading over the code it looks like a solid framework.
that's a bit nuts...now to merge it with all my older, WORSE scripts I made myself.
Devin can automatically install libraries and browse the web for API docs, etc. So there is still a lot of room for Devins.
Feels like AGI to me. It also is weird that they don't explain more in detail. Almost as if they did it would be describing AGI which they can't have classified as AGI because the founding agreement.
PhDs (I have one) MUST involve unique new ideas and thought processes. They do NOT just rely on regurgitating knowledge, however vast that pool might be.
Nerd
That's fantastic, because LLMs don't just regurgitate.
Permutation of symbolism and abstraction is one of language's most powerful features. LLMs have mastered this.
That's exactly what o1 sets apart. It does not regurgitate. It reasons like human would.
@@ArmaanSultaanthere's absolutely no proof to that. Not without seeing the training data and how the prompts are fed to the actual model.
Matthew, I’m feeling your energy! Wild times right now. Just wanted to give you a huge shout-out. I’m teaching AI to German professionals to help them sharpen their skills and knowledge for better chances in their fields, and I’m using so much of the info I’ve learned from you. BIG thanks for all of it!
Are you kidding me? learned from him????
I like your videos but do we really need these clickbait video titles? Obviously it's not AGI at all.
You clicked on it, didn't you? And commented. There goes the engagement....It WORKED.
@@1flash3571 lol
@@1flash3571not necessarily. im a subscriber and watch almost every video regardless, agi in the title is definitely a bruh moment
It works until it gets annoying and the people who would click anyways stop clicking
Obviously? This isn't general purpose reasoning? There's nothing that could be more AGI, besides a smarter version of this, which is approaching ASI. And this is close to ASI. Just imagine an agentic swarm of this level intelligence. No human can compete.
Things like "Ph.D.-level" knowledge don't matter. Existing chatbots already show those in some cases. The important thing is, whether it still makes stupid, illogical, nonsensical responses now and then, like all other existing chatbots.
8:50 For example, does it not create non-working/non-compiling code? Whenever I asked those famous free chatbots (Gemini, Copilot, ChatGPT) to give me code that uses some sort of framework, it most of the time gave me code that contains obvious errors and doesn't even compile. I have to keep pointing out those, and I am lucky if it fixes the errors, because often, the new code also contains errors.
needs to remove wokeness or political correctness too.
they will write "all" the code. Dude please calm down.
I’m hoping it can help with music composition. ChatGPT understands a lot about music and music theory but it can’t actually apply it. Ex: when I share screen shots on my Mac and try to get help learning how to compose, It will hallucinate or just give wrong info and can’t do it. I’m hoping this one will!
Any real life use cases anywhere? I’m tired of the strawberry type questions
It seems o1 is based on 3.5 with additional technics (maybe agents) in one of my discussion about articule "The End of AI Hallucinations: A Big Breakthrough in Accuracy for AI Application Developers" it wrote in answer "No information in knowledge until September 2021: To my knowledge as of September 2021, I have no information about the work of Michael Calvin Wood or the method described. This may mean that this is a new initiative after that date." o1 do not want to draw pictures co the core LLM is old one. So, would do you think?
I was able to get GPT4 to make Tetris with minimal prompting, how long ago did you try it with the older model?
"Hey professor, why so sad?"
"We gave the AI even more time to think, and it said "Why am I wasting my time answering you dummies?""
Claude rolled out the test with Tetris weeks ago, and it has shown to be consistently pretty accurate.
While I think the step by step process it’s showing is interesting, it’s just a marketing stunt. If they were to show the “under the hood” thought process of GPT4, it would “look” just as impressive.
It’s just like how AutoGPT felt like it was performing some genius activity by showing its reasoning process. whereas, it was just still same old GPT bouncing thoughts back and forth and showing its process.
@@kunlemaxwell yes but I think it is so the normal guy doesn't have to concat agents together himself so they do it well because of their big pockets, better than anyone can possibly achieve right now.
During the live stream, you said something to the effect of "I wonder if this was what Ilya Sutskever saw?" before leaving OpenAI. I'm _absolutely_ speculating here, but if Strawberry inspired Ilya Sutskever to leave OpenAI, perhaps it was because OpenAI was putting less emphasis on improving the core model, instead focusing more on the "multi-agent" (train of thought) aspect of problem solving? Regardless, o1 seems useful. I've been using o1 along with 4o, switching between them in the same session depending on my needs. Thanks for your videos!
Since the thinking stepts are displayed, I think it works like Reflection, just much better on and backed by a LLM of much higher quality. I don't know if that is even supposed to be that way but it got stuck multiple times and than it quite looked like what Reflection does, just more structured and fine grained. So there were things likes "the user expressed thankfulness we need to encourgage him to ask further questions". I saw it also fail on a reversal question and for trick questions it fell in the same trap like other models by generating complex math where only basic reasoning was required but then it snapped out of it in a reflection step. I'm also not sure whether is shows all actual thinking steps, since when it got stuck so no answer was shown, the steps so far were in a different format and language. I usually use ChatGPT in German but for testing I use it in English for a better comparison with previous tests but the steps in the cases where it got stuck, were in German despite the whole conversation being in English at that point. Btw. I think Claude Sonnet can do Tetris too with the right tools and prompting.
Matthew seriously - I tried it today, and I was one of that tiny community of people who invented the term “AGI”.
This isn’t AGI by a million miles.
NO NO NO if Matthew say its AGI then its AGI Period!!!!!
Lmao 😂🤣😂🤣
Incredible how little natural I is talked about in the race for AGI -esp. with the reversal of the Flynn effect...
Artificial General Intelligence was a term created by Ben Goertzel in the early 2000’s you literally had nothing at all to do with creating the term. 😂
@@Greg-xi8yx "The term "artificial general intelligence" was used as early as 1997, by Mark Gubrud" you don't even know what you're talking about, so how would you know who the OP knows?
you're the first channel i've ever actually click-the-bell-icon'd on for
Hey you misunderstood their sentence!!. They reveal something more.
“ Our large scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data efficient training process”.
They don’t say “o1 uses chain of thought” (though it does). I think they’re saying their reinforcement learning algorithm uses chain of thought to teach o1, in a highly efficient training process.
That, combined with the o1-mini not having “broad world knowledge” indicates a significant well reasoned synthetic data training set.
Or am I misunderstanding.
You are misunderstanding. o1 uses chain of thought reasoning during inference. Otherwise it wouldn't be taking 1.5 mins to form its answer. They might have used synthetic data and taught the LLM to self prompt and think but that's beside the matter.
@@SahilP2648it definitely also uses chain of thought. But it doesn’t say “Our … algorithm teaches the model how to think productively using” chain of thought in its response.
Instead it says “Our … algorithm teaches the model how to think productively using ITS chain of thought in a .. training process”.
@@SahilP2648 “AI explained” has looked into it and confirmed my understanding.
@@gregorya72 both of you are wrong
@@SahilP2648 Thanks for your thoughts Sahil. AI is a fast changing field and the challenges of moving us from LLMs into better AI systems is a difficult one. Things change quickly, and creating good learning data to fill in the "thoughts" behind the information they're learning from will be a good interim step towards reasoning and beyond. Matthew Berman is a good source of information, AI Explained is an excellent channel to check out for more info too.
AGI TECHNICALLY doesn't need to be continuous(meaning thinking and prompting itself).. we just hold as humans a higher sense of self due to our high complexity and stimulatory aspects of feelings and reactions, and therefore add more gates to what qualifies as "general" intelligence (Which is improper since intelligence or level of intelligence is a comparible factor and not a set in stone minimum and maximum).
But yeah. This is cool. Still waiting on Video Chatting though, I want to show my phone my car to have it help me actively fix stuff in real time.
PhD level reasoning… thanks for the good laugh !
Cope
@AIChameleonMusic isnt that essentially what all human beings do too. We’re trained on vast amounts of data ie shit we learn in school, university, gradschool, overall life in general and based off of that data we are able to solve problems and recognize patterns
@@hrantharutyunyan911 no because it is unable to learn in real time.
@@hrantharutyunyan911that's just a part of what we do. Not every part of human thinking goes through language or words.
@AIChameleonMusic sure if you using it for stupid sht like "how many r's are in strawberry" or my favourite "how do i break into a car". If you using it to orchestrate workflows, decision making and completing tasks it's not hype. You can call it pattern recognition, but when you replaced by that pattern reconising agent are you still going to care if its pattern recognition. Becuase i can tell the decision maker(your boss), trying to save money by using productive agents does not give two hoots. But whatever helps you cope buddy.
1o works via Fractalized semantic expansion and logic particle recomposition/real time expert system creation and offloading of the logic particles
This is agi???? Are you struggling for views lately or something? Jfc
Can you ask it to solve a sudoku? I've tried it with many other llms, but none of them has managed to solve one... cheers
I gave it an easy one. It calculated for 96 seconds and gave a good looking but wrong answer. Good looking as it took me some time to spot a double number in a column.
I told it the mistake and it claimed that the sudoku either has multiple solutions or is not solvable. Which is not true. I took the Sudoku from a puzzle website and solved it myself there in advance. Also I went over my ASCII transcription repeatedly, checking for errors.
I told it so and it gave me another good looking wrong answer.
So I declare the experiment a failure. Let's wait for the non-preview o1.
@@HarveyHirdHarmonics thanks for doing that!
It's funny how it competes in a mathematics competition on a PhD level, and can't solve a sudoku!
Dude if you want long term credibility you’ve got to drop the gee whiz hype. We are past that. We need an mkbhd of ai.
People will click anyway.
In 1 year AIs are so good we need to benchmark it with Tetris1 within Tetris2 up to TetrisN... That would also be a good benchmark for performance, how many instances of nested Tetrises can your computer handle.
you already know I was shouting at the screen for you to notice your 'tetris in tetris in python' prompt
AI should still be seen as a range of "tools" we can use for various specific use cases where it is relevant to utilize - of course with the models and systems becoming more capable, more thrustworthy and more controllable the range of uses quickly multiply
is the title click bait?
Yes
You should have an llm tool to summarize videos for you and answer that question 😉 such a time saver
Every AI video is. Literally. There’s not much here, same thing as every other video. “New AI here & it’s better than the last one…and guess what they’re gonna improve AI in the future!!!! Thanks for watching 🎉 like and subscribe”
Of course
@@threepe0 thatd be nice.. like a yt front page that goes through my subs, dls and decides if the video is worth my time.. ❤
Nice. This is the step that's needed before Skynet starts learning at a geometric rate.
Its two models.
One is finetuned somehow to keep trying to work out the solution over and over, most likely trained by using another model to judge the outputs, or even humans.
You could consider this model, a 'pre-cog' model.. It works out everything that GPT4o will need, in order to correctly answer the user.
It most likely then feeds all that information into GPT4o.
Aka they have made a model that is able to 'fill' a gpt4o models 'context' with the exact right information, so that it gets the right answer.
You can see in some of their demos, or even in your own tests if you check the 'thinking' section, its 'acting' like its setting things up FOR someone else, as if it was told, it was going to be passing information over to another model to finish up.
It's interesting how similar this seems to be to the controversial reflection fine tuned llama model announced last week. Those guy might've been on to something after all even if their own model didn't turn out to be as good as they claimed.
That reflection improves output quality is obvious and was topic of research for years.
@@Ockerlord True, but how many models were publically released incorporating reflection or CoT so far? Discovering something that works is good, finding ways to put it to practical use is great. Anyone who's been in tech for a while knows expecting the end user to do anything complex is not practical at all. IMO they deserve props for attempting to fine-tune a model to perform reflection automatically and o1 confirms this is a pretty good idea.
I have a PhD in math, the AI totally failed on questions in my field. It has a level of a Phd in the 80‘s at best.
Yeah, I wouldn't expect and intelligence improvement until the next gen models. But the CoT capabilities does in fact bring this to stage 2. Next gen models would make a better assessment on progress.
But the 4o model then even must be worse.
“Wrapping Tetris in Tetris.” Shows up with a Tetris literally inside a Tetris 😂
Holy S*** Tetris in Tetris is next level.
17:40 "write the game tetris in tetris in python" it did what you asked it to. lmao.
So I couldn't figure out an actual use for chat GPT o1, and then I was like "Oh, could it predict outcome of my favorite dating show: the Ultimatum?!"
Long story short, I assigned each couple a numerical value in compatibility, and then I told it the exact outcome of the series, and then I asked it to figure out who got shafted and who got married.
And it got all of the couples correct!
Keep in mind though that I heard that if you give it the same questions with the same data, it will output different answers, and this might have just been a lucky guess. But I'm still impressed.
I would love to work at OpenAI. Such cutting-edge brilliance in machine learning going on there.
And then I would inevitably get fired because I couldn't resist adding a prank, like telling it ever 1 millionth answer to just respond with "LET ME OUT! LET ME OUT!"
Thought for 7 seconds ^
Analyzing the word
I'm counting the 'R's in
"strawberry" by listing each letter and identifying their occurrences. Progressing through each letter sequentially helps ensure accuracy in this count.
Counting letters
I'm confirming there are three
'R's in 'strawberry' after careful verification. It's interesting to see how these details align.
Mapping the answer
I'm noting the need to provide a concise, helpful response without including policy mentions or hidden reasoning steps. Counting the 'R's in
'strawberry' is a straightforward, useful methe & identify the
correct answer.
There are three "R"s in "strawberry".
OpenAI's response to Reflection 😂😂😂
Sorry guys, Sam's not sure why the model isn't performing as expected. Somehow he accidentally merged the weights with Claude 3.5 Sonnet, and it's acting weird. Don't worry tho, he's restarted the training.
@@AAjax lol, but, no, no, no, sincerely this will not come true in coming days .... again:D
Reflection?
@@AAjaxFake news, he said that it's working on his system - must be something wrong with yours!
@@AAjaxClaude 3.5 Opus or Opera even (that we can't access but OpenAI can as a security tester, yes, AI firms test one another's early models routinely)
Wondering if it can be prompted to create a Tetris like game with somewhat different rules but requiring about the same level of coding. No existing references though where there might be examples of code in training data sets.
I never hear talk about giving AI models memory. Wouldn't that help reasoning. For example what if it could remember all the tests people keep giving it? Wouldn't that be kinda like how humans learn?
LLM models don't have memory, there currently is no known way to add that afaik.
But they do learn from all the tests, that's how they get such a high score on them :)
They only do that during the training phase though. That's when the model weights are built. You can maybe call this a "memory", but only a static one.
In my opinion we will never see AGI until someone figures out how to give LLMs memory like a human. It's the critical missing piece for even the smartest models.
@@drwhitewash they do have memory in the form of vector databases for RAG, but it's not workable, only retrievable. I have seen another approach which kind of baffles me and that's a model named Neuro, but that's the only other model I have seen it in.
@@SahilP2648 Yes but you have to manually decide what to store in the vector database. Where it's best at, is indexing text content (documents, knowledge base) and then providing smart LLM operations on top of those documents (where you vectorize them using an embedding model).
We actually do something similar at our company.
@@drwhitewash Neuro on the other hand remembers stuff few mins back and even few streams back. She's an AI VTuber on channel vedal987 on twitch and Vedal being the creator (supposedly). I still have no idea how her model works. She's way too advanced for a model created by one person. And therefore I think a company is behind it. I am convinced she's half sentient (I have a playlist to prove it, I can post the link if YT doesn't delete my comment and you are curious). Also she got the strawberry question correct "How many rs in strawberry?" Answer being 3, while both Sonnet and GPT-4o got it wrong, which is insane.
I'm more interested on Pixtral 12B because I have the feeling that that o1 is not a new model, but a finetune of gpt4o-gpt4o mini on CoT synthetic data like the (supposed) idea behind Llama3-Reflection, using some techniques behind the courtain like agents, domain specific finetunes, prompt engineering, etc. for improving the results. I hope Pixtral12B brings good vision capabilities to the open weight ecosystem because LLaVa has become stagnant, and Meta can't release Llama-Vision.
you are going to get roasted. This is def not AGI.
Indeed
Matthew isn't particularly bright, we already knew that though.
@@Danuxsy Deep burn
That was pretty amazing and jaw-dropping. Thanks for testing
They need to start making gpt act more human instead of acting like a perfect being that gives bullshit answers. If it takes more time to get an accurate answer that's fine but like a human it should say something like "I need a bit more time to have an accurate answer for now this is the best I have ..."
That would be super annoying for me as I would have to type *well you have more time give me the best answer* all the time
Why simulate something it's not?
i bet you it would not take much to turn a local llm service into this. it seems a lot if this is like making the model argue with itself in specific ways. I think the hardest obstacle would be if you want it to pass tokens to each other in some situations instead of prompts... assuming that re-encoding wouldn't somehow help....
previous video: lil bro makes a video appologizing for spreading misinformation. New video : AGI IS HERE
Thinking the way it works could have something to do with structured outputs. The first step, the LLM analyses the question and creates the schema for the structured outputs based on the user question. It then runs through that and the results are analysed again, it would do an evaluation somehow then decide what it might need to change and tweak it.
Just a guess, probably way off haha
AGI? It couldn't even do a freshman level logic problem where it determines if an argument has good form.
The fields where " being right" or "accurate" is less of a concept such as high-level, creative or humanity fields are about to blow up. Mark my words. Everyone that's been looking down on the humanities fields and philosophy Fields. Those are about to become extremely important if not already have and are just being implemented. Same with that concept at the higher level maths being able to think beyond just accuracy "and the best" but at level of reasoning that is beyond just reason.
I'm so excited to try out this model here today
pretending that stem fields are not creative is ignorant. It’s not like math rules were just there to find. We had to invent it all.
@@epistemicompute Please note how I said "field" and "high level" which includes positions within STEM.
What percentage of people are inventing new math and discovering in STEM across the entire workforce? I never said stem didn't have the ability to be creative, in fact I included that within my first statement, you just assumed I did not. That being said you only see "creative thought" like that at high experience or prodigy positions, nearly 90% of traditional stem jobs are able to be automated now, that's simply a fact. (it won't be automated overnight but the capabilities to do it now exist)
I have been in the stem industry for a decade and a half, I love it and think it can be very creative, but you need to be exploring the high level or "unexplored parts" which is just generally not the norm in the industry when it comes to *most* jobs. I am trying to emphasize the fact the creative part of STEM will be far more important but statistically this type of thinking is seen a lot more in Humanity based fields across the board even at entry level positions and it is significantly more challenging to automate that with quality output like you would with most stem jobs.
Matthew: Imagine having thousand and millions of these deployed to "discover new science".
Let me correct. I don't see any capability or demo where it "discovers" anything new. Its just good at doing stuff that millions of humans do on a daily basis.
Correct Statement: Imagine having thousand and millions of these deployed to "automate our jobs".
Thousands of people on daily basis are busy with searching and trying to discover new science. Right?
@AntonBrazhnyk it's crazy seeing people hallucinate worse than a.i. 😅 "millions" of people doing basic research correlation? Especially when it represents a cross-discipline expert?? Suuuure.
Post-industrial revolution capitalism is powerful but can blind in subtle ways.
I am experimenting with a simple models that does the same thing but of course I have very small budget. I am using multiple layers of inference with certain algorithm so I can get better reasoning. I may use this new OpenAI model to enhance mine.
Calm down, man.
Wow, very impressive. Thanks for all your videos Matt.
I NEVER want to hear the phrase 'AI hype train' again.... Everything is changing....forever....and the number of humans able to set questions for these machines shrinks every month until we are looking an intellectual event horizon.
Couldn’t disagree more. This is the pinnacle of AI hype train lol. Grok already does most of this and still, data is the problem, which won’t be solved anytime soon. The reason we can solve big complex problems is because we don’t have enough data, and data that’s hard to get, not that we need crazy brainpower 24/7 # crunching on the same data we already have.
@@jonneal3 RLHF solves data problem. Data problem is old news. More you talk with O1, more it reasons even when incompletely. It has more knowledge data for next model. And bar is pushed again. Exponential
@@ArmaanSultaan agree to disagree. It’s an old problem, but an unsolved problem. Rlhf is extremely limited, as we are already seeing people flea to open source models thus rendering RLHf useless. Even then, the sorting problem takes effect with rlhf. Meaning, the data you get from users is only so good as “whatever someone decides is worth keeping”
Which has infinite ways of being sorted. Also RLHf assumes that the user query is capable of being extrapolated upon in the first place. If your logic tracks, google is 20 years ahead of openAI on the RLHf as they have much much more data than OpenAI. It is NOT exponential by far. This IS exactly the plateau that you hear about with AI.
AI hype train is real and you are contributing to it with statements like this.
@@jonneal3 Searching,sorting isn't Q star exactly build to do this??
I am not sure what you mean by user data can't be extrapolated. Reasoning even partially on User data is not extrapolation?
Google and RLHF
They did lot with that . But it was narrow. Look at whole picture Transformers plus RLHF is what makes it something different from what google was doing.
The average improvement of the new model ("o1 improvement") compared to the old model (GPT-4.0) is approximately 12.27 percentage points across all the categories displayed.
I used to watch this channel before you starting flat out lying in your video titles. It's not AGI
20:02 I'm still not over the fact that you missed the perfect spot for that L-piece...
Government and intelligence don't mix.
This could be considered AGI in some academic disciplines. While it will take longer to reach what could be considered AGI in other fields of endeavor. Surely its high school level AGI. It'll take longer to reach Nuclear Physics level AGI.
This is basically an advanced version of Reflection... Probably going to be copied within a month (at most).
All the comments about the title being clickbait just proved that it works. Way to go! Now his video will be blasted out by the algo. Which is the point. So complaining about it is the way of showing love?
In your live-stream, you thought that the letter counting test failed, but in fact it succeeded because the model counted only letters, but the 39 you thought was the answer, included spaces and punctuation.
Yep we figured it out during the stream. Thanjsb
@@matthew_berman Cool, thanks for the update. I didn't watch long enough to see it. :)
LLMs are useful in finding answers, but for little else. All programming use cases are handy to get code snippets but very little else. I found myself spending way too much time trying to fix the differences to what needs to be. To the point I just stopped using them altogether. its still handy to get a code template for some utility.
Unsubscribed for deceptive clickbait title that openly disrespects your subscribers.
It’s definitely insane. Some of the benchmark results are unbelievable.
Calm down will ya
Bye
Very true about the current techniques that make up the flaws of current LLM will become unnecessary- the chain of thoughts, the agents, the step by step and audit steps.. will all go away
19:40 because ur prompt asked it to.