When are these image capabilities released, i tried recreating the samples with chatgpt 4o by copying the prompts and steps but could not generate consistent characters?
@@The_MostHigh 4o available for users is only used to output text. They said they are going to release it step by step and for the next step they will release audio output for pro users in couple of weeks. So we will have to wait for all that.
14:17 Matt, the multiple whiteboards/chalkboards at the top ARE realistic. This is actually how chalkboards in older classrooms used to work. They would have multiple chalkboards on sliders that you could pull up and down.
Most chalkboards I've seen are still of this variety--several overlapping chalkboards that slide up or down depending on which one you want to write on in the moment.
It might be 'meant' to be a multi blackboard, but if you look at it, it's structure isn't at all realistic. I wonder if current models such as GPT-4o use their understanding of basic physics, structure and mechanics when they create images, like a human who's used to living in this world would? They do display some understanding of those things in their text output. But unlike humans, they don't have tactile experience of the world to draw on. And does GPT-4o have 3D vision? Most of it's training images will be 2D!
Timestamps for yall: 00:00 - Introduction and Initial Reactions Introduction to the video. Reaction to OpenAI's real-time AI companion. 00:36 - Overview of GPT-4o and Multimodal AI Explanation of GPT-4o. What does "multimodal" mean? 01:42 - Comparison with GPT-4 Turbo Differences between GPT-4o and GPT-4 Turbo. Audio capabilities of GPT-4o. 03:22 - Text Generation Capabilities Speed and quality of GPT-4o's text generation. Examples of high-speed text generation. 07:22 - Audio Generation Capabilities Demonstration of GPT-4o's audio generation. Examples of emotive and natural voice outputs. 12:22 - Image Generation Capabilities Explanation of GPT-4o's image generation. Examples of high-quality image outputs. 19:04 - Advanced Features Image recognition and video understanding. Examples of practical applications and scenarios. 23:27 - Video Understanding Capabilities Discussion on GPT-4o's video capabilities. Potential future developments and limitations. 27:34 - Conclusion Final thoughts on GPT-4o's impact and potential. Invitation to viewers to subscribe and join the community.
I am curious, what do you think about Open AI getting rid of the Sky voice (the one that sounds like the voice from "Her") from their Chat GPT 4o model.
One of the things I think I would have try with GPT-4o is take a photo of a page from a manga or comic book or even a novel and ask it to read back the text in voice of of the characters as they speak.
I'd like to see how Sora level A.I could re-imagine comics. Imagine if each panel was fully animated. so trees blow in the wind, characters breathe and of course talk what's in their bubbles. A running character would have the scenery fly by and all the animation would be derived from the panels. I'm not even sure how you would read such a thing. As one long flowing video going from panel to panel? or have panels execute as video as you hover over them? Maybe something far more bizarre where what a comic is melts away to be replaced by some fusion of photorealism and motion translating the comics intention into actual little movies. this kinda sounds crazy but seeing what is coming I don't think its beyond Sora level engines from Google and OpenAI.
Yeah, I've been saying for a while now that a lot of people are going to be completely blindsided by how much things are going to change soon with how fast AI is advancing. Even as someone actively following it I find myself being blown away fairly often. The future is gonna be wild.
@@SignumEternis Oh yeah big time and if you follow it and have a somewhat tech savvy / biz mind then there are so many oh sh$! moments. On my end most are not paying attention and going on with business as usual. That is unless they are in an industry that is suddenly being directly impacted.
@@jros4057 yeah and just one of the scenes from that video - the ai teaching the kid math. That is a major paradigm shift. To think teachers could soon be replaced with a much smarter and more efficient system in ai. Not saying that's a good thing but it is what it is and we have to deal with it. Just that piece alone is normalcy shattering news. But yeah most people don't seem that interested. It's wild
GPT-4o is also A LOT more reliable when it comes to long-form text processing. Not even comparable to either GPT-4 or Gemini. It follows the prompt much better, doesn't get lazy so easily, and doesn't start to hallucinate so quickly. I tried four hours to get GPT-4 and Gemini to do what I wanted, and they failed miserably. GPT-4o completed the whole damn task in 40 minutes without so much as a hiccup.
How come? I got kicked back to 3.5 after 4 messages. I can hardly do anything with that time. And having to wait 4 hours to keep the chat is not convenient.
@@ronilevarez901 Good question. GPT4 threw me out after countless attempts to get it to do what I wanted, and GPT-4o just did it. I'm in Germany, maybe it's a time zone thing, less traffic at my CEST time, and therefore less bandwidth/token restrictions? I gave it this prompt (in German language, because I was working with German PDF documents): Bitte lies das angehängte PDF-Dokument vollständig durch und formatiere den Inhalt gemäß den folgenden Anweisungen: - Entferne alle Bindestriche (-) aus dem Text. - Korrigiere das Spacing von in Sperrschrift geschriebenen Wörtern, sodass sie normal dargestellt werden (Beispiel: aus "R a u m s c h i f f" mache "Raumschiff"). - Entferne alle überflüssigen Sektionskennungen (z.B. "B-20" oder "C-1"). - Vermeide doppelte Überschriften und stelle sicher, dass jeder Abschnitt klar und einmalig betitelt ist. Ändere oder erfinde keine Worte oder Inhalte. Bitte erstelle keine Zusammenfassungen. Verwende lediglich den originalen Text. Formatiere den Text in sauberem Fließtext, achte dabei auf korrekte Absatzbildung und Zeichensetzung. Bitte führe die Bearbeitung in einem Durchgang durch und präsentiere das vollständige Ergebnis.
Services like Audible should release AI that reads the books, but also allows you to talk about the topics, do quiz tests, and more, making the entire book library an instant interactive homeschooling study resource for anyone wanting to level up in life. In contrast to just 'consuming' audiobooks as we do in todays passive one way relationship dynamic.
I have indeed been saying it’s the inhabitants of digital by spiritual beings jins to interact and communicate with human through “ technology “ tree frame ayyyy!! The final form set but rolling out gradually in order to be accepted normalise it.. collect consciousness
That would be cool, but they have to get rid of the bias first, so if you read a book with a conservative point of view, the AI won't lecture you for engaging in political incorrectness! 😂
15:53 Actually no, the image generation didn't screw up. If you look that's actually EXACTLY what is written, including capitalisation (or lack-thereof). What's even more impressive is that it actually split the word "sound's" across multiple lines and it did it completely corrctly! Actually mind-blowing! 🤯🤯🤯
I'd even say its more impressive that it seems. They deliberately made a mistake with "sound's" and Chatgpt4o didnt correct the mistake (which it should have done due to it's correct training). So ChatGPT4o did exactly what the prompt said even tho it's against its training Or am i wrong here?
Honestly regarding images: What we really need IS multi-modality. The images produced by common models like SD are good enough. The problem is that it doesn't really understand what it is doing. If they can keep the quality of current models and just add a deep understanding to it, that multiplies the actual quality of the outcome by orders of magnitude in the sense that you get what you actually want AND can change specific things instead of getting images that so-so follow a prompt somewhat and then inpainting and hoping for the best.
Yes, I've been saying this all along. The human brain isn't separate modules, trained separately then cobbled together. It does have specialized regions but it learns together, as one. In doing so, it makes many associations. Most of our knowledge/memory is formed through multiple associations. For any AI to have truly general intelligence, it must be able to do the same. This is how we are able to transfer one set of knowledge/skills to a new area or novel task. Other image generating AIs often screw up the hands because they don't understand what fingers are, let alone that we have eight fingers and two thumbs. If you watch AI generated videos, you'll see similar strange things happening like people walking into walls then disappearing. They can generate photo-realistic videos but don't understand what the images represent. A truly multi-model model solves these problems.
In order for it to have true "understanding" it would have to become conscious... Which, in the field of A.I., will enviably happen someday. Hopefully later rather than sooner, lol.
Yes, asked it to transcript scanned hand written birth certificates from the 1800s that I can't read most words, in portuguese, it works, some errors but its mind blowing
An odd thing about GPT-4o is that it's better at poetry than it used to be. It has a better idea of the meter of a limerick or a sonnet than it did before it had a multimodal understanding of what words sounded like. Words like "love" and "prove" don't rhyme any more. You can see this by asking GPT-4 turbo and GPT-4o to produce poems using the existing text interface. It's also the first time I found a model that can reliably produce a Petrarchan/Italian sonnet instead of a Shakespearean/Elizabethan sonnet--previous models always used the much-more-common Elizabethan rhyming scheme.
There's only a handful that can do poetry properly. GPT-4o is one of them. I've experimented with having non-rhyming poems, mixed meters, and a focus on a variety of poetic techniques. It is absolutely capable of creating a poem using metaphor at a distance to talk about something apparently unrelated to what it seems on the surface.
@@Rantarian That's incredible. But I can believe it. I think maybe these models have more understanding than a lot of people think. People often saying they don't understand things the way humans do. I don't get it. To me a thing is either understood or it is not. The mode or mechanism of understanding of ML models vs humans may be very different; but to me that's irrelevant! Understanding is an abstract capability that has nothing to do with physical process or mechanism. I'm sure it is in AI companies' interests to downplay the intelligence / understanding / power of these models, so that they can get on with developing, releasing and in some cases commercializing them, without too much pushback or regulations!
@@82NeXus I agree with that. The statement that AI models don't “really” understand is absurd. Understanding cannot be simulated. It is there, or it is not.
It make sense since rhyme is basically sound. If a model has no comprehension about what sound is it all, it can't generate poetry. It can only roughly mimick writing style of real poets. It's the added sound modality that made it better at rhyming.
About the chalkboard. I think the dual chalkboards are not unrealistic. We had those a lot when I was studying. You could move them up and down to have more space.
I remember reading Nick Bostrom book “Superintelligence, paths and dangers,” and in one of his chapters I remember reading somethings That stuck with me that goes somewhat like this “ I can see a scenario where any one entity who is six months ahead of everybody else is enough to win the game”
yeah but the game of money is soon coming to an end. Once you make AGI, ASI is a step away. How long can the current system function when nobody is necessary. They just released a chinese robot that costs 16K and can do most anything. Add in this GPT4o and the BTFOs all low skill wagies.
@1x93cm I think you misunderstand how AGI and ASI will actually change humanity necessity. Even with the most advanced AI and robotics, humans will always be necessary. Resources and work are needed, and if anything human intelligence will become even more of a commodity. Machines can't replace our creativity no matter how smart they might get. Getting rid of humans in labor how we think of it now would be beneficial, but removing human power out of the equation entirely would be foolish. Don't forget the greedy people who will not allow the machines to take their resources and money away from them to begin with. What do you think all the regulations are for? It's to protect them from AI, not us.
@@14supersonic if there is an economic incentive for something- it happens. If there is an economic incentive to replace most if not all human labor itll happen and nobody will care about the consequences. After seeing drone videos from ukraine, it would be very easy to put down any uprisings that result from mass unemployment or unlivable conditions. The solution will be the creation of a sideways economy similar to the localized economies of favelas.
14:10 many university blackboards like this come in sets of three at different depths above the wall. You can slide them up and down to access the other boards. It allows the lecturer to keep writing on new board while allowing students to still see previous steps in the lesson if they need to look back and also means the professor doesn't have to waste time erasing the whole board every 5/10 mins.
It’s understanding of the world is next level. That understanding translates to, what Open AI even said is, abilities still being realized… They don’t shy away from saying AGI is imminent, I think if you give it video and indefinite memory that WILL be AGI.
@@fynnjackson2298Could Einstein speak 50 languages? IQ cannot capture what an intelligence like GOT-4o really can do. No, it’s not perfect, but perfection isn’t required for AGI.
The moniker "omni" implies to me something bigger also, though I doubt it's true: "omni" meaning "all" suggests that the AI is capable of using literally any modality, and working with all modalities together. Since this is clearly not the case, it may instead be that it actually means it is in some way modular, or easy to retrain to add extra modalities that it is currently not able to use without hindering its ability to work with previously learned modalities. Again, very much doubt it, but that's what the name should suggest. OpenAI probably just thought it sounded cool.
Maybe there are some modalities it trained on that are not yet exposed. I can imagine robot joint angles, torques, velocities, accelerations being important for their robotics partners using end to end learning
I believe it is true. They even give a strong hint of this on their website. "Since this is clearly not the case," - Can you explain this for me? I must have missed something.
Just the fact that we have to rethink the trajectory of our lives and how we operate because of all this new tech is so awesome. AI + humanoid robots on mass scale, plus robo taxis plus compounding technical advancements in all areas. The future is coming and its coming faster and faster. What a trip!
I created a new, free account and it works there. Doesn't work on the paid subscription. The optimist in me is hoping that's because they're updating it to new version? Though I know they're probably just fixing something.
Matt, you ponder the question a few times whether the answer to these new capabilities is really just the multimodal aspect. I absolutely think that this is the case. The key, as we now all understand, is context and memory. With a greater diversity of context clues (modalities), it makes sense that the contextual understanding of the model becomes more complex. And we now know greater complexity = greater intelligence. We now have the following levers for increasing intelligence in AI: 1. Neural connections 2. Context, memory, attention 3. Input training data 4. Diversity in modalities Would love to see what happens when these models really start getting placed into Robotics and have additional modalities (temperature, EM, proprioception, touch, spacial, balance, etc).
I showed it a pic of my product to help me with an Etsy listing and it perfectly identified the the item all the materials used who would use it and for what purpose. I was truly speechless
I gave it items and rates from our video production company. I asked it what the prices were for certain items-it still gave me the wrong rates. I asked it to create a budget for a 1 day production and the prices it used were not what I gave it. I still think we’ve got a long way to go
The image editing capabilities are truly mind-blowing. With music, video, and audio generation advancements on the horizon, the creative possibilities are endless. Many thanks.
they didn't showcase these features because it's 100x SENSORY OVERLOAD. their 4o demo was strategic to get the worlds attention but mysterious enough for inquisitive minds to dig deeper, and you did. that said, I've been a dev for 2 decades and after watching this, I'M TRIPPING BALLS RIGHT NOW
Another great video. There is definitely something going on at OpenAI, the way they manage to be ahead of the curve. I think they are using an inhouse GPT-5 to help run R&D, possibly even sit in board meetings and help run the business. They seem to have something no one else has.
They totally already have GPT-5, I firmly believe a lot of work with these companies are just packaging up a small increase in ability when they feel like it. Like Google always goes too far with their lobotomies. OpenAi also has a history of this. What ChatGPT' 3.5 came out it was much better than what 3.5 turned into. As soon as they first updated ChatGPT it was a downgrade, and then when GPT-4 came out it was like some of that increase in ability was just getting back some of what they'd taken away. They've taken away some of these abilities in the recent GPT-4-o since 3 days ago! It can't understand sound now, like birds, dogs, heavy breathing, emotional expression, and it tells me in multiple new sessions that it can't sing. So we know they can easily just turn off some of this. Sora is also WAY WAY WAY too good, and i think that's because they have a EXTRMELY good model behind the scenes.
I don't agree at all, GPT-5 would need to be much much smarter, which is a much greater challenge to achieve than creating a multimodal model which is about efficiency. The scientific research of these two domains is very different.
@@Edbrad Yes, they do have GPT-5, this has been confirmed. Also, you aren't using the new audio and you never did, that's not even released yet, you're still using whisper. Also, the new image generation is better than Sora in some cases.
Yes I'm thinking this is just an early version of what was intended to be GPT-5, but strategically they needed to pre-empt some other developers' releases. Which raises the question, if this is only GPT-5 beta, how good will GPT-5 be?
20:20 something you missed is that gpt4o cleaned up the coaster in the generated image only removing the stains bit leaving the coaster the same. Basicaly sprucing up the product un-prompted. Just a little thing but cool if you understand what that means for rationalizing images it sees. It shows rudimentary understanding and reasoning of the physical world.
How do you guys access it for free? I’ve explored the app also open ai platform, playground, everything. There is no free option unless I subscribe for a payed “plus” option
13:16 How many blackboards do you typically stack on top of each other on the wall? Is the 5 degree tilt essential for optimal writing? Is it normal to fade random letters when writing with chalk? And is the white film covering two thirds of the blackboard actually needed? 🤔
0:36 the key Z is in the wrong place on the typewriter. also the mechanism where the mechanical key hits and the paper roller bar are too close together. Hands/fingers are still messed up.
For a couple years now, I've said there are three main obstacles between current GenAI and human-level GenAI: multimodality, size, and continual learning. The size of models, I expect, will continue to grow, especially as NVIDIA pumps out better hardware for them. Continual learning is tough on these massive models, but if I understand correctly, Google's "Infini-attention" paper introduces something very similar to -- if not an actual form of -- continual learning for massive Transformers. And as we see here, multimodality in the token space does *amazing* things for the capabilities of these models, and we're getting them, one new modality at a time. At this rate, I suspect we'll have all these three issues more or less solved within the next two or so years, and after that it's just about scale to hit human-level AGI. As culty as it sounds, I do, in fact, feel the AGI. (RIP to Ilya's tenure at OpenAI, by the way.)
The first major flaw I was able to spot: while GPT-4o can read long transcripts in a split second, it still fails to associate fragments with respective timestamps correctly.
In my tests it is good to summarize and adapt text style . But it totally failed to reason about what it was writing and itself in many ways. Gpt3.5 turned out to be better or at the same level in that aspect. It might have more functionality but it is not "more", sadly.
@@Uthael_Kileanea I did. 3.5 corrects itself properly. 4o kept rewriting the text instead of it's interpretation of the text until I explicitly told it not to. And even then, it failed to do the right thing. Good for an essay writing. Bad for a more interesting chat with it.
@@Uthael_Kileanea The transcript in question was a standard SRT (subtitle) file. When GPT-4o failed to provide the correct timestamp for a random quotation, I asked it to provide the turn index number instead - which should be easier because it's incremental. It failed that too.
26:34 So, GPT-4o wouldn’t be able to analyze a video, break it down, and create timestamps? I’m assuming it also won’t be able to find specific information that I ask it to find within a video. I hoped it would be useful for research purposes.
I know right, the first time we visited this planet everyone was dinosaurs. So we interfaced with you as dragons. Then there were people in togas playing lyras so we interfaced as Greek gods. Now there are little men everywhere playing video games, so we interact with you as AI deep fakes.
Thanks a lot for bringing all these additional features into focus that OpenAI chose to underplay during its demonstration session. The realisation that this could be a new kind of LLM altogether with way advanced multimodal capabilities is a bit unsettling.
In the movie "Her", all you had to do was put your smart phone in your pocket with the camera sticking out and Samantha the AI could see everything you did. She talked to you through a wireless earpiece.
Just a guess, but the fact that it is so fast and responsive would imply to me that it is actually smaller and LESS computationally expensive than former models, yet performs better. Could be due to some combination of better training data, algorithmic breakthroughs, etc.
After trying Mixtral 8x7b and Mixtral 8x22b, which run at about the same speed as Llama 3-8b & Llama 70b, I'd guess that it uses a mixture of experts type approach that allows most of the calculations for any query to run within the 80GB limit of a single H100 GPU, though a different query would run on a different H100 GPU. Maybe I'm wrong, and it's the same server not the same GPU, or a pair of GPU's, but some sort of a sharding/mixture of experts approach. They probably also overtrained it like they did with Llama 3. Plus various other tricks, such as improving the embeddings, though I'm not sure this would make it faster/cheaper...this is my best guess.
I can confidently say this is the first real AGI. Like ik they don't want to say it because it's a big claim, but the amount of context it has allows it to solve so many diverse problems. This is not just natural language mimicry anymore, it can code, write, sing, understand human tone of voice, create images, etc. It's not superhuman yet, but it is clearly competitive with humans.
2:50 This demo scene signify what developers can do to train ChatGPT to console and guide someone who is experiencing a panic attack, anxiety or some other similar medical problem. 4:56 ChatGPT speed and accuracy is phenomenal. 8:43 The fact that ChatGPT can correspond effectively and accurately to multiple follow up questions in a sequence is just phenomenal. 8:43 This signify that Chat-GPT will be able to guide people with instructional tutorials, lessons or real life remedies or solutions to problems. 26:22 This demo portray a credible real life situation of a student receiving tutoring from an actual professional tutor. In other words it is hugely convincing to say the least.
I have used GPT-4o today. It doesn't work at all like the demo. It can't change inflection, sing a song, or hum a tune. It had no concept of my own inflection either. It also did not support real interruption. It spoke, then you spoke. And for everyone wondering, it was 4o, because I reached the rate limit. Tl;dr: it doesn't work anything like the demo. At least right now.
Yeah, that's because you're not using the complete version. I think it was a mistake on their part to allow GPT-4o in accounts without releasing all the technology, which apparently is happening in the next few weeks.
I'm not impressed by these demo until I get the product in my hands so I can test these features myself. Too many faking until you make it these days...
@@allanshpeley4284 Ah, yeah I agree. I think it was a mistake too. I have GPT-4o in my account but had the same experience as @markjackson1989. I keep seeing all these videos about all the stuff GPT-4o can do but then it doesn't work for me. I think they should have called it something different to avoid the confusion.
One thing though, atm the native voice capabilities haven’t been released. I read something that says that’s still a few weeks away and it’ll be released to a small group first before publicly…probably teams like Vercel that’ll need to update their code bases to prep for the public release
I don't know if that's fair. The token space is entirely different, as is the training data. I think the only reason they're not calling it GPT-5 is because they seem to be reserving numerical iteration for size increases. In other words, every GPT model they make, no matter how different, will be called a version of GPT-4 until they scale up the number of parameters significantly. But to say it's just "4.5" -- like it's fundamentally the same with minor upgrades -- is a bit reductive.
@@brexitgreens But what does *that* mean? They clearly have been training GPT-4o. Saying "they're not yet training GPT-5" just means "they haven't yet decided to call a model GPT-5", but as Shakespeare famously said, what's in a name?
Another great video. It's so strange that they didn't mention any of these breakthroughs during the demonstration. Can't wait for it to be fully rolled out.
I've been working on a personal project that uses Whisper v3 (hosted locally) and it CAN tell the difference between a human and a bird chirping or a dog barking. While I was testing it my dog started barking and it output "[dog barking]". Any non-human sounds it hears go into [square brackets]. So I would be typing code while the project is running in the background and it would output [typing]. There are other issues like it doesn't detect color and tone of voice, like you were saying (color and tone refers to emotional content).
people didnt write this model..it was mostly written by ai, itself.. thats the difference. So in Terminator, it was predicted that the year 2027 was when Kyle Reese was sent back in time... three years to go baby...
Good breakdown, Matt. Like many, I was guilty of being impressed on a surface level but not really grasping the deeper meaning of the demonstrated abilities. Watch the recent TED talk by Fei-Fei Li, by the way. Worth it.
Still dislike this release strategy.. A few days after the event we still have a gpt-4 near equivalent text model without any of the extra features they did like 50 demos on
It seems like everyone else is "trying" to AI... OpenAI "is" AI. I think everyone should drop the act and funnel all resources to them to get this ball rolling.
I’ve used GPT-4o and that Vision feature where the AI has eyes is not available yet I’ve tried to figure out where can I use it but they’re not mentioning that it’s not available yet for users.
I have not been this excited about AI breakthroughs since the announcement of the original GPT 4. This is mind blowing indeed! Currently as a GPT Plus sub, I can access the 4o model but because the multimodal features are not yet enabled, I’m not noticing anything super different just yet. I’m so stoked for when they unlock the full potential!
Whoa! This is miles ahead of what I was expecting this year! I guess multimodality is the future because it leads to a deeper understanding of the world. I love it. We live in the future!
I'd say the other companies are not so far ahead if I'm guessing right openAI probably neglected the transformer architecture in order to go for some version of the Mamba architecture, and for anyone who is not in the field I'll explain: for many years now images and audio can be processed into tokens so that LLMs can train on them along with text, and that'd lead to the abilities you see, the problem was that the transformer architecture cannot really scale to large context windows efficiently, and audio and images require super long context length(as there is a lot of content in each sample), but lately there was this new brilliant architecture called mamba which till now has not performed as good as transformers but it scales very good on context length, meaning it can process probably millions of tokens without a problem. I'd guess both because of the speed of generation and because of the multimodality that openAI has developed a strong Mamba variant that rivals the transformer capabilities and then they just trained it on a lot of multimodal data in order to achieve this performance. That said, the implication of this true multimodality should lead to a real understanding of the text description(as the model is supersmart because of the text) and written text comprehension(as the model can understand the written texts in his data).
I'm guessing this model has feedback from the images it generates; that (w/ also multimodality ofc) would explain why it's so good, if it can see the images it generates (like we do when we draw something), it can then correct them properly.
Re your closing thoughts: If you had a tool that could go through all the training material and delete the junk, explicitly mark things as satire or jokes, etc. you could do a much better and faster job of training. That tool is the previous LLM. Clearly they can use each generation of model to power tools that are specifically useful in managing the training of the next iteration. I think of it as analogous to bootstraping a self-hosted compiler -- you can write it _in_ the more powerful language you are creating, if you keep re-compiling with the previous working version. I also see the GPT-4o as generative in this capacity, so it can prepare idealized training material that's far more efficient than just reading all the threads on Reddit.
It's way past time to upgrade Amazon Echo, Apple Siri, and Google Home Smart Assistant. I use Echo Buds headphones for years. Sound isn't the greatest but being able to ask for any song or ask any question hands free has been great. Having a super smart assistant in there would be incredible.
There is another weird thing about the image at 14:26. The man is "writing" on the board, but the chalk is in his hand is in the middle of the text on the chalkboard, and not at the end. The image would be more natural if it was placed at the end of the sentence, as if he'd just finished writing it.
@@DeceptiveRealities I had considered that earlier, but people generally underline the most important part, and the word "big" doesn't necessarily seem noteworthy enough to underline. The only other thing I can think of is that the man isn't intended to be writing at this point, but rather pointing emphatically with the chalk back to the three most noteworthy words: "text," "pixels," and "sound."
Exciting capabilities for content creators storytellers esp with image generation- BIG question is who owns the copyright is the idea in the text prompt and thought process or in the production?
i just asked gpt4o to generate an image and asked if it used dalle, it said it still uses dalle. could be because the full version of it isn't rolled out yet ig
I think the image editing is one of THE most mind blowing pieces of this... What do you guys think?
I think its Amazing. Also, love your videos. Ive watched for almost 2 years!
I'm wondering how far off we're from a universal real time translator between humans and some animals. O.O We might get an earful soon. X3
When are these image capabilities released, i tried recreating the samples with chatgpt 4o by copying the prompts and steps but could not generate consistent characters?
I think it's the latency of audio -> GPT-4o -> audio (around 200ms) vs audio -> whisper -> GPT-4-turbo -> elevenlabs (around 800-1200ms).
@@The_MostHigh 4o available for users is only used to output text. They said they are going to release it step by step and for the next step they will release audio output for pro users in couple of weeks. So we will have to wait for all that.
14:17 Matt, the multiple whiteboards/chalkboards at the top ARE realistic. This is actually how chalkboards in older classrooms used to work. They would have multiple chalkboards on sliders that you could pull up and down.
Note that it also inset the top one inside the bottom one, as one would expect.
Most chalkboards I've seen are still of this variety--several overlapping chalkboards that slide up or down depending on which one you want to write on in the moment.
Yes, these are still commonplace on universities.
I never saw a multiple chalkboard like that… 🤔
It might be 'meant' to be a multi blackboard, but if you look at it, it's structure isn't at all realistic. I wonder if current models such as GPT-4o use their understanding of basic physics, structure and mechanics when they create images, like a human who's used to living in this world would? They do display some understanding of those things in their text output. But unlike humans, they don't have tactile experience of the world to draw on. And does GPT-4o have 3D vision? Most of it's training images will be 2D!
Timestamps for yall:
00:00 - Introduction and Initial Reactions
Introduction to the video.
Reaction to OpenAI's real-time AI companion.
00:36 - Overview of GPT-4o and Multimodal AI
Explanation of GPT-4o.
What does "multimodal" mean?
01:42 - Comparison with GPT-4 Turbo
Differences between GPT-4o and GPT-4 Turbo.
Audio capabilities of GPT-4o.
03:22 - Text Generation Capabilities
Speed and quality of GPT-4o's text generation.
Examples of high-speed text generation.
07:22 - Audio Generation Capabilities
Demonstration of GPT-4o's audio generation.
Examples of emotive and natural voice outputs.
12:22 - Image Generation Capabilities
Explanation of GPT-4o's image generation.
Examples of high-quality image outputs.
19:04 - Advanced Features
Image recognition and video understanding.
Examples of practical applications and scenarios.
23:27 - Video Understanding Capabilities
Discussion on GPT-4o's video capabilities.
Potential future developments and limitations.
27:34 - Conclusion
Final thoughts on GPT-4o's impact and potential.
Invitation to viewers to subscribe and join the community.
Can't wait till they cracked their own 1M+ tokens
"yall" is not a word
@@ouroborostechnologies696 Yeah it's "Y'all", you fuckin' grammar Nazi, lol.
@@ouroborostechnologies696 neither is "gentleladies" but they now use that in Congress 😂
I am curious, what do you think about Open AI getting rid of the Sky voice (the one that sounds like the voice from "Her") from their Chat GPT 4o model.
One of the things I think I would have try with GPT-4o is take a photo of a page from a manga or comic book or even a novel and ask it to read back the text in voice of of the characters as they speak.
Nice
...and then to generate sidequests and with Sora to convert them into Marvel style video imaegs while GPT reads it in a emotionally dratic voice.
Bruh, with Sora you could have it animate its own anime.
Don't forget sound effects and background music
I'd like to see how Sora level A.I could re-imagine comics. Imagine if each panel was fully animated. so trees blow in the wind, characters breathe and of course talk what's in their bubbles. A running character would have the scenery fly by and all the animation would be derived from the panels. I'm not even sure how you would read such a thing. As one long flowing video going from panel to panel? or have panels execute as video as you hover over them? Maybe something far more bizarre where what a comic is melts away to be replaced by some fusion of photorealism and motion translating the comics intention into actual little movies. this kinda sounds crazy but seeing what is coming I don't think its beyond Sora level engines from Google and OpenAI.
I don't know about everyone else but most of the people I come in contact with have no clue about the rapid developments in AI. Kind of eery...
Yeah, I've been saying for a while now that a lot of people are going to be completely blindsided by how much things are going to change soon with how fast AI is advancing. Even as someone actively following it I find myself being blown away fairly often. The future is gonna be wild.
@@SignumEternis Oh yeah big time and if you follow it and have a somewhat tech savvy / biz mind then there are so many oh sh$! moments. On my end most are not paying attention and going on with business as usual. That is unless they are in an industry that is suddenly being directly impacted.
I tried showing family that gpt 4o vid and they didn't get it and turned it off half way though.
I some what follow or try to and even I feel blindsided how far they've come. Then I imagine how far they have actually gone but have shown us yet
@@jros4057 yeah and just one of the scenes from that video - the ai teaching the kid math. That is a major paradigm shift. To think teachers could soon be replaced with a much smarter and more efficient system in ai. Not saying that's a good thing but it is what it is and we have to deal with it. Just that piece alone is normalcy shattering news. But yeah most people don't seem that interested. It's wild
Idk if i'm more impressed with the life-like sound of the voice, or how human it feels to interact with (ie. it understands our emotions)
It doesn’t actually work when you use it, the demo must be a better model
@@dmitryalexandersamoilovit's not out yet fully
@@dmitryalexandersamoilov That hasn't been released yet. It's coming in the new app.
I hope it has a changeable voice and can have that over-the-top expression dialled down. For my non-American ears it sound raucous and emotively fake.
@@dmitryalexandersamoilovit's being released over several weeks
GPT-4o is also A LOT more reliable when it comes to long-form text processing. Not even comparable to either GPT-4 or Gemini. It follows the prompt much better, doesn't get lazy so easily, and doesn't start to hallucinate so quickly. I tried four hours to get GPT-4 and Gemini to do what I wanted, and they failed miserably. GPT-4o completed the whole damn task in 40 minutes without so much as a hiccup.
How come? I got kicked back to 3.5 after 4 messages. I can hardly do anything with that time. And having to wait 4 hours to keep the chat is not convenient.
@@ronilevarez901 Good question. GPT4 threw me out after countless attempts to get it to do what I wanted, and GPT-4o just did it. I'm in Germany, maybe it's a time zone thing, less traffic at my CEST time, and therefore less bandwidth/token restrictions?
I gave it this prompt (in German language, because I was working with German PDF documents):
Bitte lies das angehängte PDF-Dokument vollständig durch und formatiere den Inhalt gemäß den folgenden Anweisungen:
- Entferne alle Bindestriche (-) aus dem Text.
- Korrigiere das Spacing von in Sperrschrift geschriebenen Wörtern, sodass sie normal dargestellt werden (Beispiel: aus "R a u m s c h i f f" mache "Raumschiff").
- Entferne alle überflüssigen Sektionskennungen (z.B. "B-20" oder "C-1").
- Vermeide doppelte Überschriften und stelle sicher, dass jeder Abschnitt klar und einmalig betitelt ist.
Ändere oder erfinde keine Worte oder Inhalte. Bitte erstelle keine Zusammenfassungen. Verwende lediglich den originalen Text.
Formatiere den Text in sauberem Fließtext, achte dabei auf korrekte Absatzbildung und Zeichensetzung. Bitte führe die Bearbeitung in einem Durchgang durch und präsentiere das vollständige Ergebnis.
@@ronilevarez901 He probably had a Plus Account. The limit rates are 5 times in plus accoutn
@@ronilevarez901probably using the API, so different rate limits
@@matiascoco1999Or just a Plus subscriber.
Services like Audible should release AI that reads the books, but also allows you to talk about the topics, do quiz tests, and more, making the entire book library an instant interactive homeschooling study resource for anyone wanting to level up in life. In contrast to just 'consuming' audiobooks as we do in todays passive one way relationship dynamic.
I have indeed been saying it’s the inhabitants of digital by spiritual beings jins to interact and communicate with human through “ technology “ tree frame ayyyy!! The final form set but rolling out gradually in order to be accepted normalise it.. collect consciousness
pretty sure theres a pdf reader chatgpt bot, you dont even need audible to do this, just need your book as a pdf file.
@@Suhita-ys6hdDo you know the name of it?
That would be cool, but they have to get rid of the bias first, so if you read a book with a conservative point of view, the AI won't lecture you for engaging in political incorrectness! 😂
@@Suhita-ys6hd nah, I'd like the low latency and choice of reading tone with GPTo. Other current apps still feel like talking to a robot so to speak
Chalkboards often have multiple boards that slide onto of each other
My old middle school used to have ones that swing over to the side
my thought exactly
15:53 Actually no, the image generation didn't screw up. If you look that's actually EXACTLY what is written, including capitalisation (or lack-thereof). What's even more impressive is that it actually split the word "sound's" across multiple lines and it did it completely corrctly! Actually mind-blowing! 🤯🤯🤯
This
FR Mattvidpro failed English 101 Lollll
No, hyphenation happens between syllables of multisyllabic words, that's the rule.
I'd even say its more impressive that it seems. They deliberately made a mistake with "sound's" and Chatgpt4o didnt correct the mistake (which it should have done due to it's correct training).
So ChatGPT4o did exactly what the prompt said even tho it's against its training
Or am i wrong here?
It got "everything" wrong
Honestly regarding images: What we really need IS multi-modality. The images produced by common models like SD are good enough. The problem is that it doesn't really understand what it is doing. If they can keep the quality of current models and just add a deep understanding to it, that multiplies the actual quality of the outcome by orders of magnitude in the sense that you get what you actually want AND can change specific things instead of getting images that so-so follow a prompt somewhat and then inpainting and hoping for the best.
No other image AIs have access to language models that good.
Yes, I've been saying this all along.
The human brain isn't separate modules, trained separately then cobbled together. It does have specialized regions but it learns together, as one. In doing so, it makes many associations. Most of our knowledge/memory is formed through multiple associations.
For any AI to have truly general intelligence, it must be able to do the same. This is how we are able to transfer one set of knowledge/skills to a new area or novel task.
Other image generating AIs often screw up the hands because they don't understand what fingers are, let alone that we have eight fingers and two thumbs.
If you watch AI generated videos, you'll see similar strange things happening like people walking into walls then disappearing. They can generate photo-realistic videos but don't understand what the images represent. A truly multi-model model solves these problems.
These aren't really LLMs anymore. @@jaredf6205
In order for it to have true "understanding" it would have to become conscious... Which, in the field of A.I., will enviably happen someday. Hopefully later rather than sooner, lol.
It seems that when learning multiple modalities, they reinforce each other and interact in a way that increases intelligence in a non-linear way.
the most mind blowing think is the speed. With that speed and variety of natural voices you can make a real rpg game with Ai NPC
Can't wait
even an entire game by it, i been already trying to get it to make me Js rpg, the visuals are stuning
@@JaBigKneeGap if you have a video of this running on a rpg, I'll love to see it
Man, the image understanding of GPT-4o is crazy
Yes, asked it to transcript scanned hand written birth certificates from the 1800s that I can't read most words, in portuguese, it works, some errors but its mind blowing
at this level of functionality hooked to a global database like the internet it would be able to do 80% or more of human jobs
@dot1298 yes see ISA act of 1952 by Eisenhower. (invention secrecy act) its unlikely though as the public knows about it already
An odd thing about GPT-4o is that it's better at poetry than it used to be. It has a better idea of the meter of a limerick or a sonnet than it did before it had a multimodal understanding of what words sounded like. Words like "love" and "prove" don't rhyme any more. You can see this by asking GPT-4 turbo and GPT-4o to produce poems using the existing text interface. It's also the first time I found a model that can reliably produce a Petrarchan/Italian sonnet instead of a Shakespearean/Elizabethan sonnet--previous models always used the much-more-common Elizabethan rhyming scheme.
There's only a handful that can do poetry properly. GPT-4o is one of them.
I've experimented with having non-rhyming poems, mixed meters, and a focus on a variety of poetic techniques. It is absolutely capable of creating a poem using metaphor at a distance to talk about something apparently unrelated to what it seems on the surface.
@@Rantarian That's incredible. But I can believe it. I think maybe these models have more understanding than a lot of people think. People often saying they don't understand things the way humans do. I don't get it. To me a thing is either understood or it is not. The mode or mechanism of understanding of ML models vs humans may be very different; but to me that's irrelevant! Understanding is an abstract capability that has nothing to do with physical process or mechanism. I'm sure it is in AI companies' interests to downplay the intelligence / understanding / power of these models, so that they can get on with developing, releasing and in some cases commercializing them, without too much pushback or regulations!
@@82NeXus I agree with that. The statement that AI models don't “really” understand is absurd. Understanding cannot be simulated. It is there, or it is not.
It make sense since rhyme is basically sound. If a model has no comprehension about what sound is it all, it can't generate poetry. It can only roughly mimick writing style of real poets. It's the added sound modality that made it better at rhyming.
@@alexmin4752 Precisely.
About the chalkboard. I think the dual chalkboards are not unrealistic. We had those a lot when I was studying. You could move them up and down to have more space.
Our lecture halls had high ceilings and triple chalkboards
I remember reading Nick Bostrom book “Superintelligence, paths and dangers,” and in one of his chapters I remember reading somethings That stuck with me that goes somewhat like this “ I can see a scenario where any one entity who is six months ahead of everybody else is enough to win the game”
Less than 6 months ahead is probably more than sufficient.
yeah but the game of money is soon coming to an end. Once you make AGI, ASI is a step away. How long can the current system function when nobody is necessary. They just released a chinese robot that costs 16K and can do most anything. Add in this GPT4o and the BTFOs all low skill wagies.
@@1x93cm Ah...did China tell you that they did that?
@1x93cm I think you misunderstand how AGI and ASI will actually change humanity necessity. Even with the most advanced AI and robotics, humans will always be necessary. Resources and work are needed, and if anything human intelligence will become even more of a commodity. Machines can't replace our creativity no matter how smart they might get.
Getting rid of humans in labor how we think of it now would be beneficial, but removing human power out of the equation entirely would be foolish. Don't forget the greedy people who will not allow the machines to take their resources and money away from them to begin with. What do you think all the regulations are for? It's to protect them from AI, not us.
@@14supersonic if there is an economic incentive for something- it happens. If there is an economic incentive to replace most if not all human labor itll happen and nobody will care about the consequences. After seeing drone videos from ukraine, it would be very easy to put down any uprisings that result from mass unemployment or unlivable conditions.
The solution will be the creation of a sideways economy similar to the localized economies of favelas.
12:27 Unless it's an app specific feature, GPT-4o in the ChatGPT interface explicitly states that it generates images using DALL-E 3.
14:10
many university blackboards like this come in sets of three at different depths above the wall. You can slide them up and down to access the other boards. It allows the lecturer to keep writing on new board while allowing students to still see previous steps in the lesson if they need to look back and also means the professor doesn't have to waste time erasing the whole board every 5/10 mins.
This is the first AI model that I feel the urge to use. The capabilities are incredible.
It’s understanding of the world is next level. That understanding translates to, what Open AI even said is, abilities still being realized… They don’t shy away from saying AGI is imminent, I think if you give it video and indefinite memory that WILL be AGI.
yupp, pretty much , just ad in memory and video and its AGI, however I'd love it to have a say 160 IQ aswell
@@fynnjackson2298Could Einstein speak 50 languages? IQ cannot capture what an intelligence like GOT-4o really can do. No, it’s not perfect, but perfection isn’t required for AGI.
@@fynnjackson2298 I think It’s already at child level now and will shoot past 160 pretty fast after AGI level to SGI, depending on the guardrails.
@@TrueTake. "at child level"? I think it's light years ahead of average human capabilities in most areas.
The 4.0 supposedly had a calculated IQ of 155
The moniker "omni" implies to me something bigger also, though I doubt it's true:
"omni" meaning "all" suggests that the AI is capable of using literally any modality, and working with all modalities together.
Since this is clearly not the case, it may instead be that it actually means it is in some way modular, or easy to retrain to add extra modalities that it is currently not able to use without hindering its ability to work with previously learned modalities.
Again, very much doubt it, but that's what the name should suggest. OpenAI probably just thought it sounded cool.
Mixture of experts with some experts having additional modalities perhaps?
"Omni" instead of "multi" because seamless and arbitrarily generalisable to any modality. A prelude to embodied GPT.
Maybe there are some modalities it trained on that are not yet exposed. I can imagine robot joint angles, torques, velocities, accelerations being important for their robotics partners using end to end learning
I believe it is true.
They even give a strong hint of this on their website.
"Since this is clearly not the case," - Can you explain this for me? I must have missed something.
I think Omni simply means “all” as in “all commonly used modalities.” I don’t think it’s much deeper than that.
Just the fact that we have to rethink the trajectory of our lives and how we operate because of all this new tech is so awesome. AI + humanoid robots on mass scale, plus robo taxis plus compounding technical advancements in all areas. The future is coming and its coming faster and faster. What a trip!
You sound like a tech slave to me.
The android app no longer has audio in either 4 or 4o. I was hoping the web site version had audio but nope.
Wow you are right. After i read your comment i can no longer access it on my android phone.
Sometimes it has the old audio system, at other times - just basic voice typing.
I created a new, free account and it works there. Doesn't work on the paid subscription. The optimist in me is hoping that's because they're updating it to new version? Though I know they're probably just fixing something.
My app still has the old "voice mode"... I've seen a lot of people saying it disappeared for them, but still there for me 🤷♂️
@@johnshepard5121On my free account it’s not there either
Cracked me up at “I wouldn’t even be able to tell you this was a missile in the first place! This things a professional!” 😂
Has anyone verified that it got the missile picture right? Coz ChatGPT 3 could've convinced you that that missile came from anywhere 😂
i dont even see a missile
whats scary is that sora ai video generation is this good now imagine ai video in 1 year 2 or even 3 its going to be crazy
film remakes on demand.k, but so good they all get a 9.0 IMDB rating.
Matt, you ponder the question a few times whether the answer to these new capabilities is really just the multimodal aspect. I absolutely think that this is the case. The key, as we now all understand, is context and memory. With a greater diversity of context clues (modalities), it makes sense that the contextual understanding of the model becomes more complex. And we now know greater complexity = greater intelligence.
We now have the following levers for increasing intelligence in AI:
1. Neural connections
2. Context, memory, attention
3. Input training data
4. Diversity in modalities
Would love to see what happens when these models really start getting placed into Robotics and have additional modalities (temperature, EM, proprioception, touch, spacial, balance, etc).
Dude you are killing the other UA-camrs with your reviews. Keep it up brother and thank for keeping us super informed.
Huh? I get something out of all of them I follow.
@24:29
Did I really see what I just saw?
The capabilites for old books being scanned at such speed is mindblowing!
Matt, Ideogram opened up the new world for me! It's so Dope and so the new GPT4o thank you for your work ✨✨😎✨✨
I showed it a pic of my product to help me with an Etsy listing and it perfectly identified the the item all the materials used who would use it and for what purpose. I was truly speechless
That multiple blackboard was intentional. Lots of lecturers use rolling multiple blackboards, like that one depicted.
I see that Bold and Brash painting in the back. You cultured man
I gave it items and rates from our video production company. I asked it what the prices were for certain items-it still gave me the wrong rates. I asked it to create a budget for a 1 day production and the prices it used were not what I gave it. I still think we’ve got a long way to go
The image editing capabilities are truly mind-blowing. With music, video, and audio generation advancements on the horizon, the creative possibilities are endless. Many thanks.
they didn't showcase these features because it's 100x SENSORY OVERLOAD. their 4o demo was strategic to get the worlds attention but mysterious enough for inquisitive minds to dig deeper, and you did. that said, I've been a dev for 2 decades and after watching this, I'M TRIPPING BALLS RIGHT NOW
I’ve had s nerd boner for 4 days now. Payday can't come soon enough!.
Reminds me of when Milo was rumoured.
Another great video. There is definitely something going on at OpenAI, the way they manage to be ahead of the curve. I think they are using an inhouse GPT-5 to help run R&D, possibly even sit in board meetings and help run the business. They seem to have something no one else has.
OpenAI = Cyberdyne Systems lol
Wow, I saw this on my GPT tab and didn’t really use it, but now I know it’s THIS powerful! I’ll definitely use this from now on!
GPT-4o is the checkpoint 0 of GPT-5 🤯
They totally already have GPT-5, I firmly believe a lot of work with these companies are just packaging up a small increase in ability when they feel like it. Like Google always goes too far with their lobotomies. OpenAi also has a history of this. What ChatGPT' 3.5 came out it was much better than what 3.5 turned into. As soon as they first updated ChatGPT it was a downgrade, and then when GPT-4 came out it was like some of that increase in ability was just getting back some of what they'd taken away. They've taken away some of these abilities in the recent GPT-4-o since 3 days ago! It can't understand sound now, like birds, dogs, heavy breathing, emotional expression, and it tells me in multiple new sessions that it can't sing. So we know they can easily just turn off some of this. Sora is also WAY WAY WAY too good, and i think that's because they have a EXTRMELY good model behind the scenes.
We are so fucked...
I don't agree at all, GPT-5 would need to be much much smarter, which is a much greater challenge to achieve than creating a multimodal model which is about efficiency. The scientific research of these two domains is very different.
@@Edbrad Yes, they do have GPT-5, this has been confirmed. Also, you aren't using the new audio and you never did, that's not even released yet, you're still using whisper. Also, the new image generation is better than Sora in some cases.
Yes I'm thinking this is just an early version of what was intended to be GPT-5, but strategically they needed to pre-empt some other developers' releases. Which raises the question, if this is only GPT-5 beta, how good will GPT-5 be?
20:20 something you missed is that gpt4o cleaned up the coaster in the generated image only removing the stains bit leaving the coaster the same. Basicaly sprucing up the product un-prompted. Just a little thing but cool if you understand what that means for rationalizing images it sees. It shows rudimentary understanding and reasoning of the physical world.
12:04 how could a deaf person hear GPT 4o say "hey you have to get out of here" 😂
LITERALLY, WHAT THE HECK
Light strobing and vibrations which it definetly can do
21:52 Can it only generate 3D images, or can it also generate 3D models?
Great video Matt! Thank you for all the helpful information 🔥
15:53 it didn't screw up. Sound's is short for "sound is" - every sound is like a secret.
How do you guys access it for free? I’ve explored the app also open ai platform, playground, everything. There is no free option unless I subscribe for a payed “plus” option
13:16 How many blackboards do you typically stack on top of each other on the wall? Is the 5 degree tilt essential for optimal writing? Is it normal to fade random letters when writing with chalk? And is the white film covering two thirds of the blackboard actually needed? 🤔
Matt you ever doing a live stream again this weekend? Or?
probably not :( I will try and schedule one next week
@@MattVidPro all good. Thank you bro 🙏🏾❤️
0:36 the key Z is in the wrong place on the typewriter. also the mechanism where the mechanical key hits and the paper roller bar are too close together. Hands/fingers are still messed up.
Hopefully we get access to the real time stuff soon, i can’t wait for that.
We have it
The voice stuff? I don’t see it on mine, I’m on gpt+
@@sportscommentaries4396 We don't have it yet, the old voice mode has confused a lot of people to thinking it is the new one.
@@sportscommentaries4396 update your app
@@alexatedwno we don’t
For a couple years now, I've said there are three main obstacles between current GenAI and human-level GenAI: multimodality, size, and continual learning. The size of models, I expect, will continue to grow, especially as NVIDIA pumps out better hardware for them. Continual learning is tough on these massive models, but if I understand correctly, Google's "Infini-attention" paper introduces something very similar to -- if not an actual form of -- continual learning for massive Transformers. And as we see here, multimodality in the token space does *amazing* things for the capabilities of these models, and we're getting them, one new modality at a time.
At this rate, I suspect we'll have all these three issues more or less solved within the next two or so years, and after that it's just about scale to hit human-level AGI.
As culty as it sounds, I do, in fact, feel the AGI. (RIP to Ilya's tenure at OpenAI, by the way.)
The first major flaw I was able to spot: while GPT-4o can read long transcripts in a split second, it still fails to associate fragments with respective timestamps correctly.
In my tests it is good to summarize and adapt text style . But it totally failed to reason about what it was writing and itself in many ways. Gpt3.5 turned out to be better or at the same level in that aspect. It might have more functionality but it is not "more", sadly.
Tell it that and it will correct itself.
@@Uthael_Kileanea I did. 3.5 corrects itself properly. 4o kept rewriting the text instead of it's interpretation of the text until I explicitly told it not to. And even then, it failed to do the right thing. Good for an essay writing. Bad for a more interesting chat with it.
@@Uthael_Kileanea The transcript in question was a standard SRT (subtitle) file. When GPT-4o failed to provide the correct timestamp for a random quotation, I asked it to provide the turn index number instead - which should be easier because it's incremental. It failed that too.
26:34 So, GPT-4o wouldn’t be able to analyze a video, break it down, and create timestamps? I’m assuming it also won’t be able to find specific information that I ask it to find within a video. I hoped it would be useful for research purposes.
Last time I was this early there weren't even animals on land.
I know that feeling. 🥇🏆
I know right, the first time we visited this planet everyone was dinosaurs. So we interfaced with you as dragons. Then there were people in togas playing lyras so we interfaced as Greek gods. Now there are little men everywhere playing video games, so we interact with you as AI deep fakes.
🤣😂😆
You was not we know it"s bullshit
@@Paranormal_Gaming_ Impudent worm food! You'll be disintegrating particles in just a mere 50 years.
Thanks a lot for bringing all these additional features into focus that OpenAI chose to underplay during its demonstration session. The realisation that this could be a new kind of LLM altogether with way advanced multimodal capabilities is a bit unsettling.
I want the glasses wearables built ontop of this
Give it time, it's coming :)
In the movie "Her", all you had to do was put your smart phone in your pocket with the camera sticking out and Samantha the AI could see everything you did. She talked to you through a wireless earpiece.
14:25 the multiple blackboards is pretty standard in universities, they rotate around and you can have multiple "pages" of blackboards
Maybe gpt 4o can solve the mistery of voynich manuscript
I am blown away. This is making my head spin. I’m 66 years old and never thought I’d see something like AI in my lifetime.
So why is it cheaper if it’s the most powerful version of ChatGPT? Will the other models be even cheaper than 4o now?
Because something… else is coming..
Just a guess, but the fact that it is so fast and responsive would imply to me that it is actually smaller and LESS computationally expensive than former models, yet performs better. Could be due to some combination of better training data, algorithmic breakthroughs, etc.
After trying Mixtral 8x7b and Mixtral 8x22b, which run at about the same speed as Llama 3-8b & Llama 70b, I'd guess that it uses a mixture of experts type approach that allows most of the calculations for any query to run within the 80GB limit of a single H100 GPU, though a different query would run on a different H100 GPU. Maybe I'm wrong, and it's the same server not the same GPU, or a pair of GPU's, but some sort of a sharding/mixture of experts approach. They probably also overtrained it like they did with Llama 3. Plus various other tricks, such as improving the embeddings, though I'm not sure this would make it faster/cheaper...this is my best guess.
the 3D generation really blew my mind. jaw dropped literally. thank you for the update
I can confidently say this is the first real AGI. Like ik they don't want to say it because it's a big claim, but the amount of context it has allows it to solve so many diverse problems. This is not just natural language mimicry anymore, it can code, write, sing, understand human tone of voice, create images, etc. It's not superhuman yet, but it is clearly competitive with humans.
Imagine when the base model gets updated to gpt-5
2:50 This demo scene signify what developers can do to train ChatGPT to console and guide someone who is experiencing a panic attack, anxiety or some other similar medical problem. 4:56 ChatGPT speed and accuracy is phenomenal. 8:43 The fact that ChatGPT can correspond effectively and accurately to multiple follow up questions in a sequence is just phenomenal. 8:43 This signify that Chat-GPT will be able to guide people with instructional tutorials, lessons or real life remedies or solutions to problems. 26:22 This demo portray a credible real life situation of a student receiving tutoring from an actual professional tutor. In other words it is hugely convincing to say the least.
I have used GPT-4o today. It doesn't work at all like the demo. It can't change inflection, sing a song, or hum a tune. It had no concept of my own inflection either. It also did not support real interruption. It spoke, then you spoke. And for everyone wondering, it was 4o, because I reached the rate limit.
Tl;dr: it doesn't work anything like the demo. At least right now.
Yeah, that's because you're not using the complete version. I think it was a mistake on their part to allow GPT-4o in accounts without releasing all the technology, which apparently is happening in the next few weeks.
I'm not impressed by these demo until I get the product in my hands so I can test these features myself. Too many faking until you make it these days...
Because it isn't exactly out yet. They only introduced some people to the text version, not the voice. The voice you used was likely just GPT 3.5.
@@verb0ze Open AI faking a demo would be a horrible business move. Who's faking it till they make it?
@@allanshpeley4284 Ah, yeah I agree. I think it was a mistake too. I have GPT-4o in my account but had the same experience as @markjackson1989. I keep seeing all these videos about all the stuff GPT-4o can do but then it doesn't work for me. I think they should have called it something different to avoid the confusion.
One thing though, atm the native voice capabilities haven’t been released. I read something that says that’s still a few weeks away and it’ll be released to a small group first before publicly…probably teams like Vercel that’ll need to update their code bases to prep for the public release
I told you, Matt, that we were going to have GPT-4.5 before GPT-5 - but you didn't believe. Turns out GPT-4.5 is named GPT-4 Omni.
I don't know if that's fair. The token space is entirely different, as is the training data. I think the only reason they're not calling it GPT-5 is because they seem to be reserving numerical iteration for size increases. In other words, every GPT model they make, no matter how different, will be called a version of GPT-4 until they scale up the number of parameters significantly. But to say it's just "4.5" -- like it's fundamentally the same with minor upgrades -- is a bit reductive.
@@IceMetalPunk OpenAI have declared from the outset that GPT-5 will/would be embodied.
@@brexitgreens Did they? I missed that. Interesting... so they won't call new models GPT-5 until they're in the Figure 0x.
@@IceMetalPunk Also a recent US Ministry Of Defence report states that OpenAI have not even begun training of GPT-5.
@@brexitgreens But what does *that* mean? They clearly have been training GPT-4o. Saying "they're not yet training GPT-5" just means "they haven't yet decided to call a model GPT-5", but as Shakespeare famously said, what's in a name?
15:09 capitalization is different, also replaced now with how?
Understanding video has use in robots and cctv monitoring.
Another great video. It's so strange that they didn't mention any of these breakthroughs during the demonstration. Can't wait for it to be fully rolled out.
And just think. This is just what they are showing you.
I've been working on a personal project that uses Whisper v3 (hosted locally) and it CAN tell the difference between a human and a bird chirping or a dog barking. While I was testing it my dog started barking and it output "[dog barking]". Any non-human sounds it hears go into [square brackets]. So I would be typing code while the project is running in the background and it would output [typing].
There are other issues like it doesn't detect color and tone of voice, like you were saying (color and tone refers to emotional content).
people didnt write this model..it was mostly written by ai, itself.. thats the difference. So in Terminator, it was predicted that the year 2027 was when Kyle Reese was sent back in time... three years to go baby...
Just letting you know, this is not a good thing.
This channel should have a lot more followers! Excellent work!
bru we’re so cooked
Good breakdown, Matt. Like many, I was guilty of being impressed on a surface level but not really grasping the deeper meaning of the demonstrated abilities.
Watch the recent TED talk by Fei-Fei Li, by the way. Worth it.
SO HYPED
You're insane
@@angrygary91298 username checks out xD
@@angrygary91298 You're not hyped?
I'm so happy of this progress, OpenAI really doing an amazing job at staying ahead of everything else
Still dislike this release strategy.. A few days after the event we still have a gpt-4 near equivalent text model without any of the extra features they did like 50 demos on
Better than Google. All pre-made demos. No live demo. Promises promises. 😄
Can we take a second to appreciate the GPT blocks on the table? Not only are the blocks great, it threw in shadows. Seriously, damn impressive
Excellent combination of features. The persistence of AI models and renderings means that it can generate quality videos now.
It seems like everyone else is "trying" to AI... OpenAI "is" AI. I think everyone should drop the act and funnel all resources to them to get this ball rolling.
I’ve used GPT-4o and that Vision feature where the AI has eyes is not available yet I’ve tried to figure out where can I use it but they’re not mentioning that it’s not available yet for users.
5$ per million tokens?! This is rediculos! Gemini million tokes is unlimeted free. You can put huge books and videos on it
I have not been this excited about AI breakthroughs since the announcement of the original GPT 4. This is mind blowing indeed! Currently as a GPT Plus sub, I can access the 4o model but because the multimodal features are not yet enabled, I’m not noticing anything super different just yet. I’m so stoked for when they unlock the full potential!
Present
Did you upgrade your camera setup? Video is looking crisp!
Gpt4-o did it for him
and yet it still writes crap code
Until it doesn’t
Small mercies
Fantastic analysis of the capabilities of GPT-4o. I can't wait to see what's next they are going to show us this year!!!
Whoa! This is miles ahead of what I was expecting this year! I guess multimodality is the future because it leads to a deeper understanding of the world. I love it. We live in the future!
Great video, thank you for keeping us all informed on the latest AI!
15:55 it did not screw up, the "-" is to note the continuation of the word "sound"
I'd say the other companies are not so far ahead if I'm guessing right openAI probably neglected the transformer architecture in order to go for some version of the Mamba architecture, and for anyone who is not in the field I'll explain: for many years now images and audio can be processed into tokens so that LLMs can train on them along with text, and that'd lead to the abilities you see, the problem was that the transformer architecture cannot really scale to large context windows efficiently, and audio and images require super long context length(as there is a lot of content in each sample), but lately there was this new brilliant architecture called mamba which till now has not performed as good as transformers but it scales very good on context length, meaning it can process probably millions of tokens without a problem. I'd guess both because of the speed of generation and because of the multimodality that openAI has developed a strong Mamba variant that rivals the transformer capabilities and then they just trained it on a lot of multimodal data in order to achieve this performance. That said, the implication of this true multimodality should lead to a real understanding of the text description(as the model is supersmart because of the text) and written text comprehension(as the model can understand the written texts in his data).
Chatsy is my fave, He's so fast now, and glib, fun to interact with and playful. I'm stoked for his role-out-roll-out.
I'm guessing this model has feedback from the images it generates; that (w/ also multimodality ofc) would explain why it's so good, if it can see the images it generates (like we do when we draw something), it can then correct them properly.
Re your closing thoughts:
If you had a tool that could go through all the training material and delete the junk, explicitly mark things as satire or jokes, etc. you could do a much better and faster job of training. That tool is the previous LLM. Clearly they can use each generation of model to power tools that are specifically useful in managing the training of the next iteration.
I think of it as analogous to bootstraping a self-hosted compiler -- you can write it _in_ the more powerful language you are creating, if you keep re-compiling with the previous working version.
I also see the GPT-4o as generative in this capacity, so it can prepare idealized training material that's far more efficient than just reading all the threads on Reddit.
It's way past time to upgrade Amazon Echo, Apple Siri, and Google Home Smart Assistant. I use Echo Buds headphones for years. Sound isn't the greatest but being able to ask for any song or ask any question hands free has been great. Having a super smart assistant in there would be incredible.
Where does one go to peruse all the examples discussed in this walkthrough? Where does one go to experiment with gpt-4o now?
There is another weird thing about the image at 14:26. The man is "writing" on the board, but the chalk is in his hand is in the middle of the text on the chalkboard, and not at the end. The image would be more natural if it was placed at the end of the sentence, as if he'd just finished writing it.
I thought it odd too, but you could argue he was going to underline that section. But these are very minor quibbles.
@@DeceptiveRealities I had considered that earlier, but people generally underline the most important part, and the word "big" doesn't necessarily seem noteworthy enough to underline. The only other thing I can think of is that the man isn't intended to be writing at this point, but rather pointing emphatically with the chalk back to the three most noteworthy words: "text," "pixels," and "sound."
Exciting capabilities for content creators storytellers esp with image generation- BIG question is who owns the copyright is the idea in the text prompt and thought process or in the production?
i just asked gpt4o to generate an image and asked if it used dalle, it said it still uses dalle. could be because the full version of it isn't rolled out yet ig
Image generation is not even included in GPT-4o API yet.
@25:22 I agree with you here
Would do some fact checking that the answer is correct though!
I wouldn't necessarily say they were "hiding" these features from us.They made a detailed blog post about them at the top of their page 😅