Undergrad for sure. But then again... a lot of things can just be seen as "advanced retrieval." You could even argue anything other than completely novel ideas, or independent re-discovery of an idea is a level of retrieval.
@@2299momo Thats the craziest part of all this, threatened humans with great memories and no creativity somehow keep inventing new relative metrics when two years ago ALL OF THIS WAS SCIENCE FICTION. I'm pretty sure all the people with the most hubris about AI are gonna be like dogs to the rest of us soon. I'm a super human now. Im a lawyer. I've made doctors double check something because they didnt think of it this year. People like me have the most agency now. Why? i know im not capable of advanced retrieval. I know what good looks like though,
It’s funny when you think about it-many people argue that language is what has enabled us to become “intelligent.” Yet, some of these same people argue that language models themselves can’t be the key to creating intelligence. They might have valid points, and they might be saying things that are technically true, but I think they’re missing the larger point. We ourselves don’t fully understand what intelligence actually is, and we don’t all agree on a precise definition. And, really, who cares? The question isn’t whether we’re creating a true world using exact physics; it’s whether we’re creating a world that behaves in a certain way. Some critics seem to be trying to judge these models by whether they take the same paths we do, like relying on physics engines in games. For example, if you have a VR game that’s incredibly realistic with a physics engine, it doesn’t “understand” anything. But it achieves the effect. So, I think these critics are missing the point-many of their arguments, even if technically correct, can be dismissed with a simple question: does it accomplish what we intended? It’s almost that simple. Consider this: I don’t fully understand the physics of walking, yet I still walk. So, what exactly would their argument be there?
Exactly... The problem here is that math oriented scientists tend to have very very bad understanding of human psychology and internal thoughts, also for many its a philosophical fight that has been ongoing since at least Ancient Greece whichis dualism vs monism. I remember watching Josha Bach being so excited about his findings about human perception based on his work with AI... Basically he came to a model that psychology had for decades now and that is available in any cognitive science 101 book.
You both are the two smartest people on the internet, you guys get it. Chatgpt acomplishes everything I ever wished for, hence to me it's intelligent lol, as it does the tasks I give him intelligently with almost no error, does at human level. That's enough for me. Are their sentient thats another question. But calling LLM's ONLY statistics is diminishing it for their own agenda I feel.....
OK, look. If an AI model only sees red circles and blue squares in its training data, you just taught it that red objects are always circles and blue objects are always squares, and it learned that, and it thinks that that is a physical law. It is not unable to learn physical laws. It is just learning exactly what it is taught, and if you teach it things that are wrong, it will believe them, and that is the fault of the teacher, not the student.
It's not that the things are wrong "as such", it's just that AI is not able to distinguish abstract from specific. It sticks to specific facts and treats them as the ultimate truth and cannot deduce that there might exist something else. Also, it cannot imagine things. When children learn about gravity, they can imagine and even draw what might happen without gravity before they actually learn about space. An AI neural network cannot imagine "anti rules" even when prompted for it.
No, it doesn't think anything, its just simple pattern matching to the database of know patterns. Its not learning anything, nor is anyone teaching it. I would call that analog computing instead of "artificial intelligence" . Its a machine. An actual human would be able to infer almost instinctively that the correlation of the two axis of information might produce more variants. Like a red apple and a purple grape are concepts on their own, they wouldn't mix things statistically and create a red grape or a purple apple out of nowhere in their mind model, and if you ask about a purple fruit, the human probably is going to answer grape and not apple. We know that's because its literally what babies do when they learn such things, also one single-shot learning is enough for a human baby, they don't have to see pictures of all the apples in the entire world.
Nothing new here, this is what plenty of people have been saying about llms forever. Also they overfited a small model, this basically says nothing and is exactly what we would expect.
Yeah I'm really surprised here that people are taking more from this than possible. The entire issue with understanding Neural Networks is because once they've been trained on a large variety of things, it's no longer possible to discern what they're doing. If you use small models like this with very simple data, then it definitely looks like it's just retrieval. The question is "With enough data, does a new property emerge?" which will not be answered by these small sorts of examples.
It's not retrieval. Training a tiny cartoon model and overfitting it on solid colours that only move left and right doesn't prove that it's incapable of learning the physical laws, it proves that you're incapable of creating the dataset and hyperparameters that would allow it to do so. This is nothing new either, this epiphany is something every beginner learned 20 years ago. Obviously we know that the neural network in a security camera has learned the general features of faces because otherwise when you have an intruder, it simply wouldn't be able to detect that novel face that wasn't in it's training data, and because we've looked inside those models to literally visualize those features it learned. However, you can do what they did here and train a face detector to overfit on specific faces, and then it won't learn the general features of faces, it will only be able to detect when 1 particular face is present. You can also overfit on certain face shapes. If all the faces in your dataset are cartoonishly round, then it's going to overfit on round faces and assume things about other face types that are only relevant to round faces. This doesn't prove that the properly trained model isn't learning how faces work and doesn't say anything about the architecture in general. You need a diverse dataset that represents real world physics for these complex abstractions to emerge. It's like only ever teaching a child about addition and then when they fail a multiplication test you say "aha! see! they're incapable of learning math because they can't generalize".
I think this is already understood widely. If you train a model on video, it will “understand” how pixels change in videos. Video is just a visual representation of the physical world. Would you train a baby to walk, talk and function in the world by sitting it down and getting it to watch videos? Multimodality is what we “train on” as humans, why expect anything less from an a.i. system?
are we not currently in the training simulator for humans? it would seem the most obvious conclusion to reach based on what's happening currently and what that will look like in 10-20 years
because the biological way to learn is not the only way, nor the same laws or physic constraints applies, otherwise why biological beings didnt evolve with wheels?
We train babies on input from their eyes. We train AI on the input from their cameras. There's no difference. Apart from: you can train one AI model, and all the copies of it have the knowledge.
In the example where the red square turns into a circle at 09:45, what question did the researchers ask the AI beforehand? Did they ask anything specific, or did they just present it with a short video with no guided prompting? If no prompt was given, why can't the AIs response be interpreted as "A red square must be a mistake. I've seen how the world works and red squares should not exist. So first things first, I need to correct that and make it a circle." If that is the only thing it has "experienced" then that is how it 'believes' the world should be, so it fixed it. Not unlike Plato's Allegory of the cave. If the model was trained on more diverse data, would it make the same choice or would it do as the researchers expected and kept the square a square?
I'd love to see that follow up - if the model had still never seen a red square, but had seen green and purple ones, would it have learned that color isn't necessarily predictive of shape and accepted that a red square is possible even though it had never seen one?
Sora understands physical law as it pertains to the input it has been given. You first have to define understanding properly. Why is this news. This his how our own brain works. Why do you think that Aristotle, who was definitely intelligent, believed that the speed of an object falling was proportional to its weight: heavier objects would fall faster than lighter ones. He had a limited set of training data, and for him this was the sensible conclusion. That doesn't mean he hasn't inferrred a general rule, just that he inferred the wrong rule.
No, it doesn't understand physics. It's just a cheap magic trick. Sora takes prompt, expands it to describe scene in greater detail and compiles video from training data. Your deep philosophical interpretation of a simple programming processes is out of place here. Sam Altman wants you to think that simple primitive magic trick is a "real magic".
@@meandego Understanding is recognizing patterns and applying those patterns. You have to stop looking at AI as some kind of threat or competition and understand that It is a replication of our own brains. Can you do everything that a human could possibly do? I don't think so. Why is it that we have such high standards for artificial intelligence? People like you will die inside when we finally do create AGI because your ego won't allow you to accept that one day you may be beyond inferior to a machine and that scares you because you don't know what the outcome may be. You are an organic machine with programming that gives you the illusion of autonomy but ultimately how did you become the person that you are? what external influences shaped your mind? When you think something or make a decision, how did you create that thought? It had to come from somewhere, right? What is consciousness?
This seems pretty dumb... if all the training data you have are red circles and blue squares, it's not surprising that the model does shit when it sees a red square. It wouldn't happen as the training data becomes larger, and the model extracts general essential patterns from apparently unrelated situations. Scale is the key.
Basically would be the equivalent to the allegory of the cave by Plato. If the only thing you saw was shadows cast by yourself then that's the only reality you know and accept. You won't create any concepts outside of what you know. Now could we create better ways to make training data? Sure but I think some of it will be coming naturally by having robots that interact with the actual world and learn.
Exactly its not that the model cant inference. Its that they dont grasp the underlying principle. You could also say their imagination runs wild and goes back to their training data. I am curious if you train them on physics first and then do the same would your outcomes be different?
What is AGI? To understand it better, try asking yourself: at what age could a child perform this task? This is the mindset you should adopt when considering AGI’s potential. And if you think AI is a deception or a con, it may help to consider the principles of evolution. Is anyone thinking about what we genuinely need from a superintelligent AI, beyond just what humans might prefer? Do we truly want an AI to be the ultimate arbiter of truth? And if so, who decides what truth even is? Would it be defined by the victors in humanity’s long and complex history? Do we, as a species, need an Oz-like figure pulling the strings behind the curtain before we “awaken” to reality?
As I read this paper it addresses a classic problem in statistical prediction: any machine learning or predictive model will always be biased towards predicting outcomes within its trained distribution. In simpler terms, models like LLMs are designed to predict the next letter, while models like Sora focus on generating the next frame in an image or video, based on patterns they’ve seen before. The paper argues that these models struggle with ‘out-of-distribution’ predictions, meaning they’re not effective at identifying what doesn’t fit their learned patterns. The problem is not just the inverse of predicting the next likely outcome. It’s a harder problem because the model would need to account for an infinite multi dimensional space of possibilities outside its training scope. If to be solved halfway you need something like a “Sora multiple scenario outcome model”, which is not built yet and requires either an extreme scaling of the existing architecture to account for all scenarios, then that scenario model(s) could be put into the “final Sora” along side the rest of the data, or a completely different architecture. Must say I initially tend to agree with the paper’s perspective.
any machine learning or predictive model will always be biased towards predicting outcomes within its trained distribution which is also a sentence that is true for humans 100%
I hate when people use the word understand in reference to AI because these models do not understand in the same paradigm in which we do, when you ask the question does it understand and then constrict the model to our basis of contextual and experiential understanding the answer is always no at this point.
If a child was shown only red balls and blue squares and then one day he was shown a red square, chances are that his brain/sight would correct the shape or the color to fit his expectations. There are a lot of adjustments going on when we see something. Obviously the technology behind the human brain is way more sophisticated than that of our generative AI but the data poured into the human brain, through our five senses, is insanely more important than a few billion parameters.
This is missing a big part of the picture. In dreams you can fly, objects can jump around etc. Your human mental model doesn't have to conform to cause and effect or strictly to physics *unless* you are using it to interact with the real phenomenal world and then your body uses feedback mechanisms to correct the model to conform to what is actually happening in external physical reality. It doesn't matter if AI generated videos hallucinate impossible things. It's a feature not a bug in a world model. Your world model has to be close enough for government work not perfect.
You are getting it wrong. The models have world shattering capabilities, even without being "really" intelligent. The failures regarding "doing it from memory", instead of "doing it by logic" are evident in mathematical problems. However, they change the whole infraestructure with both AlphaProof, with simbolic language. and won the silver medal in math olimpics
Yann LeCun is saying this already for like 10+ years. They tried to train on video to make model of the world for many years. Didn't work and will certainly not work with synthetic video.
Unbelievable. This is SO basic if you've ever really worked with Neural Networks or even just simple regressive stats! 1. INTERPOLATION can be reasonably dependable if you cover the input data space well 2. EXTRAPOLATION is much harder. Rule of thumb: The further away from covered input, and the more complex your models, the worse things get. How on earth would an AI model know that form is more important than color? WE know that, because we have context. The model can simply minimize its error by complying with the known input. That's why LeCun et al think they can 'correct' this by adding context. The wall they'll hit is ... what IS this context that you need?
I've been playing around with layers of abstraction using LLMs, this is where the LLM would make notes (commentry linked to the original text or other abstractions) and then use those them to make changes to other abstractions or the original text. This helps things stay on track when handling complex / long agentic tasks. I wonder if there's an equivalent for video.
I was an AI product manager for GE Software and I post videos on how AI actually works-including Gen AI and its shortcomings. These are not surprising results. You are absolutely right - it’s the core architecture that is the problem. Gen AI works by using statistical distributions in its training data to do guided semi-random retrieval of data and does extrapolation from there. This cannot avoid hallucinations and you have a great explanation of why hallucinations happen. Gen AI needs a lot of double-checking and fact-checking, including and especially text Gen AI programs like ChatGPT.
But human brains have the exact same problem, including hallucinating. If you trully are an AI engineer I humbly direct you to some cognitive science / cognitive psychology subjects From my perspective, the fact we encounter problems of such nature with current architectures shows (not disproves) - that we are on the right track It seems to me many AI researchers need to catch up on what cognitivists have gathered in recent 30 years and also some stuff on modern neuroscience. The fact that LLM`s lost computer`s perfection and make mistakes is a proof we are going in the right direction - there is no real creativity without hallucinations and mistakes - Henning Beck has a nice book on this (flawed but shows the problem in a simple way)
There's a massive difference between an LLM and a visual generation AI. Within language, humans have already encoded and implicitly shown the relationships between the representative units. (words) There is almost zero such relational data in visual data units. (like moving squares) Do not conflate LLM style AI with video generation AI. Max Tegmark recently put out a paper that highlights the relational geometric nature of the information in an LLM which represent implicit aspects of the world, but it will only know things that are not learned in the childhood phase that humans need not ever discuss at length because of shared human experience. There are explicit geometric relationships between concepts in an LLM that are similar. "Queen" is close to woman and "King" is close to man but those two sets are far away from each other in the actual dimensional information encoded into neural networks trained on language. Those relationships are in language and not in physical objects viewed through a camera.
All this says is they overfit an ill-suited model. The whole point of "good ML" is making a model generalize OOD, which isn't only common, but was shown for LLMs on this very channel.
LLMs are not “Language Models”, that is a misnomer. Scaling laws, scaling laws, scaling laws. We don’t know where the end is here, especially now that we have confirmation that we’re now training on both ends (pre / inference).
This could be similar to something noticed with text to image models. They can have difficulty imagining two very different things in the same scene. Our test of choice is a combination of say Henry VIII and an alien visitation. Images based on the Tudor era have a style, as do images with extra terrestrials. Ask the model to combine the two elements without any specific direction and the model gets confused
It was clear what those models are and what the limitations are. It was clear that the recent advancements mainly came from improved computation power, the theory itself is already > 50 years old. We likely won't see AGI emerging from those models alone.
Humans aren't out-of-distribution either, but thinking creatively is still tough - it's hard to break free from the usual patterns, or maybe we'd be on a whole other planet, even in another galaxy.
Yep. Just look at our art. or on aliens humans invented.. All aliens/monsters we invented are just a variation of earthly creatures. Then you look at some of the petrified remains of prehistoric oceanic liveforms and you see how even on earth life was at some point more alien than whatever alien we can dream up. Also, creativity is making errors, hence why creative people have many problems with maintaining "normal live". They just tend to do everything different than others, which from time to time is a usefull variation - then u have succesfull artist. Most creative people are just considered "weirdos" though and live life of poverty, chaos and wrong decisions.
for like 15yrs ive told ppl that putting a 2yr old in a room with 1000 screens and pumping them with info dosent give you a teenager that knows how to tie its own shoes. they might know every episode of bold and the beautiful by heart but they wont understand how to dress themselves ext
The reason Ai has hit a roadblock in games is that the more complex models are not efficient in real time. You can simulate lots of stuff in theory, but when it comes to bringing it to the real world, the infrastructure is the key problem. Ai can potentially do anything IMAO, but as long as we fail to create the appropriate infrastucture for the Tech it will always be a novel trick without real implecations for the world. Humanity needs to focus on real economics that have tangibile purpose, rather than beating the value maximisation drum. It's the same as green tech, nobody seems to care that our targets are not aligned with what is actually achievable in terms of production. Selling the illusion of green tech is enough to generate value and maximise revenue. Marketing over product. We need to return to a path of creating useful products that bring meaningful innovation, instead of just generating wealth for a small percentage of us. As long as that doesn't change ai will always remain a bubble. There is no point in inventing and building trains if we are not maintaining the tracks it drives on. This study makes some points I guess but is really just marketing by the look of it.
Why can't they just do what elementary school kids do teach it left from right, make sure it understands a 3-D world, and remembers forward, backward, up, and down? A kid doesn't have to know Newtonian physics to know that when they push a ball, it does one of four functions forward, backward, left, right. Video data is unless unless it understands a 3-D universe video learning it will still see from a 2-D perspective. It made extra hands and fingers because it looked at the world in 2-D and was training to see spontaneous creation.
there is a 5th characteristic required for intelligent behaviour: the ability to feel. Without the discomfort of pain or the pleasure of joy, AI will never reach the levels of human achievements.
Doesn`t seem relevant here at all. Feelings are mostly evolved affects which are mostly involuntary and automated. If anything they would cause unnecessery errors. We wan`t brilliant tools, not frustrated creatures. Other than that there is also the qualia problem. Those are indescribable and even we humans, despite mirror neurons and all the empathic circuitry cannot know what others feel when they feel for instancee warmth or cold or wet or love or regret.
@@tomburnell8453 yes and no. Feelings are a part of a simulation of reality your brain presents to the observer process running on your brain`s hardware. So feelings already happen inside the simulation we call our personal perceived reality. The problem is that just like we have trouble looking into AI`s internals, we cannot decode ones internal thoughts/feelings/perceptions which means we have no idea what it actually is and how it works andeven if the qualia part is the same in all humans or even similar (the "feeling" (quale) of seeing red can for one person be the same exact feeling as for the other seeing red) there is no way to know untill we have 100% simulation of a human brain.
@@romanhrobot9347 Maybe good art comes from suffering and brilliance arises from frustration? Those "unnecessary" errors maybe required to find the solution.
I find particularly interesting that the model has not learnt that objects never change shape, which should also be retrievable. I'm not convinced that this isn't an artifact of an insufficiently large neural network. But it will be interesting to see what others do to confirm or contradict the conclusions.
Look at image and music generation. Right now it is so good that barely makes mistakes and creates new things that never existed before and they look and sound really good. It does not need to be perfect, it just needs to look realistic.
21:00 "Take a video of piss womb." Now that's thinking outside the box. Or maybe in the box...if you know what I mean. Anyway, crazy creativity is definitely what's needed. Can't deny that.
Personally, It obviously was not about model understanding the physical world, it's mostly about an an illusion of such. For me, as an artist, AI have to make something i invisioned, not something real. When you get results close to reality, or surreal but close to what you expected, you don't need more for long time to impress people and ocupy their attention. Add some control insturements to it and narrative is set, for deeper immersive engagement with what you creating. In defence of AI visionary advance, even if AI right now can't simulate perfect physically dependent worlds, we will reach in depth goal by "fake it till you make it" principle and compute optimisation. "AGI" does not need perfect understanding of physical world, people don't have ideal vision on things, space attention, all the data for prediction. First "AGI" models will have an "illusion of perfection", just because people are actually way less productive in a lot of their sensory and predicting abilities than they think, and cause it will "do the job" as it is. For me, physical understanding of world will come exactly as an emergent illusion, rather than perfect representation of reality, just how any other industry appear to develop in human world. And after that, "AGI" will find a way to reimagine itself, by creating physically acurate structure for next itterations of itself with using of simple exception method. Right now, i think, there is not enough compute and optimisation of it to reach ideal physically based computation while having real time generative model.
Funny. It has been known for more than 100 years that you can't learn causal relationships (i.e. laws of physics) by just observing, you have to do experiments. Statistics only give you correlations. It stuns me OpenAI assumed that neural networks magically achieve something that has been proven impossible mathematically.
Large models can make mistakes - sometimes even huge ones - but that’s how OpenAI and companies like it improve them. Without that process, we wouldn’t have seen innovations like AlphaGo, AlphaZero, and others.
Ok. I have an idea. What if one was to feed a model data set comprised of basketball footage. However, solely include specifically footage of missed shots being taken and turnovers. Next, include an additional data packet with the statistics from various NBA games. Finally, have the AI simulate entire basketball games based on the data inputs. This would, in theory, determine these model structures' capability to simulate physical reality, would it not?
also, even if it takes 20 years to get to agi, just the ai we already have is poised to revolutionize the world and likely have deep effects on employment. Even if the AI can't do it all, it can still reduce a team of 10 to half that size.
Argh, that is sort of logical at this stage. Assuming Sora uses "only" a transformer based model, it just interpolates of training data with a certain minimal margin of variance. If the model would be sort of agentic, things could change. Lets say the model would generate a prompt per frame of video, including agentic input of the physics, then the transformer can generate video outside of training data correctly. It is one thing to generate text that follow over all simple structure and next word prediction. Its a miracle that transformers can generate imaged, let alone video at this level.
Boh, there could be worlds where red stuff becomes circular by moving. If there was no example disproving it in the data set, then we can not expect the system to learn it. They should make a dataset so that something has to go against what the dataset could teach.
It's funny how so many people say gotcha, LLMs won't reach AGI, who ever said that? OpenAI? Never seen it. They have said, we have LLMs, look at all the cool things they can do, and we will try to use them to make an AGI. I'm pretty sure today's LLMs could explain the difference.
I don't like the way they tested ood, for example if they tested different colors *except* red, instead of testing a single color, then tried red, thier finding would be more valid.
That's kinda expected. We need a different AI approach to train the world model first and only after that we can dump the massive amounts of text and videos onto them.
We are going to see unreal engine like models that allow some text to animation like features that can be stylized with ai image generators frame by frame that can be tuned into a video. Bottom line, I think this would have to start with physics built into the system given a multimodal will always have scale limitations. It is not that far fetched for unreal to learn to intake an image, identify the objects, use an object library to place the objects, and then run the model I mentioned. Image to video and even text to video are still possible with this concept…I’m sure it’s coming eventually
AI is only competing against human intelligence. Ask most adults to draw a stick figure cartoon strip of ten images and not only won't they get the human anatomy right, they won't have a chance in hell of getting even basic physics correct.
This is a Trojan horse from China. Bydamce company is the owner of tiktok. This is how they ask their questions, so they can find out more openai inner secrets. They put openai under pressure "you're a liar". Then, some employee will leak the minimal in their comments, Johnny Apples leaks another minimal, and from these littles, Bydance will put together the puzzle. Then they'll be closer to know how Sora works and how to make a competitor model.
This and a few other papers have moved the needle a little for me. A little. But they have holes. This is a very small model, comparatively. And has never been taught that it is possible and advantageous for it to make guesses about OOD data. Plato's cave.
This makes all the sense to me, .. for us it may seem obvious what is expected on the next frame, but for the AI there is no information explicit or implicit that would clearly define what it should predict .. so when it predict something else that fits the input data it's still a valid solution to the problem from the AI perspective .. we as humans just cannot grasp that these things that comes natural to us .. requires knowing something we don't realize. So to me this doesn't prove that AI is not 'intelligent' or however you want to call it. It's just that things are more complicated.
lol openAI will probably come out with diffusion chain of thought inference based video model with insane inference scaling laws so the model thinks about every video patch in relation to all the others like an o1 gold medal maths olympiad contestant...
I remember a lot of research coming out about a year ago showing that models which are trained on general information sets actually outperform specialized models in their own specialization. The research that they did is a HIGHLY specialized model. I wonder what would happen if they used a wide variety of complex shape and color patterns moving in a wide variety of patterns with significant variation to allow the model to learn the abstract principals of categories and to determine which categories impact others and which don't. In this example, the category of color was always paired with the category of shape, so OF COURSE it could not generalize a deviation from that. But if you give a distribution that makes it evident that those categories are fundamentally unrelated, then the model should be able to predict more accurately. IMO this is just lazy research.
The best generators in the industry may grasp concepts, but can't consistently create images or videos without putting a floating hand in the backdrop or morphing from one being into another. You can ask the model, "is it normal to see hands float in mid-air" or "is it possible for a cow to instantaneously shift into a horse" and the model will be intelligent enough to tell you "no, that defies the laws of physics." Yet, despite being intelligent enough to know better, it still has a likely chance that it will create an albrich horror abomination when you simply ask it to create an image of a puppy or create two people's faces morphing together when you ask it to generate a video of two people kissing. AI, AGI, and most top LLMs on the market today have intelligence for sure. What they lack is a world understanding where they can see and understand in detail this endless amount of data they know. They may know something is correct or know something is supposed to be a certain way, but they lack the understanding as to why.
I wanted to provide some feedback on your videos. I watch almost all of them and I just wanted to share my thought. I watched a video the other day talking about Intentional Breaks when speaking. These breaks give people a chance to process what you're saying by giving them a quick break. I think if you can practice intentionally taking a half breath, and being comfortable with a short pause, your videos will improve a lot.
I do like your video's however one major issue I have is that you really need to normalize the audio. You go from super excited to flat (or using AI) back to super excited and your volume goes up and down with it as well. You REALLLLLLY need to get an editor. Other than that, I enjoyed the video. Thanks for the info.
It is known already. Please be advised that real time multi modal multi sensory data input is required for real world training. Only extrapolated perplexity can come from static interpolated data. It is 100 percent incorrect to suggest that symbolic and probabilistic LLM reasoning engines correlate with what is suggested here in any way. Real world extrapolation requires real world sensors for a time. 🎉 Just like humans.
Nope nope nope - not there yet. This is a step. But, what is needed is a rich sensory environment. It needs to feel gravity dragging on it. It needs to feel friction. It needs to feel texture. It needs to hear chalk on a blackboard. It needs to taste bitter apples and sweet pears. It needs all this and more before it can make more than abstract sense of what is shown in pictures. Until it's environment for growing is much richer I'd never trust it as a left fielder in a baseball game. Can it catch that little ball? Can it learn the clues to catching it reliably? All we have in LLMs is a fancy relational dictionary. It could easily spit out the equations for flight for that baseball. But, it has no idea what the equations mean. Or what "Keep your eye on the ball" means or how and why it works. {^_^}
if an ai has only seen red circles going right and blue squares going left, where is it supposed to take the authority from to assume that red circles can also go left? if you have seen your whole life only apples going left and oranges going right, where do you take the authority from to assume that apples can also go right? well easy: 1. you can take an apple and throw it to the right. 2. you have seen a billion other objects going in all sorts of directions, and since an apple is an object, you can safely assume that apples can also go right. none of this is true for an ai. it can not do experiments in the world and it has not seen enough examples to feel safe to generalize. even yann lecun says that ai can in principle do this generalization, it just needs way too many examples. so what we have to do is simply speed up this process somehow. a baby that sees the apple going right for the first time is surprised and delighted. because learning something new about the real world is joyful, it feels good. maybe that is what we have to model in ais. if there is a measure of new and surprising information, the learning process should focus on that. maybe with a kind of attention mechanism on newness. a baby would then take that apple and throw it to right multiple more times just because it is so fun. this reassures and strengthens the theory and observation that oh apples can also go to the right. the baby does that until it feels boring. thats when it has learned the new rule. and thats what we should also model with an ai that should learn about the real world. we should see this video generation model generating a red square that turns into a red circle not as hallucinating, but as doing an experiment. the model doesnt know the rules. maybe theres is a rule that objects can change shape? why should this rule be anything less probable than circles only going to one direction? there is not a hierarchy of reasonability of possible rules. maybe such a hierarchy can be formed but only with many more examples. so the way to react to such an experiment is just to tell the model, no, thats not how it works. then it can learn from that. of course an ai embodied in a robot in the real world, that can do these experiments and have the real world as a teacher would be much better, since the real world gives these answers for free and every time correct and consistent. we should research what happens in ai models when they see new stuff like the red square going left. are they surprised? how is this new rule formed? how is a generalization like "all objects of all colors and shapes can go in all directions" formed during the process of seeing all the examples? and then maybe we can find a way to speed up that process with a sort of attention mechanism. a biological organism with a neural net in the real world of course only retrieves data that it has learned in the real world. because anything else would quickly lead to its death. a frog that wants to catch a fly retrieves data about the flight paths of flies it has seen in the past in the real world. if it hallucinates, it does not catch the fly and dies of hunger. and since this is how the human brain evolved, the human brain also mostly retrieves data from its huge database. original ideas are very rare. with science we have created a safe space, a playground for this. but for most organisms with neural nets this doesnt work. of course the frog is not predicting every pixel on its retina. it is only predicting the path of the fly. well, in a way it is predicting everything, but it is predicting it to stay the same! if the frog would suddenly predict the stone turning into a snake and the grass turning into fire, it would never be able to catch the fly. so it is predicting the surrounding to stay the same. just like you are predicting the wall to hold up and not suddenly crumble with a horde of zombies behind it attacking you. so of course real world prediction is possible, it just has to include that most of the stuff stays the same all the time, and only some interesting stuff changes, that we need to focus our attention on. and what we also need is constant learning. let me tell you something that you have probably never thought about: we dont even know how it feels to not constantly learn! in fact if the constant learning would be switched off somehow, we would probably start hallucinating immediately, just like llms do. and in fact we do, namely when we are sleep deprived. people who havent slept for long start to hallucinate. this is because their constant learning doesnt work anymore, because we need sleep to learn. so in summary, the tldr: i dont think we need a completely different architecture for ais to be able to form generalizations. just maybe another addition like a sort of attention mechanism on new stuff to speed up the process that is already possible, just taking way too many examples currently. we need a way for ais to be able to constantly learn. maybe with a sort of sleep mechanism. it has to be clear that there is a real world where hallucinations are not helpful, and a dream world where it can happily hallucinate all it wants, and sort of make experiments in its mind without bad consequences and rearrange stuff while doing it. we need embodied ais in the real world, that can do experiments and feel pain when they hit their head (and so discover that objects are actually solid), or touch the hot plate, and joy when they discover that apples can not only go to the left, but also to the right. the main point of this video that generative ais only retrieve stuff from their database is also true for most organisms with a neural net in the real world, including humans, because anything else is not helpful in most cases.
There are other papers which say, essentially, "Ai doesn't generalize at all; all it can do is repeat patterns it has seen before". One could interpret this as "we are going down a path that will never result in agi". But I see it as, "this is an alien intelligence that learns by example, not by generalization; it thinks in a different way that is far more powerful in some ways and far less effective in other ways". Did you really think an inorganic brain that learns by reading every book plus the entire internet in less time than it takes you to learn how to play guitar, would be just like you?
A simple criticism to the article: you cannot study what are supposed to be emergent properties if you restrict drastically the dimension of the training space. If I train a model on 1 billion examples of red circles that fall at 100mph that's the only world he thinks it exists, laws of natures for the models are just "going 100mph", why should it forecast 10mph for a new situation? How dumb is that? I wonder how a human could forecast correctly a phenomenon that defies all previously known law of nature 😅
I’ve been saying this from the start of the hype. There is no sign of intelligence; they hijacked the word. LLMs are at best ‘go fetch’ and ‘Simon says.’
9:50 Given the result they got using a red square turning. I wonder what would happen if they used a green triangle. Would it not have moved? would it change the color to either blue or red or anything else? A human would think, shape and color has no meaning as data showed two different shapes with two different colors moving to the right. the similarity is a moving part, so a green triangle would move at a constant speed towards the right. Very querious what an AI would do about that.
I have not read the paper, but if training is based on a initialized model simply using these narrow examples, I would expect nothing more than what we have seen. Effectively you are testing the world model of a 2 month old child, so of cause you see these effects. A lot of the model understanding does not sit in the parameters itself, but in the gradient (and it's derivatives) over the parameter distribution. To create a model of the world, you have to train the model with all sorts of shapes and directions. Also the dynamics you use have to be logically/predictively consistent. Two shapes traveling different directions based on shape or color might not result in a predictable pattern. Also, we have no idea what 'out of distribution ' means for humans, since every teenage person has seen petabytes of data.
AI knows what it learned. Ai doesn't know what it wasn't taught correctly. Obviously obvious. Humans can understand, or they can memorize. Memorize is easier, but not right
Haven't developers been saying this for years? It's skewed training data. I've noriced with Gemini Ai that it had real problems when it hits a template answer (much of which are based on highly politicized issues with a strong left leaning bias), but it does not match the training data (which is not uncommon for left wing political ideologies that are not founded in science, but suffer from tunnel vision), it is now required to hold two separate ideas in its mind simultaneously that contradict. This causes a lot of problems too.
The real achievement here was making a 2 minute explanation into a 24 minute video
xD
You must be new here
I can shrink this video into one word : Duh!
"OMG it's just advanced retrieval and not really smart!"
... Welcome to College.
Undergrad for sure. But then again... a lot of things can just be seen as "advanced retrieval." You could even argue anything other than completely novel ideas, or independent re-discovery of an idea is a level of retrieval.
"emergent properties" could not be further from "advanced retrieval" if it tried.
@@2299momo Thats the craziest part of all this, threatened humans with great memories and no creativity somehow keep inventing new relative metrics when two years ago ALL OF THIS WAS SCIENCE FICTION. I'm pretty sure all the people with the most hubris about AI are gonna be like dogs to the rest of us soon. I'm a super human now. Im a lawyer. I've made doctors double check something because they didnt think of it this year. People like me have the most agency now. Why? i know im not capable of advanced retrieval. I know what good looks like though,
you could argue the brain is partially just that.
@Vurt72 it's worse. We have high RAM. Our information retrieval is a complete enigma with no commonality across families communities or the human race
It’s funny when you think about it-many people argue that language is what has enabled us to become “intelligent.” Yet, some of these same people argue that language models themselves can’t be the key to creating intelligence. They might have valid points, and they might be saying things that are technically true, but I think they’re missing the larger point. We ourselves don’t fully understand what intelligence actually is, and we don’t all agree on a precise definition. And, really, who cares? The question isn’t whether we’re creating a true world using exact physics; it’s whether we’re creating a world that behaves in a certain way.
Some critics seem to be trying to judge these models by whether they take the same paths we do, like relying on physics engines in games. For example, if you have a VR game that’s incredibly realistic with a physics engine, it doesn’t “understand” anything. But it achieves the effect. So, I think these critics are missing the point-many of their arguments, even if technically correct, can be dismissed with a simple question: does it accomplish what we intended? It’s almost that simple.
Consider this: I don’t fully understand the physics of walking, yet I still walk. So, what exactly would their argument be there?
Exactly... The problem here is that math oriented scientists tend to have very very bad understanding of human psychology and internal thoughts, also for many its a philosophical fight that has been ongoing since at least Ancient Greece whichis dualism vs monism. I remember watching Josha Bach being so excited about his findings about human perception based on his work with AI... Basically he came to a model that psychology had for decades now and that is available in any cognitive science 101 book.
You both are the two smartest people on the internet, you guys get it. Chatgpt acomplishes everything I ever wished for, hence to me it's intelligent lol, as it does the tasks I give him intelligently with almost no error, does at human level. That's enough for me. Are their sentient thats another question. But calling LLM's ONLY statistics is diminishing it for their own agenda I feel.....
OK, look. If an AI model only sees red circles and blue squares in its training data, you just taught it that red objects are always circles and blue objects are always squares, and it learned that, and it thinks that that is a physical law. It is not unable to learn physical laws. It is just learning exactly what it is taught, and if you teach it things that are wrong, it will believe them, and that is the fault of the teacher, not the student.
This is such an underrated comment!
Together, the Teacher and the Taught create the Teaching.
I can see humans getting it wrong as well 😂😂
It's not that the things are wrong "as such", it's just that AI is not able to distinguish abstract from specific. It sticks to specific facts and treats them as the ultimate truth and cannot deduce that there might exist something else. Also, it cannot imagine things. When children learn about gravity, they can imagine and even draw what might happen without gravity before they actually learn about space. An AI neural network cannot imagine "anti rules" even when prompted for it.
No, it doesn't think anything, its just simple pattern matching to the database of know patterns. Its not learning anything, nor is anyone teaching it. I would call that analog computing instead of "artificial intelligence" . Its a machine.
An actual human would be able to infer almost instinctively that the correlation of the two axis of information might produce more variants. Like a red apple and a purple grape are concepts on their own, they wouldn't mix things statistically and create a red grape or a purple apple out of nowhere in their mind model, and if you ask about a purple fruit, the human probably is going to answer grape and not apple.
We know that's because its literally what babies do when they learn such things, also one single-shot learning is enough for a human baby, they don't have to see pictures of all the apples in the entire world.
Nothing new here, this is what plenty of people have been saying about llms forever.
Also they overfited a small model, this basically says nothing and is exactly what we would expect.
people are wrong though. LLMs have significant potential beyond that
Yeah I'm really surprised here that people are taking more from this than possible. The entire issue with understanding Neural Networks is because once they've been trained on a large variety of things, it's no longer possible to discern what they're doing.
If you use small models like this with very simple data, then it definitely looks like it's just retrieval. The question is "With enough data, does a new property emerge?" which will not be answered by these small sorts of examples.
correct; see Arc Agi challenge
Thanks for saving me 24 minutes. I was gonna watch the whole video and suspecting the paper's full of shit. lol
Yeah even Apple has published the same results two months ago, but using only the text part from the LLM. Even then we already knew this 😂
It's not retrieval. Training a tiny cartoon model and overfitting it on solid colours that only move left and right doesn't prove that it's incapable of learning the physical laws, it proves that you're incapable of creating the dataset and hyperparameters that would allow it to do so.
This is nothing new either, this epiphany is something every beginner learned 20 years ago.
Obviously we know that the neural network in a security camera has learned the general features of faces because otherwise when you have an intruder, it simply wouldn't be able to detect that novel face that wasn't in it's training data, and because we've looked inside those models to literally visualize those features it learned.
However, you can do what they did here and train a face detector to overfit on specific faces, and then it won't learn the general features of faces, it will only be able to detect when 1 particular face is present.
You can also overfit on certain face shapes. If all the faces in your dataset are cartoonishly round, then it's going to overfit on round faces and assume things about other face types that are only relevant to round faces. This doesn't prove that the properly trained model isn't learning how faces work and doesn't say anything about the architecture in general. You need a diverse dataset that represents real world physics for these complex abstractions to emerge.
It's like only ever teaching a child about addition and then when they fail a multiplication test you say "aha! see! they're incapable of learning math because they can't generalize".
I think this is already understood widely. If you train a model on video, it will “understand” how pixels change in videos. Video is just a visual representation of the physical world. Would you train a baby to walk, talk and function in the world by sitting it down and getting it to watch videos? Multimodality is what we “train on” as humans, why expect anything less from an a.i. system?
are we not currently in the training simulator for humans? it would seem the most obvious conclusion to reach based on what's happening currently and what that will look like in 10-20 years
because the biological way to learn is not the only way, nor the same laws or physic constraints applies, otherwise why biological beings didnt evolve with wheels?
@@ADRIFTHIPHOP - This would explain a lot of things. Hehe.
We train babies on input from their eyes. We train AI on the input from their cameras. There's no difference. Apart from: you can train one AI model, and all the copies of it have the knowledge.
@@davew9615 - ah, yes. The sensory deprivation tank with only a video screen for the baby… nothing could possibly go wrong with this youngster! 😉
You need to stop the okay bring back the pretty pretty
Truly truly REALLY!
Yeah. I honestly preferred the couple of weeks when the voice was generated over the Okay epoch.
I got pretty drunk once with it
hahaha
Essentially
In the example where the red square turns into a circle at 09:45, what question did the researchers ask the AI beforehand? Did they ask anything specific, or did they just present it with a short video with no guided prompting? If no prompt was given, why can't the AIs response be interpreted as "A red square must be a mistake. I've seen how the world works and red squares should not exist. So first things first, I need to correct that and make it a circle." If that is the only thing it has "experienced" then that is how it 'believes' the world should be, so it fixed it. Not unlike Plato's Allegory of the cave. If the model was trained on more diverse data, would it make the same choice or would it do as the researchers expected and kept the square a square?
I'd love to see that follow up - if the model had still never seen a red square, but had seen green and purple ones, would it have learned that color isn't necessarily predictive of shape and accepted that a red square is possible even though it had never seen one?
Sora understands physical law as it pertains to the input it has been given. You first have to define understanding properly. Why is this news. This his how our own brain works. Why do you think that Aristotle, who was definitely intelligent, believed that the speed of an object falling was proportional to its weight: heavier objects would fall faster than lighter ones. He had a limited set of training data, and for him this was the sensible conclusion. That doesn't mean he hasn't inferrred a general rule, just that he inferred the wrong rule.
Would SORA infer the Schrödinger equation if it was trained on video of the double slit experiment?
Maybe not wrong, and instead less nuanced? With that in mind I suppose we'll never truly know what it means to be "right"😅
No, it doesn't understand physics. It's just a cheap magic trick.
Sora takes prompt, expands it to describe scene in greater detail and compiles video from training data.
Your deep philosophical interpretation of a simple programming processes is out of place here. Sam Altman wants you to think that simple primitive magic trick is a "real magic".
@@meandego I also believe that is "just a cheap magic trick". Keep the investors invest. ✌ ✌
@@meandego Understanding is recognizing patterns and applying those patterns. You have to stop looking at AI as some kind of threat or competition and understand that It is a replication of our own brains. Can you do everything that a human could possibly do? I don't think so. Why is it that we have such high standards for artificial intelligence? People like you will die inside when we finally do create AGI because your ego won't allow you to accept that one day you may be beyond inferior to a machine and that scares you because you don't know what the outcome may be. You are an organic machine with programming that gives you the illusion of autonomy but ultimately how did you become the person that you are? what external influences shaped your mind? When you think something or make a decision, how did you create that thought? It had to come from somewhere, right? What is consciousness?
This seems pretty dumb... if all the training data you have are red circles and blue squares, it's not surprising that the model does shit when it sees a red square. It wouldn't happen as the training data becomes larger, and the model extracts general essential patterns from apparently unrelated situations. Scale is the key.
Which is one of the 3 main takeaways from the Chinese authors, who said scaling still works
Basically would be the equivalent to the allegory of the cave by Plato. If the only thing you saw was shadows cast by yourself then that's the only reality you know and accept. You won't create any concepts outside of what you know. Now could we create better ways to make training data? Sure but I think some of it will be coming naturally by having robots that interact with the actual world and learn.
What's the idea?
@@MrWizardGGChina is home of retractions though. 1/3 of papers.
Exactly its not that the model cant inference. Its that they dont grasp the underlying principle. You could also say their imagination runs wild and goes back to their training data. I am curious if you train them on physics first and then do the same would your outcomes be different?
What is AGI?
To understand it better, try asking yourself: at what age could a child perform this task? This is the mindset you should adopt when considering AGI’s potential. And if you think AI is a deception or a con, it may help to consider the principles of evolution.
Is anyone thinking about what we genuinely need from a superintelligent AI, beyond just what humans might prefer? Do we truly want an AI to be the ultimate arbiter of truth? And if so, who decides what truth even is? Would it be defined by the victors in humanity’s long and complex history? Do we, as a species, need an Oz-like figure pulling the strings behind the curtain before we “awaken” to reality?
How would you say we're doing so far, without AI?
As I read this paper it addresses a classic problem in statistical prediction: any machine learning or predictive model will always be biased towards predicting outcomes within its trained distribution. In simpler terms, models like LLMs are designed to predict the next letter, while models like Sora focus on generating the next frame in an image or video, based on patterns they’ve seen before.
The paper argues that these models struggle with ‘out-of-distribution’ predictions, meaning they’re not effective at identifying what doesn’t fit their learned patterns.
The problem is not just the inverse of predicting the next likely outcome. It’s a harder problem because the model would need to account for an infinite multi dimensional space of possibilities outside its training scope. If to be solved halfway you need something like a “Sora multiple scenario outcome model”, which is not built yet and requires either an extreme scaling of the existing architecture to account for all scenarios, then that scenario model(s) could be put into the “final Sora” along side the rest of the data, or a completely different architecture.
Must say I initially tend to agree with the paper’s perspective.
any machine learning or predictive model will always be biased towards predicting outcomes within its trained distribution
which is also a sentence that is true for humans 100%
I hate when people use the word understand in reference to AI because these models do not understand in the same paradigm in which we do, when you ask the question does it understand and then constrict the model to our basis of contextual and experiential understanding the answer is always no at this point.
If a child was shown only red balls and blue squares and then one day he was shown a red square, chances are that his brain/sight would correct the shape or the color to fit his expectations. There are a lot of adjustments going on when we see something. Obviously the technology behind the human brain is way more sophisticated than that of our generative AI but the data poured into the human brain, through our five senses, is insanely more important than a few billion parameters.
This is missing a big part of the picture. In dreams you can fly, objects can jump around etc. Your human mental model doesn't have to conform to cause and effect or strictly to physics *unless* you are using it to interact with the real phenomenal world and then your body uses feedback mechanisms to correct the model to conform to what is actually happening in external physical reality. It doesn't matter if AI generated videos hallucinate impossible things. It's a feature not a bug in a world model. Your world model has to be close enough for government work not perfect.
You are getting it wrong. The models have world shattering capabilities, even without being "really" intelligent. The failures regarding "doing it from memory", instead of "doing it by logic" are evident in mathematical problems. However, they change the whole infraestructure with both AlphaProof, with simbolic language. and won the silver medal in math olimpics
Who is Lis?
I believe this is the heart of the matter
Yann LeCun is saying this already for like 10+ years. They tried to train on video to make model of the world for many years. Didn't work and will certainly not work with synthetic video.
LeCun talking about ~that: ua-cam.com/video/EGDG3hgPNp8/v-deo.html (more specific about intuitive physics and video @ 33:10)
Unbelievable. This is SO basic if you've ever really worked with Neural Networks or even just simple regressive stats!
1. INTERPOLATION can be reasonably dependable if you cover the input data space well
2. EXTRAPOLATION is much harder. Rule of thumb: The further away from covered input, and the more complex your models, the worse things get.
How on earth would an AI model know that form is more important than color? WE know that, because we have context. The model can simply minimize its error by complying with the known input. That's why LeCun et al think they can 'correct' this by adding context. The wall they'll hit is ... what IS this context that you need?
well put
I've been playing around with layers of abstraction using LLMs, this is where the LLM would make notes (commentry linked to the original text or other abstractions) and then use those them to make changes to other abstractions or the original text. This helps things stay on track when handling complex / long agentic tasks.
I wonder if there's an equivalent for video.
I was an AI product manager for GE Software and I post videos on how AI actually works-including Gen AI and its shortcomings. These are not surprising results. You are absolutely right - it’s the core architecture that is the problem.
Gen AI works by using statistical distributions in its training data to do guided semi-random retrieval of data and does extrapolation from there.
This cannot avoid hallucinations and you have a great explanation of why hallucinations happen. Gen AI needs a lot of double-checking and fact-checking, including and especially text Gen AI programs like ChatGPT.
But human brains have the exact same problem, including hallucinating. If you trully are an AI engineer I humbly direct you to some cognitive science / cognitive psychology subjects
From my perspective, the fact we encounter problems of such nature with current architectures shows (not disproves) - that we are on the right track
It seems to me many AI researchers need to catch up on what cognitivists have gathered in recent 30 years and also some stuff on modern neuroscience. The fact that LLM`s lost computer`s perfection and make mistakes is a proof we are going in the right direction - there is no real creativity without hallucinations and mistakes - Henning Beck has a nice book on this (flawed but shows the problem in a simple way)
"Bytedance"
Truly the bastion of tech innovation we all turn to...
the way Aliagents structures their AI agents is groundbreaking, can’t wait to see what’s next
There's a massive difference between an LLM and a visual generation AI.
Within language, humans have already encoded and implicitly shown the relationships between the representative units. (words) There is almost zero such relational data in visual data units. (like moving squares)
Do not conflate LLM style AI with video generation AI. Max Tegmark recently put out a paper that highlights the relational geometric nature of the information in an LLM which represent implicit aspects of the world, but it will only know things that are not learned in the childhood phase that humans need not ever discuss at length because of shared human experience.
There are explicit geometric relationships between concepts in an LLM that are similar. "Queen" is close to woman and "King" is close to man but those two sets are far away from each other in the actual dimensional information encoded into neural networks trained on language. Those relationships are in language and not in physical objects viewed through a camera.
All this says is they overfit an ill-suited model.
The whole point of "good ML" is making a model generalize OOD, which isn't only common, but was shown for LLMs on this very channel.
I feel like only you didnt know this:
"OMG it's just advanced retrieval and not really smart"
LLMs are not “Language Models”, that is a misnomer.
Scaling laws, scaling laws, scaling laws. We don’t know where the end is here, especially now that we have confirmation that we’re now training on both ends (pre / inference).
If something works better than before, it's not "doing it wrong"; it just means there's room for improvement.
This could be similar to something noticed with text to image models. They can have difficulty imagining two very different things in the same scene. Our test of choice is a combination of say Henry VIII and an alien visitation. Images based on the Tudor era have a style, as do images with extra terrestrials. Ask the model to combine the two elements without any specific direction and the model gets confused
Aliagents is definitely ahead of the curve with their tokenized AI approach, the potential here is huge
It was clear what those models are and what the limitations are. It was clear that the recent advancements mainly came from improved computation power, the theory itself is already > 50 years old. We likely won't see AGI emerging from those models alone.
Humans aren't out-of-distribution either, but thinking creatively is still tough - it's hard to break free from the usual patterns, or maybe we'd be on a whole other planet, even in another galaxy.
Yep. Just look at our art. or on aliens humans invented.. All aliens/monsters we invented are just a variation of earthly creatures.
Then you look at some of the petrified remains of prehistoric oceanic liveforms and you see how even on earth life was at some point more alien than whatever alien we can dream up.
Also, creativity is making errors, hence why creative people have many problems with maintaining "normal live". They just tend to do everything different than others, which from time to time is a usefull variation - then u have succesfull artist. Most creative people are just considered "weirdos" though and live life of poverty, chaos and wrong decisions.
for like 15yrs ive told ppl that putting a 2yr old in a room with 1000 screens and pumping them with info dosent give you a teenager that knows how to tie its own shoes. they might know every episode of bold and the beautiful by heart but they wont understand how to dress themselves ext
The reason Ai has hit a roadblock in games is that the more complex models are not efficient in real time. You can simulate lots of stuff in theory, but when it comes to bringing it to the real world, the infrastructure is the key problem. Ai can potentially do anything IMAO, but as long as we fail to create the appropriate infrastucture for the Tech it will always be a novel trick without real implecations for the world. Humanity needs to focus on real economics that have tangibile purpose, rather than beating the value maximisation drum. It's the same as green tech, nobody seems to care that our targets are not aligned with what is actually achievable in terms of production. Selling the illusion of green tech is enough to generate value and maximise revenue. Marketing over product. We need to return to a path of creating useful products that bring meaningful innovation, instead of just generating wealth for a small percentage of us. As long as that doesn't change ai will always remain a bubble. There is no point in inventing and building trains if we are not maintaining the tracks it drives on. This study makes some points I guess but is really just marketing by the look of it.
Why can't they just do what elementary school kids do teach it left from right, make sure it understands a 3-D world, and remembers forward, backward, up, and down? A kid doesn't have to know Newtonian physics to know that when they push a ball, it does one of four functions forward, backward, left, right. Video data is unless unless it understands a 3-D universe video learning it will still see from a 2-D perspective. It made extra hands and fingers because it looked at the world in 2-D and was training to see spontaneous creation.
You get so excited when you understand something
there is a 5th characteristic required for intelligent behaviour: the ability to feel. Without the discomfort of pain or the pleasure of joy, AI will never reach the levels of human achievements.
can it be simulated?
Doesn`t seem relevant here at all. Feelings are mostly evolved affects which are mostly involuntary and automated. If anything they would cause unnecessery errors.
We wan`t brilliant tools, not frustrated creatures.
Other than that there is also the qualia problem. Those are indescribable and even we humans, despite mirror neurons and all the empathic circuitry cannot know what others feel when they feel for instancee warmth or cold or wet or love or regret.
@@tomburnell8453 yes and no. Feelings are a part of a simulation of reality your brain presents to the observer process running on your brain`s hardware. So feelings already happen inside the simulation we call our personal perceived reality. The problem is that just like we have trouble looking into AI`s internals, we cannot decode ones internal thoughts/feelings/perceptions which means we have no idea what it actually is and how it works andeven if the qualia part is the same in all humans or even similar (the "feeling" (quale) of seeing red can for one person be the same exact feeling as for the other seeing red) there is no way to know untill we have 100% simulation of a human brain.
@@romanhrobot9347 Maybe good art comes from suffering and brilliance arises from frustration? Those "unnecessary" errors maybe required to find the solution.
So what are the implications for Tesla ?
I find particularly interesting that the model has not learnt that objects never change shape, which should also be retrievable.
I'm not convinced that this isn't an artifact of an insufficiently large neural network. But it will be interesting to see what others do to confirm or contradict the conclusions.
It seems like AI decides how it orders its thoughts. Quite impressive! They deserve respect!
Look at image and music generation. Right now it is so good that barely makes mistakes and creates new things that never existed before and they look and sound really good. It does not need to be perfect, it just needs to look realistic.
21:00 "Take a video of piss womb." Now that's thinking outside the box. Or maybe in the box...if you know what I mean. Anyway, crazy creativity is definitely what's needed. Can't deny that.
Aliagents is pushing the limits of what’s possible with AI, the future looks promising for them
Was that video text's done in powerpoint?!
Personally, It obviously was not about model understanding the physical world, it's mostly about an an illusion of such.
For me, as an artist, AI have to make something i invisioned, not something real.
When you get results close to reality, or surreal but close to what you expected, you don't need more for long time to impress people and ocupy their attention.
Add some control insturements to it and narrative is set, for deeper immersive engagement with what you creating.
In defence of AI visionary advance, even if AI right now can't simulate perfect physically dependent worlds, we will reach in depth goal by "fake it till you make it" principle and compute optimisation.
"AGI" does not need perfect understanding of physical world, people don't have ideal vision on things, space attention, all the data for prediction.
First "AGI" models will have an "illusion of perfection", just because people are actually way less productive in a lot of their sensory and predicting abilities than they think, and cause it will "do the job" as it is.
For me, physical understanding of world will come exactly as an emergent illusion, rather than perfect representation of reality, just how any other industry appear to develop in human world.
And after that, "AGI" will find a way to reimagine itself, by creating physically acurate structure for next itterations of itself with using of simple exception method.
Right now, i think, there is not enough compute and optimisation of it to reach ideal physically based computation while having real time generative model.
Funny. It has been known for more than 100 years that you can't learn causal relationships (i.e. laws of physics) by just observing, you have to do experiments. Statistics only give you correlations. It stuns me OpenAI assumed that neural networks magically achieve something that has been proven impossible mathematically.
Large models can make mistakes - sometimes even huge ones - but that’s how OpenAI and companies like it improve them. Without that process, we wouldn’t have seen innovations like AlphaGo, AlphaZero, and others.
Ok. I have an idea.
What if one was to feed a model data set comprised of basketball footage. However, solely include specifically footage of missed shots being taken and turnovers.
Next, include an additional data packet with the statistics from various NBA games.
Finally, have the AI simulate entire basketball games based on the data inputs.
This would, in theory, determine these model structures' capability to simulate physical reality, would it not?
also, even if it takes 20 years to get to agi, just the ai we already have is poised to revolutionize the world and likely have deep effects on employment. Even if the AI can't do it all, it can still reduce a team of 10 to half that size.
There is a'robot brain' being developed that is modular and thus can go into many systems. That robot brain needs to be plug in here too.
It needs a deeper model that is trained on physics and 3D shapes/objects/fluids.
Then feed it visuals on top. Somewhat like a game engine.
Argh, that is sort of logical at this stage. Assuming Sora uses "only" a transformer based model, it just interpolates of training data with a certain minimal margin of variance.
If the model would be sort of agentic, things could change. Lets say the model would generate a prompt per frame of video, including agentic input of the physics, then the transformer can generate video outside of training data correctly.
It is one thing to generate text that follow over all simple structure and next word prediction. Its a miracle that transformers can generate imaged, let alone video at this level.
AI's solutions feel more and more like the Wizard of Oz
Boh, there could be worlds where red stuff becomes circular by moving.
If there was no example disproving it in the data set, then we can not expect the system to learn it.
They should make a dataset so that something has to go against what the dataset could teach.
It's funny how so many people say gotcha, LLMs won't reach AGI, who ever said that? OpenAI? Never seen it. They have said, we have LLMs, look at all the cool things they can do, and we will try to use them to make an AGI. I'm pretty sure today's LLMs could explain the difference.
Maybe I’m dumb but this is what I assumed the case was - video models can only show what they have been trained on.
I don't like the way they tested ood, for example if they tested different colors *except* red, instead of testing a single color, then tried red, thier finding would be more valid.
That's kinda expected. We need a different AI approach to train the world model first and only after that we can dump the massive amounts of text and videos onto them.
We are going to see unreal engine like models that allow some text to animation like features that can be stylized with ai image generators frame by frame that can be tuned into a video. Bottom line, I think this would have to start with physics built into the system given a multimodal will always have scale limitations. It is not that far fetched for unreal to learn to intake an image, identify the objects, use an object library to place the objects, and then run the model I mentioned. Image to video and even text to video are still possible with this concept…I’m sure it’s coming eventually
A return to a balance perspective on AI!
AI is only competing against human intelligence. Ask most adults to draw a stick figure cartoon strip of ten images and not only won't they get the human anatomy right, they won't have a chance in hell of getting even basic physics correct.
following Aliagents closely, their work with tokenized AI agents is one to watch
Didn’t Apple have some research like this a couple of months ago?
This is a Trojan horse from China. Bydamce company is the owner of tiktok. This is how they ask their questions, so they can find out more openai inner secrets. They put openai under pressure "you're a liar". Then, some employee will leak the minimal in their comments, Johnny Apples leaks another minimal, and from these littles, Bydance will put together the puzzle. Then they'll be closer to know how Sora works and how to make a competitor model.
I think you probably need to combine several models. Models have their use cases.
This and a few other papers have moved the needle a little for me.
A little.
But they have holes. This is a very small model, comparatively.
And has never been taught that it is possible and advantageous for it to make guesses about OOD data.
Plato's cave.
This makes all the sense to me, .. for us it may seem obvious what is expected on the next frame, but for the AI there is no information explicit or implicit that would clearly define what it should predict .. so when it predict something else that fits the input data it's still a valid solution to the problem from the AI perspective .. we as humans just cannot grasp that these things that comes natural to us .. requires knowing something we don't realize. So to me this doesn't prove that AI is not 'intelligent' or however you want to call it. It's just that things are more complicated.
Your videos are CRAZY!
lol openAI will probably come out with diffusion chain of thought inference based video model with insane inference scaling laws so the model thinks about every video patch in relation to all the others like an o1 gold medal maths olympiad contestant...
I remember a lot of research coming out about a year ago showing that models which are trained on general information sets actually outperform specialized models in their own specialization.
The research that they did is a HIGHLY specialized model. I wonder what would happen if they used a wide variety of complex shape and color patterns moving in a wide variety of patterns with significant variation to allow the model to learn the abstract principals of categories and to determine which categories impact others and which don't.
In this example, the category of color was always paired with the category of shape, so OF COURSE it could not generalize a deviation from that. But if you give a distribution that makes it evident that those categories are fundamentally unrelated, then the model should be able to predict more accurately.
IMO this is just lazy research.
The best generators in the industry may grasp concepts, but can't consistently create images or videos without putting a floating hand in the backdrop or morphing from one being into another.
You can ask the model, "is it normal to see hands float in mid-air" or "is it possible for a cow to instantaneously shift into a horse" and the model will be intelligent enough to tell you "no, that defies the laws of physics."
Yet, despite being intelligent enough to know better, it still has a likely chance that it will create an albrich horror abomination when you simply ask it to create an image of a puppy or create two people's faces morphing together when you ask it to generate a video of two people kissing.
AI, AGI, and most top LLMs on the market today have intelligence for sure. What they lack is a world understanding where they can see and understand in detail this endless amount of data they know. They may know something is correct or know something is supposed to be a certain way, but they lack the understanding as to why.
I wanted to provide some feedback on your videos. I watch almost all of them and I just wanted to share my thought.
I watched a video the other day talking about Intentional Breaks when speaking. These breaks give people a chance to process what you're saying by giving them a quick break.
I think if you can practice intentionally taking a half breath, and being comfortable with a short pause, your videos will improve a lot.
really cool vid, also imagine this method in 3d space
I’ve had this theory for sometime but didn’t want to be told I was dumb lol
Where do you get your blow from... I want some 😂
will we collectively get smarter as AI improves and get's smarter? if so maybe it's best to grow together.
it's a very complex system of video game character creation sliders
This is not surprising OpenAI is trying to look better than they are.
What about real life physic training data through robotic sensors? What would happen?
I do like your video's however one major issue I have is that you really need to normalize the audio. You go from super excited to flat (or using AI) back to super excited and your volume goes up and down with it as well. You REALLLLLLY need to get an editor. Other than that, I enjoyed the video. Thanks for the info.
I'm always very nice to you in the comments, but I feel like we all already knew this.
It is known already. Please be advised that real time multi modal multi sensory data input is required for real world training. Only extrapolated perplexity can come from static interpolated data. It is 100 percent incorrect to suggest that symbolic and probabilistic LLM reasoning engines correlate with what is suggested here in any way. Real world extrapolation requires real world sensors for a time. 🎉 Just like humans.
humans also suffer from this. The only difference is that we always predict what happens next and correct our internal world model
Nope nope nope - not there yet. This is a step. But, what is needed is a rich sensory environment. It needs to feel gravity dragging on it. It needs to feel friction. It needs to feel texture. It needs to hear chalk on a blackboard. It needs to taste bitter apples and sweet pears. It needs all this and more before it can make more than abstract sense of what is shown in pictures. Until it's environment for growing is much richer I'd never trust it as a left fielder in a baseball game. Can it catch that little ball? Can it learn the clues to catching it reliably?
All we have in LLMs is a fancy relational dictionary. It could easily spit out the equations for flight for that baseball. But, it has no idea what the equations mean. Or what "Keep your eye on the ball" means or how and why it works.
{^_^}
if an ai has only seen red circles going right and blue squares going left, where is it supposed to take the authority from to assume that red circles can also go left? if you have seen your whole life only apples going left and oranges going right, where do you take the authority from to assume that apples can also go right? well easy: 1. you can take an apple and throw it to the right. 2. you have seen a billion other objects going in all sorts of directions, and since an apple is an object, you can safely assume that apples can also go right. none of this is true for an ai. it can not do experiments in the world and it has not seen enough examples to feel safe to generalize.
even yann lecun says that ai can in principle do this generalization, it just needs way too many examples. so what we have to do is simply speed up this process somehow. a baby that sees the apple going right for the first time is surprised and delighted. because learning something new about the real world is joyful, it feels good. maybe that is what we have to model in ais. if there is a measure of new and surprising information, the learning process should focus on that. maybe with a kind of attention mechanism on newness. a baby would then take that apple and throw it to right multiple more times just because it is so fun. this reassures and strengthens the theory and observation that oh apples can also go to the right. the baby does that until it feels boring. thats when it has learned the new rule. and thats what we should also model with an ai that should learn about the real world.
we should see this video generation model generating a red square that turns into a red circle not as hallucinating, but as doing an experiment. the model doesnt know the rules. maybe theres is a rule that objects can change shape? why should this rule be anything less probable than circles only going to one direction? there is not a hierarchy of reasonability of possible rules. maybe such a hierarchy can be formed but only with many more examples. so the way to react to such an experiment is just to tell the model, no, thats not how it works. then it can learn from that. of course an ai embodied in a robot in the real world, that can do these experiments and have the real world as a teacher would be much better, since the real world gives these answers for free and every time correct and consistent.
we should research what happens in ai models when they see new stuff like the red square going left. are they surprised? how is this new rule formed? how is a generalization like "all objects of all colors and shapes can go in all directions" formed during the process of seeing all the examples? and then maybe we can find a way to speed up that process with a sort of attention mechanism.
a biological organism with a neural net in the real world of course only retrieves data that it has learned in the real world. because anything else would quickly lead to its death. a frog that wants to catch a fly retrieves data about the flight paths of flies it has seen in the past in the real world. if it hallucinates, it does not catch the fly and dies of hunger. and since this is how the human brain evolved, the human brain also mostly retrieves data from its huge database. original ideas are very rare. with science we have created a safe space, a playground for this. but for most organisms with neural nets this doesnt work.
of course the frog is not predicting every pixel on its retina. it is only predicting the path of the fly. well, in a way it is predicting everything, but it is predicting it to stay the same! if the frog would suddenly predict the stone turning into a snake and the grass turning into fire, it would never be able to catch the fly. so it is predicting the surrounding to stay the same. just like you are predicting the wall to hold up and not suddenly crumble with a horde of zombies behind it attacking you. so of course real world prediction is possible, it just has to include that most of the stuff stays the same all the time, and only some interesting stuff changes, that we need to focus our attention on.
and what we also need is constant learning. let me tell you something that you have probably never thought about: we dont even know how it feels to not constantly learn! in fact if the constant learning would be switched off somehow, we would probably start hallucinating immediately, just like llms do. and in fact we do, namely when we are sleep deprived. people who havent slept for long start to hallucinate. this is because their constant learning doesnt work anymore, because we need sleep to learn.
so in summary, the tldr:
i dont think we need a completely different architecture for ais to be able to form generalizations. just maybe another addition like a sort of attention mechanism on new stuff to speed up the process that is already possible, just taking way too many examples currently.
we need a way for ais to be able to constantly learn. maybe with a sort of sleep mechanism. it has to be clear that there is a real world where hallucinations are not helpful, and a dream world where it can happily hallucinate all it wants, and sort of make experiments in its mind without bad consequences and rearrange stuff while doing it.
we need embodied ais in the real world, that can do experiments and feel pain when they hit their head (and so discover that objects are actually solid), or touch the hot plate, and joy when they discover that apples can not only go to the left, but also to the right.
the main point of this video that generative ais only retrieve stuff from their database is also true for most organisms with a neural net in the real world, including humans, because anything else is not helpful in most cases.
What‘s so difficult about spelling the word „is“ right?
"My competitors stuff isn't as good and it isn't actually ai!" Grain of salt people.
There are other papers which say, essentially, "Ai doesn't generalize at all; all it can do is repeat patterns it has seen before". One could interpret this as "we are going down a path that will never result in agi". But I see it as, "this is an alien intelligence that learns by example, not by generalization; it thinks in a different way that is far more powerful in some ways and far less effective in other ways". Did you really think an inorganic brain that learns by reading every book plus the entire internet in less time than it takes you to learn how to play guitar, would be just like you?
I never understood why you all thought otherwise. Sora and similar are just predicting pixels. Of course they don’t figure out physics.
A simple criticism to the article: you cannot study what are supposed to be emergent properties if you restrict drastically the dimension of the training space. If I train a model on 1 billion examples of red circles that fall at 100mph that's the only world he thinks it exists, laws of natures for the models are just "going 100mph", why should it forecast 10mph for a new situation? How dumb is that? I wonder how a human could forecast correctly a phenomenon that defies all previously known law of nature 😅
Diary of CEO thumbnail 😂
I’ve been saying this from the start of the hype. There is no sign of intelligence; they hijacked the word. LLMs are at best ‘go fetch’ and ‘Simon says.’
9:50 Given the result they got using a red square turning. I wonder what would happen if they used a green triangle.
Would it not have moved? would it change the color to either blue or red or anything else?
A human would think, shape and color has no meaning as data showed two different shapes with two different colors moving to the right.
the similarity is a moving part, so a green triangle would move at a constant speed towards the right.
Very querious what an AI would do about that.
Can't AGI be achieved with a shocking amount of virtual data instead of just relying on physical (real) data? Like 1000x the amount of real data
The human brain is split into parts that work for differnt tasks a combined model is probably the solution
Artificial Spatial Intelligence
That is all.
I have not read the paper, but if training is based on a initialized model simply using these narrow examples, I would expect nothing more than what we have seen. Effectively you are testing the world model of a 2 month old child, so of cause you see these effects. A lot of the model understanding does not sit in the parameters itself, but in the gradient (and it's derivatives) over the parameter distribution. To create a model of the world, you have to train the model with all sorts of shapes and directions. Also the dynamics you use have to be logically/predictively consistent. Two shapes traveling different directions based on shape or color might not result in a predictable pattern.
Also, we have no idea what 'out of distribution ' means for humans, since every teenage person has seen petabytes of data.
AI knows what it learned. Ai doesn't know what it wasn't taught correctly. Obviously obvious. Humans can understand, or they can memorize. Memorize is easier, but not right
Haven't developers been saying this for years?
It's skewed training data.
I've noriced with Gemini Ai that it had real problems when it hits a template answer (much of which are based on highly politicized issues with a strong left leaning bias), but it does not match the training data (which is not uncommon for left wing political ideologies that are not founded in science, but suffer from tunnel vision), it is now required to hold two separate ideas in its mind simultaneously that contradict.
This causes a lot of problems too.