Every time I hear stuff like this I have to remind myself that the human brain fits all of it's memory and processing power into a head sized container and only uses a few tens of Watts of energy to maintain operation continuously. Clearly the issue is with the method used. Our computers are already much faster and have more gates than a brain has neurons. And even simple creatures with tiny "bird brains" can do amazing things.
We are, I don’t know how many functions each neuron computes. Every once in a while, we come up with an approximate scale of what we think the human brain computes, but we are consistently changing how much we think the human brain has in terms of compute power. It could be a quantum system though, and then the number of neurons would be almost irrelevant compared to its design.
Unfortunately there is a very real chance that human brains can never understand the theory of human brains well enough to discover its underlying mechanics... Everything we see about brains so far implies they operate off a lot of hazy approximations combining real physical geometry, chemistry and electrical impulses into what somehow ends up a broth of consciousness. To say we don't currently build computers this way is the underest of understatements.
Because we don't actually have memory at all lol. We have the likely hood that something occurred based on everything. That's why our memories can be completely different to someone else about similar events.
I don't get it well, the intrinsic dimension of ALL natural language is 42? or this value is only for English language? If this value is only for English, which is the value for all other languages? for example Chinese, Spanish, Italian, etc?
These are Language Models, and it's well known that natural languages follow Zipf's Law, where word frequencies adhere to a power-law distribution. Because LLMs are trained to learn and predict patterns in language, it’s clear that they must also exhibit this behavior. In fact, this could explain why LLMs seem to hit an efficiency ceiling-they are constrained by the power-law nature of language itself. As the models improve, their gains become increasingly marginal, particularly when dealing with rare words and complex language structures.
but the higher dimensional woowoo crystals say I'm going to take a brave risk and form a new lasting relationship with someone unexpected if I'm mindful of my dietary choices and donate 1$...
Zipf's law could apply even with extremely low-entropy (easily modelled) data, it's a feature of the alphabet/dictionary/etc not the semantic volume of things you could express.
Sounds like the Law of Diminishing Returns is a universal law, just with different names when different people discover that it also applies to whatever they are studying.
Two 20 Watt human brains looking at a 20 million Watt supercomputer operating for 3 months costing $200 million. “Look at what they need to mimic a fraction of our power”
@@mrufa You can't "train" a human mind as quickly as a supercomputer, true, but that's only because there's a throughput issue. A supercomputer is a warehouse sized machine that requires tremendous energy and can only perform specialized tasks. The human mind is slower but much more efficient.
@@JeffNeelzebub is it, despite the time it takes to learn stuff? I wonder... Probably yes but will it stay on top until the end of the century, or even the decade, for that matter?
Possibly shouldn't be a surprising relationship: Thermodynamic entropy and entropy in information theory are related and it tells us that each bit of information has an minimum cost in terms of energy. When you plot cross entropy, you're plotting missing information. It would make sense to flip the y axis and consider that to be how much information was learned. When you plot compute, you're also plotting energy, which is directly proportional to the information and therefore should produce a straight line. Not all models/learning schemes are 100% efficient so they are constrained to one side of that line. The other side represents a thermodynamic impossibility. It would break the 2nd law because the entropy of the universe (increased by the heat output of your GPUs, decreased by your model learning) would decrease.
@@GodwynDi That statement assumes that human brains *do* cross that boundary, but that is not a sound assumption. I look forward to your upcoming paper investigating that. :)
@@GodwynDi We don't cross the line. The better question is how we get so close to the line using a hunk of wrinkly wet meat that works completely differently and uses much less power. And I think the answer is that most of us most of the time are just parroting words that we came up with as a culture or species. So a fairer comparison is thousands or millions of brains coming up with an intelligible language while each individual brain is often just using minimal effort and energy to repeat it back.
@@liam3284The relationship is that thermodynamic entropy is a statement about the arrangements of the microstates of the elements of the system which is exactly information encoding.
I think the explanation is rather simple and has to do with irreducible complexity of objective reality. Basically you cant have precise knowledge (eliminated errors) without sufficient many axioms to begin the process of elimination. At least not from a model. Growing compute is nothing else, but a growing number of data points which are used to pin-point the result with some degree of certainty, but because we generally train models for real world applications the objective reality from which the training data generally comes from and since objective reality does not have some "magic knowledge" that defines the reality more than it defines reality you cant get past that line when you plot error against compute. We just cant provide the model with cheat sheets for its applications. I'm pretty sure one could breach that line in theory by providing some kind of cheat sheet for the model to take shortcuts.
22:25 I remember reading at least one compelling paper that argued that emergent functions like this is more a property of the way we measure the model's functions, rather than a step change in the ability of the model. You might want to look into that
i was gonna guess that it was just a limitation of the data being collected in such a way that it couldnt ever produce a value on the far side of that line…
@@pritamlaskar sure, I was meaning to hunt it down anyway. It's called "Are Emergent Abilities of Large Language Models a Mirage?" The relevant quote from the abstract: "Here, we present an alternative explanation for emergent abilities: that for a particular task and model family, when analyzing fixed model outputs, emergent abilities appear due to the researcher's choice of metric rather than due to fundamental changes in model behavior with scale. Specifically, nonlinear or discontinuous metrics produce apparent emergent abilities, whereas linear or continuous metrics produce smooth, continuous predictable changes in model performance."
Someone from Google way back in the early 2010s predicted this limit to LLMs. Not a well published paper most people didn't wanted to hear it at the time.
@@jcorey333 That's a brilliant quote from the abstract. Translated into plain English: "We ended up fooling ourselves by measuring things without questioning what the measurements represent."
I am blown away by both the implication of those papers and also (and especially) by your ability to convey so much information in a 24 minute video that makes it understandable to amateurs in this field like me.
Einstein's first name is not John. Einstein's first name is approximately Alpert. Einstein's first name is probably not Alphonso. Einstein's first name is derived from the Germanic Adalbert. Einstein's first name is Eduard. (Albert's second son.) I don't think there's a single sentence segment in any human language that you can come up with that has only one correct solution for the next word.
No, answer is 34. Otherwise this BS, which they name AI, but has nothing to do with "intelligence", would be gone as any other trash. Same as VR. There are 20 times more VR games following rule 34 than regular ones.
@@DarkFox2232in what way does the search and learning seen in AI models differ from intelligence seen in biological systems? Of course there are differences in implementation, but what about concept?
@@vastabyss6496 If person is born brain dead, but part responsible for heart and lungs work. Will you call it "intelligent"? See, ability to recognize patterns and learn from them is simplistic way to describe intelligence. If human being required same amount of trial and error in learning process as those models. And still delivered so mediocre results... There would be no humanity. Because no human would in its life cycle manage to elevate himself from being wiggly worm in dirt. You can't even imagine magnitude of "machine learning" which costs $500 billion USD or more. What they have is comparable to human with 0.0001 IQ. And they bet on scale. They are brute forcing their way through mediocrity. But they'll not succeed because when they press play on that generated model. What it simulates is not intelligent approach to problem solving. It is what would be called "intuitive" approach to problem solving. Same as experienced people do things by "gut feeling". Except that those people did get to that point through experience processed by incomparably high intelligence and gaining wisdom. (Even if their IQ was actually just about average.) Models we use now are intuitive mimic game. And that's good for something and bad for most of other tasks. Like generating images. Why they look good? because there are high quality samples and good blending algorithm. But why they keep failing at basic concept of human posture, shape, limb/finger/teeth/... count and placement? Because it is same as if you take fashion magazine. Cut out parts of people, dresses, ... and give it to 3 y/o mediocre kid and say: "Create me: " It will have those nice looking parts. Everything is detailed and realistic. But result will be grotesque. Because that kid still lacks fundamental understanding of reality. Its mind is not mindful enough.
So, FWIW, it is well known in science that log-log plots nearly always end up looking linear. It is a feature of log-log plots and/or the unlikeliness that any system is super exponential.
exponential curves are still exponential on log-log plots, its only polynomials that become linear. log(y) = m*log(x) + b y = exp(m*log(x)+b) y = exp(b) * x^m
When it crosses the line AI will learn to say "I don't know" and stop hallucinating. The 'I dont know' factor being absent from a vector being driven through higher dimensional space mathematically seems like a hard limit without some sort of mock self awareness strapped to that process.
I like that so many people are becoming interested in the field of AI. More people working on a subject is bound to lead to breakthroughs. But, I think there might be a slight misunderstanding about how LLMs actually work. AI models are essentially pattern-matching systems. They take an input and produce an output based on the data and algorithms they’ve been trained on. When we talk about AI ‘learning,’ we’re referring to its ability to improve predictions or outputs based on past data, not the kind of conscious understanding or introspection humans experience. Adding an 'I don’t know' response isn’t as simple as flipping a switch-it requires mechanisms to estimate uncertainty reliably and suppress low-confidence outputs. While some AI systems do incorporate confidence thresholds, they’re not equivalent to a human's ability to 'know what they don’t know.' Achieving that kind of awareness would likely require breakthroughs in self-aware architecture, which current AI does not possess. Your point about self-awareness is thought-provoking, and I agree that pursuing architectures capable of true self-awareness could be revolutionary. Unfortunately, current research is heavily driven by commercial priorities, which means a lot of resources are focused on systems like LLMs that are practical but not necessarily a step toward self-aware AI. After all, "robot overlords" would threaten their fiscal quarters. Note: I asked ChatGPT to help me rewrite my original comment, as I felt it came across as rude, and ChatGPT said it comes across as condescending. So if it seems overly eager to please, that's why.
21:33 I’m not a fan of numerology, but it is funny - how the dimension of natural language happens to be 42 (just like the "Answer to the Ultimate Question" number from "The Hitchhiker’s Guide to the Galaxy"). :)
"Einstein's first name is …” Einstein's first name is universally known. Einstein's first name is known by most people. Einstein's first name is not an example of name weirdness. Many possible next words....
@@omfgacceptmyname The point of A.I. technology was not business, initially. It was all ways to try lo learn how our brains work and what exactly intelligence is. Although not involved in any project, I've dedicated all of my life to find out such things, and finally got enough insights to get the puzzle solved to all of its pieces just two years ago. It took me longer than half a century. Fairly ingenuous but mathematically simple. But current A.I. … I call it Artificial Idiocy. They try to sell an obviously incomplete and faulty tool as a complete, wonderful magic wand, without really knowing what intelligence is and how it works.
@@wafikiri_ The reasoning ability of AI is the selling point. It's not an autonomous agent, but it can still fill the role of a support chat, site scraper, idea generator, etc. It's even better since LLMs are literally made to generalize data, so they can produce any media with at least some accuracy. One big improvement is in facial recognition/image generation, while other improvements involve the study of proteins and viruses that are very difficult for humans to observe. The curated crap we're being sold now will change as AI is weaponized in other forms. That line the video shows at the beginning can be considered a barrier for AI becoming integrated with everyday life. Until then, it will remain in the apps as a novelty or generalized tool to avoid tedious work, but I believe we are almost at the point of creating something mistaken for sentience.
Another fascinating point is how well our observations in neural biology follow similar power scaling laws. The human brain seems to fit very nicely on the primate scaling curve and (not surprisingly) points to an adaptation within primates for superior cognitive scaling performance vs other mammals. There are obviously important distinctions between ML and our brains. Models like GPT-4 are highly specialized and would be a better comparison to the sub-network of regions in our brains that processes language. Lastly, an area where we are significantly lagging in capability, is the Abstraction and Reasoning Corpus (ARC). Human scores against ARC are in the 80% range whereas our best algorithms are in the range of 30% and of course all of the most interesting applications of AI/ML will heavily task abstraction and reasoning tasks. We have LOTS of work left to do so please don't fall into the trap of thinking we just need to throw more GPU's at this and we somehow get to the singularity... we are still missing very important stuff but the progress we have achieved is also incredibly impressive.
That's a fascinating comparison and really does line up with this. Agreed about the GPU brute force approach, there's obviously still missing components.
I like your viewpoint. There's still so many unknowns and questions to be answered regarding AI. From the technical to the philosophical. I wonder if we will ever get there and what that journey will bring for our species.
0:30 ish - Ok, first impression comment... ... can anything cross that line? The way the graph is presented it appears asymptotic so AI isn't even part of the equation... granted, I don't yet understand the meaning of the graph and am just going on naive impression of graphs in general and it will probably be explained in the video.
It just seems to me that crossing that line would mean a negative error rate? Like the AI would have to make a negative number of mistakes? There's no way this is correct, but it's what it seems like from the charts.
I didn't know that you were the one that made those videos on imaginary numbers A decade later that is still the best explanation I've ever seen I wish I had that in college, and I still reference it to other people when imaginary number conversations come up
Its not the AI model, its a property of the dataset - that's the only commonality. The fact that it follows a power law is a significant indicator. Most statistical linguistic experts will be able to point to many such power laws that appear when we measure human languages. The most commonly known is word-frequency power law, so well know that it has a name, Zipf's Law. Regardless of language, regardless of what collection of works, the top 100 words comprise approximately of half the collection and the next top 10000 words comprise the remaining half. Power laws appear in a lot of AI datasets because most complex data exhibits these power law properties, and folks generally only apply AI on complex problems.
This is interesting, but I don’t see how it explains what we’re seeing in AI. Why would the error of the system, which is a measure of its ability to learn, reach this limit? Larger models perform better, and models with more training data perform better. The question is why we can’t squeeze any more performance out of a fixed model size. Seems to be a limitation of the network, not the data.
@darkstar4494 But you're still using essentially the same model at the end of the day, to see drastic differences without changing model size you'd have to use a different model.
@@Ansalion agreed. Can a massive leap in model design (as transformers were) could get better performance for a fixed model and data size? That is the question posed by this video. Right now, model selection doesn’t make much difference. I don’t see what this has to do with the question I asked about power laws in linguistics.
@@darkstar4494 Its more a case of a power law dataset being simultaneously easy to learn but also difficult to master. Half the dataset is easy to learn because examples are abundant. The other half of the dataset has a sparsity issue where examples become increasingly rare, eventually to the point where you only get to words that only appear once. It doesn't matter how many TB of data you have, there will always be these rare words. And regardless of the model used, all models have issues learning when there are only a few or just 1 example to learn from. And that's why initially the AI learns rapidly, reducing its error rate very quickly, but eventually an inflection point occurs when it becomes increasingly difficult to learn from a lack of sufficient data.
I think this trend makes a lot of sense. AI models don't actually learn. They just recognize and copy patterns, and in a complicated way sort of extrapolate and interpolate those patterns onto new data. The more data you train it on, the higher the accuracy of the AI's predictions. Letting it train for longer then, is just the AI recognizing the patterns in the training data better, and having more training data gives it more data to interpolate/extrapolate from. But no matter how good your algorithm is (Which AIs currently basically are), there is always a limit to how much data can be recovered from a limited amount of samples. If you look at AI as being very fancy data interpolation machines, this result makes a lot of sense. The only way I can see that might break this pattern is by having AI actually *learn* rather than do pattern recognition. You'd need an AI that's trained for being able to learn rather than to do one single task, but that's difficult because you'd have to measure learning ability, and making and reviewing your own mistakes is a big part of learning. Having curiosity and a drive to experiment is a part of that too. Such decisions aren't very rational and measurable. You'd need to evaluate something as abstract as the ability to learn. That's definitely not an easy thing to do.
This is an active area of research, and you might enjoy looking into Continual Learning, Multi-Task Learning and Meta-Learning. Probably the largest gap between AI and animal intelligence is the fact that our brains and bodies in general are hierarchical systems with abilities like neuromodulation (neurotransmitters), neurogenesis and metaplasticity (neurons/signals modifying other neurons). As you're pointing out, it's quite clear that our cognition requires more than a universal approximator. Also, what you're referring to as "actual" learning is typically called human learning, as the distinction is important but disqualifying neural networks from learning would make comparisons between the two more clunky.
@@tukib_ I suppose yeah, current AI models do learn. Just not like humans. If you think about it, yeah, if you want AI with human-like intelligence you're gonna need similar mechanisms. Similar to dopamine in our meat computers, it'd need to be able to give itself a reward for learning new things, and the system for that has to be integrated with the rest of the model in a way that it can understand what's important to learn / improve at, while also not getting stuck in a trap of giving itself infinite reward without being productive. It would need to be a closed loop system of multiple systems acting on eachother, that can also determine reward for the subsystems, while also still being productive to humans. Getting a bit philosophical here, but I think the only way to achieve that is by having it either value its own existence, or value being useful to humans, more so than it would value its sense of productivity, in a way that's hard-coded to ensure it can't possibly break that loop (somewhat similar to human emotion). If it doesn't value either of those, it will not care about actually being productive to humanity, and likely create a closed loop of giving itself infinite reward by setting arbitrary goals for what counts as being "productive" (similar to how we might get drawn to memes or tiktok, which activate our dopamine reward system because our brain recognizes it as productivity). But at that point, I'd say it gets eerily similar to being a conscious machine. Either the machine values it's own existence, which will definitely turn out bad for humanity if it gets sufficiently advanced, or the machine feels an obligation to serve humanity, which.. kinda feels like having super intelligent dogs? As dystopian as the latter sounds, I don't think having machines with some form of emotion that are solely dedicated to serving humans is necessarily a bad thing as long as they're content with their existence. Anyways, enough rambling about philosophy. I am also very much interested in the actual mechanics of AI, so I'll definitely be checking out Continual Learning, Multi-Task Learning and Meta-Learning as you mentioned.
@Ab3ndcgi oppenheimer is not a hero just a sad story aalso if humans didn't ask 'why' you would not be here criticising scientists while being an egotistical nobody
@Ab3ndcgi The "why" is the potential for the model owners of automating a lot of jobs out of existence and collecting rent on the cost difference between a human doing the task and an AI doing the task. It's just about getting rich, that's it.
What an elegant explanation of a complex subject that hits the sweet spot between over simplifying and getting mired in detail! Douglas Adams would smiled at the intrinsic dimension of natural language being *42* .
Indeed, a very elephant explanation, like an elephant in a porcelain shop. Oh, you said elegant! I'm sorry, I'm an AI and my language-database is small. The most I got was 41.
That looks exactly like a Receiver Operating Curve. Any signal tested against a threshold will exhibit this curve. A threshold false positive curve, also known as a receiver operating characteristic (ROC) curve, shows the relationship between true positive rate (TPR) and false positive rate (FPR) at different classification thresholds.
Coming from Computational Fluid Dynamics these graphs look like a limit of the resolution. Transferring to ml, the the number of neurons in the net.... Maybe I am missing something, but it does not seem surprising to me.
If you think of LLMs as simply compressing "meanings" from the training data in a high dimensional space, and interpolating between the stored meanings to provide a next token prediction, the trend line reflects something akin to the underlaying compression efficiency of a transformer.
So, in other words, it wont cross this line because it doesn't decompress a prediction of what the predicament is not. Waste of resources would that be? Yes. Is it worth it? It depends, are you scared of time travel, demiurge machines and eternity? Its great at predicting, but does not deduce.
that was my line of thinking also... It is a Virtual Intelligence mapping products of real intelligence. The dimension of LLM is the minimum meaningful pattern dimension of the used inputs. So basically it is the pattern storage dimension of human intelligence. That is also a reason of degradation. When VI is fed by its own products, you have a positive mismatch feedback. These VI's are not stable, because they are just imprints, and as such does not include the original control feedback of true AI. ------ The answer to the "Life and the universe and everything" is 42, because the dimension of the Original Idea of Creation was 42. And it makes sense only when you learn what the question really means
@@meleardil You're right that feeding them their own output won't help, but there is some benefit of identifying sparsely defined areas of the input space and augmenting it. The dimensionality is based on the compute budget, token embedding, scaling laws, etc. The last part, you're reading Douglas Adams on mushrooms eh?
@@luke.perkin.online "reading Douglas Adams on mushrooms " I am using my imagination for fun. If you have no idea what I am talking about, you shall seriously complain to your parents.
Difference between humans and chatbots: humans look at a small number of things obsessively and encode a small number of deep (complex) patterns. Chatbots (and many other AI systems) get more data than any human could possibly see in thousands of lifetimes thrown at them, and they encode a large number of shallow (simple) patterns. Both are intelligent, but in different ways. If somebody wants a system with human-like patterns, it's not obvious to me how you'd get that. For now, use chatbots for jobs they can do well with the pattern types that they have - which are more than good enough to do a lot of useful work.
@@GillfigGarstangima guess they mean some human soul thing given the gun and Yahweh worship. I’d also argue the odds of a human intellect being made by a company is vanishingly low, but for a very different reason. We are General Intelligence, and you’d want to build AI’s to be better than humans at specific tasks. Not a AGI
@@MycaeWitchofHyphae It depends on how you define AGI; if companies are actually serious about selling humanoid robots in the near future they have a definite incentive to crack AGI. We also don’t know to what degree human intelligence is actually ‘general intelligence’; we clearly come with a bunch of pre-made optimisations for developing certain useful behaviours and not others; we don’t need to try to learn our first language, interpret visual information, (usually) read social cues etc, but almost every human seems to require _lots_ of deliberate practice to be able to do mental arithmetic or learn to draw or acquire a second language in adulthood.
It is an empirical limit we are observing. Not a mathematical. But we are trying to understand the empirical limit using math. And NNs are not just matrices, with that view you are missing all the nonlinearity which is the difficult part.
I could be stupid but he really lost me around the 19:00 mark. I’ve seen science communicators who can explain complex topics so that laypeople can actually grasp them. I didn’t feel that in this video toward the end. Perhaps lay people aren’t his intended audience, but this is a UA-cam video so🤷♂️ Idk why people devote so much time to talking through equations when such discussions rarely add any additional meaning.
Zero error is achieved when dataset size is the number of fundamental particles in the universe, and the computational power is the number of interactions they undergo every planck time.
The answer is somewhat straightforward, if technical and boring. The error of a statistical model may be given by KL-divergence. The KL-divergence of a statistical model is zero when the cross-entropy is minimized. The entropy of human language sample, say a small sample via English Wikipedia, may be measured in shannons (bits). Adding one neuron to a model is roughly linear in units of shannons for a given neuron type because the parameter space of a neuron is something that can be saved in a 64-bit computer. A model needs at least as much entropy as English Wikipedia to predict it accurately, which is why a language model inevitably eats shit when it encounters something it hasn't learned before (it already used all of its bits in representing what it has seen). Where you place the neuron matters, but its contribution to reducing the KL-divergence is still at most linear. In fact, the upper bound should never be greater than the entropy of a single neuron. This is only on average since placement matters relative to poorly placed previous neurons. For example, although linear autoencoders may form a bottleneck at the latent space, the entropy of the neurons in the encode and decoder layers still matters. Maybe I'll write a book on it once I finish my current one. A high KL-divergence of two different humans with the same language model is basically the result of all human conflict, but it's hard to explain in a comment.
@@JensRoland If you are new to ML and are interested in some of the approximation theory, I found Chinmay Hedge's "Foundations of Deep Learning" to be a fantastic set of notes.
this sounds like what i call translation slip, where i am quite literally not able to properly understand what you are saying because of different underlying structures of the languages.
The last paragraph - arent you assuming a lot of things whrn you claim that if people understud eachother they would not have conflict? You can have perfect uderstanding and empathy of someone but if material conditions allow for only one of you to survive that would still lead to conflict
@@karlhans8304 Yes, both minds can be represented in terms of a common model/representational space, but the fact of discontinuity between organisms is FUNDAMENTAL to the nature of their existence and not incidental. Language and other forms of communication can transcend this difference to an extent and that is miraculous, but the difference is ultimately irreducible except in the mutual death of the organisms, which disintegrates their identities/homeostatic boundaries and renders them common to each other beyond that fundamental limit.
Natural, verbal, language is twice as fast as written. The approach for both is completely different : verbal, fast & ephemeral, written, slow perhaps millennia.
Great video. This is fascinating and I appreciated your analysis. Side note: if you sign up for a Brilliant trial, you might be very surprised when they charge your credit card $200. Set a reminder to cancel before the trial ends.
I wanted to study Nuclear physics and I kind of click on this video, suddenly on the moment of realization; Oh my god! He is that guy who tought me imaginary numbers.❤ I still don't forget how valuable those videos are. Thanks for them. Keep up! You deserve billions of subscribers ❤.
Haha. I don't really know much about the field you are on about, but it reminds me of "Tai's Model". Essentially a bunch of nurses needed to measure the area under a curve for a glucose chart and didn't know about calculus, so basically reinvented the trapezodial rule and published it as a paper. I do wonder how much time is wasted reinventing things purely because no one can know what they don't know.
@TheMajorpickle01 happens all the time. A major energy theorem in general relativity broke major ground in our understanding of the theory in the 60s. It lead to all sorts of practical solutions to tough problems in astrophysics etc... 40 years later a group of string theorists proved it again thinking string theory had finally made an important contribution to gravitational research... not an original contribution ... no strings needed for it to be true or to even prove the theorem. The guys actually thought that they made a major discovery about gravity... while literally being 40 years behind the tip of the spear. The same thing with the AI guys and stat mech and information theory... only they're actually further behind than the guys who pioneered the field in the late 1800s... it'd be funny if it wasn't so sad and expensive.
Could you actually make clear what exact finding in the AI field here have already been described in stat mech and information theory? I would like to read up on that.
AI folks are fundamentally confusing error rates assumed in statistical models for some grander law of ML or something. This reminds me of baysesian folks misunderstanding the problem of induction and thinking you can interpret a null for verification.
Especially when he said it could be a fundamental law of intelligent systems I chuckled... Is our brain not an intelligent system and therefore a piece of counterevidence to said law, since it can clearly learn more with much less data and much less compute than these LLMs are learning?
I thought we knew this law is just a function of AI being equivalent to stirring linear algebra curve fitters in, basically, the dumbest way possible. We know this because the counter factual is used all the time, even in AI. there are many cases where hand coded linear approximators are used instead of NNs, in transformer architecture for instance. The approximators algos do much much better in terms of performance per watt in these cases, hence why they're almost universally used instead of learned NNs for simple cases. That line is basically, 'well if you keep just stirring my weights using only basic back-prop and dumb simple loss mechanisms like a bunch of idiots, I'm not going to find smarter ways to do any of this without a shitton of weight stirring'
This….. you are completely right, it’s a function of the efficiency of simple feedback. It has a noise floor constantly rescrambling the weights with every operation, there’s no gating or non-linear behaviour in the back-prop preventing “stirring” on irrelevant data. I truely believe the biggest jump we see in advancement will be better “neurons” not better models….. don’t get me wrong, better models will keep coming and get amazing….. but overnight the will ALL get better with better back-prop. Apply what we learned from non-linear forward prop, and use that to sieve the back-prop….. that step function in intelligence could rock out world in a matter of hours, because every model out there can suddenly use the same weights and go back into training and find a whole new level in hours
@@geekswithfeet9137definitely, alternate backprop algorithms are really interesting also, skip connects are just a hack in this sense to get backprop to even work, we've already had studies showing that they might not be necessary and are just a hack
Applying this to large models is more speculative fiction than theory. Our only examples of high complexity come from nature where it is, without exception, an emergent property of simpler and simpler subcomponents.
@@geekswithfeet9137 architecture, we don't yet understand how cortex's do what they do, how they process data, and to be honest I don't think our current approaches will takes us very far at all. Can you imagine just how dizzyingly complex and utterly different the math of simulating a cerebral cortex would be ? from the primitive shit we are doing now ?
This seems like a very simple answer. If you're working with a logic concept where there is a concept of right/wrong, prefer/dismiss, the scaling is going to look like that. If every concept was on a segmentation of three equal values per learned task, then it would look like a 1/3, 3/1, or somewhere in between. The idea of continuing to be taught towards a specific goal, is going to be dualities ad infinitum in complexity, but still end up on a 1/2 2/1 scale.
I'm a ML Engineer. The answer comes down to 2 concepts: 1. Transistors. All computing ultimately boils down to the billions of transistors running on CPUs and GPUs. 2. Next-Token Prediction, ie Transformers using Attention & Self-Attention. Effectively, it is not believed whether increasingly massive computational capacity is the breakthrough needed to achieve AGI or even ASI. LLMs are just predicting the next token, or word in the sentence. LLMs fall within the goal of NLP (NLU/NLG). There are many other subsets with promise, ie GANs and Agent-based Reinforcement Learning.
If I understand correctly, the difference is that it has the capacity to interpolate between data points, which provides a limited kind of "understanding"
The difference is that a database actually stores the training data, so it's the same size as the data. LLMs and image diffusion models are around 20 000 times smaller than their training data, yet can still kinda recreate it. This is because they store patterns, abstractions, ideas etc about the data, instead of the data itself. Anthropic managed to find and adjust those ideas within their LLM, to create their brainwashed "Golden Gate Claude" AI that thinks it's a bridge. It's a fascinating experiment
@@vrdev4714 not even, it just stores weights, as in, probabilities ex: if the you give it ABCD, it knows the next letter has a 50% chance of being E or 50% chance of being a space based off the statistics it crunched during it's training pretty much just a massive weighted randomizer, that's why you need a seed
@@DDracee Lol what the hell are you talking about AI does not have randomness. It's called the output layer and the percent is how much each neuron is firing, between 0 and 1, ya know? Do you?
This is definitely not an unavoidable physical limitation, considering that brains exist. At a much smaller scale, smaller data set, and smaller power consumption, the human brain can learn much more efficiently. So these scaling laws may be exclusive to binary computers. The missing piece of this puzzle probably lies in advancements in neuromorphic computing, and an adaptation of LLM architecture to take advantages of such hardware such as synaptic plasticity and saving efficiency by using analog computing for the weights.
error rate of an "ai model" obviously could be 0 with a finite problem to solve, if you can define a mapping of each input to each output. This graph is fundamentally a *goal* of most ai models, to be able to compute problems more complex in a more resource efficient and generic way (accepting error rates), otherwise we'd just at some point map every combination of inputs to every possible output.
"Maybe what humans call chaos, Pixel would find perfect. Or perhaps symmetry and harmony might take on meanings so complex that human minds would perceive them as randomness or noise".
This is one of the most interesting videos on ML I've seen in the last 12 months, exceptionally well structured and visualized. Thank you very much @WelchLabsVideo!
Some of the craziest shit we've seen over the last few years is the ability to detect likeness between images. You can upload an image of something and computers can find similar images. If you upload a picture of a person, it can find other images of that same person even if it doesn't know who that person is (you uploaded an original picture). As this gets more powerful, the future will be a strange place none of us can predict.
I think this shows that LLMs tell us more about how language works for a collective group, rather than simulating the deductive function of a mind. It just runs into the limit of the discussion among the training group, or it runs into the limit of its own ability to approach that limit. After it runs out of reasonable predictions it goes all “Here be dragons,” and I doubt there will ever be zero dragons. I think that would require a fundamental change in how it works, not just getting bigger.
Interesting. And yet, despite the incredible decrease in Validation Loss with model size, and compute, the model still gets an incredible number of things "wrong", which suggests, there is a fundamentally different issue happening as well.
Given that I believe LLMs are, at their core, statistical models, I'm not surprised at all that error on a training set decreases with the number of variables. It's been very well known that you can improve the 'fit' of a statistical model (from single regression all the way here) by increasing the number of variables. The risk is, of course, overfitting. That could be one source of tension in the models, that by increasing the variables, you improve on the training set but not its actual predictive capability. I understand that GPT-4's training set is, effectively, the Internet... but there may be an issue with the fact that people just... talk differently online. Which would be another source of error.
If LLM Validation Loss works as described in the video, then what it is really measuring is the "believability" of the produced text. Low validation loss means the text is understandable and could have been conceivably written by a human. The correctness of the meaning conveyed by the text is a wholly different property, which is only indirectly captured by the LLM by virtue of the fact that most text written by humans (and thus the training data) is intended to convey correct information. Thus, for text to "look right" it at least has to "look correct". The LLM training optimizes for believability, and so the LLM makes mistakes very confidently and eloquently
They are not being trained with a reward function that has anything to do with the veracity of the output. It's just trained one token at a time. A very local, very greedy training.
@@NemisCassander > That could be one source of tension in the models, that by increasing the variables, you improve on the training set but not its actual predictive capability. The real problem is that the models can't actually "understand" what you're asking. They're just pulling a statistically likely next word. It's why prompt writing has become a job title - it's not sufficient to askChatGPT a question and expect it to intuit your meaning, you have to ask it the question in a way that navigates around it's statistical model and avoids answers you don't want. Certainly having more context (ie: more training data and a longer input buffer) can help mask over that problem, but it never gets rid of it. That said, current AIs have so much training done that the problem is rarely it's intrinsic lack of understanding - more often than not the problem is that it's been trained on data from the internet, and the internet is pretty famously full of crap. It's the culmination of all knowledge, including vast amounts of "knowledge" that was invented out of whole cloth. Garbage in, garbage out.
@@brexitgreens I don't know about the correct way. I'm expressing my feelings. When I first heard of Brilliant I was mildly interested, now I'm utterly annoyed, and there is zero chance I'm going to look into their service. Perhaps I'm unique, maybe everyone else loves the repetition, but I think there is such a thing as brand over expression. At some point you badger people to the point they don't want to hear from you again.
@@brexitgreens Advertisement is inherently bad and a system based on advertisement will value products that prioritise marketing over products that prioritise quality. The alternative is simple, abolish advertising and establish third party reviews as the main mode of spreading new products. You dont need someone to scream in your face telling you what you need, when you need something you can go to third party review systems and look for the best quality product
@@surfacepro3328 Yes that will totally work lol. They will simply dump their ad budgets into bribing the 3rd party reviewers. This has been done in nearly every review based space for years now. Try again.
Everybody here heard 42 and lost their shit. Everyone then ignored than in the next sentence he literally said the experiments confirmed it was a number closer to 100
Bravo, I do not speak in the Mathematical language that you know and use in this video, but so wel was it delivered, that even with my lack in the basic algebraic functions and LOGrytms I followed along in th evisual and symbolic sense understanding the general context. Bravo.
Soo well said James!! You just gave me a greater love for Porsche than I didn’t even think I could have! Really appreciate the passion you share for these wonderful works of art and engineering! Cars are cool!
@@venmis137 A hard mathematical limit regarding how intelligent a given mental structure can be, no matter what else you give it. Breaking such barriers would, in principle, require a different or evolved form of intelligence from the one that came before.
It is a natural effort vs outcome rule that exist in everything across the universe. Effort will always approach infinity before reaching a perfect/maximum outcome.
"A hard mathematical limit..." Err, within the _science-fiction_ universe of "Orion's Arm" - just to be clear, this isn't an actual mathematical result about minds, it's an idea from a work of collaborative fiction. (of course if anyone wants to correct me with links to published, peer reviewed papers etc. that'd be great - a quick search didn't come up with anything specific, though toposes themselves are obviously pretty significant in category theory etc. which is presumably what the idea is riffing on)
@@miklosprisznyak9102 As I said, "open" is unfortunately already far on its way to be a meaningless buzzword. And I don't like that that is the case, because it also devalues the term "Open Source".
@@mrossknekeep in mind that it's not just about a unit of measurement. Bits are the storage medium itself, and what we measure is compressability of information given a fixed amount of storage.
So.... If I am understanding this right (and I very well might not lol) then this means that we are facing a 90s audio issue... Just at a grander scale. In audio you can resolve frequencies that are half the sample rate in time, but need high enough bit depth to map the change in volume. Increasing the sample rate and bit depth just maps closer to the underlying physical reality. But to have any amount of accuracy in audio, you need an insane sample rate at a high bit depth, but which makes for really big files that quickly became unmanageable (for a 90s computer anyways). To solve this, we had compression. If two samples next to each other were the same, then you could short hand that. Instead of storing the full bit depth of each sample, you could just store the difference. But these methods only shrink things so much. Modern lossy compression relies on psychoacustics, basically saying that if a noise is overly complex, then we mask out most of what is heard, and then focus on things like rhythm and melody rather than our ability to follow the 3rd chair flute part that adds more flavor than melody. This approximation allows for massive multi-gig files to shrink down to hundreds of MBs with minimal appreciable losses, or the typical 3.5-5MB file sizes that we tend to view as acceptable losses. To get below the line we need some sort of psycho-reasoning filter compression for AI similar to the mp3 model for audio. It wouldn't be without errors... Just have errors that don't matter (or go unnoticed) and allow for far less data and compute to train from to resolve similar answers. Just as the trained ear can hear mp3 compression, someone who is more sensitive would notice these minor issues, but would still be able to pick out the signal from the noise. I wonder what that analog would be, and how it would apply to the training data.
8:40 I think the problem is context. Like if two words are synonyms, then there is no objectively "correct answer." So we need a model designed to do more than "find the statistically correct next word", it would need to be able to more deeply analyze the whole sentence in a way to "understand grammar."
The models understand grammar (they return grammatically correct output that is simply factually false), but grammar is not synonymous with meaning and knowledge.
I love pattern recognition. If you find similar patterns between two different phenomena whereas one is experimentally known and the other is a mystery, you can often use what's known of the one phenomenon to explain the other's.
I thought the same thing. Maybe the aforementioned short, but I think a different YT'er made an entire other video very similar to this, at least at face value, but this one dug right into the weeds.
facts! It's one of the reasons I feel AI as we know it right now is hype. Maybe with a new architecture or new theory, we could actually justify all the research. Like you said, kinda feels like a waste of compute for now
@@ChimbzZ The evolution of model design has proceeded at such a fast pace that we haven't focused on specializing the hardware for inference efficiency. From a physics perspective we are nowhere near the lower bound for inference power efficiency and it would be reasonable to design an inference engine in the nominal range of 20 watts at some future point (nowhere near that in the coming decades to be fair). We are going to hit an AI/ML winter again soon based on the realization of deployment costs for AI/ML but the research will then shift to bringing down those costs and progress will eventually be made launching the next boom cycle.
@@John-zz6fz AI winter? I doubt it. Have you seen o1-preview? use wrappers and embedded instances of this model, teach them to interact with the command and graphical interfaces (some companies are doing so already, but at a smaller scale than the large AI firms), optimize them for certain goals by creating datasets for lots of different operations and instructions (ranging from opening a browser, navigating the internet, updating a system, bypassing captcha, etc.), and you get agents, now deploy them for larger goals and optimize their performance as goals are successfully achieved (while keeping them bounded), then deploy them for arbitrary goals, and you have essentially unbounded agents with ownership and administrative rights navigating systems and the internet. You can teach them to embed themselves in low-level code or create abstraction layers (hypervisor rootkits in a sense), and they can stay in systems for a while. That's how you get to AGI in my view. It sounds scary but it does not need to be if they evolve in a more or less aligned manner. We already have a bunch of "bots" doing simple things on the internet. However, they lack the reasoning and generative power behind that powerful systems like o1-preview offer. The question is, is leaving a bunch of engineers and scientists and other people unemployed actually good for the economy? that can create other issues but I do not see those issues being "AI winters" or so. Thoughts?
The earliest computers in the 1950s-1960s were exceptionally slow and used tons of power, yet they were essential to our computing journey. Of course, what we're doing right now is indeed worth it. We can't improve efficiency without starting somewhere.
I think our current architectures are subject to these constraints. Which is why later AGI or ASI probably wont look anything like what we are doing right now.
That’s why Google is starting to analyze rat brains and primate brains. LLM is just one building block of intelligence, not AGI. We need to invent a better architecture than Transformer, or build something based on a brain-based cognitive architecture.
Becuase the search space increases the closer you get to the zero line. When you're working with an objective function (like in machine learning or neural networks), getting closer to the optimal solution (e.g., the "zero line" or the point where the loss function approaches zero) may require more fine-tuned adjustments. The "search space" represents the range of possible solutions, and as you approach optimality, the number of feasible solutions grows because small changes might still lead to acceptable performance. This makes it harder to find the exact best solution as you're navigating many near-optimal possibilities. Alternatively, if referring to gradient-based learning methods, as the loss or error approaches zero, the gradients may become smaller (vanishing gradients issue), which can cause the search process to slow down since the steps taken in the optimization become tiny, making it feel like the search space is expanding.
my brain was feeling stuffed from studying mechanics and all the calculations. watched this vid, understood prolly a fraction of the stuff but surprisingly, my mental fatigue is gone. thank you for that.
19:00 In some respects, as soon as you get into applying Taylor functions, this whole discussion reminds me of trying to model aspects of particle physics. 22:45 Yep! I'm not alone!
@@user_375a82 it is as simple as that. but you guys are afraid to fix the problem ot at least try it building a math graph with half negative zero and half positive zero,, yeah i know this sound crazy because, it's never been done before... plus i think you guys stil don't get what i say...feed me and house me for a year, or pay me for my time and your questions it's definitely worth the investments... i am going to become homeless soon, While I can figure out the secrets of the universe of math, i cannot find a job to feed myself or shelter myself from the rain... no kidding...
This is just the limits imposed by the laws of uncertainty which is inherent in our universe and given all data is derived from the universe it cannot reach beyond the fundamental limits of nature.
There is a fundamental problem of rediscovering prior results every few years. The so-called 'law' directly relates to optimization and can be found in many convex optimization (no surprises here, as the learning methods follow analogous optimization routines). Furthermore, people are often confused between asymptotic guarantees and finite time/iteration guarantee.
Because it's not AI. LLM technology is not AI. The Turing Test is a flawed methodology for determining AI, and LLMs were designed to pass that flawed test.
"We don't know why", of course we know why : Current AIs only guess the answer from a prompt. There's a hard limit at how accurately you can guess something, you will always fail sometimes. To cross the line you need actual intelligence : to be able to understand the question and find out the correct answer. To succeed in school you can either be a good cheater or good at learning.
But why? The human at the repair shop takes into account a few key of data points: The cost of the parts The cost of labor How much you are willing to overpay. Each of these has its own subpoints, until it rolls out an estimate. The human mechanic should do better than just a picture of the damage and AI.
@@HexerPsy All this can take many days and multiple visits. Imagine the whole process of insurance approval happening in a few clicks, and repair shops pick the vehicle, does their thing based on the inputs. Now, Imagine the driver is a lady. Unlocks many layers of complexity
NOOOO I WISH YOU DUG A LITTLE DEEPER INTO CROSS ENTROPYYYY. You are so good at explaining. Even though I know what Cross entropy loss is I was excited to see how you explained it. Either way thank you for this awesome video
Every time I hear stuff like this I have to remind myself that the human brain fits all of it's memory and processing power into a head sized container and only uses a few tens of Watts of energy to maintain operation continuously. Clearly the issue is with the method used. Our computers are already much faster and have more gates than a brain has neurons. And even simple creatures with tiny "bird brains" can do amazing things.
Do not forget neurons within the body connecting the head to The physical world as a RAG .
@@Donut-Goddam .. lets make a mechanus
We are, I don’t know how many functions each neuron computes. Every once in a while, we come up with an approximate scale of what we think the human brain computes, but we are consistently changing how much we think the human brain has in terms of compute power. It could be a quantum system though, and then the number of neurons would be almost irrelevant compared to its design.
Unfortunately there is a very real chance that human brains can never understand the theory of human brains well enough to discover its underlying mechanics...
Everything we see about brains so far implies they operate off a lot of hazy approximations combining real physical geometry, chemistry and electrical impulses into what somehow ends up a broth of consciousness.
To say we don't currently build computers this way is the underest of understatements.
Because we don't actually have memory at all lol. We have the likely hood that something occurred based on everything. That's why our memories can be completely different to someone else about similar events.
"What is the minimum theoretical dimensionality of natural language?"
... "42" o_o
You have been warned ⚠️ The Connections (2021) [short documentary] ❤
coincidence? i think not.
real answer is 69, it's nice.
I don't get it well, the intrinsic dimension of ALL natural language is 42? or this value is only for English language? If this value is only for English, which is the value for all other languages? for example Chinese, Spanish, Italian, etc?
After all these years, we finally know the question. What a relief!
I’m a physicist and I was like “so… it’s a gas”.
Statistical mechanics is more powerful than what people thinks
Yes, it’s no wonder the same people working on spin glasses also do research in neural networks and high-dimensional inference
You have been warned ⚠️ The Connections (2021) [short documentary] ❤
definitely some electron gas activity going on
How close is it to phase-transition? In this analogy, what's the pressure variable? What happens when we make a supercritical fluid?
"everything is ideal gas"
- Mongo Einstein
These are Language Models, and it's well known that natural languages follow Zipf's Law, where word frequencies adhere to a power-law distribution. Because LLMs are trained to learn and predict patterns in language, it’s clear that they must also exhibit this behavior. In fact, this could explain why LLMs seem to hit an efficiency ceiling-they are constrained by the power-law nature of language itself. As the models improve, their gains become increasingly marginal, particularly when dealing with rare words and complex language structures.
but the higher dimensional woowoo crystals say I'm going to take a brave risk and form a new lasting relationship with someone unexpected if I'm mindful of my dietary choices and donate 1$...
Now I do not need to make this comment!
Zipf's law could apply even with extremely low-entropy (easily modelled) data, it's a feature of the alphabet/dictionary/etc not the semantic volume of things you could express.
Sounds like the Law of Diminishing Returns is a universal law, just with different names when different people discover that it also applies to whatever they are studying.
@@speedstyle. But it provides an upper bound for the entropy of language right? Assuming words are independent.
Two 20 Watt human brains looking at a 20 million Watt supercomputer operating for 3 months costing $200 million.
“Look at what they need to mimic a fraction of our power”
Yes, but given that brain developed over such time span, it would be too soon to say anything
These two brains had required 20+ years of training/education to even have been able to say that, mind you.
To be fair our brains have had literally millions of years worth of r&d through evolution
@@mrufa You can't "train" a human mind as quickly as a supercomputer, true, but that's only because there's a throughput issue. A supercomputer is a warehouse sized machine that requires tremendous energy and can only perform specialized tasks. The human mind is slower but much more efficient.
@@JeffNeelzebub is it, despite the time it takes to learn stuff? I wonder... Probably yes but will it stay on top until the end of the century, or even the decade, for that matter?
"the intrinsic dimension of natural language is 42"
we all knew it
Everyone's talking a out how it's 42, I don't get how that's special, can someone explain?
@@Zeni-th. Hitchhiker's guide to the galaxy
@@triffid0hunter how do these two things relate?
@@Zeni-th.Answer to life, the universe and everything is 42.
But no one knows the question.
@@wumi2419 I mean, we would. But the FUCKING VOGONS.
Possibly shouldn't be a surprising relationship:
Thermodynamic entropy and entropy in information theory are related and it tells us that each bit of information has an minimum cost in terms of energy.
When you plot cross entropy, you're plotting missing information. It would make sense to flip the y axis and consider that to be how much information was learned.
When you plot compute, you're also plotting energy, which is directly proportional to the information and therefore should produce a straight line.
Not all models/learning schemes are 100% efficient so they are constrained to one side of that line. The other side represents a thermodynamic impossibility. It would break the 2nd law because the entropy of the universe (increased by the heat output of your GPUs, decreased by your model learning) would decrease.
That raises the question then of how we do it. Why can we cross to the other side of that line with our thoughts? What do we do differently?
@@GodwynDi That statement assumes that human brains *do* cross that boundary, but that is not a sound assumption. I look forward to your upcoming paper investigating that. :)
@@GodwynDi We don't cross the line. The better question is how we get so close to the line using a hunk of wrinkly wet meat that works completely differently and uses much less power. And I think the answer is that most of us most of the time are just parroting words that we came up with as a culture or species. So a fairer comparison is thousands or millions of brains coming up with an intelligible language while each individual brain is often just using minimal effort and energy to repeat it back.
@@liam3284The relationship is that thermodynamic entropy is a statement about the arrangements of the microstates of the elements of the system which is exactly information encoding.
I think the explanation is rather simple and has to do with irreducible complexity of objective reality. Basically you cant have precise knowledge (eliminated errors) without sufficient many axioms to begin the process of elimination. At least not from a model. Growing compute is nothing else, but a growing number of data points which are used to pin-point the result with some degree of certainty, but because we generally train models for real world applications the objective reality from which the training data generally comes from and since objective reality does not have some "magic knowledge" that defines the reality more than it defines reality you cant get past that line when you plot error against compute. We just cant provide the model with cheat sheets for its applications. I'm pretty sure one could breach that line in theory by providing some kind of cheat sheet for the model to take shortcuts.
22:25 I remember reading at least one compelling paper that argued that emergent functions like this is more a property of the way we measure the model's functions, rather than a step change in the ability of the model. You might want to look into that
i was gonna guess that it was just a limitation of the data being collected in such a way that it couldnt ever produce a value on the far side of that line…
Can you find the paper? I really like what you said there.
@@pritamlaskar sure, I was meaning to hunt it down anyway.
It's called "Are Emergent Abilities of Large Language Models a Mirage?"
The relevant quote from the abstract:
"Here, we present an alternative explanation for emergent abilities: that for a particular task and model family, when analyzing fixed model outputs, emergent abilities appear due to the researcher's choice of metric rather than due to fundamental changes in model behavior with scale. Specifically, nonlinear or discontinuous metrics produce apparent emergent abilities, whereas linear or continuous metrics produce smooth, continuous predictable changes in model performance."
Someone from Google way back in the early 2010s predicted this limit to LLMs. Not a well published paper most people didn't wanted to hear it at the time.
@@jcorey333 That's a brilliant quote from the abstract. Translated into plain English: "We ended up fooling ourselves by measuring things without questioning what the measurements represent."
I am blown away by both the implication of those papers and also (and especially) by your ability to convey so much information in a 24 minute video that makes it understandable to amateurs in this field like me.
Einstein's first name is not John.
Einstein's first name is approximately Alpert.
Einstein's first name is probably not Alphonso.
Einstein's first name is derived from the Germanic Adalbert.
Einstein's first name is Eduard. (Albert's second son.)
I don't think there's a single sentence segment in any human language that you can come up with that has only one correct solution for the next word.
The opposite of true is xxxxx
If a 2d shape has 3 straight sides it is a xxxxxxx
The third planet from the sun is xxxxx
@@shearerforgold
"The opposite of true is..."
fallacious
not correct
affected
counterfeit
deceptive
untrustworthy
unjust
imprecise
(and dozens more)
@@shearerforgold
"If a 2d shape has 3 straight sides it is a..."
polygon
trigon
non-spheroid
wedge
triagonal
trilateral
(and more)
Of course the answer is 42. Always was!
No, answer is 34. Otherwise this BS, which they name AI, but has nothing to do with "intelligence", would be gone as any other trash.
Same as VR. There are 20 times more VR games following rule 34 than regular ones.
The ultimate qiestion of life, universe and everything!
Ludwig Boltzmann would strongly agree ...intelligence is a statistical approach
@@DarkFox2232in what way does the search and learning seen in AI models differ from intelligence seen in biological systems? Of course there are differences in implementation, but what about concept?
@@vastabyss6496 If person is born brain dead, but part responsible for heart and lungs work. Will you call it "intelligent"?
See, ability to recognize patterns and learn from them is simplistic way to describe intelligence. If human being required same amount of trial and error in learning process as those models. And still delivered so mediocre results... There would be no humanity. Because no human would in its life cycle manage to elevate himself from being wiggly worm in dirt.
You can't even imagine magnitude of "machine learning" which costs $500 billion USD or more. What they have is comparable to human with 0.0001 IQ. And they bet on scale. They are brute forcing their way through mediocrity.
But they'll not succeed because when they press play on that generated model. What it simulates is not intelligent approach to problem solving. It is what would be called "intuitive" approach to problem solving. Same as experienced people do things by "gut feeling". Except that those people did get to that point through experience processed by incomparably high intelligence and gaining wisdom. (Even if their IQ was actually just about average.)
Models we use now are intuitive mimic game. And that's good for something and bad for most of other tasks. Like generating images. Why they look good? because there are high quality samples and good blending algorithm. But why they keep failing at basic concept of human posture, shape, limb/finger/teeth/... count and placement? Because it is same as if you take fashion magazine. Cut out parts of people, dresses, ... and give it to 3 y/o mediocre kid and say: "Create me: "
It will have those nice looking parts. Everything is detailed and realistic. But result will be grotesque. Because that kid still lacks fundamental understanding of reality. Its mind is not mindful enough.
The quality of your videos is worth the wait
You have been warned ⚠️ The Connections (2021) [short documentary] ❤
He lost me at 0:01.
I shitted while I was listening.
You mean the delusion? AI is pure guessing algorithms.
So, FWIW, it is well known in science that log-log plots nearly always end up looking linear. It is a feature of log-log plots and/or the unlikeliness that any system is super exponential.
thank you, this needs to be higher
exponential curves are still exponential on log-log plots, its only polynomials that become linear.
log(y) = m*log(x) + b
y = exp(m*log(x)+b)
y = exp(b) * x^m
When it crosses the line AI will learn to say "I don't know" and stop hallucinating. The 'I dont know' factor being absent from a vector being driven through higher dimensional space mathematically seems like a hard limit without some sort of mock self awareness strapped to that process.
This is the smartest, most practical observation I’ve seen yet. AI is pretty useless without this sort of thing.
I like that so many people are becoming interested in the field of AI. More people working on a subject is bound to lead to breakthroughs. But, I think there might be a slight misunderstanding about how LLMs actually work.
AI models are essentially pattern-matching systems. They take an input and produce an output based on the data and algorithms they’ve been trained on. When we talk about AI ‘learning,’ we’re referring to its ability to improve predictions or outputs based on past data, not the kind of conscious understanding or introspection humans experience.
Adding an 'I don’t know' response isn’t as simple as flipping a switch-it requires mechanisms to estimate uncertainty reliably and suppress low-confidence outputs. While some AI systems do incorporate confidence thresholds, they’re not equivalent to a human's ability to 'know what they don’t know.' Achieving that kind of awareness would likely require breakthroughs in self-aware architecture, which current AI does not possess.
Your point about self-awareness is thought-provoking, and I agree that pursuing architectures capable of true self-awareness could be revolutionary. Unfortunately, current research is heavily driven by commercial priorities, which means a lot of resources are focused on systems like LLMs that are practical but not necessarily a step toward self-aware AI. After all, "robot overlords" would threaten their fiscal quarters.
Note: I asked ChatGPT to help me rewrite my original comment, as I felt it came across as rude, and ChatGPT said it comes across as condescending. So if it seems overly eager to please, that's why.
21:33 I’m not a fan of numerology, but it is funny - how the dimension of natural language happens to be 42 (just like the "Answer to the Ultimate Question" number from "The Hitchhiker’s Guide to the Galaxy"). :)
But does it know Obama's last name?
Or his birthplace?😮🤔
The real questions
You mean Baracco Barner?
It will correctly answer Barack
Even if he was born in Hawaii that shouldn’t be a US state 😅 The natives still hate the white man there and it is overpriced tourist trap
"Einstein's first name is …”
Einstein's first name is universally known.
Einstein's first name is known by most people.
Einstein's first name is not an example of name weirdness.
Many possible next words....
what is even the point of this technology if not to confuse people, make them more miserable, and take their money?
Exactly. To set it at 1.0 for "Albert" would be ridiculous.
@@omfgacceptmyname The point of A.I. technology was not business, initially. It was all ways to try lo learn how our brains work and what exactly intelligence is. Although not involved in any project, I've dedicated all of my life to find out such things, and finally got enough insights to get the puzzle solved to all of its pieces just two years ago. It took me longer than half a century. Fairly ingenuous but mathematically simple. But current A.I. … I call it Artificial Idiocy. They try to sell an obviously incomplete and faulty tool as a complete, wonderful magic wand, without really knowing what intelligence is and how it works.
6:58
@@wafikiri_ The reasoning ability of AI is the selling point. It's not an autonomous agent, but it can still fill the role of a support chat, site scraper, idea generator, etc. It's even better since LLMs are literally made to generalize data, so they can produce any media with at least some accuracy. One big improvement is in facial recognition/image generation, while other improvements involve the study of proteins and viruses that are very difficult for humans to observe.
The curated crap we're being sold now will change as AI is weaponized in other forms. That line the video shows at the beginning can be considered a barrier for AI becoming integrated with everyday life. Until then, it will remain in the apps as a novelty or generalized tool to avoid tedious work, but I believe we are almost at the point of creating something mistaken for sentience.
Another fascinating point is how well our observations in neural biology follow similar power scaling laws. The human brain seems to fit very nicely on the primate scaling curve and (not surprisingly) points to an adaptation within primates for superior cognitive scaling performance vs other mammals. There are obviously important distinctions between ML and our brains. Models like GPT-4 are highly specialized and would be a better comparison to the sub-network of regions in our brains that processes language. Lastly, an area where we are significantly lagging in capability, is the Abstraction and Reasoning Corpus (ARC). Human scores against ARC are in the 80% range whereas our best algorithms are in the range of 30% and of course all of the most interesting applications of AI/ML will heavily task abstraction and reasoning tasks. We have LOTS of work left to do so please don't fall into the trap of thinking we just need to throw more GPU's at this and we somehow get to the singularity... we are still missing very important stuff but the progress we have achieved is also incredibly impressive.
That's a fascinating comparison and really does line up with this. Agreed about the GPU brute force approach, there's obviously still missing components.
I like your viewpoint. There's still so many unknowns and questions to be answered regarding AI. From the technical to the philosophical. I wonder if we will ever get there and what that journey will bring for our species.
GPT 4-o1 looks to have made some significant headway in ARC.
@@michaelsutherland5848 no, It didn't. The only change is a small improvement in chain of thought
The map is not the territory, entropy without complexity, is like a ratio without proportions, (figure/ground).
0:30 ish - Ok, first impression comment... ... can anything cross that line? The way the graph is presented it appears asymptotic so AI isn't even part of the equation... granted, I don't yet understand the meaning of the graph and am just going on naive impression of graphs in general and it will probably be explained in the video.
that's how it looks to me too
like you can't go back in time in a physics time graph...
It just seems to me that crossing that line would mean a negative error rate? Like the AI would have to make a negative number of mistakes? There's no way this is correct, but it's what it seems like from the charts.
@@tompycz2225right? Does this get explained?
We need to ask an ai to explain it...
I mean doesn't that apply to all life too? There would be a limit of what a human brain can do
I didn't know that you were the one that made those videos on imaginary numbers
A decade later that is still the best explanation I've ever seen
I wish I had that in college, and I still reference it to other people when imaginary number conversations come up
42? But what is the question? ... ohhh!
What is the intrinsic dimensionality of language? 42! (point one but actually it's more like 100 or something mumble mumble mumble)
😂😂
@@hargisss Well I think that 100 is a much better estimate since 1405006117752879898543142606244511569936384000000000 is way too large
You have been warned ⚠️ The Connections (2021) [short documentary] ❤
Primagen reference?
Its not the AI model, its a property of the dataset - that's the only commonality. The fact that it follows a power law is a significant indicator. Most statistical linguistic experts will be able to point to many such power laws that appear when we measure human languages. The most commonly known is word-frequency power law, so well know that it has a name, Zipf's Law. Regardless of language, regardless of what collection of works, the top 100 words comprise approximately of half the collection and the next top 10000 words comprise the remaining half. Power laws appear in a lot of AI datasets because most complex data exhibits these power law properties, and folks generally only apply AI on complex problems.
so much in that beautiful formulaeaeum
This is interesting, but I don’t see how it explains what we’re seeing in AI.
Why would the error of the system, which is a measure of its ability to learn, reach this limit?
Larger models perform better, and models with more training data perform better.
The question is why we can’t squeeze any more performance out of a fixed model size.
Seems to be a limitation of the network, not the data.
@darkstar4494 But you're still using essentially the same model at the end of the day, to see drastic differences without changing model size you'd have to use a different model.
@@Ansalion agreed.
Can a massive leap in model design (as transformers were) could get better performance for a fixed model and data size? That is the question posed by this video.
Right now, model selection doesn’t make much difference.
I don’t see what this has to do with the question I asked about power laws in linguistics.
@@darkstar4494 Its more a case of a power law dataset being simultaneously easy to learn but also difficult to master. Half the dataset is easy to learn because examples are abundant. The other half of the dataset has a sparsity issue where examples become increasingly rare, eventually to the point where you only get to words that only appear once. It doesn't matter how many TB of data you have, there will always be these rare words. And regardless of the model used, all models have issues learning when there are only a few or just 1 example to learn from.
And that's why initially the AI learns rapidly, reducing its error rate very quickly, but eventually an inflection point occurs when it becomes increasingly difficult to learn from a lack of sufficient data.
I think this trend makes a lot of sense.
AI models don't actually learn. They just recognize and copy patterns, and in a complicated way sort of extrapolate and interpolate those patterns onto new data. The more data you train it on, the higher the accuracy of the AI's predictions. Letting it train for longer then, is just the AI recognizing the patterns in the training data better, and having more training data gives it more data to interpolate/extrapolate from. But no matter how good your algorithm is (Which AIs currently basically are), there is always a limit to how much data can be recovered from a limited amount of samples.
If you look at AI as being very fancy data interpolation machines, this result makes a lot of sense.
The only way I can see that might break this pattern is by having AI actually *learn* rather than do pattern recognition. You'd need an AI that's trained for being able to learn rather than to do one single task, but that's difficult because you'd have to measure learning ability, and making and reviewing your own mistakes is a big part of learning. Having curiosity and a drive to experiment is a part of that too. Such decisions aren't very rational and measurable. You'd need to evaluate something as abstract as the ability to learn. That's definitely not an easy thing to do.
This is an active area of research, and you might enjoy looking into Continual Learning, Multi-Task Learning and Meta-Learning. Probably the largest gap between AI and animal intelligence is the fact that our brains and bodies in general are hierarchical systems with abilities like neuromodulation (neurotransmitters), neurogenesis and metaplasticity (neurons/signals modifying other neurons). As you're pointing out, it's quite clear that our cognition requires more than a universal approximator. Also, what you're referring to as "actual" learning is typically called human learning, as the distinction is important but disqualifying neural networks from learning would make comparisons between the two more clunky.
@@tukib_ I suppose yeah, current AI models do learn. Just not like humans.
If you think about it, yeah, if you want AI with human-like intelligence you're gonna need similar mechanisms. Similar to dopamine in our meat computers, it'd need to be able to give itself a reward for learning new things, and the system for that has to be integrated with the rest of the model in a way that it can understand what's important to learn / improve at, while also not getting stuck in a trap of giving itself infinite reward without being productive.
It would need to be a closed loop system of multiple systems acting on eachother, that can also determine reward for the subsystems, while also still being productive to humans.
Getting a bit philosophical here, but I think the only way to achieve that is by having it either value its own existence, or value being useful to humans, more so than it would value its sense of productivity, in a way that's hard-coded to ensure it can't possibly break that loop (somewhat similar to human emotion). If it doesn't value either of those, it will not care about actually being productive to humanity, and likely create a closed loop of giving itself infinite reward by setting arbitrary goals for what counts as being "productive" (similar to how we might get drawn to memes or tiktok, which activate our dopamine reward system because our brain recognizes it as productivity).
But at that point, I'd say it gets eerily similar to being a conscious machine. Either the machine values it's own existence, which will definitely turn out bad for humanity if it gets sufficiently advanced, or the machine feels an obligation to serve humanity, which.. kinda feels like having super intelligent dogs?
As dystopian as the latter sounds, I don't think having machines with some form of emotion that are solely dedicated to serving humans is necessarily a bad thing as long as they're content with their existence.
Anyways, enough rambling about philosophy. I am also very much interested in the actual mechanics of AI, so I'll definitely be checking out Continual Learning, Multi-Task Learning and Meta-Learning as you mentioned.
The problem is AI predicts the next word, it doesn't learn as you said, when I learn language I learn words, grammar and whatever else
@Ab3ndcgi oppenheimer is not a hero just a sad story
aalso if humans didn't ask 'why' you would not be here criticising scientists while being an egotistical nobody
@Ab3ndcgi The "why" is the potential for the model owners of automating a lot of jobs out of existence and collecting rent on the cost difference between a human doing the task and an AI doing the task. It's just about getting rich, that's it.
What an elegant explanation of a complex subject that hits the sweet spot between over simplifying and getting mired in detail!
Douglas Adams would smiled at the intrinsic dimension of natural language being *42* .
Indeed, a very elephant explanation, like an elephant in a porcelain shop.
Oh, you said elegant! I'm sorry, I'm an AI and my language-database is small. The most I got was 41.
@@AniMageNeBy💀
Fr this video offered me no explanation. I am only more confused. Listening to all this math is like listening to Vorgon poetry.
That looks exactly like a Receiver Operating Curve. Any signal tested against a threshold will exhibit this curve. A threshold false positive curve, also known as a receiver operating characteristic (ROC) curve, shows the relationship between true positive rate (TPR) and false positive rate (FPR) at different classification thresholds.
My brain predicts one word after watching this - excellent.
Coming from Computational Fluid Dynamics these graphs look like a limit of the resolution. Transferring to ml, the the number of neurons in the net.... Maybe I am missing something, but it does not seem surprising to me.
If you think of LLMs as simply compressing "meanings" from the training data in a high dimensional space, and interpolating between the stored meanings to provide a next token prediction, the trend line reflects something akin to the underlaying compression efficiency of a transformer.
So, in other words, it wont cross this line because it doesn't decompress a prediction of what the predicament is not.
Waste of resources would that be? Yes.
Is it worth it?
It depends, are you scared of time travel, demiurge machines and eternity?
Its great at predicting, but does not deduce.
that was my line of thinking also...
It is a Virtual Intelligence mapping products of real intelligence.
The dimension of LLM is the minimum meaningful pattern dimension of the used inputs.
So basically it is the pattern storage dimension of human intelligence.
That is also a reason of degradation. When VI is fed by its own products, you have a positive mismatch feedback.
These VI's are not stable, because they are just imprints, and as such does not include the original control feedback of true AI.
------
The answer to the "Life and the universe and everything" is 42, because the dimension of the Original Idea of Creation was 42. And it makes sense only when you learn what the question really means
@@meleardil You're right that feeding them their own output won't help, but there is some benefit of identifying sparsely defined areas of the input space and augmenting it.
The dimensionality is based on the compute budget, token embedding, scaling laws, etc.
The last part, you're reading Douglas Adams on mushrooms eh?
@@luke.perkin.online "reading Douglas Adams on mushrooms "
I am using my imagination for fun. If you have no idea what I am talking about, you shall seriously complain to your parents.
@@meleardil wasn't very imaginative just quoting him
Very well explained to someone who knows very little about how LLMs works. Thank you.
Difference between humans and chatbots: humans look at a small number of things obsessively and encode a small number of deep (complex) patterns. Chatbots (and many other AI systems) get more data than any human could possibly see in thousands of lifetimes thrown at them, and they encode a large number of shallow (simple) patterns. Both are intelligent, but in different ways. If somebody wants a system with human-like patterns, it's not obvious to me how you'd get that. For now, use chatbots for jobs they can do well with the pattern types that they have - which are more than good enough to do a lot of useful work.
You’ll never get human level intelligence because of what that entails
@@AR15andGODWhat do you mean by this?
Read his username his opinion means nothing. @@GillfigGarstang
@@GillfigGarstangima guess they mean some human soul thing given the gun and Yahweh worship.
I’d also argue the odds of a human intellect being made by a company is vanishingly low, but for a very different reason. We are General Intelligence, and you’d want to build AI’s to be better than humans at specific tasks. Not a AGI
@@MycaeWitchofHyphae It depends on how you define AGI; if companies are actually serious about selling humanoid robots in the near future they have a definite incentive to crack AGI.
We also don’t know to what degree human intelligence is actually ‘general intelligence’; we clearly come with a bunch of pre-made optimisations for developing certain useful behaviours and not others;
we don’t need to try to learn our first language, interpret visual information, (usually) read social cues etc, but almost every human seems to require _lots_ of deliberate practice to be able to do mental arithmetic or learn to draw or acquire a second language in adulthood.
21:30 - so 42 is indeed the answer!
It was a documentary all along
It always was
How many people out there have no idea what he means by this? Go read the Hitchhiker's guide series.
That was such a good video, everything was so good! Showing the math with the images you cut out everything makes sense. Thank you!
It may be a mathematical limit. Let's remember these neural networks are fundamentally large matrices.
With elements that are discrete, versus brain elements which are not.
@@johnsmith1953xanalog computing should make a comeback, what if that’s the key to everything
@@Max-zi5wx At that point may as well use quantum computers since it adds hype
It is an empirical limit we are observing. Not a mathematical. But we are trying to understand the empirical limit using math. And NNs are not just matrices, with that view you are missing all the nonlinearity which is the difficult part.
@@Max-zi5wx Shhhhhhhhh!! Shush!!!
This video deserves not just an award, but an entire new category of award for distilling such amazing insights into 24 minutes.
I could be stupid but he really lost me around the 19:00 mark.
I’ve seen science communicators who can explain complex topics so that laypeople can actually grasp them. I didn’t feel that in this video toward the end.
Perhaps lay people aren’t his intended audience, but this is a UA-cam video so🤷♂️
Idk why people devote so much time to talking through equations when such discussions rarely add any additional meaning.
Zero error is achieved when dataset size is the number of fundamental particles in the universe, and the computational power is the number of interactions they undergo every planck time.
Nvidia to the moon?
Error can never be 0. This is what Hume pointed out with the problem of induction. We’ve known this for hundreds, if not, thousands of years.
@@hamm8934 idk bro im flawless
And the compute energy would destroy the dataset making the particles unobservable
They literally said in the video that's not true, even with infinite sizes
The answer is somewhat straightforward, if technical and boring. The error of a statistical model may be given by KL-divergence. The KL-divergence of a statistical model is zero when the cross-entropy is minimized. The entropy of human language sample, say a small sample via English Wikipedia, may be measured in shannons (bits). Adding one neuron to a model is roughly linear in units of shannons for a given neuron type because the parameter space of a neuron is something that can be saved in a 64-bit computer. A model needs at least as much entropy as English Wikipedia to predict it accurately, which is why a language model inevitably eats shit when it encounters something it hasn't learned before (it already used all of its bits in representing what it has seen).
Where you place the neuron matters, but its contribution to reducing the KL-divergence is still at most linear. In fact, the upper bound should never be greater than the entropy of a single neuron. This is only on average since placement matters relative to poorly placed previous neurons. For example, although linear autoencoders may form a bottleneck at the latent space, the entropy of the neurons in the encode and decoder layers still matters.
Maybe I'll write a book on it once I finish my current one. A high KL-divergence of two different humans with the same language model is basically the result of all human conflict, but it's hard to explain in a comment.
Where do I sign up for more? This is fascinating and potentially vitally important for AI development
@@JensRoland If you are new to ML and are interested in some of the approximation theory, I found Chinmay Hedge's "Foundations of Deep Learning" to be a fantastic set of notes.
this sounds like what i call translation slip, where i am quite literally not able to properly understand what you are saying because of different underlying structures of the languages.
The last paragraph - arent you assuming a lot of things whrn you claim that if people understud eachother they would not have conflict? You can have perfect uderstanding and empathy of someone but if material conditions allow for only one of you to survive that would still lead to conflict
@@karlhans8304 Yes, both minds can be represented in terms of a common model/representational space, but the fact of discontinuity between organisms is FUNDAMENTAL to the nature of their existence and not incidental. Language and other forms of communication can transcend this difference to an extent and that is miraculous, but the difference is ultimately irreducible except in the mutual death of the organisms, which disintegrates their identities/homeostatic boundaries and renders them common to each other beyond that fundamental limit.
I've been waiting for this longer form video since seeing your short on this. Thanks!
Written language is not "natural language", it's a lossy downsampling of natural language.
LOL... I think in "Higher Dimensions" than I can ever express on paper.
Natural, verbal, language is twice as fast as written. The approach for both is completely different : verbal, fast & ephemeral, written, slow perhaps millennia.
Great video. This is fascinating and I appreciated your analysis. Side note: if you sign up for a Brilliant trial, you might be very surprised when they charge your credit card $200. Set a reminder to cancel before the trial ends.
Holy crap. Better use other stuff then, like Khanacademy, etc.
I wanted to study Nuclear physics and I kind of click on this video, suddenly on the moment of realization;
Oh my god! He is that guy who tought me imaginary numbers.❤
I still don't forget how valuable those videos are.
Thanks for them. Keep up! You deserve billions of subscribers ❤.
re on the error line intersecting the x-axis: the graph is log/log, so the zero corresponds to e**0, not 0
Thanks for including references in the description, I really appreciate that.
As someone who used your Imaginary Numbers Are Real series as “further research” for my Precalculus students, a physics book is exciting!
First actually interesting ai video of the 2024.
Fundamental law? Nope you've discovered the limitations of linear algebra and information encoding. Problem is solved... just not in your field.
Haha. I don't really know much about the field you are on about, but it reminds me of "Tai's Model". Essentially a bunch of nurses needed to measure the area under a curve for a glucose chart and didn't know about calculus, so basically reinvented the trapezodial rule and published it as a paper.
I do wonder how much time is wasted reinventing things purely because no one can know what they don't know.
@TheMajorpickle01 happens all the time. A major energy theorem in general relativity broke major ground in our understanding of the theory in the 60s. It lead to all sorts of practical solutions to tough problems in astrophysics etc... 40 years later a group of string theorists proved it again thinking string theory had finally made an important contribution to gravitational research... not an original contribution ... no strings needed for it to be true or to even prove the theorem. The guys actually thought that they made a major discovery about gravity... while literally being 40 years behind the tip of the spear. The same thing with the AI guys and stat mech and information theory... only they're actually further behind than the guys who pioneered the field in the late 1800s... it'd be funny if it wasn't so sad and expensive.
Could you actually make clear what exact finding in the AI field here have already been described in stat mech and information theory? I would like to read up on that.
AI folks are fundamentally confusing error rates assumed in statistical models for some grander law of ML or something. This reminds me of baysesian folks misunderstanding the problem of induction and thinking you can interpret a null for verification.
Especially when he said it could be a fundamental law of intelligent systems I chuckled... Is our brain not an intelligent system and therefore a piece of counterevidence to said law, since it can clearly learn more with much less data and much less compute than these LLMs are learning?
I thought we knew this law is just a function of AI being equivalent to stirring linear algebra curve fitters in, basically, the dumbest way possible. We know this because the counter factual is used all the time, even in AI. there are many cases where hand coded linear approximators are used instead of NNs, in transformer architecture for instance. The approximators algos do much much better in terms of performance per watt in these cases, hence why they're almost universally used instead of learned NNs for simple cases. That line is basically, 'well if you keep just stirring my weights using only basic back-prop and dumb simple loss mechanisms like a bunch of idiots, I'm not going to find smarter ways to do any of this without a shitton of weight stirring'
This….. you are completely right, it’s a function of the efficiency of simple feedback. It has a noise floor constantly rescrambling the weights with every operation, there’s no gating or non-linear behaviour in the back-prop preventing “stirring” on irrelevant data.
I truely believe the biggest jump we see in advancement will be better “neurons” not better models….. don’t get me wrong, better models will keep coming and get amazing….. but overnight the will ALL get better with better back-prop. Apply what we learned from non-linear forward prop, and use that to sieve the back-prop….. that step function in intelligence could rock out world in a matter of hours, because every model out there can suddenly use the same weights and go back into training and find a whole new level in hours
@@geekswithfeet9137definitely, alternate backprop algorithms are really interesting
also, skip connects are just a hack in this sense to get backprop to even work, we've already had studies showing that they might not be necessary and are just a hack
Applying this to large models is more speculative fiction than theory. Our only examples of high complexity come from nature where it is, without exception, an emergent property of simpler and simpler subcomponents.
@@geekswithfeet9137 architecture, we don't yet understand how cortex's do what they do, how they process data, and to be honest I don't think our current approaches will takes us very far at all. Can you imagine just how dizzyingly complex and utterly different the math of simulating a cerebral cortex would be ? from the primitive shit we are doing now ?
This seems like a very simple answer. If you're working with a logic concept where there is a concept of right/wrong, prefer/dismiss, the scaling is going to look like that. If every concept was on a segmentation of three equal values per learned task, then it would look like a 1/3, 3/1, or somewhere in between. The idea of continuing to be taught towards a specific goal, is going to be dualities ad infinitum in complexity, but still end up on a 1/2 2/1 scale.
It is a relief to see such intelligent video with honest information having over 1.1m views. Hopefully peeps stood till the ending.
I'm a ML Engineer. The answer comes down to 2 concepts:
1. Transistors. All computing ultimately boils down to the billions of transistors running on CPUs and GPUs.
2. Next-Token Prediction, ie Transformers using Attention & Self-Attention. Effectively, it is not believed whether increasingly massive computational capacity is the breakthrough needed to achieve AGI or even ASI.
LLMs are just predicting the next token, or word in the sentence. LLMs fall within the goal of NLP (NLU/NLG). There are many other subsets with promise, ie GANs and Agent-based Reinforcement Learning.
4:57 is it wrong to say that "current AI is just a fancy database that stores tokenized data that is retrieved via natural human language"?
Yes
If I understand correctly, the difference is that it has the capacity to interpolate between data points, which provides a limited kind of "understanding"
The difference is that a database actually stores the training data, so it's the same size as the data. LLMs and image diffusion models are around 20 000 times smaller than their training data, yet can still kinda recreate it. This is because they store patterns, abstractions, ideas etc about the data, instead of the data itself. Anthropic managed to find and adjust those ideas within their LLM, to create their brainwashed "Golden Gate Claude" AI that thinks it's a bridge. It's a fascinating experiment
@@vrdev4714 not even, it just stores weights, as in, probabilities
ex: if the you give it ABCD, it knows the next letter has a 50% chance of being E or 50% chance of being a space based off the statistics it crunched during it's training
pretty much just a massive weighted randomizer, that's why you need a seed
@@DDracee Lol what the hell are you talking about AI does not have randomness. It's called the output layer and the percent is how much each neuron is firing, between 0 and 1, ya know? Do you?
This is definitely not an unavoidable physical limitation, considering that brains exist. At a much smaller scale, smaller data set, and smaller power consumption, the human brain can learn much more efficiently. So these scaling laws may be exclusive to binary computers. The missing piece of this puzzle probably lies in advancements in neuromorphic computing, and an adaptation of LLM architecture to take advantages of such hardware such as synaptic plasticity and saving efficiency by using analog computing for the weights.
Likely has to do with "chunking" of hierarchical concepts with heuristics that can to some extent be recursively modified on the fly.
Nothing shows that brains don't follow these laws.
The dataset size and model for brains are both very very large
error rate of an "ai model" obviously could be 0 with a finite problem to solve, if you can define a mapping of each input to each output. This graph is fundamentally a *goal* of most ai models, to be able to compute problems more complex in a more resource efficient and generic way (accepting error rates), otherwise we'd just at some point map every combination of inputs to every possible output.
Everybody who has taken first semester computer science knows that this is impossible. See e.g. the halt problem.
"Maybe what humans call chaos, Pixel would find perfect. Or perhaps symmetry and harmony might take on meanings so complex that human minds would perceive them as randomness or noise".
I'm just commenting for engagement cause this is phenomenal content 👍👍👍
You just have explained yourself why AI needs its OWN factual internet.
Dude, nice work, thanks
This is one of the most interesting videos on ML I've seen in the last 12 months, exceptionally well structured and visualized. Thank you very much @WelchLabsVideo!
Some of the craziest shit we've seen over the last few years is the ability to detect likeness between images. You can upload an image of something and computers can find similar images. If you upload a picture of a person, it can find other images of that same person even if it doesn't know who that person is (you uploaded an original picture). As this gets more powerful, the future will be a strange place none of us can predict.
Seems like "Shannon's Entropy"
I'm in Europe and would be interested in the book at some point :)
Will get there at some point! Ping me here and I'll add you to the international waitlist! www.welchlabs.com/contact
You have been warned ⚠️ The Connections (2021) [short documentary] ❤
Im getting the sense of some Nyquist sampling limit.
I think this shows that LLMs tell us more about how language works for a collective group, rather than simulating the deductive function of a mind. It just runs into the limit of the discussion among the training group, or it runs into the limit of its own ability to approach that limit. After it runs out of reasonable predictions it goes all “Here be dragons,” and I doubt there will ever be zero dragons. I think that would require a fundamental change in how it works, not just getting bigger.
High quality videos, thanks for the work!
Interesting. And yet, despite the incredible decrease in Validation Loss with model size, and compute, the model still gets an incredible number of things "wrong", which suggests, there is a fundamentally different issue happening as well.
Given that I believe LLMs are, at their core, statistical models, I'm not surprised at all that error on a training set decreases with the number of variables. It's been very well known that you can improve the 'fit' of a statistical model (from single regression all the way here) by increasing the number of variables. The risk is, of course, overfitting. That could be one source of tension in the models, that by increasing the variables, you improve on the training set but not its actual predictive capability.
I understand that GPT-4's training set is, effectively, the Internet... but there may be an issue with the fact that people just... talk differently online. Which would be another source of error.
If LLM Validation Loss works as described in the video, then what it is really measuring is the "believability" of the produced text. Low validation loss means the text is understandable and could have been conceivably written by a human.
The correctness of the meaning conveyed by the text is a wholly different property, which is only indirectly captured by the LLM by virtue of the fact that most text written by humans (and thus the training data) is intended to convey correct information. Thus, for text to "look right" it at least has to "look correct".
The LLM training optimizes for believability, and so the LLM makes mistakes very confidently and eloquently
They are not being trained with a reward function that has anything to do with the veracity of the output. It's just trained one token at a time. A very local, very greedy training.
@@NemisCassander > That could be one source of tension in the models, that by increasing the variables, you improve on the training set but not its actual predictive capability.
The real problem is that the models can't actually "understand" what you're asking. They're just pulling a statistically likely next word. It's why prompt writing has become a job title - it's not sufficient to askChatGPT a question and expect it to intuit your meaning, you have to ask it the question in a way that navigates around it's statistical model and avoids answers you don't want. Certainly having more context (ie: more training data and a longer input buffer) can help mask over that problem, but it never gets rid of it.
That said, current AIs have so much training done that the problem is rarely it's intrinsic lack of understanding - more often than not the problem is that it's been trained on data from the internet, and the internet is pretty famously full of crap. It's the culmination of all knowledge, including vast amounts of "knowledge" that was invented out of whole cloth. Garbage in, garbage out.
@@personzorz citation
The more Brilliant ads I see, the more I don't ever one to see their name brought up again.
What's the correct way to advertise anything for you?
@@brexitgreens I don't know about the correct way. I'm expressing my feelings. When I first heard of Brilliant I was mildly interested, now I'm utterly annoyed, and there is zero chance I'm going to look into their service. Perhaps I'm unique, maybe everyone else loves the repetition, but I think there is such a thing as brand over expression. At some point you badger people to the point they don't want to hear from you again.
@@brexitgreens Advertisement is inherently bad and a system based on advertisement will value products that prioritise marketing over products that prioritise quality. The alternative is simple, abolish advertising and establish third party reviews as the main mode of spreading new products. You dont need someone to scream in your face telling you what you need, when you need something you can go to third party review systems and look for the best quality product
@@surfacepro3328 Yes that will totally work lol. They will simply dump their ad budgets into bribing the 3rd party reviewers. This has been done in nearly every review based space for years now. Try again.
@@surfacepro3328High quality marketing for high quality products would be the best
Everybody here heard 42 and lost their shit. Everyone then ignored than in the next sentence he literally said the experiments confirmed it was a number closer to 100
Bravo, I do not speak in the Mathematical language that you know and use in this video, but so wel was it delivered, that even with my lack in the basic algebraic functions and LOGrytms I followed along in th evisual and symbolic sense understanding the general context. Bravo.
Soo well said James!! You just gave me a greater love for Porsche than I didn’t even think I could have! Really appreciate the passion you share for these wonderful works of art and engineering! Cars are cool!
Ong, i thought this was a video by Welches Fruit Snacks and i was super interested in how this related to the product
42 is crazyyy
whatdoyouget...
Reminds me of the idea of a toposophic barrier
What's that?
@@venmis137 A hard mathematical limit regarding how intelligent a given mental structure can be, no matter what else you give it. Breaking such barriers would, in principle, require a different or evolved form of intelligence from the one that came before.
I am so glad I read so much Orion's Arm back in the day, or I would not have remembered that term.
It is a natural effort vs outcome rule that exist in everything across the universe. Effort will always approach infinity before reaching a perfect/maximum outcome.
"A hard mathematical limit..."
Err, within the _science-fiction_ universe of "Orion's Arm" - just to be clear, this isn't an actual mathematical result about minds, it's an idea from a work of collaborative fiction.
(of course if anyone wants to correct me with links to published, peer reviewed papers etc. that'd be great - a quick search didn't come up with anything specific, though toposes themselves are obviously pretty significant in category theory etc. which is presumably what the idea is riffing on)
This is one of the best videos I have seen on LLMs. Period. Great stuff!
Just discovered your channel. This video is just so good.
As a curious physicist, is there a study of how much energy they used?
Teapots to 42.
Amazing work. Will be supporting.
Respect.
9:28 _Open_ AI, or what do they call themselves again? Yes, "open" is unfortunately already far on its way to be a meaningless buzzword…
It's a very blatant lie. OpenAI is about the least open AI company.
@@miklosprisznyak9102
As I said, "open" is unfortunately already far on its way to be a meaningless buzzword. And I don't like that that is the case, because it also devalues the term "Open Source".
@@Lampe2020 💯 I couldn't agree more.
Your mom is "open"
No, OpenAI is a registered trademark.
Given that neural networks are basically measured in bits, it makes sense that their information density scales logarithmically.
How a phenomenon scales is independent of the units you use to measure it.
@mrosskne not really. Yes they will both scale at log, but which log is based on the unit of measurement used
@@mrossknekeep in mind that it's not just about a unit of measurement. Bits are the storage medium itself, and what we measure is compressability of information given a fixed amount of storage.
@@Smorb42 Nom
@@Smorb42 No.
why do I watch these? I don´t understand a thing, yet I keep listening.
So.... If I am understanding this right (and I very well might not lol) then this means that we are facing a 90s audio issue... Just at a grander scale.
In audio you can resolve frequencies that are half the sample rate in time, but need high enough bit depth to map the change in volume. Increasing the sample rate and bit depth just maps closer to the underlying physical reality. But to have any amount of accuracy in audio, you need an insane sample rate at a high bit depth, but which makes for really big files that quickly became unmanageable (for a 90s computer anyways).
To solve this, we had compression. If two samples next to each other were the same, then you could short hand that. Instead of storing the full bit depth of each sample, you could just store the difference. But these methods only shrink things so much. Modern lossy compression relies on psychoacustics, basically saying that if a noise is overly complex, then we mask out most of what is heard, and then focus on things like rhythm and melody rather than our ability to follow the 3rd chair flute part that adds more flavor than melody. This approximation allows for massive multi-gig files to shrink down to hundreds of MBs with minimal appreciable losses, or the typical 3.5-5MB file sizes that we tend to view as acceptable losses.
To get below the line we need some sort of psycho-reasoning filter compression for AI similar to the mp3 model for audio. It wouldn't be without errors... Just have errors that don't matter (or go unnoticed) and allow for far less data and compute to train from to resolve similar answers. Just as the trained ear can hear mp3 compression, someone who is more sensitive would notice these minor issues, but would still be able to pick out the signal from the noise. I wonder what that analog would be, and how it would apply to the training data.
I think Baidu's "Deep Learning scaling is predictable empirically " from 2017 deserves a mention in this context.
Petaflopdays is the same level of cursedness as kilowatthours and makes me shiver
In both cases, it helps estimate the costs of consumption
Why does he keep saying "peDaflops", though?
@@DisgruntledDoomer?
re uploading the same video? or did i just time travel?
It's the multiverse dude.🤷
It got patched 😊
Seriously I thought the same thing
You have been warned ⚠️ The Connections (2021) [short documentary] ❤
He posted a short about this topic before
8:40 I think the problem is context. Like if two words are synonyms, then there is no objectively "correct answer." So we need a model designed to do more than "find the statistically correct next word", it would need to be able to more deeply analyze the whole sentence in a way to "understand grammar."
The models understand grammar (they return grammatically correct output that is simply factually false), but grammar is not synonymous with meaning and knowledge.
I love pattern recognition. If you find similar patterns between two different phenomena whereas one is experimentally known and the other is a mystery, you can often use what's known of the one phenomenon to explain the other's.
Wait....this video is a reupload right? I swear I've already watched it....
he posted like a 30 second short about this
The short you are refering to was 54 seconds, this one is 24 minutes. Not a reupload.
I thought the same thing. Maybe the aforementioned short, but I think a different YT'er made an entire other video very similar to this, at least at face value, but this one dug right into the weeds.
that's a lot of gpu usage... kinda makes you wonder if LLM are worth all the electrical energy
facts! It's one of the reasons I feel AI as we know it right now is hype. Maybe with a new architecture or new theory, we could actually justify all the research. Like you said, kinda feels like a waste of compute for now
@@ChimbzZ The evolution of model design has proceeded at such a fast pace that we haven't focused on specializing the hardware for inference efficiency. From a physics perspective we are nowhere near the lower bound for inference power efficiency and it would be reasonable to design an inference engine in the nominal range of 20 watts at some future point (nowhere near that in the coming decades to be fair). We are going to hit an AI/ML winter again soon based on the realization of deployment costs for AI/ML but the research will then shift to bringing down those costs and progress will eventually be made launching the next boom cycle.
@@John-zz6fz AI winter? I doubt it. Have you seen o1-preview? use wrappers and embedded instances of this model, teach them to interact with the command and graphical interfaces (some companies are doing so already, but at a smaller scale than the large AI firms), optimize them for certain goals by creating datasets for lots of different operations and instructions (ranging from opening a browser, navigating the internet, updating a system, bypassing captcha, etc.), and you get agents, now deploy them for larger goals and optimize their performance as goals are successfully achieved (while keeping them bounded), then deploy them for arbitrary goals, and you have essentially unbounded agents with ownership and administrative rights navigating systems and the internet. You can teach them to embed themselves in low-level code or create abstraction layers (hypervisor rootkits in a sense), and they can stay in systems for a while. That's how you get to AGI in my view. It sounds scary but it does not need to be if they evolve in a more or less aligned manner. We already have a bunch of "bots" doing simple things on the internet. However, they lack the reasoning and generative power behind that powerful systems like o1-preview offer. The question is, is leaving a bunch of engineers and scientists and other people unemployed actually good for the economy? that can create other issues but I do not see those issues being "AI winters" or so. Thoughts?
They could be, if they’re used for research in that field
The earliest computers in the 1950s-1960s were exceptionally slow and used tons of power, yet they were essential to our computing journey. Of course, what we're doing right now is indeed worth it. We can't improve efficiency without starting somewhere.
I think our current architectures are subject to these constraints.
Which is why later AGI or ASI probably wont look anything like what we are doing right now.
That’s why Google is starting to analyze rat brains and primate brains.
LLM is just one building block of intelligence, not AGI.
We need to invent a better architecture than Transformer, or build something based on a brain-based cognitive architecture.
Becuase the search space increases the closer you get to the zero line. When you're working with an objective function (like in machine learning or neural networks), getting closer to the optimal solution (e.g., the "zero line" or the point where the loss function approaches zero) may require more fine-tuned adjustments. The "search space" represents the range of possible solutions, and as you approach optimality, the number of feasible solutions grows because small changes might still lead to acceptable performance. This makes it harder to find the exact best solution as you're navigating many near-optimal possibilities. Alternatively, if referring to gradient-based learning methods, as the loss or error approaches zero, the gradients may become smaller (vanishing gradients issue), which can cause the search process to slow down since the steps taken in the optimization become tiny, making it feel like the search space is expanding.
my brain was feeling stuffed from studying mechanics and all the calculations. watched this vid, understood prolly a fraction of the stuff but surprisingly, my mental fatigue is gone. thank you for that.
wow, it's amazing how dumb I am, don't have a clue what you are talking about
42?
BuT wHaT iS tHe QuEsTiOn?/?
19:00 In some respects, as soon as you get into applying Taylor functions, this whole discussion reminds me of trying to model aspects of particle physics.
22:45 Yep! I'm not alone!
i keep saying that so long as you don't fix the double zeros on the math graph computers will not go past that line...
Enlighten me bc I have no clue what you mean
Ha ha hope its not a simple mistake like that - lol.
@@user_375a82 it is as simple as that. but you guys are afraid to fix the problem ot at least try it building a math graph with half negative zero and half positive zero,, yeah i know this sound crazy because, it's never been done before... plus i think you guys stil don't get what i say...feed me and house me for a year, or pay me for my time and your questions it's definitely worth the investments... i am going to become homeless soon, While I can figure out the secrets of the universe of math, i cannot find a job to feed myself or shelter myself from the rain... no kidding...
This is just the limits imposed by the laws of uncertainty which is inherent in our universe and given all data is derived from the universe it cannot reach beyond the fundamental limits of nature.
There is a fundamental problem of rediscovering prior results every few years. The so-called 'law' directly relates to optimization and can be found in many convex optimization (no surprises here, as the learning methods follow analogous optimization routines). Furthermore, people are often confused between asymptotic guarantees and finite time/iteration guarantee.
Interesting! Can you shoot me some references using the contact link at welchlabs.com?
Because it's not AI. LLM technology is not AI. The Turing Test is a flawed methodology for determining AI, and LLMs were designed to pass that flawed test.
So what methodology is not flawed?
"We don't know why", of course we know why : Current AIs only guess the answer from a prompt.
There's a hard limit at how accurately you can guess something, you will always fail sometimes. To cross the line you need actual intelligence : to be able to understand the question and find out the correct answer.
To succeed in school you can either be a good cheater or good at learning.
21:30 So the answer was 42... So the Hitchhiker's Guide to the Galaxy was right all along.
I want AI to unlokc some real world challenges like assessing the damage to a vehicle after an accident based on pic taken by your phone
But why?
The human at the repair shop takes into account a few key of data points:
The cost of the parts
The cost of labor
How much you are willing to overpay.
Each of these has its own subpoints, until it rolls out an estimate.
The human mechanic should do better than just a picture of the damage and AI.
@@HexerPsy All this can take many days and multiple visits. Imagine the whole process of insurance approval happening in a few clicks, and repair shops pick the vehicle, does their thing based on the inputs. Now, Imagine the driver is a lady. Unlocks many layers of complexity
Is there going to be more prints og the book? Really would love a physical copy.
NOOOO I WISH YOU DUG A LITTLE DEEPER INTO CROSS ENTROPYYYY.
You are so good at explaining. Even though I know what Cross entropy loss is I was excited to see how you explained it.
Either way thank you for this awesome video