Post show reflections: I know I was pushing the "LLMs are databases" line quite hard, and the guests (and Ryan's article) were suggesting that they do some (small) kind of "patterned meta reasoning". This is quite a nuanced issue. While I still think LLMs are basically databases, *something* interesting happens with in-context learning. The "reasoning" prompt (or database query if you like) is parasitic on the human operator - but the LLM itself does seem to do some of the patterned completion/extension of the human reasoning prompt pattern in context i.e. "above the database layer" there is some kind of primitive meta patterning going in which is creating novel combinations of retrieved skill programs in the LLM. It's a subtle point but I don't think I was able to express it in the show. - Tim
Hi Tim, LLM output is in the mathematical sense a sequential chaotic process. The ARC challenge uses 2D grids which requires more visual intelligence that is not purely sequential. It is entirely possible and even highly likely that multi model models that have the ability to focus sequentially on parts of images and have in context sequential generation of possible similarities in the examples and testing those during inference will lead to solving the ARC challenge without the need for coding. The main problem with the ARC challenge is that the most advanced models (which are closed source) are not allowed to participate. It is difficult to state those models don’t work if they cannot participate. The solution is to ensure the advanced models are used in a mode that prevents them from training future models on the interactions. A very easy prompt that can be used on multiple hard problems is “keep trying and testing ideas until you find one that works.” That is very much how humans solve such puzzles. LLMs are indeed not just databases if you see what they can do with sequential problem solving like e.g. coding and debugging entirely novel programs. Just as zooming in on Mandelbrot set is not a database problem. Both can take you into directions that you can only discover by going there.
No need to apologize. As far as I remember, every guest you've had agrees with the LLMs are databases theory. So, it's pretty reasonable to say that if most of the experts in the field agree. I think the confusion comes from all the marketing hype that the companies put out to try to oversell their capabilities.
@@TheTEDfan You can use whatever you want on your own. I'm sure you would get some reward if you could solve all the challenges with an LLM (since many have tried and apparently the best is still only 50%). But the point of the ARC challenge is to discover new ideas, so using an LLM goes against the spirit of it no matter what.
Excellent content! I'm awaiting the Geffrey Hinton interview.... Also, this avenue that you've been reporting on reminds me of this paper on the brain - as an aside to ML/AI, it's interesting theory as to the organic functions we're trying to create digital counterparts to: [note: yt wont let me post links] google search "National Library of Medicine Top-down predictions in the cognitive brain"
ICL or RAG doesn't conflict with your view. They are basically relatively trivial kind of of program systhesis on top of LLM , since they involve some heuristic or embedding based discrete choices to guide the LLM. It's a rather weak form of system 2 on top system 1 to help performance IMO.
I can't begin to tell you how much quality of life this channel has brought me over the years since my health issues have impeded my mobility. These videos are so stimulating and profound, I wish I offer more. I so, so, so much appreciate your work Tim, and Yannic and Keith too. Thank you all so much.
I think the resurgence of the ARC challenge is one of the most interesting things to have happened this year in AI. Just the level of nuance and debate it has forced into the conversation can only be good for the community. Whether it is beaten or not, we’ll all be wiser for having gone through this exercise. Chollet really has devised an incredibly ingenious challenge.
I often share your insightful and well-explained videos with my children. I want to express my sincere gratitude to everyone involved in creating MLST's content. It's truly exceptional. I wish more content creators would prioritize clear, informative delivery over sensationalism, as you do so well. Thank you!
9:54 I don't think it's clockwise. The first example shows that white should be on top, but also that pink should be on top of brown (there would be more brown in the solution otherwise). So I'd guess: white > pink > brown > yellow (in terms of z-index)
It's overlaying 4x4 quadrants of the original on top of each other, treating black as transparent. 4 over 1, then 3 over 1, then 2 over 1. Solution is resulting quadrant 1.
@@benbridgwater6479 You can also think of it like layering colored papers on top of each other with the black part cut out. If the Upper Right is Quadrant 1, And Numbering Clockwise, in Quadrants... Lay down Quadrant 4 first (All yellows + Black), then lay Quadrant 3 on top of it (All Pink and Black), then lay down Quadrant 2 ((Red + Black) and finally lay the last layer Quadrant 1 (White and black)... the black represent nothingness or glass or pure alpha channel or transparency depending on your viewpoint, just as long as you let anything under black come thru to the top). Once a color gets covered by anything other than black, it is superseded.
Your original comment is correct (white > pink > red/brown < yellow). The comment directly above confuses the order the quadrants are superimposed, ignoring that pink visibility trumps red/brown visibility.
@@simonahrendt9069 No - the only special color is black which acts as transparent when overlayed on top of something else. Otherwise what "trumps" what just comes from the overlay order - whatever ends up on top wins. Easy to verify in GIMP.
@@simonahrendt9069 The "order" is Layer 1 (Yellow = Q4), Layer 2(Pink=Q3), Layer 3 (Red=Q2), Layer 4(White=Q1). Red covers pink. Pink does NOT trump red. Where Q4 means quadrant 4, and you layer it down going counterclockwise starting at Q4. (But you can name the quadrants whatever you want... the order does not change).
Timestamps 00:00:00 Introduction 00:03:00 Francois Chollet's Intelligence Concept 00:08:00 Human Collaboration 00:15:00 ARC Tasks and Symbolic AI 00:27:00 Evaluation Techniques 00:35:23 (Main Interview) Competitors and Approaches 00:40:00 Meta Learning Challenges 00:48:00 System 1 vs System 2 01:00:00 Inductive Priors and Symbols 01:18:00 Methodologies Comparison 01:25:00 Training Data Size Impact 01:35:00 Generalization Issues 01:47:00 Techniques for AI Applications 01:56:00 Model Efficiency and Scalability 02:10:00 Task Specificity and Generalization 02:13:00 Summary
I not sure I see ARC tests as examples of "abstraction" and/or "reasoning." I see them as our capacity - at the Perceptual level - to automatically Categorization concrete things into like "kinds" of things due to their perceived similarity (or dis-similarity in the case of a missing similar piece). This is why young children (not yet operating at a very high level of verbal abstract reasoning) can solve these types of problems. The problems are resolved at the perceptual level - not at the higher (verbal) levels of abstraction reasoning. If the images are flashed for a brief faction of a second, you won't "perceive" the solution. Instead, you just stare at them over time, and your brain instantiates them through constructing neural pathways that are similar. And you see the "solution". This is why humans don't need large, labeled data sets to "get" what a cat is. A young child doesn't even need to be at the verbal stage to differentiate dogs and cats in to different "kinds" of things.
That's kinda the best measure of core human intelligence we have, this test is not contaminated by knowledge. There's a blur between knowledge and intelligence and that's the issue with current IQ test.
The idea of the ARC test is that you have a various simple perceptual tests, but the AI needs to instruct itself to solve it. That way it needs to reason, which can be defined as needing to instantiate less brute force perceptual solvers.
@@divineigbinoba4506 I wouldn't say this class of problems are independent from any prior knowledge. One has to come up with a function (series of steps) that maps some input space to some output space. Time is the limiting factor. The shape of the input and output will constrain the search space to some degree. Without any knowledge, it's still brute force. Time to solution goes down with a more accurate "initial guess" / more efficient "tools" to approach the problem with. A "tool" is either prior understanding of a general idea, or prior knowledge of / familiarity with an idea specific to this context. Definitely not a pure function devoid of knowledge.
The superposition example at 9:34 is not in the clockwise order: It's yellow -> red -> pink -> white with later ones on top. Left edge of first one shows pink is on top of red.
I think many people under estimate the complexity of our visual cortex. There has been interesting research, based on persons with brain defects. And every time one finds new insights. What looks simple in the arc challenge is million of years of evolution. Language is only a few thousands of years Reasoning likely even less. Amazing conversation. Thanks
@@Bd-ng1zv quick search, but there are man y more One notable example of new insights gained from studying someone with brain defects is the case of patient H.M. (Henry Molaison). H.M. underwent surgery in 1953 to treat severe epilepsy, which involved the removal of large portions of his medial temporal lobes, including the hippocampus. After the surgery, H.M. was unable to form new long-term memories, although his short-term memory and general cognitive abilities remained intact. This led to the groundbreaking insight that the hippocampus is crucial for the formation of new long-term declarative memories (memories of facts and events) but not for short-term memory or procedural memory (like learning new motor skills). This case provided a foundational understanding of how different types of memory are processed in the brain and how specific brain regions are involved in different aspects of memory, significantly influencing neuroscience and cognitive psychology.
It's not so much zero-shot learning that's needed as runtime incremental permanent learning. It doesn't seem gradient descent would work since you'd just be fine-tuning on new experience and would end up losing the pre-trained model's capabilities. Runtime learning might sometimes be one-shot, but other times generalization over repeated patterns, learning exceptions, etc. Really need to ditch gradient descent altogether and find new incremental method.
@@benbridgwater6479 No need to ditch it all together; gradient descent is how you learn to throw a ball; a little better each time. Gradient descent is not how you learn to reason a little better each time.
Also multi-context, so the AI can work on a text making use of a working context. That way there is no contamination between the working text and the meta-goals. This way you can ask the AI to be "didactive" in parts of the text and "critical" in other parts.. I would think embedding slicing could solve that.
@@MrMichiel1983 Yes, but in our brain fine motor skills like ball throwing are learnt by Cerebellum, while cognitive pattern matching/etc is learnt by Cortex. What AGI needs is for "cortical learning" to be available all the time ([preferably no training vs inference time distinction).
I don't think the LLM's image recognition capabilities are precise enough for the ARC challenge. It's not that the LLM doesn't know; it's more that it cannot see as clearly as you think.
Yeah, I think it’s unclear if simply solving the image recognition will be enough, it might not be, but the ARC test does feel a little pointless currently when the LLMs clearly can’t see clearly.
Yeah LLMs are not precision instruments. They are good at getting the gist of things in any domain. This is just an extra layer that makes that even more pronounced. I would argue if you made an LLM be able to solve these, they would cease to be useful in any other domain. Plus what the heck is wrong with people. It's not a language task, it's vision+reasoning task. I don't understand why people try to use a language algorithm to solve literally everything now.
@@InfiniteQuest86 Yeah, I did a quick experiment where I described the scene of the first question on the test in great detail, and the model was able to get it correct because it was now a text-based problem. I then refreshed the chat and asked the model to describe what it sees, and it was evident that the vision capabilities were the issue, though I was already suspicious of this being the issue based on my long experience with the model.
I tried this with GPT4o, and it does understand the color scheme when I paste it in. It can't grasp the conversion tho. I think without some hints this particular problem is very difficult to solve.
Congrats! We both hit 131k subs at the same time :) - What's everyone's take on few shot promting vs. test time fine tuning? My sense is in the limit, few shot prompting would be all you need, and ultimately zero shot (based on their point that as the foundation model gets bigger, you need lest test time tuning)
Ah, looks like we both noticed that, but actually it is not increasing and decreasing amplitude. It's increasing and decreasing variation in Amplitude.
Clarification on reasoning. "knowledge acquisition = reasoning" is a great heuristic but clearly isn't exactly correct. I think it helps folks understand what is meant in this context though. It might be more correct to say that reasoning is the thing we do when we "rearrange the variables" to construct models (in many cases composed of existing models we already have) to make sense of the world. My cohost Keith Duggar defines reasoning as **performing an effective computation to** **derive knowledge or achieve a goal.** “We may have knowledge of the past but cannot control it; we may control the future but have no knowledge of it.” - Claude Elwood Shannon Science leverages control to gain knowledge. Engineering leverages knowledge to gain control. Reasoning is the **effective** computation in both. Effective method: A method is considered effective for a class of problems if it meets these criteria: 1. It consists of a finite number of precise instructions. 2. It terminates after a finite number of steps. 3. It always produces a correct answer for problems within its class. 4. It can be executed by a human using only writing materials. 5. It requires no ingenuity, only strict adherence to the instructions.
Reasoning doesn't require knowledge acquisition - it can just be reasoning over known facts, using known methods. What reasoning does need is multiple steps and combinatorial application of knowledge. Essentially Intelligence = prediction, and reasoning = multi-step what-if prediction.
@@benbridgwater6479 This is just semantics - I agree with you. Reasoning can be described as "creatively recombining knowledge you already have". There is still an infinite space of recombinations which needs to be performed efficiently.
Error functions have a natural tendency towards parsimony, reflecting the principle of Occam's razor. This principle is also observed in nature, where simple and efficient solutions often prevail, suggesting that parsimony is a fundamental aspect of both human and other natural systems.
Not always. There is that recent paper "Neural Redshift" claiming that this is not universal for all neural networks and is in fact related to specific architecture choices: "But unlike common wisdom, NNs do not have an inherent “simplicity bias”. This property depends on components such as ReLUs, residual connections, and layer normalizations." They show that randomly initialized networks (not trained yet) have different complexity properties based on metrics such as frequencies in Fourier decompositions, order in polynomial decompositions, and compressibility of the input-output mapping and it influences how hard is to train networks and get good generalization. Their conclusions: "We examined inductive biases that NNs possess independently of their optimization. We found that the parameter space of popular architectures corresponds overwhelmingly to functions with three quantifiable properties: low frequency, low order, and compressibility. They correspond to the simplicity bias previously observed in trained models which we now explain without involving (S)GD. We also showed that the simplicity bias is not universal to all architectures."
I can back up everything he said with lower domain examples, especially the 32k context deg! I’ve written to OpenAI devs on this since the 120k context models came out. 4o was an improvement from turbo but still lacks the nuance of the OG gpt4 model. We still use this where we can. But the context window hurts us
@@pliniocastro1546 RAG, Recall, Search. grab any piece of text, after 32K you just get a significant drop in intelligence. give it 100K tokens ask it to identify or count the number of times a word is said. then use command F.
never touched machine learning, don't know what a tensor even is (just seen it as a class in some machine learning code on twitter), but 35 mins in the video and I dont feel lost. you bet im subbing.
Are we sure that there is only one correct answer for all the questions though? What if the test set is based on our understanding of the world and rules, when in fact ML can find hidden rules that are just as valid for the test set, but aren't rules that we have thought of as valid?
Yes. Remember when you were being tested and you had to guess what the questioner thinks is the right answer, even if it's not? Some of our tests to the ML algorithms follow that pattern, it has to guess not only the right answer but make sure it's what we think is right.
@@wwkk4964 LoL, well, if a student shows me that a question is formulated incorrectly, I give them credit, and then I fix the question. This question set has been out in the wild for HOW LONG now and still no one has fixed it?
@@marcfruchtman9473 negative numbers did not exist for European mathematicians well over a millenia after brahmagupta gave rules for computations with negative and zero.
I guess I'm missing something? The first ARC example - the three pairs of 06 drawings, numbered 15, 43 and 84, with each of the 2 in any pair called either A - on the left, or B on the right. So givien those definitions and the three pairs of six drawings, it seems like he gets part of the first one (15) wrong when he says the six drawings on the left are 'not connected together'; he then says (correctly) that the six on the right 'have a gap'. For 43, he gives a different explanation than the verbiage below the drawings, and calls it monotonicity. He says on the right 'they're all going up or they're all going down' - I guess this could be considered correct depending on if you approach each drawing from either the left or right side, even though they're described as all increasing. But then he says the drawings on the left are 'going up and down at the same time' which is not monotonicity (I just looked it up :-), is it? I don't know this stuff. What am I getting wrong? Thanks for an interesting video.
I'm not an AI researcher, I'm a philosophy scholar, but here are some thoughts: After about an hour in to the video, it's pointed out that intelligence can't exist as a blank slate, and it's implied that humans are drawing from a very basic set of reasoning skills that can be combined and employed in various ways. We probably don't learn those skills from our limited life experience, humans (and other species) have probably evolved them over millions of years. Evolution is just learning on a different scale. This suggests an ML approach to developing a more basic set of reasoning skills is probably the way to go. The challenge is figuring out what data to train on. Human created data is probably far too shallow. Spitballing: Is it possible to write a program that generates pairs of packets of randomly generated data, and that same data which has been manipulated in random ways (i.e. ways we may not have thought of) in order to train an ML system on how to identify patterns? On a similar note, predicate logic is used to relate symbols. It should be straightforward to build a symbolic predicate logic implication generator of varying degrees of complexity. That is, it starts with a symbolic output statement, and then applies randomly selected logical operations on it to produce a set of symbolic premises with the output statement as a conclusion. Would it be possible to use such a generator to train an ML system to do complex logical reasoning?
Your predicate logic idea exists, and was used to try to create proofs from scratch. It was questionably useful since it was the first attempt, but yeah it's a good idea. I'm not sure your pair of randomly manipulated packets idea would work how you think, since if you start from random noise and then manipulate it somehow, it will still look like random noise unless the manipulation is very specific and detectable.
29:00 would mamba 2 be effective with the symmetry transformation combination? it seemed to be a bridging concept between ssm, and transformers at least, if i understood the paper correctly. Another paper on llm grokking, seems like an important step towards reasoning.
Groking is training-time generalization (learnt via gradient descent). What these example-based ARC tests require isn't learnt generalizations over training samples, but rather runtime ability to form new generalizations over each ARC problem's example before/after pairs. Not only is there no gradient descent available at runtime, but the generalization task is a bit different since we know a generalization exists for each problem, so it's really a matter of search to find it (i.e. find a composition-of-transformations description for example # 1 expressed in a generalized form that also works when applied to the other examples for the problem).
It helps to have a programmer's mind ... Call the top two quadrants of the 8x8 pattern as "1 & 2", and bottom two quadrants as "3 & 4". Now, treating the black squares as transparent, copy quadrant 4 over 1, then 3 over 1, then 2 over 1. The output 4x4 pattern is the resulting quadrant 1.
@@benbridgwater6479 Yes. Except the "answer" that they provide in the video is "incorrect", I don't understand why they have it wrong, but technically the 3rd row should be RED RED White Pink, not Pink Red White Pink.
I think an autoregressive self-supervised next token prediction multimodal training system that uses BERT embeddings concatenated with abstract representations of images of Chinese characters would teach a model enough abstract spacial reasoning to pass the arc challenge. The trick here is that these images have semantic meaning, and the Chinese language has incorporated spacial ideas with semantic purpose into characters.
Thanks for this interesting Video. Just getting started (first few minutes into the video) but it appears to me that Bongard's test 4:44 number 43, regarding sets of amplitude has an error. By definition Amplitude is the displacement vs the baseline (peak height or peak trough) therefore, this question seems to have some errors where the amplitude in some of the samples in set A are not increasing, and some in set B are not decreasing, therefore it appears to me that what they might have been looking for was actually meant to be the Variation in Amplitude., ie the answer should be Set A has "Increasing Variations of Amplitude" and Set B has "Decreasing variation of Amplitude" (where variation would be the peak to trough of the wave vs the previous wave's peak to trough) To be more specific, Not all examples in Set A are increasing in Amplitude, but all are increasing in "variation". And not all examples in Set B are decreasing in Amplitude, but they are all decreasing in variation. In that respect, it is important to note that the "quality" of the test question could reflect poorly on some AI, that can't really "complain" about the answers not being present if they are forced to choose from a set of answers?
Additionally, at 10:00, the 1st 8x8 diagram converts to a 4x4 diagram. It is incorrectly converting. It looks to me like the Top Left Quadrant is layer 1, going counter-clockwise, Bottom Left, then Bottom Right, and Finally the top layer is Top Right. By stacking the layers in order, bottom to top, and considering Black as absence of color -- meaning black would appear like "glass", then if you stack the upper left quadrant first, you get pretty much all yellow, then stack bottom left, you cover with mostly pink, leaving some yellow showing thru the black, then you stack the red(brn?) and that is where the "error" shows up. The 3rd quadrant of Red should create 2 red dots at row 3 [R][R][P][P], then finally you add the last layer, which is why it "appears" to take precedent, because it is last. You get: [Y][Y][W][B] [P][P][P][R] [R][R][W][P] (NOT [P][R][W][P] [W][W][P][B] In other words the "Answer" (at least as far as I can tell, accidentally has pink in the 3rd rows, 1st column and it should not.
Depends how formal you are about the definition of amplitude I guess? Decreasing / increasing (peak to peak) amplitude, maybe? I don't think that has much bearing on the test itself since the answer remains the same regardless of how nitpicky you want to get about the description you choose.
@@puneeification If a signal is increasing in amplitude, it gets greater from baseline. Therefore, question 43, incorrectly shows amplitudes both increasing and decreasing on the left and right sides. Even if you consider peak to peak, that is still an amplitude off of a baseline, therefore Question 43, right side, 2nd column, row 2, shows a rising amplitude from baseline. A question needs to be formulated properly in order for it to be considered a valid test of intelligence.
Hey Tim, could we ask Chollet about a broader measure of intelligence (physical, visual understanding of the world instead of abstract anthropocentric)? i.e - cats, dogs, gorillas, etc have immense intelligence about the world but they wouldn’t even rank on ARC.
I am nonplussed by the ARC-challenge. LLMs fail on it simply because they are out of domain. These are visuospatial tests. Better pre-training and scale on a multimodal model is the path forward. All these algorithmic and symbolic addons are just hacks. Forget about search space. Just make a bigger model and let it do what it does best - build abstractions and spot patterns. I also think Chollet is massively underestimating the amount of visuospatial training humans receive in their lifetime. Which is why I think it is fair to simply suggest multimodal scale as the ultimate and best solution to this challenge. It is perfectly fine to challenge Chollet's take on this. Beware reflexive deference to authority.
Yes I kept saying this, The LLM visual recognition capability isn’t precise enough for this. It’s not that the LLM doesn’t know, it’s just that it can’t see it as clearly as people think.
Agreed! If you've ever really tested the best multimodal frontier models on images to their limits you'll know they have terrible vision ability. I'd guess it's less than 1% as capable as a human. As Ryan pointed out in his blogpost if a human was read aloud problems from the ARC test while blindfolded and asked to speak the output row by row, that's what you're asking today's models to do in solving this. Of course they won't do well, and neither would 99.999% of humans
Sure, the easy way to solve these would be to train on a massive test set of similar problems, but that defeats the purpose of making progress towards AGI. You'd like an AGI to be able to solve problems that it is not familiar with and has not pre-trained on.
Entirely agree. LLMs evaluate and generate sequential input and output. Visual patterns require a more parallel interpretation which is entirely possible with neural nets. Very much overhyped. A soon as more visual multimodal models are trained it will be fairly straightforward to let the model guess patterns and similarities, test these in context during inferences and conclude whether they work or not. And continue to search until there is a solution. Very much like humans. Nothing special about that. Visual data requires more data and compute but nothing that is unimaginable in the near term.
@@TheTEDfan yes although I do question whether even with high quality vision perceptual capability that will translate to visual reasoning. I suspect there are some things the human brain does as it's trained on very high res continuous vision. So we learn how things move in an image/video and predict what they will look like over the following few seconds continuously. It may require training on video directly as opposed to just images. Or maybe the multimodality will transfer easily, idk
also: i dont understand how tokenization is still a thing. i mean it may have been useful 15 years ago to speed things up. but nowadays the first two layers could do the tokenization. just take the ascii characters. you lose two layers of efficiency, but you would gain so much. suddenly the model can count letters in words, can count words, etc. without having to resort to external tool-calls. and you could do all of this 2d grid made of ascii characters stuff natively. ascii as direct input is an obvious choice.
The problem with that - if you start by embedding characters rather than sub-word tokens, is that the initial embeddings (before they start getting augmented by transformations) won't have any meaning. It makes the leaning task much harder. Starting with tokens the model can learn a word2vec type embedding where some semantics is already captured.
@@benbridgwater6479 yeah i get that. my question is only: how many layers of a say multilayer perceptron does it take to emulate the tokenizer? take the word "tokenizer". lets say these are two tokens: "token" and "izer". it is clear that using these as input has many advantages. i just wonder if it really takes so much computing power of a neural net to learn this tokenization from the single characters? is it not after layer 2 or so already basically the same? it just has to combine the five characters into one thing. and then you have the embedding of that thing, that is equivalent to the "token" embedding.
@@benbridgwater6479 and isnt it true that your level 0 input actually shouldnt have any meaning or semantics? it gets its semantics on the way through the layers, but not at the input level. a pixel in a cat picture doesnt have a meaning either until it gets further down. i agree that it slows it down and makes it harder, but it should be possible and it should have other benefits.
I imagine rather than ASCII you mean one token for each possible byte, rather than just the 127 or so that encode for an ASCII character? (So that it can still handle Unicode characters)
Isn't it funny how often we like the decisions we make, are comfortable with them, sometimes even excited for what comes next. You would think it would be less often, if what we consider to be reasoning, transpires this way. Is it presenting words that align to the average of the totality of similar words, within the context? Is that a reasoned analysis? Does there need to be risk of consequences, can a meaningless decision, be a reasoned one?
I'm curious of what prompt might trigger an LLM to "deploy" the kind of visual pattern seeking that an ARC challenge demands. I came to think about the so called "gestalt laws" that seem to be hard coded into our perception and thinking. From a certain vantage point these could be understood as the kind of "psychological priors" that might help when solving ARC problems. So; would a prompt like "Always use the gestalt laws when solving problems". This might put an LLM in a mode that together with train-of-thought and visual thinking seems to do.
"Regarding the prompt "Always use the gestalt laws when solving problems", this instruction could potentially enhance an LLM's ability to address the ARC problem in the following ways: Principle of Similarity: Grouping similar elements together can help in identifying patterns and relationships within the data, facilitating better comprehension and reasoning. Principle of Proximity: Clustering elements that are close to each other can aid in understanding the structure and context of information, which is crucial for solving complex problems. Principle of Closure: Encouraging the model to perceive incomplete shapes as complete can improve its ability to fill in gaps and make more accurate predictions based on partial information. Principle of Continuity: Recognizing continuous patterns can help the model follow logical sequences and enhance its reasoning capabilities. By incorporating these gestalt principles, LLMs could potentially improve their performance on the ARC benchmark by enhancing their pattern recognition, problem-solving, and reasoning skills, leading to more human-like comprehension and decision-making abilities "
To rule out that the core problem for an LLM is that this problem is 2D. Has anybody tried to create a 1D version of ARC? Same type of problems just in a 1D grid?
Most LLM convert this into 1D internally... they just take every pixel, and convert it into a very long 1 dimensional array then consume it in 1 shot, don't they?
Why don't we uncristalize the models? Let them run in training mode and use a function to determin the backpropagation signal strenght. This should make them able to slightly alter the weights when they are wrong. Give them neuroplasticity and see what comes of it
I think we address this in the some of the other shows we have on this i.e. ua-cam.com/video/J0p_thJJnoo/v-deo.html and ua-cam.com/video/mEVnu-KZjq4/v-deo.html - but we will be sure to address it in the upcoming show with Chollet. The basic idea is just like with computing some Bayesian quantities are computationally intractable because they require you to consider every possible value of a thing i.e. an infinite number of things, Chollets measure of intelligence also requires to to consider the space of all possible programs which clearly isn't possible to do on a computer
Chollet's measure requires that youy somehow get a perfec solution for a given problem, and thats not attainable in general case in finite time due to computational irreducibility (halting problem etc.).
the way i understood is that sequences that has infinite amount of patterns cannot be calculated/searched in finite time with a computer. however, these sequences can be defined using finite descriptions, even if they cannot be computed.
Flexibility? Human crystalized intelligence is not really very crystalized at all. Even our old memories are constantly being modified by more recent experiences
Tim, I watched this episode twice and (in between) researched all literature I could find about ARC-AGI. In the begining you mention Core Knowledge as defined in cognitive science. But you do not ellaborate how that translates into inductive priors. Specifically, treating ARC tasks as object manipulation. Have I missed something?
What does it say about society and our traditional metrics of performance assuming the premise that LLMs are not "true intelligence" is valid. Suppose the analogy "LLMs are just look up databases" is true...even though if so they are data bases that machines organized not humans but what does it mean if that sort of compression/memorization can perform tasks that society values as useful even if we want to debate if it is actually a sign of intelligence. For some reason I am reminded at how chimps out perform humans in the aptly named "chimp test" meanwhile nobody is particularly impressed with the kind of profound insight that can offer to our beliefs about what relevant philosophical terms might actually imply. Either way it seems obvious to me that we are on the verge of major disruption regardless if people want to debate the metrics of semantics. We have still entered the age of the thinking machine to my view bc no longer are people discussing whether it is or is not possible for machines to reach human levels of performance in some arbutrary task...now the question is what is the best metric to use for comparision... IMO this is what Turing meant to broach with his Turing test thought experiment. Not as a true test to detemine some measurable difference...but rather he anticipated a time when attitudes and perceptions would shift focus beyond the mystical beliefs about what it means to think and more towards the demarkation of how that plays out as an application and how humanity will respond in terms of what we value.
Watch some more of these videos. Pretty much everyone he has every talked to agrees that LLMs are just a database lookup and can't do any reasoning. All the experts in the field agree on this. And it just makes logical sense. By what mechanism could they possibly do reasoning? There's nothing in the algorithm to do that. LLMs are intelligent in the way wikipedia is. It contains a lot of information, but it isn't going to be able to reason on it. ARC is a perfect example. The prompt "Tell me the 5th letter in this sentence" is another. It can't even handle these simple things in large part because it doesn't understand anything. It just probabilistically selects the next likely word given the previous words. People really need to come to terms with this. That's literally all it's doing.
@xzvbcxSyntaxError Yeah exactly. You've answered it yourself. That's how it's encoding the information to be retrieved. It's still stored data that is getting retrieved and output. Can you tell me why some databases use graphs? Your argument makes no sense. Omg, this database is stored in ASCII, so it's encoded a certain way so it's not retrieving data. It's thinking! Talking about it this way shows how the argument you are presenting makes no sense. What difference does it make how people choose to store and retrieve stuff? It's effectively a database. It's closer to a database than a thinking person. That's the point. Not that people literally think you are retrieving exact stored content the way a normal database works.
@@thorvaldspear No, that is incorrect... Amplitude is supposed to be referenced to the baseline. Therefore, technically, the answer provided by the page displayed in the video is incorrect. Set A has is Increasing variation of the amplitude (on left), and decreasing variation of amplitude on the right. The answer regarding amplitude alone is not correct, specifically because , right side, 2nd column, row 2, shows a rising amplitude from baseline. However, if you modify the question to be "Variation" in amplitude, then it would make a lot more sense.
@@thorvaldspear If I were to look at the amplitude of a wave on an oscilloscope, I would measure it by looking at the "peak" versus the baseline. The question clearly mixes up decreasing and increasing amplitudes on both sides of the image. So, whoever wrote the question was either thinking variation of amplitude, but never wrote down the word "variation", or simply misunderstood it. Either way, the question/answer is not valid.
You could use Metas new compiler as the language for concepts instead of chatgpt and python, likely much quicker and claude , use smallest first, then second then top one. or nvidia's. you can have a platform which runs on interpolation but not the abstract layer, like game of life? Also you can interpolate and find any value in two steps with the right structure, I wrote such a program two years ago, concept is an extension of Newton method, how you would apply it to this I don't know
Yes - the rules are on Kaggle, but nothing about it needing to do anything else. Obviously there's quite a perceptual component to the challenge, which perhaps favors hybrid approaches (neural perception + symbolic program synthesis/search), but no rules saying they have to be.
can we lern to creaty symphonies from seeing frequency images of audio files? No. It is not a matter of intelligence but a matter of state representations. LLMs have no sense for the arc states. Therefore these tasks are difficult for them. But it is no problem to train NN with such tasks and they woulod be able to solve them because they would learn to understand the states like the understood linguistic states.
Considering the Aristotelian Blank Slate idea and LLMs in general, we would perhaps benefit from philosophers to do some work for us in trying to well-define what is going on here. Fortunately I have been doing just that. Aristotle went out of fashion during the Enlightenment era Rationalism from which the digital computing paradigm. At the same time we get the idea of bottom-up causality explains everything and that language has a pure logical form. Then comes the 20th century and states that digital arithmetic systems can not compute quantum information efficiently (Schrödinger Equation), bottom-up causality doesn't cut it (Gödel) and language is evolutionary and logic has no content (Wittgenstein). In practice this means that natural language has Zipf's Law (fractal dimensionality of word frequency), where most frequent words are syntactic, in other words, they give logical support to the content of communication. Guess what? When we measure sequential windows of Zipf's Law the grammatical top frequency stabilizes around thousand tokens, which probably explains the quality difference between GPT-2 and GPT-3; and why other technologies see similar quality increase when they reach the same limit and why Transformers have not had qualitative leaps from further scaling. Logical structures are Blank Slates in a sense. These are the "genetical brain organs" in humans (Chomsky) or the architecture of neural network training apparatus / feature engineering. What happens to the content after learning is the interesting thing that ARC-like tests should go for. According to Eero Tarasti and his Existential Semiotics, with humans we encode phenomenal patterns together with our noumenological existential volatility. In other words emotional frustration allows us to easily switch context when our initial problem solving goes wrong. Neural Networks do not have that. They are purely phenomenologically limited to the target function. In Quantum Machine Learning we also get the Phase Shift parameter which could be "emotionalized", but doing it in human-like manner would be nearly impossible. But still it would give something more controllable than with Transformers, where you have to guess the correct magic word for invocation of viable secondary contexts. In other words, human brains have evolved for our own environment for a very long time, which gives us super powers because the "no-content" apparatus of us is always inside the distribution. I call this Cartesian digital bottom-up computational system as Hobb's Golem (Descartes said computers are imposdible, Hobb's said "challenge accepted); Hobb's Golem is not a product of natural evolution, but it is always built with engineering principles. We both might start as Blank Slates, but the way our Logic Organs work and the way our content gets encoded, is fundamentally different. Trying to prove there is no difference is silly. The interesting question is, how does that matter? At which point should we take information pollution of digital environments seriously and try to build more "brain friendly" user experiences? Are LLMs part of that or against? I think LLMs might be good for "more democratic access", but problematic as "unverifiable content consumption"; should we just start paying journalists and educators proper salary so they could do their jobs rather than trying to synthesize away the human component in information refinement? Would be interesting to see a show around these domains. QNLP, Aristotle, Wittgenstein, Complexity Sciences, 4E cognition (Post-Cognitivism and their reaction to Connectionism).
That's fine, this is the public dataset. The method described has not yet been tested on the private dataset (where the method described wouldn't be allowed)
We need a kind of AI baby that learns by interacting with the world, and living alongside humans. An evolutionary algorithm needs to reward AI that uses less data for higher inference. How can that we done? Could something like that be run using a basic simulation of some problems.
I dont really see why people assume a NN helps the LLM beyond being a probabilistic distributed hash. Why do you/they think the backprop ever mattered beyond probabilities being normalized? A reasonable common sense assumption is that the LLM is merely recursively best-fitting crystalized tokens. I predict in a couple years we will be laughing at how much extra work could of been avoided with a more efficient probabilistic hashing system.
This is true, but the generalisation power of LLMs is (in my opinion) that they implicitly permute many different symmetry-based variations of the training data, so there is a surprising amount to find if you guide the search process. Of course, these permutations will just be "in the neighbourhood" of the data it was trained on, so the space of creativity is still grounded by the source data, the inductive prior used and the search query (the prompt).
The "bitter lesson" was wrong though, because all our best models work precisely because they have a tonne of hand-designed engineering and priors. Believing in scale was a cute idea a couple of years ago
@@MachineLearningStreetTalk Thinking it's wrong is a cute idea nowadays. Maybe 70 years and a stage of the field where one reference manual with 1000+ pages called "Modern AI" (Russell / Norvig) dedicated a (very) few pages to ANNs, in a sub-section of a chapter, and Hinton having papers rejected because one about NNs was already enough for a top AI Conference wasn't enough. People need to turn to philosophy and making knots in their minds. You really think that was the way Connectionist ideas were introduced In Psychology? Or what originated the Transformer architecture? Or even what showed the Transformer could go from GPT 2 level performance to GPT 3 and beyond? This is the intuition: your brain has neurons. Neurons do a certain type of information processing. You have a lot of them, and evolution made human neurons organize in a particular way that makes humans have cognitive capabilities that are far beyond what other animals have. In neuroscience, many folks are convinced the changes in the human brain were also mainly a matter of scaling (number of neurons and connections in certain areas: no need for "hand-design engineering" there too: cute, him?). But no doubt you are much smarter than I am (no ironoy or cinicism here: just something I believe). I just think you have a too much "philosophical" tendency, and you have a profound need to complicate stuff. There was a time I regularly followed your show: it was fresh, and ideal to watch over my kid as he played in the street. But nowadays, if this is your idea of street talk... Kind regards, and thank you for answering my comment. I feel honored.
@@MachineLearningStreetTalk Not believing in scale is a "cute" idea nowadays. Believing in it was a bold stance that few took-kudos to Frank Rosenblatt and Geoffrey Hinton. Undoubtedly, their inspiration came from the brain and their background in psychology. The human brain isn't that different from other animals near our evolutionary line. Recently, Demis Hassabis, who also has a background in neuroscience, joined Hinton in criticizing Chomskian ideas about language. There's no time for discussions now, but it's unfortunate that my more curated answer done on the tablet was deleted. Kind regards. As someone who isn't a native English speaker, I thought, "Why not ask ChatGPT to improve my grammar?" After all, if AI can mimic the brain, it can surely handle my sentences!
We have interviewed connectionists, try the Nick Chater interview. NNs are nothing like the brain whatsoever, you might enjoy the Max Bennett interview when we release that. Unfortunately due to the complexity of the topic I can't address your other points here but suffice to say we have addressed them many times on previous episodes.
Surely the issue for an LLM is that it's just a 1D token predictor (strings) rather than a 2D token predictor (images) required here? I presume someone has tried to turn an LLM into a 2D next token (pixel) predictor?
If it turns out to be possible to cheat/shortcut it, absolutely. Dileep George speaks about the "perceptual leakage" of ARC here substack.com/home/post/p-145553885 - it's only a good benchmark if it can't be gamed, and I agree it probably will be eventually
how is an ai model like an llm supposed to know about spatial relationships. here: ABC DEF GHI for an llm this is just a text ABC DEF GHI. how is it supposed to know that B is "above" E. or that if you go diagonally from A towards E, you end up at I? or that A is two to the left of C? it has never learned about that. it has no collection of features and representations and vector embeddings that say "to the left of" or "above" or "two steps further" or "diagonal" or "straight" or "inside" or "outside" (E is "inside" of B D F H), etc. at least not a text model. an image model or a video model maybe. or a multimodal model. but the llm would have to have access to these features. and it would need a "grid mode" and run in that whenever a problem with a 2d grid comes up. or take for example connect 4. draw a 6 by 7 grid. the discs fall "down". how is the llm supposed to know, what down is? it doesnt know, that my monitor is standing upright on my table and for me it looks like down, gravity is switched on, etc. these supposed super-intelligent models cant even count the correct number of discs on a board! training models like these, would go like this: you come up with thousands of example grids and let humans label them, what they see there. for example: 5 by 5 grid with three blobs. red blob is to the left of the blue blob. the red blob consists of 3 pixels and the blue one of 5. and so on. but you can already see, that this is not so easy as labelling cat and dog picutures! there is much more detail to be described. it is much more abstract. the grid is already an abstraction from the real world we live in for us humans which is much more smooth! that's the problem.
This kind of approach, trying to understand "how human intelligence solve problems" and do the same seems too simplistic. Remembers "symbolic AI". Think about how we solve problems and try to figure out how to make a machine to do the same!! I think "connectionism" still seems much more promising! Give the data, let the system "interpolate" (call what you will) no matter how, and expect for a magic to emerges! The secret of transformers model is the ability to make math connections (induction) and let the linguistic make another layer of human experiences connections (deductive) on top. And this seems to break down the idea that AI is only about hardware computations. Just because the second layer (linguistic) is not math, even being supported by math! Every time we try to explain what intelligence must do, we are in the wrong path! We still don't know what intelligence is! Because we don't know, we cannot think about "intelligence must do this" or "must be like that"! That's my guess!
I doubt some Bushman can pass ARC-AGI test. Does it mean he is not intelligent? I think Francois Chollet really underestimates our experience when he talks about our ability to resolve a "new" task.
Of course they won't pass all the test but they'll surpass current LLM... The caves men core knowledge would be less than ours. If you looked at some of the ARC test, it's mostly pattern recognition and nothing special.
LLMs are far more generally intelligent than any other model (say image, audio) since the training data on language has more knowledge. If you think about image/audio generation, they are essentially specialized with less intelligence due to the large amount of data but minimal knowledge stored in that medium. I think what's better than an LLM would be a logic machine that prioritizes truth over plausibility.
@@ckqWe learned along time ago that symbolic AI just isn’t going to work so the sort of logic machine you’re describing is unlikely to ever work. Unless you mean something else by it.
I think models are bad at this kind of test mostly because they are not trained for that. AI researchers were mostly focused on text and even vision models are mostly optimized for object recognition, not recognizing patterns and spatial intelligence. So most of the ARC test difficulty comes from picking the field that is mostly ignored by AI scientists and model progress there is lagging behind. And of course for humans this kind of problem is really easy cause recognizing predators is a high priority task for survival.
This whole thing makes me pretty angry. I guess this type of gaming the system is always going to happen when money is involved. They tried to game the system to win in pretty bad faith in my opinion. The ARC challenge clearly states that the purpose is to find new ideas, which means if you try to use an existing thing like an LLM you should be disqualified (even if you contorted yourself horribly with crazy prompt engineering (which it sounds like they did), then you still are going 100% against the spirit of the competition). Secondly, training it on extra examples to try to memorize is strictly against the intent of learning from a few examples. Yes, you can do that, but that is not the point of ARC. Also, should be disqualified. I can't tell whether they know what they are doing and are just trying to protect themselves, or if they really believe their own BS. Basically, if you can't do it the way it was intended to be done, then you are doing it wrong. You need to come up with a new idea. Aka, the literal point of the competition. They didn't put down so much money so you could just use an LLM. Duh, of course people can try to do that. No one cares. They are trying to find new ideas.
If there are tons of human trying to solve this arc challenge, and we still can't find the solution, at least we can prove human isn't so intelligent after all.
But... A HUMAN is still having to GUIDE this "training" and the decisions on the "type" of training: fine-tuning, running python programs.. the AI still isn't "Thinking" or "Reasoning" for itself it's still wholely dependent on the Human to help it solve the Arc challenge.. That's NOT AGI. - youtube.com/@bioneuralai
gosh i wish the guests didnt speak like they are sleep-deprived for days. I am sure they are all brilliant people but to listen to them is another story urgh...
I think the resurgence of the ARC challenge is one of the most interesting things to have happened this year in AI. Just the level of nuance and debate it has forced into the conversation can only be good for the community. Whether it is beaten or not, we’ll all be wiser for having gone through this exercise. Chollet really has devised an incredibly ingenious challenge.
never touched machine learning, don't know what a tensor even is (just seen it as a class in some machine learning code on twitter), but 35 mins in the video and I dont feel lost. you bet im subbing.
Post show reflections: I know I was pushing the "LLMs are databases" line quite hard, and the guests (and Ryan's article) were suggesting that they do some (small) kind of "patterned meta reasoning". This is quite a nuanced issue. While I still think LLMs are basically databases, *something* interesting happens with in-context learning. The "reasoning" prompt (or database query if you like) is parasitic on the human operator - but the LLM itself does seem to do some of the patterned completion/extension of the human reasoning prompt pattern in context i.e. "above the database layer" there is some kind of primitive meta patterning going in which is creating novel combinations of retrieved skill programs in the LLM. It's a subtle point but I don't think I was able to express it in the show. - Tim
Hi Tim, LLM output is in the mathematical sense a sequential chaotic process. The ARC challenge uses 2D grids which requires more visual intelligence that is not purely sequential. It is entirely possible and even highly likely that multi model models that have the ability to focus sequentially on parts of images and have in context sequential generation of possible similarities in the examples and testing those during inference will lead to solving the ARC challenge without the need for coding. The main problem with the ARC challenge is that the most advanced models (which are closed source) are not allowed to participate. It is difficult to state those models don’t work if they cannot participate. The solution is to ensure the advanced models are used in a mode that prevents them from training future models on the interactions. A very easy prompt that can be used on multiple hard problems is “keep trying and testing ideas until you find one that works.” That is very much how humans solve such puzzles. LLMs are indeed not just databases if you see what they can do with sequential problem solving like e.g. coding and debugging entirely novel programs. Just as zooming in on Mandelbrot set is not a database problem. Both can take you into directions that you can only discover by going there.
No need to apologize. As far as I remember, every guest you've had agrees with the LLMs are databases theory. So, it's pretty reasonable to say that if most of the experts in the field agree. I think the confusion comes from all the marketing hype that the companies put out to try to oversell their capabilities.
@@TheTEDfan You can use whatever you want on your own. I'm sure you would get some reward if you could solve all the challenges with an LLM (since many have tried and apparently the best is still only 50%). But the point of the ARC challenge is to discover new ideas, so using an LLM goes against the spirit of it no matter what.
Excellent content! I'm awaiting the Geffrey Hinton interview....
Also, this avenue that you've been reporting on reminds me of this paper on the brain - as an aside to ML/AI, it's interesting theory as to the organic functions we're trying to create digital counterparts to:
[note: yt wont let me post links]
google search "National Library of Medicine Top-down predictions in the cognitive brain"
ICL or RAG doesn't conflict with your view. They are basically relatively trivial kind of of program systhesis on top of LLM , since they involve some heuristic or embedding based discrete choices to guide the LLM. It's a rather weak form of system 2 on top system 1 to help performance IMO.
I can't begin to tell you how much quality of life this channel has brought me over the years since my health issues have impeded my mobility. These videos are so stimulating and profound, I wish I offer more. I so, so, so much appreciate your work Tim, and Yannic and Keith too. Thank you all so much.
Thank you!!
I think the resurgence of the ARC challenge is one of the most interesting things to have happened this year in AI. Just the level of nuance and debate it has forced into the conversation can only be good for the community. Whether it is beaten or not, we’ll all be wiser for having gone through this exercise. Chollet really has devised an incredibly ingenious challenge.
I often share your insightful and well-explained videos with my children. I want to express my sincere gratitude to everyone involved in creating MLST's content. It's truly exceptional. I wish more content creators would prioritize clear, informative delivery over sensationalism, as you do so well. Thank you!
Thank you!!
9:54 I don't think it's clockwise. The first example shows that white should be on top, but also that pink should be on top of brown (there would be more brown in the solution otherwise).
So I'd guess: white > pink > brown > yellow (in terms of z-index)
It's overlaying 4x4 quadrants of the original on top of each other, treating black as transparent. 4 over 1, then 3 over 1, then 2 over 1. Solution is resulting quadrant 1.
@@benbridgwater6479 You can also think of it like layering colored papers on top of each other with the black part cut out. If the Upper Right is Quadrant 1, And Numbering Clockwise, in Quadrants... Lay down Quadrant 4 first (All yellows + Black), then lay Quadrant 3 on top of it (All Pink and Black), then lay down Quadrant 2 ((Red + Black) and finally lay the last layer Quadrant 1 (White and black)... the black represent nothingness or glass or pure alpha channel or transparency depending on your viewpoint, just as long as you let anything under black come thru to the top). Once a color gets covered by anything other than black, it is superseded.
Your original comment is correct (white > pink > red/brown < yellow). The comment directly above confuses the order the quadrants are superimposed, ignoring that pink visibility trumps red/brown visibility.
@@simonahrendt9069 No - the only special color is black which acts as transparent when overlayed on top of something else. Otherwise what "trumps" what just comes from the overlay order - whatever ends up on top wins. Easy to verify in GIMP.
@@simonahrendt9069 The "order" is Layer 1 (Yellow = Q4), Layer 2(Pink=Q3), Layer 3 (Red=Q2), Layer 4(White=Q1).
Red covers pink. Pink does NOT trump red.
Where Q4 means quadrant 4, and you layer it down going counterclockwise starting at Q4. (But you can name the quadrants whatever you want... the order does not change).
Timestamps
00:00:00 Introduction
00:03:00 Francois Chollet's Intelligence Concept
00:08:00 Human Collaboration
00:15:00 ARC Tasks and Symbolic AI
00:27:00 Evaluation Techniques
00:35:23 (Main Interview) Competitors and Approaches
00:40:00 Meta Learning Challenges
00:48:00 System 1 vs System 2
01:00:00 Inductive Priors and Symbols
01:18:00 Methodologies Comparison
01:25:00 Training Data Size Impact
01:35:00 Generalization Issues
01:47:00 Techniques for AI Applications
01:56:00 Model Efficiency and Scalability
02:10:00 Task Specificity and Generalization
02:13:00 Summary
Just stumbled across this - I'm the author of the paper discussed from 4:00. Really cool discussion and you did a great job explaining :)
Great paper Mikel! Thanks!
I not sure I see ARC tests as examples of "abstraction" and/or "reasoning." I see them as our capacity - at the Perceptual level - to automatically Categorization concrete things into like "kinds" of things due to their perceived similarity (or dis-similarity in the case of a missing similar piece). This is why young children (not yet operating at a very high level of verbal abstract reasoning) can solve these types of problems. The problems are resolved at the perceptual level - not at the higher (verbal) levels of abstraction reasoning. If the images are flashed for a brief faction of a second, you won't "perceive" the solution. Instead, you just stare at them over time, and your brain instantiates them through constructing neural pathways that are similar. And you see the "solution". This is why humans don't need large, labeled data sets to "get" what a cat is. A young child doesn't even need to be at the verbal stage to differentiate dogs and cats in to different "kinds" of things.
That's kinda the best measure of core human intelligence we have, this test is not contaminated by knowledge.
There's a blur between knowledge and intelligence and that's the issue with current IQ test.
The idea of the ARC test is that you have a various simple perceptual tests, but the AI needs to instruct itself to solve it. That way it needs to reason, which can be defined as needing to instantiate less brute force perceptual solvers.
@@divineigbinoba4506 I wouldn't say this class of problems are independent from any prior knowledge. One has to come up with a function (series of steps) that maps some input space to some output space. Time is the limiting factor.
The shape of the input and output will constrain the search space to some degree. Without any knowledge, it's still brute force.
Time to solution goes down with a more accurate "initial guess" / more efficient "tools" to approach the problem with.
A "tool" is either prior understanding of a general idea, or prior knowledge of / familiarity with an idea specific to this context.
Definitely not a pure function devoid of knowledge.
The superposition example at 9:34 is not in the clockwise order:
It's yellow -> red -> pink -> white with later ones on top.
Left edge of first one shows pink is on top of red.
Yeah, it took me a while before having the same conclusion 👍
I think many people under estimate the complexity of our visual cortex.
There has been interesting research, based on persons with brain defects.
And every time one finds new insights.
What looks simple in the arc challenge is million of years of evolution.
Language is only a few thousands of years
Reasoning likely even less.
Amazing conversation.
Thanks
Can you give an example of one the new insights that was found by studying someone with brain defects?
@@Bd-ng1zv
quick search, but there are man y more
One notable example of new insights gained from studying someone with brain defects is the case of patient H.M. (Henry Molaison). H.M. underwent surgery in 1953 to treat severe epilepsy, which involved the removal of large portions of his medial temporal lobes, including the hippocampus.
After the surgery, H.M. was unable to form new long-term memories, although his short-term memory and general cognitive abilities remained intact. This led to the groundbreaking insight that the hippocampus is crucial for the formation of new long-term declarative memories (memories of facts and events) but not for short-term memory or procedural memory (like learning new motor skills).
This case provided a foundational understanding of how different types of memory are processed in the brain and how specific brain regions are involved in different aspects of memory, significantly influencing neuroscience and cognitive psychology.
You talk through and present everything so clearly, I can follow along and understand easily despite knowing nothing about ML, thanks!
Excellent conversation! This is the type of cross-disciplinary discussion that will advance the field of artificial intelligence.
Long term planning and zero shot learning seem like the last hurdles to AGI
Zero shot learning sounds easy enough, right? 😜
It's not so much zero-shot learning that's needed as runtime incremental permanent learning. It doesn't seem gradient descent would work since you'd just be fine-tuning on new experience and would end up losing the pre-trained model's capabilities. Runtime learning might sometimes be one-shot, but other times generalization over repeated patterns, learning exceptions, etc. Really need to ditch gradient descent altogether and find new incremental method.
@@benbridgwater6479 No need to ditch it all together; gradient descent is how you learn to throw a ball; a little better each time. Gradient descent is not how you learn to reason a little better each time.
Also multi-context, so the AI can work on a text making use of a working context. That way there is no contamination between the working text and the meta-goals. This way you can ask the AI to be "didactive" in parts of the text and "critical" in other parts.. I would think embedding slicing could solve that.
@@MrMichiel1983 Yes, but in our brain fine motor skills like ball throwing are learnt by Cerebellum, while cognitive pattern matching/etc is learnt by Cortex. What AGI needs is for "cortical learning" to be available all the time ([preferably no training vs inference time distinction).
I don't think the LLM's image recognition capabilities are precise enough for the ARC challenge. It's not that the LLM doesn't know; it's more that it cannot see as clearly as you think.
Yes it cannot see and (as part of that) it cannot reason in the spatial domain. Sooner or later this will be solved
Yeah, I think it’s unclear if simply solving the image recognition will be enough, it might not be, but the ARC test does feel a little pointless currently when the LLMs clearly can’t see clearly.
Yeah LLMs are not precision instruments. They are good at getting the gist of things in any domain. This is just an extra layer that makes that even more pronounced. I would argue if you made an LLM be able to solve these, they would cease to be useful in any other domain. Plus what the heck is wrong with people. It's not a language task, it's vision+reasoning task. I don't understand why people try to use a language algorithm to solve literally everything now.
@@InfiniteQuest86 Yeah, I did a quick experiment where I described the scene of the first question on the test in great detail, and the model was able to get it correct because it was now a text-based problem. I then refreshed the chat and asked the model to describe what it sees, and it was evident that the vision capabilities were the issue, though I was already suspicious of this being the issue based on my long experience with the model.
I tried this with GPT4o, and it does understand the color scheme when I paste it in. It can't grasp the conversion tho. I think without some hints this particular problem is very difficult to solve.
Wow this is better than expected
Congrats! We both hit 131k subs at the same time :) - What's everyone's take on few shot promting vs. test time fine tuning? My sense is in the limit, few shot prompting would be all you need, and ultimately zero shot (based on their point that as the foundation model gets bigger, you need lest test time tuning)
At 5:38, it's not monoticity, it's increasing vs decreasing amplitude. Just nit-picking. 😄
You passed the test, well done my friend :)
Ah, looks like we both noticed that, but actually it is not increasing and decreasing amplitude. It's increasing and decreasing variation in Amplitude.
@@marcfruchtman9473 actually its the emotional state of a 5th grade doodler, get it right please.
@@marcfruchtman9473 I mean you really only have to read the legend under the picture guys, it's not that hard...
@@samifawcett4246 hehe
Tim, and Yannic and Keith are the smartest tech-journos I ever heard of 😀
I love these guys. what an amazing conversation
Clarification on reasoning. "knowledge acquisition = reasoning" is a great heuristic but clearly isn't exactly correct. I think it helps folks understand what is meant in this context though. It might be more correct to say that reasoning is the thing we do when we "rearrange the variables" to construct models (in many cases composed of existing models we already have) to make sense of the world.
My cohost Keith Duggar defines reasoning as **performing an effective computation to** **derive knowledge or achieve a goal.**
“We may have knowledge of the past but cannot control it;
we may control the future but have no knowledge of it.”
- Claude Elwood Shannon
Science leverages control to gain knowledge.
Engineering leverages knowledge to gain control.
Reasoning is the **effective** computation in both.
Effective method:
A method is considered effective for a class of problems if it meets these criteria:
1. It consists of a finite number of precise instructions.
2. It terminates after a finite number of steps.
3. It always produces a correct answer for problems within its class.
4. It can be executed by a human using only writing materials.
5. It requires no ingenuity, only strict adherence to the instructions.
Reasoning doesn't require knowledge acquisition - it can just be reasoning over known facts, using known methods. What reasoning does need is multiple steps and combinatorial application of knowledge. Essentially Intelligence = prediction, and reasoning = multi-step what-if prediction.
@@benbridgwater6479 This is just semantics - I agree with you. Reasoning can be described as "creatively recombining knowledge you already have". There is still an infinite space of recombinations which needs to be performed efficiently.
Reasoning = loop(deduction • induction • abduction)
Error functions have a natural tendency towards parsimony, reflecting the principle of Occam's razor. This principle is also observed in nature, where simple and efficient solutions often prevail, suggesting that parsimony is a fundamental aspect of both human and other natural systems.
Not always. There is that recent paper "Neural Redshift" claiming that this is not universal for all neural networks and is in fact related to specific architecture choices: "But unlike common wisdom, NNs do not have an inherent “simplicity bias”. This property depends on components such as ReLUs, residual connections, and layer normalizations." They show that randomly initialized networks (not trained yet) have different complexity properties based on metrics such as frequencies in Fourier decompositions, order in polynomial decompositions, and compressibility of the input-output mapping and it influences how hard is to train networks and get good generalization. Their conclusions: "We examined inductive biases that NNs possess independently of their optimization. We found that the parameter space of popular architectures corresponds overwhelmingly to functions with three quantifiable properties: low frequency, low order, and compressibility. They correspond to the simplicity bias previously observed in trained models which we now explain without involving (S)GD. We also showed that the simplicity bias is not universal to all architectures."
I'm interested for ai feature
Absolutely fascinating, rich and informative episode. Lots to consider here, thank you so much. 🙏👍
I can back up everything he said with lower domain examples, especially the 32k context deg! I’ve written to OpenAI devs on this since the 120k context models came out. 4o was an improvement from turbo but still lacks the nuance of the OG gpt4 model. We still use this where we can. But the context window hurts us
Any examples on your lower domain proposal?
@@pliniocastro1546 RAG, Recall, Search. grab any piece of text, after 32K you just get a significant drop in intelligence. give it 100K tokens ask it to identify or count the number of times a word is said. then use command F.
never touched machine learning, don't know what a tensor even is (just seen it as a class in some machine learning code on twitter), but 35 mins in the video and I dont feel lost. you bet im subbing.
22:05 How does a simple version of the exact solution Chollet imagined go against the spirit of the metric?
Very informative interview!
Are we sure that there is only one correct answer for all the questions though? What if the test set is based on our understanding of the world and rules, when in fact ML can find hidden rules that are just as valid for the test set, but aren't rules that we have thought of as valid?
Excellent point.
Yes. Remember when you were being tested and you had to guess what the questioner thinks is the right answer, even if it's not? Some of our tests to the ML algorithms follow that pattern, it has to guess not only the right answer but make sure it's what we think is right.
@@wwkk4964 LoL, well, if a student shows me that a question is formulated incorrectly, I give them credit, and then I fix the question. This question set has been out in the wild for HOW LONG now and still no one has fixed it?
@@marcfruchtman9473 negative numbers did not exist for European mathematicians well over a millenia after brahmagupta gave rules for computations with negative and zero.
@@wwkk4964 re: Brahmagupta. really great work... if I had just a fraction of that capability...
I guess I'm missing something?
The first ARC example - the three pairs of 06 drawings, numbered 15, 43 and 84, with each of the 2 in any pair called either A - on the left, or B on the right.
So givien those definitions and the three pairs of six drawings, it seems like he gets part of the first one (15) wrong when he says the six drawings on the left are 'not connected together'; he then says (correctly) that the six on the right 'have a gap'.
For 43, he gives a different explanation than the verbiage below the drawings, and calls it monotonicity. He says on the right 'they're all going up or they're all going down' - I guess this could be considered correct depending on if you approach each drawing from either the left or right side, even though they're described as all increasing. But then he says the drawings on the left are 'going up and down at the same time' which is not monotonicity (I just looked it up :-), is it? I don't know this stuff. What am I getting wrong?
Thanks for an interesting video.
I'm not an AI researcher, I'm a philosophy scholar, but here are some thoughts:
After about an hour in to the video, it's pointed out that intelligence can't exist as a blank slate, and it's implied that humans are drawing from a very basic set of reasoning skills that can be combined and employed in various ways. We probably don't learn those skills from our limited life experience, humans (and other species) have probably evolved them over millions of years.
Evolution is just learning on a different scale. This suggests an ML approach to developing a more basic set of reasoning skills is probably the way to go. The challenge is figuring out what data to train on. Human created data is probably far too shallow.
Spitballing: Is it possible to write a program that generates pairs of packets of randomly generated data, and that same data which has been manipulated in random ways (i.e. ways we may not have thought of) in order to train an ML system on how to identify patterns?
On a similar note, predicate logic is used to relate symbols. It should be straightforward to build a symbolic predicate logic implication generator of varying degrees of complexity. That is, it starts with a symbolic output statement, and then applies randomly selected logical operations on it to produce a set of symbolic premises with the output statement as a conclusion. Would it be possible to use such a generator to train an ML system to do complex logical reasoning?
Your predicate logic idea exists, and was used to try to create proofs from scratch. It was questionably useful since it was the first attempt, but yeah it's a good idea. I'm not sure your pair of randomly manipulated packets idea would work how you think, since if you start from random noise and then manipulate it somehow, it will still look like random noise unless the manipulation is very specific and detectable.
29:00 would mamba 2 be effective with the symmetry transformation combination? it seemed to be a bridging concept between ssm, and transformers at least, if i understood the paper correctly.
Another paper on llm grokking, seems like an important step towards reasoning.
Groking is training-time generalization (learnt via gradient descent). What these example-based ARC tests require isn't learnt generalizations over training samples, but rather runtime ability to form new generalizations over each ARC problem's example before/after pairs. Not only is there no gradient descent available at runtime, but the generalization task is a bit different since we know a generalization exists for each problem, so it's really a matter of search to find it (i.e. find a composition-of-transformations description for example # 1 expressed in a generalized form that also works when applied to the other examples for the problem).
The difficulty of this task doesn't come from long range interactions. The problems are very small.
What is the name of the host, and what is his professional experience?
www.linkedin.com/in/ecsquizor
Can some one explain the puzzle at 9:50
It helps to have a programmer's mind ...
Call the top two quadrants of the 8x8 pattern as "1 & 2", and bottom two quadrants as "3 & 4".
Now, treating the black squares as transparent, copy quadrant 4 over 1, then 3 over 1, then 2 over 1.
The output 4x4 pattern is the resulting quadrant 1.
@@benbridgwater6479 Yes. Except the "answer" that they provide in the video is "incorrect", I don't understand why they have it wrong, but technically the 3rd row should be
RED RED White Pink, not Pink Red White Pink.
I think an autoregressive self-supervised next token prediction multimodal training system that uses BERT embeddings concatenated with abstract representations of images of Chinese characters would teach a model enough abstract spacial reasoning to pass the arc challenge. The trick here is that these images have semantic meaning, and the Chinese language has incorporated spacial ideas with semantic purpose into characters.
It’s a fun idea but the odds of that method beating ARC are extremely low.
Around 1:56:15 we get to the nub of it... Traversing a manifold in the forward pass of an LLM, in activation space, is more than database retrieval.
Some really awesome insights from Jack at 1:34:17
Thanks for this interesting Video. Just getting started (first few minutes into the video) but it appears to me that Bongard's test 4:44 number 43, regarding sets of amplitude has an error. By definition Amplitude is the displacement vs the baseline (peak height or peak trough) therefore, this question seems to have some errors where the amplitude in some of the samples in set A are not increasing, and some in set B are not decreasing, therefore it appears to me that what they might have been looking for was actually meant to be the Variation in Amplitude., ie the answer should be Set A has "Increasing Variations of Amplitude" and Set B has "Decreasing variation of Amplitude" (where variation would be the peak to trough of the wave vs the previous wave's peak to trough)
To be more specific, Not all examples in Set A are increasing in Amplitude, but all are increasing in "variation". And not all examples in Set B are decreasing in Amplitude, but they are all decreasing in variation.
In that respect, it is important to note that the "quality" of the test question could reflect poorly on some AI, that can't really "complain" about the answers not being present if they are forced to choose from a set of answers?
Additionally, at 10:00, the 1st 8x8 diagram converts to a 4x4 diagram. It is incorrectly converting. It looks to me like the Top Left Quadrant is layer 1, going counter-clockwise, Bottom Left, then Bottom Right, and Finally the top layer is Top Right.
By stacking the layers in order, bottom to top, and considering Black as absence of color -- meaning black would appear like "glass", then if you stack the upper left quadrant first, you get pretty much all yellow, then stack bottom left, you cover with mostly pink, leaving some yellow showing thru the black, then you stack the red(brn?) and that is where the "error" shows up.
The 3rd quadrant of Red should create 2 red dots at row 3 [R][R][P][P], then finally you add the last layer, which is why it "appears" to take precedent, because it is last. You get:
[Y][Y][W][B]
[P][P][P][R]
[R][R][W][P] (NOT [P][R][W][P]
[W][W][P][B]
In other words the "Answer" (at least as far as I can tell, accidentally has pink in the 3rd rows, 1st column and it should not.
Depends how formal you are about the definition of amplitude I guess? Decreasing / increasing (peak to peak) amplitude, maybe? I don't think that has much bearing on the test itself since the answer remains the same regardless of how nitpicky you want to get about the description you choose.
@@puneeification If a signal is increasing in amplitude, it gets greater from baseline. Therefore, question 43, incorrectly shows amplitudes both increasing and decreasing on the left and right sides. Even if you consider peak to peak, that is still an amplitude off of a baseline, therefore Question 43, right side, 2nd column, row 2, shows a rising amplitude from baseline. A question needs to be formulated properly in order for it to be considered a valid test of intelligence.
Hey Tim, could we ask Chollet about a broader measure of intelligence (physical, visual understanding of the world instead of abstract anthropocentric)? i.e - cats, dogs, gorillas, etc have immense intelligence about the world but they wouldn’t even rank on ARC.
They can tho. Take any animal they do understand concepts such as connectedness or inside vs outside
@@eva__4380 Indeed. The ARC challenge isn’t designed though in a way that would allow a non-anthropomorphic intelligence to illustrate this fact.
I am nonplussed by the ARC-challenge. LLMs fail on it simply because they are out of domain. These are visuospatial tests. Better pre-training and scale on a multimodal model is the path forward. All these algorithmic and symbolic addons are just hacks. Forget about search space. Just make a bigger model and let it do what it does best - build abstractions and spot patterns.
I also think Chollet is massively underestimating the amount of visuospatial training humans receive in their lifetime. Which is why I think it is fair to simply suggest multimodal scale as the ultimate and best solution to this challenge.
It is perfectly fine to challenge Chollet's take on this. Beware reflexive deference to authority.
Yes I kept saying this, The LLM visual recognition capability isn’t precise enough for this. It’s not that the LLM doesn’t know, it’s just that it can’t see it as clearly as people think.
Agreed! If you've ever really tested the best multimodal frontier models on images to their limits you'll know they have terrible vision ability. I'd guess it's less than 1% as capable as a human.
As Ryan pointed out in his blogpost if a human was read aloud problems from the ARC test while blindfolded and asked to speak the output row by row, that's what you're asking today's models to do in solving this. Of course they won't do well, and neither would 99.999% of humans
Sure, the easy way to solve these would be to train on a massive test set of similar problems, but that defeats the purpose of making progress towards AGI. You'd like an AGI to be able to solve problems that it is not familiar with and has not pre-trained on.
Entirely agree. LLMs evaluate and generate sequential input and output. Visual patterns require a more parallel interpretation which is entirely possible with neural nets. Very much overhyped. A soon as more visual multimodal models are trained it will be fairly straightforward to let the model guess patterns and similarities, test these in context during inferences and conclude whether they work or not. And continue to search until there is a solution. Very much like humans. Nothing special about that. Visual data requires more data and compute but nothing that is unimaginable in the near term.
@@TheTEDfan yes although I do question whether even with high quality vision perceptual capability that will translate to visual reasoning. I suspect there are some things the human brain does as it's trained on very high res continuous vision. So we learn how things move in an image/video and predict what they will look like over the following few seconds continuously. It may require training on video directly as opposed to just images. Or maybe the multimodality will transfer easily, idk
also: i dont understand how tokenization is still a thing. i mean it may have been useful 15 years ago to speed things up. but nowadays the first two layers could do the tokenization. just take the ascii characters. you lose two layers of efficiency, but you would gain so much. suddenly the model can count letters in words, can count words, etc. without having to resort to external tool-calls. and you could do all of this 2d grid made of ascii characters stuff natively. ascii as direct input is an obvious choice.
The problem with that - if you start by embedding characters rather than sub-word tokens, is that the initial embeddings (before they start getting augmented by transformations) won't have any meaning. It makes the leaning task much harder. Starting with tokens the model can learn a word2vec type embedding where some semantics is already captured.
@@benbridgwater6479 yeah i get that. my question is only: how many layers of a say multilayer perceptron does it take to emulate the tokenizer? take the word "tokenizer". lets say these are two tokens: "token" and "izer". it is clear that using these as input has many advantages. i just wonder if it really takes so much computing power of a neural net to learn this tokenization from the single characters? is it not after layer 2 or so already basically the same? it just has to combine the five characters into one thing. and then you have the embedding of that thing, that is equivalent to the "token" embedding.
@@benbridgwater6479 and isnt it true that your level 0 input actually shouldnt have any meaning or semantics? it gets its semantics on the way through the layers, but not at the input level. a pixel in a cat picture doesnt have a meaning either until it gets further down. i agree that it slows it down and makes it harder, but it should be possible and it should have other benefits.
I imagine rather than ASCII you mean one token for each possible byte, rather than just the 127 or so that encode for an ASCII character? (So that it can still handle Unicode characters)
Isn't it funny how often we like the decisions we make, are comfortable with them, sometimes even excited for what comes next. You would think it would be less often, if what we consider to be reasoning, transpires this way.
Is it presenting words that align to the average of the totality of similar words, within the context? Is that a reasoned analysis? Does there need to be risk of consequences, can a meaningless decision, be a reasoned one?
I'm curious of what prompt might trigger an LLM to "deploy" the kind of visual pattern seeking that an ARC challenge demands. I came to think about the so called "gestalt laws" that seem to be hard coded into our perception and thinking. From a certain vantage point these could be understood as the kind of "psychological priors" that might help when solving ARC problems. So; would a prompt like "Always use the gestalt laws when solving problems". This might put an LLM in a mode that together with train-of-thought and visual thinking seems to do.
"Regarding the prompt "Always use the gestalt laws when solving problems", this instruction could potentially enhance an LLM's ability to address the ARC problem in the following ways:
Principle of Similarity: Grouping similar elements together can help in identifying patterns and relationships within the data, facilitating better comprehension and reasoning.
Principle of Proximity: Clustering elements that are close to each other can aid in understanding the structure and context of information, which is crucial for solving complex problems.
Principle of Closure: Encouraging the model to perceive incomplete shapes as complete can improve its ability to fill in gaps and make more accurate predictions based on partial information.
Principle of Continuity: Recognizing continuous patterns can help the model follow logical sequences and enhance its reasoning capabilities.
By incorporating these gestalt principles, LLMs could potentially improve their performance on the ARC benchmark by enhancing their pattern recognition, problem-solving, and reasoning skills, leading to more human-like comprehension and decision-making abilities "
To rule out that the core problem for an LLM is that this problem is 2D. Has anybody tried to create a 1D version of ARC? Same type of problems just in a 1D grid?
Most LLM convert this into 1D internally... they just take every pixel, and convert it into a very long 1 dimensional array then consume it in 1 shot, don't they?
Why don't we uncristalize the models? Let them run in training mode and use a function to determin the backpropagation signal strenght. This should make them able to slightly alter the weights when they are wrong. Give them neuroplasticity and see what comes of it
Do it yourself, write a paper on it!
Can you get Maurice weiler on the podcast?
You throw the word “non computable” nonchalantly. Please elaborate.
I think we address this in the some of the other shows we have on this i.e. ua-cam.com/video/J0p_thJJnoo/v-deo.html and ua-cam.com/video/mEVnu-KZjq4/v-deo.html - but we will be sure to address it in the upcoming show with Chollet. The basic idea is just like with computing some Bayesian quantities are computationally intractable because they require you to consider every possible value of a thing i.e. an infinite number of things, Chollets measure of intelligence also requires to to consider the space of all possible programs which clearly isn't possible to do on a computer
Chollet's measure requires that youy somehow get a perfec solution for a given problem, and thats not attainable in general case in finite time due to computational irreducibility (halting problem etc.).
the way i understood is that sequences that has infinite amount of patterns cannot be calculated/searched in finite time with a computer.
however, these sequences can be defined using finite descriptions, even if they cannot be computed.
Reasoning crystallised into LLM is different to reasoning crystallised into a human brain how exactly?
Flexibility? Human crystalized intelligence is not really very crystalized at all. Even our old memories are constantly being modified by more recent experiences
The main point is that the human also has fluid intelligence (combinatorial reasoning/problem solving) while the LLM doesn't.
Tim, I watched this episode twice and (in between) researched all literature I could find about ARC-AGI. In the begining you mention Core Knowledge as defined in cognitive science. But you do not ellaborate how that translates into inductive priors. Specifically, treating ARC tasks as object manipulation. Have I missed something?
In my mind we're simply looking for algorithms from which Gestalt naturally emerges.
What does it say about society and our traditional metrics of performance assuming the premise that LLMs are not "true intelligence" is valid.
Suppose the analogy "LLMs are just look up databases" is true...even though if so they are data bases that machines organized not humans but what does it mean if that sort of compression/memorization can perform tasks that society values as useful even if we want to debate if it is actually a sign of intelligence.
For some reason I am reminded at how chimps out perform humans in the aptly named "chimp test" meanwhile nobody is particularly impressed with the kind of profound insight that can offer to our beliefs about what relevant philosophical terms might actually imply.
Either way it seems obvious to me that we are on the verge of major disruption regardless if people want to debate the metrics of semantics.
We have still entered the age of the thinking machine to my view bc no longer are people discussing whether it is or is not possible for machines to reach human levels of performance in some arbutrary task...now the question is what is the best metric to use for comparision...
IMO this is what Turing meant to broach with his Turing test thought experiment.
Not as a true test to detemine some measurable difference...but rather he anticipated a time when attitudes and perceptions would shift focus beyond the mystical beliefs about what it means to think and more towards the demarkation of how that plays out as an application and how humanity will respond in terms of what we value.
Watch some more of these videos. Pretty much everyone he has every talked to agrees that LLMs are just a database lookup and can't do any reasoning. All the experts in the field agree on this. And it just makes logical sense. By what mechanism could they possibly do reasoning? There's nothing in the algorithm to do that. LLMs are intelligent in the way wikipedia is. It contains a lot of information, but it isn't going to be able to reason on it. ARC is a perfect example. The prompt "Tell me the 5th letter in this sentence" is another. It can't even handle these simple things in large part because it doesn't understand anything. It just probabilistically selects the next likely word given the previous words. People really need to come to terms with this. That's literally all it's doing.
@xzvbcxSyntaxError Yeah exactly. You've answered it yourself. That's how it's encoding the information to be retrieved. It's still stored data that is getting retrieved and output. Can you tell me why some databases use graphs? Your argument makes no sense. Omg, this database is stored in ASCII, so it's encoded a certain way so it's not retrieving data. It's thinking! Talking about it this way shows how the argument you are presenting makes no sense. What difference does it make how people choose to store and retrieve stuff? It's effectively a database. It's closer to a database than a thinking person. That's the point. Not that people literally think you are retrieving exact stored content the way a normal database works.
9:10 it's not clockwise, magenta is on top of red, the order is white-magenta-red-yellow.
5:34 I still can't understand this one even after 3 replays. I guess I'm not generally intelligent!
After reading @mouduge's comment I now recognize it as increasing vs decreasing amplitude. Took me a while.
@@thorvaldspear No, that is incorrect... Amplitude is supposed to be referenced to the baseline. Therefore, technically, the answer provided by the page displayed in the video is incorrect. Set A has is Increasing variation of the amplitude (on left), and decreasing variation of amplitude on the right. The answer regarding amplitude alone is not correct, specifically because , right side, 2nd column, row 2, shows a rising amplitude from baseline. However, if you modify the question to be "Variation" in amplitude, then it would make a lot more sense.
@@marcfruchtman9473 You're overthinking it.
@@thorvaldspear If I were to look at the amplitude of a wave on an oscilloscope, I would measure it by looking at the "peak" versus the baseline. The question clearly mixes up decreasing and increasing amplitudes on both sides of the image. So, whoever wrote the question was either thinking variation of amplitude, but never wrote down the word "variation", or simply misunderstood it. Either way, the question/answer is not valid.
9:56 *anti-clockwise
Why do we assume solving the ARC 2D challenge is coming close to human intelligence?
You could use Metas new compiler as the language for concepts instead of chatgpt and python, likely much quicker and claude , use smallest first, then second then top one. or nvidia's. you can have a platform which runs on interpolation but not the abstract layer, like game of life? Also you can interpolate and find any value in two steps with the right structure, I wrote such a program two years ago, concept is an extension of Newton method, how you would apply it to this I don't know
ARC is really addictive
The "winner" would be the person scoring 100%. The "leader" is the person with the current highest score.
Can a solution win ARC even if it can't do anything else?
Yes - the rules are on Kaggle, but nothing about it needing to do anything else. Obviously there's quite a perceptual component to the challenge, which perhaps favors hybrid approaches (neural perception + symbolic program synthesis/search), but no rules saying they have to be.
can we lern to creaty symphonies from seeing frequency images of audio files? No. It is not a matter of intelligence but a matter of state representations. LLMs have no sense for the arc states. Therefore these tasks are difficult for them.
But it is no problem to train NN with such tasks and they woulod be able to solve them because they would learn to understand the states like the understood linguistic states.
recommend more please help
no 😈
Considering the Aristotelian Blank Slate idea and LLMs in general, we would perhaps benefit from philosophers to do some work for us in trying to well-define what is going on here. Fortunately I have been doing just that.
Aristotle went out of fashion during the Enlightenment era Rationalism from which the digital computing paradigm. At the same time we get the idea of bottom-up causality explains everything and that language has a pure logical form.
Then comes the 20th century and states that digital arithmetic systems can not compute quantum information efficiently (Schrödinger Equation), bottom-up causality doesn't cut it (Gödel) and language is evolutionary and logic has no content (Wittgenstein).
In practice this means that natural language has Zipf's Law (fractal dimensionality of word frequency), where most frequent words are syntactic, in other words, they give logical support to the content of communication.
Guess what? When we measure sequential windows of Zipf's Law the grammatical top frequency stabilizes around thousand tokens, which probably explains the quality difference between GPT-2 and GPT-3; and why other technologies see similar quality increase when they reach the same limit and why Transformers have not had qualitative leaps from further scaling.
Logical structures are Blank Slates in a sense. These are the "genetical brain organs" in humans (Chomsky) or the architecture of neural network training apparatus / feature engineering. What happens to the content after learning is the interesting thing that ARC-like tests should go for.
According to Eero Tarasti and his Existential Semiotics, with humans we encode phenomenal patterns together with our noumenological existential volatility. In other words emotional frustration allows us to easily switch context when our initial problem solving goes wrong.
Neural Networks do not have that. They are purely phenomenologically limited to the target function. In Quantum Machine Learning we also get the Phase Shift parameter which could be "emotionalized", but doing it in human-like manner would be nearly impossible. But still it would give something more controllable than with Transformers, where you have to guess the correct magic word for invocation of viable secondary contexts.
In other words, human brains have evolved for our own environment for a very long time, which gives us super powers because the "no-content" apparatus of us is always inside the distribution. I call this Cartesian digital bottom-up computational system as Hobb's Golem (Descartes said computers are imposdible, Hobb's said "challenge accepted); Hobb's Golem is not a product of natural evolution, but it is always built with engineering principles. We both might start as Blank Slates, but the way our Logic Organs work and the way our content gets encoded, is fundamentally different.
Trying to prove there is no difference is silly. The interesting question is, how does that matter? At which point should we take information pollution of digital environments seriously and try to build more "brain friendly" user experiences? Are LLMs part of that or against?
I think LLMs might be good for "more democratic access", but problematic as "unverifiable content consumption"; should we just start paying journalists and educators proper salary so they could do their jobs rather than trying to synthesize away the human component in information refinement?
Would be interesting to see a show around these domains. QNLP, Aristotle, Wittgenstein, Complexity Sciences, 4E cognition (Post-Cognitivism and their reaction to Connectionism).
I don’t see how you get “bottom-up causality not working” from Gödel’s completeness theorem?
You’re ruining the dataset by explaining your reasoning which the LLMs are now gonna be training on.
That's fine, this is the public dataset. The method described has not yet been tested on the private dataset (where the method described wouldn't be allowed)
We need a kind of AI baby that learns by interacting with the world, and living alongside humans. An evolutionary algorithm needs to reward AI that uses less data for higher inference. How can that we done? Could something like that be run using a basic simulation of some problems.
The reason there hasn’t yet been a successful attempt at this, is not for lack of people thinking of it.
I dont really see why people assume a NN helps the LLM beyond being a probabilistic distributed hash. Why do you/they think the backprop ever mattered beyond probabilities being normalized? A reasonable common sense assumption is that the LLM is merely recursively best-fitting crystalized tokens. I predict in a couple years we will be laughing at how much extra work could of been avoided with a more efficient probabilistic hashing system.
This is true, but the generalisation power of LLMs is (in my opinion) that they implicitly permute many different symmetry-based variations of the training data, so there is a surprising amount to find if you guide the search process. Of course, these permutations will just be "in the neighbourhood" of the data it was trained on, so the space of creativity is still grounded by the source data, the inductive prior used and the search query (the prompt).
once a machine realizes "I think therefore I am" we've achieved AGI
Define "realize".
@@lenyabloko let me google that for you: verb. 1) become fully aware of (something) as a fact; understand clearly.
"he realized his mistake at once"
I'm feeling dumb : I must be a LLM
Counting on chatgpt to summarize this. People don't like the "bitter lesson" about AI. Actual chartered Psychologist here.
The "bitter lesson" was wrong though, because all our best models work precisely because they have a tonne of hand-designed engineering and priors. Believing in scale was a cute idea a couple of years ago
@@MachineLearningStreetTalk Thinking it's wrong is a cute idea nowadays. Maybe 70 years and a stage of the field where one reference manual with 1000+ pages called "Modern AI" (Russell / Norvig) dedicated a (very) few pages to ANNs, in a sub-section of a chapter, and Hinton having papers rejected because one about NNs was already enough for a top AI Conference wasn't enough. People need to turn to philosophy and making knots in their minds. You really think that was the way Connectionist ideas were introduced In Psychology? Or what originated the Transformer architecture? Or even what showed the Transformer could go from GPT 2 level performance to GPT 3 and beyond? This is the intuition: your brain has neurons. Neurons do a certain type of information processing. You have a lot of them, and evolution made human neurons organize in a particular way that makes humans have cognitive capabilities that are far beyond what other animals have. In neuroscience, many folks are convinced the changes in the human brain were also mainly a matter of scaling (number of neurons and connections in certain areas: no need for "hand-design engineering" there too: cute, him?). But no doubt you are much smarter than I am (no ironoy or cinicism here: just something I believe). I just think you have a too much "philosophical" tendency, and you have a profound need to complicate stuff. There was a time I regularly followed your show: it was fresh, and ideal to watch over my kid as he played in the street. But nowadays, if this is your idea of street talk... Kind regards, and thank you for answering my comment. I feel honored.
@@MachineLearningStreetTalk Not believing in scale is a "cute" idea nowadays. Believing in it was a bold stance that few took-kudos to Frank Rosenblatt and Geoffrey Hinton. Undoubtedly, their inspiration came from the brain and their background in psychology. The human brain isn't that different from other animals near our evolutionary line. Recently, Demis Hassabis, who also has a background in neuroscience, joined Hinton in criticizing Chomskian ideas about language. There's no time for discussions now, but it's unfortunate that my more curated answer done on the tablet was deleted. Kind regards. As someone who isn't a native English speaker, I thought, "Why not ask ChatGPT to improve my grammar?" After all, if AI can mimic the brain, it can surely handle my sentences!
We have interviewed connectionists, try the Nick Chater interview. NNs are nothing like the brain whatsoever, you might enjoy the Max Bennett interview when we release that. Unfortunately due to the complexity of the topic I can't address your other points here but suffice to say we have addressed them many times on previous episodes.
@@MachineLearningStreetTalk I appreciate the time you took to answer me. Thank you very much. Keep the good work.
Surely the issue for an LLM is that it's just a 1D token predictor (strings) rather than a 2D token predictor (images) required here? I presume someone has tried to turn an LLM into a 2D next token (pixel) predictor?
I do not see this as reasoning, this approach is generate x amount of solutions and check what sticks on the wall.
Cool!
When AI solves the ARC challenge we will conclude it wasn't a measure of intelligence. Same story for the last 50 years.
If it turns out to be possible to cheat/shortcut it, absolutely. Dileep George speaks about the "perceptual leakage" of ARC here substack.com/home/post/p-145553885 - it's only a good benchmark if it can't be gamed, and I agree it probably will be eventually
how is an ai model like an llm supposed to know about spatial relationships. here:
ABC
DEF
GHI
for an llm this is just a text ABC DEF GHI. how is it supposed to know that B is "above" E. or that if you go diagonally from A towards E, you end up at I? or that A is two to the left of C? it has never learned about that. it has no collection of features and representations and vector embeddings that say "to the left of" or "above" or "two steps further" or "diagonal" or "straight" or "inside" or "outside" (E is "inside" of B D F H), etc. at least not a text model. an image model or a video model maybe. or a multimodal model. but the llm would have to have access to these features. and it would need a "grid mode" and run in that whenever a problem with a 2d grid comes up.
or take for example connect 4. draw a 6 by 7 grid. the discs fall "down". how is the llm supposed to know, what down is? it doesnt know, that my monitor is standing upright on my table and for me it looks like down, gravity is switched on, etc. these supposed super-intelligent models cant even count the correct number of discs on a board!
training models like these, would go like this: you come up with thousands of example grids and let humans label them, what they see there. for example: 5 by 5 grid with three blobs. red blob is to the left of the blue blob. the red blob consists of 3 pixels and the blue one of 5. and so on. but you can already see, that this is not so easy as labelling cat and dog picutures! there is much more detail to be described. it is much more abstract. the grid is already an abstraction from the real world we live in for us humans which is much more smooth! that's the problem.
From the newline character, and lots of examples in the training set?
This kind of approach, trying to understand "how human intelligence solve problems" and do the same seems too simplistic. Remembers "symbolic AI". Think about how we solve problems and try to figure out how to make a machine to do the same!! I think "connectionism" still seems much more promising! Give the data, let the system "interpolate" (call what you will) no matter how, and expect for a magic to emerges! The secret of transformers model is the ability to make math connections (induction) and let the linguistic make another layer of human experiences connections (deductive) on top. And this seems to break down the idea that AI is only about hardware computations. Just because the second layer (linguistic) is not math, even being supported by math! Every time we try to explain what intelligence must do, we are in the wrong path! We still don't know what intelligence is! Because we don't know, we cannot think about "intelligence must do this" or "must be like that"! That's my guess!
💖
I doubt some Bushman can pass ARC-AGI test. Does it mean he is not intelligent? I think Francois Chollet really underestimates our experience when he talks about our ability to resolve a "new" task.
I'm sure cave men would be able to pass it,
If not they wouldn't have survived.
@@divineigbinoba4506
The average IQ of bushmen (someone like cave men) is estimated by Richard Lynn at 55. And they have survived.
Of course they won't pass all the test but they'll surpass current LLM...
The caves men core knowledge would be less than ours.
If you looked at some of the ARC test, it's mostly pattern recognition and nothing special.
Pretty much this. Many of these seem to have ZERO reason to not be solved by a bigger LLM.
glad this is being reported on. LLMs are overrated.
LLMs are far more generally intelligent than any other model (say image, audio) since the training data on language has more knowledge.
If you think about image/audio generation, they are essentially specialized with less intelligence due to the large amount of data but minimal knowledge stored in that medium.
I think what's better than an LLM would be a logic machine that prioritizes truth over plausibility.
@@ckqWe learned along time ago that symbolic AI just isn’t going to work so the sort of logic machine you’re describing is unlikely to ever work. Unless you mean something else by it.
i would call it a logic machine that has *some* priority, and for us to continue exploring what the best way to think about those are,
Doesn't really seem overrated now that the LLMs have managed to beat this and become a new SOTA, does it?
This gets tiresome. LLMs are intelligent at certain things and awful at others, just like ALL humans.
Automatic Reference Counting
So it's pick breeder???
I wish I was smart enough to have any idea what they're saying
are all AI researchers this depressed???
indeed, yes. They better enjoy life instead of creating AGI, it is such a depressive thing.
I think models are bad at this kind of test mostly because they are not trained for that. AI researchers were mostly focused on text and even vision models are mostly optimized for object recognition, not recognizing patterns and spatial intelligence. So most of the ARC test difficulty comes from picking the field that is mostly ignored by AI scientists and model progress there is lagging behind. And of course for humans this kind of problem is really easy cause recognizing predators is a high priority task for survival.
We keep comparing human inference to AI models training.
Discord is the best
Tim posits that chollets argument is genius, next the interviewees disagree, and then tim shits all over the interviewees :)
This whole thing makes me pretty angry. I guess this type of gaming the system is always going to happen when money is involved. They tried to game the system to win in pretty bad faith in my opinion. The ARC challenge clearly states that the purpose is to find new ideas, which means if you try to use an existing thing like an LLM you should be disqualified (even if you contorted yourself horribly with crazy prompt engineering (which it sounds like they did), then you still are going 100% against the spirit of the competition). Secondly, training it on extra examples to try to memorize is strictly against the intent of learning from a few examples. Yes, you can do that, but that is not the point of ARC. Also, should be disqualified. I can't tell whether they know what they are doing and are just trying to protect themselves, or if they really believe their own BS. Basically, if you can't do it the way it was intended to be done, then you are doing it wrong. You need to come up with a new idea. Aka, the literal point of the competition. They didn't put down so much money so you could just use an LLM. Duh, of course people can try to do that. No one cares. They are trying to find new ideas.
Someone already got above 50% 😂
Did someone actually above 50%
yeh not adhering to the rules of the competition though ... and not validated on the actual test set
18:00
Yes, it will be 100% soon. Then we will know that arc test was nothing like what it would take get to what these people think AGI is supposed to do.
@@wwkk4964 what? Even if a system gets 100% on ARC that's not = to AGI.
Just need an academics brain in a jar, linked up to a Raspberry Pi. Full AGI solved. Only joking! Amazing talk.
A General Intelligence algorithm is not that intractible after all...
If there are tons of human trying to solve this arc challenge, and we still can't find the solution, at least we can prove human isn't so intelligent after all.
the weirdes thing is how many people call color brown "red" 🤣
Man, I went from watching a yudkowsky lecture to this... spooky.
But... A HUMAN is still having to GUIDE this "training" and the decisions on the "type" of training: fine-tuning, running python programs.. the AI still isn't "Thinking" or "Reasoning" for itself it's still wholely dependent on the Human to help it solve the Arc challenge.. That's NOT AGI. - youtube.com/@bioneuralai
Uh … 😅
gosh i wish the guests didnt speak like they are sleep-deprived for days. I am sure they are all brilliant people but to listen to them is another story urgh...
one of the stupidest thing ever. give computer a test designed by humans then say humans do better at it. wtf
That's what people used to say about playing chess.
Who else do you suggest design the test?
another CAPTCHA ugh.
F I R S T
I think the resurgence of the ARC challenge is one of the most interesting things to have happened this year in AI. Just the level of nuance and debate it has forced into the conversation can only be good for the community. Whether it is beaten or not, we’ll all be wiser for having gone through this exercise. Chollet really has devised an incredibly ingenious challenge.
never touched machine learning, don't know what a tensor even is (just seen it as a class in some machine learning code on twitter), but 35 mins in the video and I dont feel lost. you bet im subbing.