00:00-Introduction 01:00-Part I 03:06-Tradititional approach to science 04:16-Era of AI (new approach) 05:46-Data to Neural Net 13:44-Neural Net to Theory 15:45-Symbolic Regression 21:45-Rediscoverying Newton's Law of gravity 23:40-Part II 25:23-Rise of foundation model paradigm 27:28-Why does this help? 31:06-Polymathic AI 37:52-Simplicity 42:09-Takeaways 42:42-Questions
For anyone that does not do work with ML, the takeaway of symbolic regression as a means of model simplification may seem quite powerful at first, but often our rational to justify neural net usage is precisely due to the difficulty in the derivation of explainable analytical expressions to phenomena. People like Stephen Wolfram suggest that perhaps this assumption of assuming complex phenomena can be model analytically is precisely why we are having problems advancing. The title of the video to seasoned ML researchers sounds like the speaker will be explaining techniques to analyze neural net weights instead of talking about this.
It seems like very powerful idea, when AI observes the system, then learns to predict behaviour and then the rules of this predictions are used to delivery math statement. Wish the authors the best luck
It is precisely what I'm working on for some time now, very well explained in this presentation, nice work! (the idea of pySR is outrageously elegant, I absolutely love it!)
John Koza had Genetic Programming which is basically the same thing in the 90s. He made documentaries, talking about reusing learnt functions and everything, very interesting. Didn't really take off though, it just suffers from being slow like most evolutionary methods (unless you parallelise massively like OpenAI Evolution strategies) and can't learn more complex tasks that deep learning can. In another timeline it could've got more attention and maybe become better than neural nets
@@gumbo64maybe its application will be better suited for some other situations or environments or scales in the future if NNs hit some type of thing they cannot overcome quickly enough.
@@Fx_- We're currently hardcapped on current AI models with hardware but I am building a full stack system that takes advantage of currently existing hardware with implementing a daughter board to speed up the analogous computational requirements for large scale implementation. You'd be surprised at how little you need extremely large supercomputers when you scale more efficiently. Well and also leverage quantum computers for their relation with randomness.
@@DreadedEgg the implication here is he was less than two years of age when he signed up for a UA-cam account… I suppose anything is possible these days
Well then, allow me. EUUUUUAAAHHHHH EUAHHHHH AAAAA SKYNET GRAY GOO!!! Omg I DON'T UNDERSTAND MATH HOW CAN YOU DO IT BY YOURSELF? Ancient aliens!!!! David Ike, D.u.m.b.s, Robert Bigelow taco bell space station!!! REEEE SCREEEEEEE. You're welcome. Also I looove math and science and astronomy. Happy learning!
I think it is mostly the fact that, as he said, cats don't teleport or disappear, so you have some sense of structure and continuity that aligns with the PDEs you want to solve.
@@lbgstzockt8493 You're saying the same thing. "Structure and continuity" come from this measurement of the real world (it's a video of a real cat, experiencing real physics).
The folding analogy looks a lot like convolution. Also, the piecewise continuous construction of functions is used extensively in waveform composition in circuit analysis applications, though the notation is different, using multiplication by the unit step function u(t).
Thought the same thing. Can do the Evaluation as a convolution of the two activation functions. Nevertheless, i guess the representation is somewhat more intuitive this way, as the middle part can be extracted as well if needed.
Thought the same! (This vid appeared in my recs after watching the 3B1B convolutions video!) On what he's actually describing with the folding (11:10), I think it's actually pretty easy to miss, since he assumes you kind of anticipate or half-understand what he's about to say, so he goes over it pretty quickly So for anyone who coming to this completely naive or who might have missed it the first time, like I did... The chart (d) essentially traces out chart (c) while (b) is increasing, then traces it in reverse while (b) is decreasing, and then traces it forwards again as (b) increases again Some people might get slightly mad at me for pointing out the obvious Well, it IS simple, and it's easy enough to intuit why it would happen once you see it, BUT it is only obvious once you see it, and it's easy to miss in real time (at least I think!)
Being able to derive gravity laws from raw data is a cool example. How sensitive is this process to bad data? For example, non-unique samples, imprecise measurements, missing data (poor choice of sample space), irrelevant data, biased data, etc). I would expect any attempt to derive new theories from raw data to have this sort of problem in spades.
I am re-reading once again the book By David Foster Wallace History of Infinity. There he describes the book by Bacon Novum Organum. In book one there is an apt statement that I would like to paste 8. Even the effects already discovered are due to chance and experiment, rather than to the sciences. For our present sciences are nothing more than peculiar arrangements of matters already discovered, and not methods for discovery, or plans for new operations.
The "folding analogy" is incorrect. That is not how composition works. It works only in this case because of the very specific nature of the "first layer"(in his example).
There are multiple different awesome ideas in this presentations. For example, an idea of having a neural net discovering new physics, or simply of being the better scientist than a human scientist. Such neural nets are on the verge of discovery or maybe in use right now. But I think the symbolic distillation in the multidimensional space is the most intriguing to me and a subject that was worked on as long as the neural networks were here. Using a genetic algorithm but also maybe another (maybe bigger?) neural network is needed for such a symbolic distillation. In a way, yes, the distillation is needed to speed up the inference process, but I can also imagine that the future AI (past the singularity) will not be using symbolic distillation. Simply, it will just create a better single model of reality in its network and such model will be enough to understand the reality around and to make (future) prediction of the behavior of the reality around.
none of that will ever happen lol. Neural nets cannot reason. Theory is important. Science & physics alone aren't just data & statistics. Those two are actually pretty new to science.
I was wondering or missing the concept of Meta-Learning with transformers, especially because most of these physics simulations shown are quite low-dimensional. Put a ton of physics equations into a unifying language format, treat each problem as a gradient step of a transformer, and predict on new problems. In this way, your transformer has learned on other physics problems, and infers maybe the equation/solution to your problem right away. The difference to pre-training is that these tasks or problems are shown each at a time unlike the entire distribution without specification. There has been work to this on causal graphs, and low-dimensional image data of mnist, where the token size is the limitational factor of this approach, I believe.
Quote (16:40): State of the art for symbolic regression... 25 days later a paper was released where so called KAN's where used to do symbolic regression, and I am pretty sure that this will be the state of the art. I know it was used only on small datasets and has some other flaws, but this is not worth talking about since we will make it work. They also refrence Miles Cranmer.
I can't shake the feeling that someone is going to train an AI model on a range of differently scaled phenomena (quantum mechanics, atomic physics, fluid dynamics, macro gravity / chemical / physical dynamics) and accidentally find an aligned theory of everything, and they'll only end up finding it because they noticed some weird behavior in the network while looking for something else. Truly, "the greatest discoveries are typically denoted not by 'Eureka' but by 'Hm, that's funny...' "
The problem is thinking about these things as if the universe is distinguishing between scales. Any true "theory of everything" will by definition be scale invariant and the structures we see at different scales will be a natural result of the fundamental phenomenon at that level. We don't discuss that human beings very rarely exist entirely independently. If there is a human being in a place, there is an assumption that they had parents, were raised to maturity/independence, and that must have occurred in a finite time period. These are such basic assumptions that no one would believe someone who claims they came into being fully formed and were an independent creation by a God or randomness. We cannot know what the original person or primordial ooze came to be simply by looking at our current local environment.
Just like the guy who finds a severe vulnerability in linux ecosystems, accidentally by just benchmarking a database. And shits, that happened recently lol
Great presentation! My main takeaway is that we need a more unified approach to neural network models. Interoperability is important and can substitute for or even supercede the quality increase of pre-training.
Ha I have surrendered to it just to get it off too, but unless you hit not interested, it will come back even if you watched it. But I don't want to say not interested and have UA-cam think that I don't like a i because I do (btw) this video was damn interesting, thumb up for me
33:16 Mark my words. There won't be any foundational-level model can achieve 5-digit of accuracy like the finite difference does for PDE, which was popularized three hundred years ago by Euler. Using the model alone (without the help of non-blackbox outer algorithms or second-order optimizer), no matter you have 1000 billion params, or what, never. 1000 years later our AI overlords will still use finite difference (maybe the BDF table will be learned by blackbox).
As an artist using image generation models, it's become obvious that foundational models trained on very wide content perform much better in general. It's similar to an artist drawing nudes and studying skeletons in order to draw fantasy characters better. It's also been shown that newer foundational models that have their dataset neutered do not perform as well, even though they might be higher resolution, or generate more detail. This is why I think it could be argued that training is transformative and falls under fair use. Unfortunately the marketing has been centered around making images that looked like other people's work (copying) which is a mistake. This has attracted people to file lawsuits against AI companies. This could be mitigated if AI companies worked closer with actual artists in order to better understand the creative process and how that relates to presenting the technology as a tool for artists, similar to how this presentation is illustrating how to use these tools for scientists.
Solving problems is the essence of the Hegelian dialectic. Problem, reaction, solution -- The Hegelian dialectic! Neural networks create solutions to input vectors or problems, your mind is therefore a reaction to the external world of problems! Thesis (action) is dual to anti-thesis (reaction) creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic. Concepts are dual to percepts -- the mind duality of Immanuel Kant. Vectors (contravariant) are dual to co-vectors (covariant) -- Riemann geometry is dual. Converting measurements or perceptions (vectors) into ideas or conceptions is a syntropic process -- teleological. Your mind is building a "reaction space" from the input or "problem (vector) space" to create a "solution space" and this process is called problem solving or thinking (concepts) -- Hegel. Targets, goals, or objectives are inherently teleological and problem solving is a syntropic process -- duality! "Always two there are" -- Yoda. Syntropy is dual to increasing entropy -- the 4th law of thermodynamics!
Well not sure this will go anywhere except maybe modify some of our archaic equations for nonlinear terms. The problem is probably related to NP hardness and using more expansive nonlinearity methods to crack certain problems that are more specified. We will always not know what we don't know. Using more general nonlinear models was bound to greatly improve our simulations. The real question for NN is this the MOST ACCURATE or most INSIGHTFUL and BEST of nonlinear methods to do so? Somehow I doubt this, but it's certainly a nice proof of principle and place to venture off further. To put all our faith in it might be a mistake though. We might be looking at long predicted by mathematicians limits to reductionism, and our first method to not overfit billions of parameters will give us an illusion that this is the only way, and we could be looking at a modern version of epicycles. If we want to really go further we need to use such models to not just get better at copying reality, but finding general rules that allow it's consistent creation and persistence through time. Perhaps one way to do this would be to consider physical type symmetries on weights.
Hmm do you think resonance and harmonics might fit in here. I imagine that patterns of connections within NN/neural networks that are self-stabilizing in some way would tend to persist throughout iterations (a kind of memory). Physics gives us resonance and harmonics that describe periodic behavior in everything from atoms to predator-prey relationships to solar systems. The fourier transform essentially gives us a logic chain to describe any signal, but as some combination of periodic frequencies instead of linear lengths. It is a concept that arises again and again. Both quantum and relativistic perspectives of spacetime are highly influenced by periodic or near-periodic behavior. Maybe this is fundamental to NN as well and the cat videos taught the AI how to recognize low-dimensional periodic relationships in data. Which could explain why it helped as a preset for totally unrelated data. I'm not exactly sure if that was at all similar to what you were suggesting but it seemed related in my mind. Half-baked thought sources: www.quantamagazine.org/how-the-physics-of-resonance-shapes-reality-20220126/ www.sciencedirect.com/science/article/abs/pii/S0893608012002584 (machine learning with adaptive resonance)
Of course we only know what we know. Won't modeling the known better, lead to discovering what sticks out abnormally? This will probably lead us to newer discoveries, quicker.
yooo, i'm so glad i came across this! i've been thinking about how neural networks can teach us about our own thinking and pattern finding; i'm glad there is discussion about it
okay it's not about what i initially thought, but whoa. this polymath approach sounds excellent. i feel it's similar to how people who study many different fields can be quicker to grasp a novel problem
I was pretty surprised to see this not actually purpose much of anything other than using tools to analyse patterns, three same tools that have been in use for decades. Is this a venture pitch? Throwing more processing at it helps, but doesn't "solve" anything on its own.
There's a paper on Feature Imitating Networks that's gotten a few good applications in medical classification, and subtask induction is a similar line of thought. FINs are usually used to produce low dimensional outputs, but I was thinking about using them for generative surrogate modeling. FINs can help answer the question of how to use neural networks to discover new physics. An idealized approach would turn every step of a coded simulator into something differentiable. It occurs to me that the approach of this talk, and interpretability research generally, is essentially the inverse problem of trying to get neural networks to mimic arbitrary potentially nondifferentiable data workflows.
30 mins to say that you can fit simpler models to a neural network data-generating process, and another 30 to say that more training data (even if relegated to what we call “pretraining”) improves performance. ps: things are simple because they are ubiquitous and they are ubiquitous because it’s how the world works (law of conservation of mass and energy, i.e. addition), not because it’s “useful”
one of the best suggestions of the algorithm. there is a phrase widely used in education circles nowadays: 'Learning how to learn' and it is often criticised as human babies are already born with the ability to learn. But in the case of machines I suspect that is the way to go. They lack the genetic encoding we embedded in our biological systems for so long. Maybe we should treat these early machine learning models as their DNA?
It is probably dumb or anyone else have trouble understanding folding analogy at 12:15? Is he suggesting that the planes are superimposed with one top of other? Or is he suggesting that sum of figure a and figure b lead to figure c? Can anyone help me in understanding it?
I didn't get what is the reason to use symbolic regression. Analytical relationships/models are not the same as symbolicly representables. "Derivability" is required.
model mining? brain digging? fascinating, i guess we gonna need some tools to uncover these gems from the nural nets or do we need to build the nets/models in a more comprehensive way?
35:21 Good pretrained in some epochs by using Polymathics results does not mean training from scratch has a worse error. It is just a matter of time the good model will have the same quality.
Right, the point is energy efficiency and optimized speed/quality for multiple applications. The pretraining is done once for the foundation model, which safes efforts for the various latter applications.
What an amazing fck of presentation. I mean, of course the subject and research is absolutely mind-blowing, but the presentation in itself is soooo crystal clear, I will surely aim for this kind of distilled communication, thank you!!
Thank you for your talk. I found it extremly interesting. I have some comment on your statement that simplicity is implied by utility : Differential Equations are very useful in describing our world, however they are at least in my mind not simple and to most people also not familiar. I would love to discuss about it !
At 17:53, he has a plot on the right side, but he seems to attain only an expression in the variables x and y. There is no equation, so how is he even able to make a plot against those 2 variables? If you try plotting some of the given expressions by equating them to a constant (e.g. 2(x+sin(y+1.3))=3 ), you don't get anything that looks like his plot. If there is a 3rd variable (e.g. z, or something like f(x, y)), then the plot should be a 3D plot. Instead, the plot is 2D.
17:20 If genetic algorithm is a bruce-force algorithm, why using it? Is its time complexity less than the bruce-force algorithm, similar to dynamic programming used in RL?
This is SO cool! My first thought was just having incredible speed once the neural net is simplified down. For systems that are heavily used, this is so important
Regarding simplicity: I think that you are missing something important about the addition operation that makes it "simple". We are also familiar with division (the arithmetic operation) and it is also useful, but we would not say that division is "simple" in the same way the addition is simple (or we would say that addition is simpler than division, even though both are "familiar" and "useful").
That is because addition is infinitely more "useful" than division. Literally any group of things, whether physical or not, coming together in some sense is addition. There are a lot of things next to each other in the universe lol. It is because it is so fundamental that it seems so "simple", because it is and they are just two different ways of saying the same thing.
@@samuelwaller4924 I was thinking of simplicity in an algorithmic sense: addition can be performed by a simple and fast parallel circuit, while division must be performed in a stepwise, linear way, where each step depends on the result of the previous one. Multiplication is similarly simpler than division, whereas subtraction exactly as simple as addition. My point is that these arithmetic operations are not "simple" or "complex" just because of our subjective experience with them, but because different operations actually have different innate properties, and it is a glaring flaw of analysis to think otherwise.
Training LLMs on code doesn't teach them to reason a bit better, it teaches them to reason a LOT better. It makes sense if you think about it: what do you learn when you (a human being) learn to write software? You learn a new way of thinking.
Fantastic. At 55 minutes though, it is suggested that we don't have a simple concept like + built into us. Perhaps not in a blank neural net, but we for example are not born with a blank slate. It is clear that any toddler understands in some way, the concept of 'more' and 'less' even though they lack empirical understanding. With sufficiently robust generalized data sets based on physical principles, information theory as language and perhaps even the nature of emotions, given enough GPUs to sustain large inter-operational neural nets, would this not give rise to something more than the sum of it's parts?
Could you pre-train some layers (i.e. turn the standard activation functions for a few layers into pySR estimated functions) as a way to increase/change the dimensionality of the input data? Possibly could decrease the number of layers needed or time taken to train the network. If not run early training with parallel genetically pruned custom activation layers to approach the space from different paths while trying to find the minimum loss.
I don't know about all the fancy stuff but as a programmer this makes me 30 to 50% more productive and my daughter, who is a manager, makes her about 10 to 15% more productive.
Reminds me of a sociology paper with tons of seemingly complex math that, in the end, says something like, "school bullying is exacerbated when it goes unaddressed." So what was all the math for? Credibility.
one might reason out the implications of what he said here without him having to also provide the vision for how his work might be applied. or give it to a gpt and let it do it for you
Fine-tune an LLM to interpret neural nets. Iterate and maybe symbolic regression (i.e. language) will help us supercharge LLM training. But hallucinations could be a major issue...
The fact that AlphaFold discovered the remaining 199.7 million unique protein structures and Microsoft Quantum discovered 32 million new materials in essentially no time at all means there are yet _more_ discoveries to be observed unfolding from future AI models. There should be another scientific breakthrough sometime after the release of GPT-5, perhaps 2025-2026 there will be major breakthroughs in multiple sectors such as computing, medicine, physics, mathematics, as well as a new economical paradigm surfacing that involves AI and robots. The next set of major breakthroughs is likely to occur shortly after the $100 billion Microsoft/OpenAI data center comes online in 2028, although Google is also doing the same thing on a similar timeline now. I expect to start seeing flurries of discoveries and the pace accelerating around 2028-2030 and onward.
35:16 Yes, doing the model from scratch with traditional machine learning is worse compared to the pre-trained generative network, but only for the *same time frame*, if you give the traditional machine learning approach more *time*, then it can out-perform the pre-trained generative network, while the pre-trained network will just keep on spitting out the same type of results.
The better approach is to use the pre-trained generative network to bootstrap samples for the genetic programming("Scratch-AViT-B") model thus getting the best of both.
Great presentation. Its marvelous to see a take on AI from a broad, scientific/mathematical perspective without too much focus on technicalities. Really exited to see how this might improve or add to our understanding of the/(this?:) ) universe.
@@JorgetePanete Thank you for pointing this out. It shows that LLms are already surpassing humans (like myself) in many respects - Chat GPT makes no spelling mistakes.
With enough inputs you can make any curve or field match the current data. So it this even science ? I am very skeptical. It will provide very little real insight, when you have inscrutable AI model able to predi t something. It might as well be the oracle of Delphi.
You might want to think about simplicity in terms of Kolmogorov complexity e.g your NN should try to emit the least complex, in the Kolmogorov sense, syntax tree. Also, I think "+" is simple because it is closed over the field of integers. I think that if your operation takes you from one domain to another its more complicated. In that way you might consider using Category Theory. You could think about penalizing models that "move' further away into other mathematical spaces from a 'base" space.
Kolmogorov complexity can be thought of the ideal “lower bound” for a compressor/predictor in unsupervised learning. But it’s also uncomputable which would make it hard to implement in practice 😅
@@coda-n6u true, I think I was trying to get at a weighting of symbols used. I’m not sure if that could be learned or would have to be assumed. I think 1+1 is simple because is in some ways assumed ( forgetting Russell) whereas something difficult like say the Kullback-Liebler Divergence is defined in terms of simpler primitives Edit: big picture would be you need some sort of error term to trade off against accuracy otherwise your tree grows without bound either in depth or complexity of the operators Consider it something like dropout or pruning.
@@DensityMatrix1 Yeah that's interesting! I feel like any theory with a sufficiently complex symbolic representation could be factored into smaller bits that could themselves be learned as features. It's a big search problem, so I guess it's about allowing the algorithm to search deeply + generate complicated symbolic representations, but having it bias towards shorter ones (since they're more likely to be true). Honestly a big problem I have no idea how to solve.
Solomonoff induction isn't tractable for beings with finite compute and AFAIK there's no standout best approximation to it. Myopic piecemeal modeling is probably better in many cases than trying for a theory of everything.
Problem, reaction, solution -- The Hegelian dialectic! Neural networks create solutions to input vectors or problems, your mind is therefore a reaction to the external world of problems! Thesis (action) is dual to anti-thesis (reaction) creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic. Concepts are dual to percepts -- the mind duality of Immanuel Kant. Vectors (contravariant) are dual to co-vectors (covariant) -- Riemann geometry is dual. Converting measurements or perceptions (vectors) into ideas or conceptions is a syntropic process -- teleological. Your mind is building a "reaction space" from the input or "problem (vector) space" to create a "solution space" and this process is called problem solving or thinking (concepts) -- Hegel. Targets, goals, or objectives are inherently teleological and problem solving is a syntropic process -- duality! "Always two there are" -- Yoda. Syntropy is dual to increasing entropy -- the 4th law of thermodynamics!
@@ThePyrosirys You can treat input vectors as problems, watch the following:- ua-cam.com/play/PLMrJAkhIeNNQ0BaKuBKY43k4xMo6NSbBa.html Problems are becoming solutions (targets) via optimization -- a syntropic process, teleological. Neural networks are therefore syntropic as they learn as they converge towards goals and solutions. The learning process is teleological as your goal is to achieve a deeper understanding of reality. Perceptions are dual to conceptions -- the mind duality of Immanuel Kant. Machine learning is based upon the Hegelian dialectic if you treat your input vectors as problems!
@@ThePyrosirys Minimizing prediction errors is a syntropic process -- teleological. "The brain is a prediction machine" -- Karl Friston, neuroscientist. Syntropy is the correct word to use here and means that there is a 4th law of thermodynamics -- duality. Average information (entropy) is dual to mutual or co-information (syntropy) -- information is dual! Your brain processes information to optimize your predictions -- natural selection.
So what you are saying is: Our mind creates models based on patterns we observe to predict reallity? How does that imply that information is dual? What do you even mean by "informatiom is dual?" and how do you apply Hegelian dialectics here? Tesis/Ant refere to conceptets that are contradictory to each other
Surely the problem with AI is Fudge In = Fudge Out, so if the Standard Model (and especially attempts to fix it) is full of fudge then fudge will result. I'm not saying the model outline below is correct, but if it is, or something pretty similar, no physics AI would come up with it, even if fed all the accepted (potentially) useful papers, and (filtered, biased, artefact-ridden) data.. -- POLECTRON FIELD: cell: a + & a - particle split by Full Split Energy as a positron+ & electron-. Bonds to 12 neighbours MATTER: p+ / e- = half cell (& a cell as +-+ / -+-)? Polarises field as + & - shells. SPIN: centre polarisation axis LECKY: total absolute charge. MASS: cells/lecky inside particles. INERTIA: field rebalances behind mass with a kick STRONG GRAVITY: field repels mass. DARK ENERGY: voids grow as lecky shrinks cells and is lost to gravity gradients DARK 'MATTER': galactic lecky gradient. Denser field slows acceleration and TIME, thinner field aids acceleration BIG BANG: more proton-antiproton pairs malformed as proton-muon than antiproton-antimuon so hydrogen beat antihydrogen POSITRONIUM: e+p. Muon: ep_e. Proton: pep. Neutron: pep_e. Tau: epep_e. Neutron mass is halfway between muon and tau ANTIMATTER: 1,2 e_p pairs annihilate. 3: proton+anti proton or muon+anti muon. 4: neutron+anti neutron. 5: tau+anti tau WEAK FORCE: unstable atoms form and annihilate e_p pairs. BETA- DECAY: pep_e => pep e. BETA+: pep + new e_p => pep_e p NUCLEAR FORCE: neutron electrons bond to protons. ENTANGLEMENT: correlation broken by interaction? Physical link? BLACK HOLE: atoms cut into neutrons fused as higher mass tau cores (epep). Field rotates. Core annihilates: ep => cell? PHOTON: cell polarisation/lateral shift wave. LONGITUDINAL WAVE: gravitational wave, neutrino: 1 to 3 cell wave DOUBLE SLIT: photon/particle field warps diffract and interfere, guiding the core. Detectors interfere with guides ENTROPY: simplicity. Closed system complexity reduces over time. Uniformly (dis)ordered (hot)/cold field is simplest
This is not an endorsement of your alternative model, but the skepticism of models and digging deeper the conceptual ruts that we dig ourselves into. In flat world, we are all just lengths...
@@rugbybeef .. and widths unless it's a 1D flat world... I'm not into the Holographic Universe even though 2D is technically simpler than 3D - just not when we live in a 3D reality. Gravity, Dark Energy and Dark Matter need to be linked to one field, might as well make it an EM particle field. Neutron Mass is halfway between Muon and Tau bar a tiny bit of binding energy. I don't know why this relationship is not mentioned by anyone but me.
@@PrivateSi So Ive always wondered about this as vision is a 2D diminishment of our 3D world, I always believed that flatlanders would only see the lengths of their colleagues in a 1D analogue. Like if their square friend had distinct colors on each face, they would see and could infer their colleagues vertex. However differentiating a circle of radius 1 and a square of width 1 that rotated synchronously each time you tried to move around it would be impossible.
The issue is discovering the higher-ordering principle which subsumes a continuum of self singularities and discontinuities. Linear math works well in-between the singularities, but cannot extrapolate through them, in a sense they are like mathematical worm-holes. Attempts to linearize across the discontinuities will fail. A whole harmonically-related series will only be properly understood from the perspective of a higher-ordering principle, similar to the idea of projection from a higher magnitude to a lower dimensional space, or from the idea of negative curvature. The point is the epistemological assumption of a static model is problematic, the real world has static islands which are bounded within areas of great change, and so the basic function changes completely there, that is to say, the dynamics of change themselves change. So to bridge that gap you can’t just ignore it, or flatten it, you have to seek how to remap it in such a manner that it is no longer infinite, but cyclical, as Gauss did with the complex number domain.
Solving problems is the essence of the Hegelian dialectic. Problem, reaction, solution -- The Hegelian dialectic! Neural networks create solutions to input vectors or problems, your mind is therefore a reaction to the external world of problems! Thesis (action) is dual to anti-thesis (reaction) creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic. Concepts are dual to percepts -- the mind duality of Immanuel Kant. Vectors (contravariant) are dual to co-vectors (covariant) -- Riemann geometry is dual. Converting measurements or perceptions (vectors) into ideas or conceptions is a syntropic process -- teleological. Your mind is building a "reaction space" from the input or "problem (vector) space" to create a "solution space" and this process is called problem solving or thinking (concepts) -- Hegel. Targets, goals, or objectives are inherently teleological and problem solving is a syntropic process -- duality! "Always two there are" -- Yoda. Syntropy is dual to increasing entropy -- the 4th law of thermodynamics!
No, convolution layers, like 2d, take inputs and further extrapolate features by applying specific linear kernel methods (specific for 2d space or 3d), this seems to be doing something different where it is not a layer, but instead applies different layers together, by folding it over. Tbh don’t understand folding, but convolution layers are common in image problems so they are easier to understand
38:18 Why is “+” simple? Well, maybe because it is closely related to the concept of counting and “having” things. If I have two apples and I get (add) three more, I can just count the number of apples to verify that I now have five apples. That has got nothing to do with the concept of simplicity. Not sure if I even want to continue watching the whole thing…
"Science today will be this one: the experimentalist arrives with a data collection unit, the theorist arrives with a Neural network and symbolic regression algorithm, we sit down, we plug in both machines, observe the two machines performing the scientific inquiry for us and then real Understanding comes. We did our super ego duty. The science is done out there for us. And maybe while the scientists sit there they come up with a truly novel idea together but it is pure curiosity, surplus, since the science is already done."
For "+," I do think it is simple because I hypothesize that the human brain does have built-in neurons specifically for counting small numbers (usually 5-9 varying between persons), so when you are an infant, you don't actually need to learn to count objects under this number (I suspect that in certain area of the brain, likely hippocampus, there are this amount of special neurons that are served as synaptic placeholders for the visual cortex in object identification. Then, it serves as the starting point to further learn the abstract concept of "+." That is also why "+" is the first mathematical operation that most humans (if not all) learned. If nothing is built-in, I wonder if someone can teach a human multiplication without them knowing addition. This experiment would be highly unethical, tho.
This is already well known. It's called 'subitizing'. I believe the research showed that subitizing is not implemented in separable neural substructures.
Thoughts from my deep ignorance Regarding the idea that " +" might be assumed ( in replies) to be the first mathematical operation of human behavior. I wonder what would be different if looking at this from my perspective "What if “ - “ is actually the first mathematical operation? What if the second operation, the “+ “ is the process of filling in the vacuum caused by the first “ - “ ? The first loss of coherence.. as an identifiable cellular membrane (ovum) being fully formed and then losing that coherence by the separation of the membrane experiences as a gap forced by penetration of new foreign material (sperm) that then becomes assimilated, exchanged. Not either or, + or - but shared - is part of + . And always was.
I'm going to state the obvious. That is smart. Yes it draws questions about AI explainability regarding deep learning NNs but what this chap is saying is quite brilliant. For me, as long as the conventional approach is combined with the model he is propounding, there should be some excellent science out of that. Then there can be even more science when we start to understand the reasons and mechanisms by which the deep learning neural networks some humans build are doing and are capable of what they are so. Let's not miss the point of what he is saying, at least what I interpret that he is saying... The NN is finding some order through patterns, it really is those patterns that are probably most related to something interesting, ie of scientific interest, then we can sift through the rest of the noise to see if something was missed, let's say we do that if questions are presented that don't have an answer. So all in all, it is a very powerful way of cutting through the fluff. If we then want to scientifically describe the fluff itself, it is now more distinct. I think what this guy is saying is brilliant. Incidentally, I think we ultimately find out that deep learning neural networks come to sensible decisions because the have the fidelity to tap into the innate intelligence structure of reality itself, but that is a next topic, although entirely pertinent.
Symbolic Regression is starting to catch on but, as usual, people aren't using the Algorithmic Information Criterion so they end up with unprincipled choices on the Pareto frontier between residuals and model complexity if not unprincipled choices about how to weight the complexity of various "nodes" in the model's "expression". A node's complexity is how much machine language code it takes to implement it on a CPU-only implementation. Error residuals are program literals aka "constants". I don't know how many times I'm going to have to point this out to people before it gets through to them (probably well beyond the time maggots have forgotten what I tasted like). This whole notion that "+" is just what we're used to is intellectual poison.
Could these models apply compression to themselves through techniques like quantization, pruning, and knowledge distillation becoming faster and faster and smaller until AGI emerges from a phone sized device which can invent warp drive?
Cool idea! Essentially, we can deduce symbolic, testable scientific theories from deep learning models using things like PySR. Making foundation models (which are trained on a wide variety of phenomena, not necessarily related to the area of application) for specific scientific application gives ANNs an advantage. Simplicity (explainability, legibility) comes from familiarity with a problem area, so we should be training models on lots of diverse examples to help them “get used” to solving these types of problems, even if the examples may seem irrelevant (cat videos & differential equations 🐈) Interesting application of explainable AI 🎉 Congratulations on your research
36:36 becoming more basically intelligent because of understanding spacio temporal connectivity. The flashing faces in peripheral vision illusion it shows us The monsters we create when we lack that.
I think this is likely, but it will be another lowering of the goal posts of science. The first lowering was with newton, and the abandonment of the idea of understanding the causes of physical principles in intuitive terms. This was thrown out and replaced with building intelligible mathematical theories of the world, instead of trying to make the world itself intelligible. AI will be the next lowering of the goal posts, where we can no longer even make intelligible theories of the world, instead, the theories will be totally unintelligible to any human, locked away in the statistical correlations of some AI. Instead of trying to interpret data from the world, we'll be trying to interpret AI.
I think you missed the point of what he's saying. he's not saying we should just build ai and use neural nets to solve problems, rather we should train the network, see how it solves the problem, and attempt to deduce a theory from studying the actual mechanisms of the neural network itself. The network is not the end goal, it's merely a new tool for data analysis.
Thank you. I guess we have to let them have their journey. The good news is the AI will mirror back their unconscious motivations for wanting these increasingly abstract and unintelligible constructs.
This has crossed my mind and this is exciting indeed. High dimensionality patterns are often hidden but the fact that they are high dimension makes for the discovery of robust natural laws. We are in need of territory. We no no longer have to rely on empirical, philosophical or mathematical models to create natural laws. Data in high dimensionality can reveal many laws. Exciting times!
00:00-Introduction
01:00-Part I
03:06-Tradititional approach to science
04:16-Era of AI (new approach)
05:46-Data to Neural Net
13:44-Neural Net to Theory
15:45-Symbolic Regression
21:45-Rediscoverying Newton's Law of gravity
23:40-Part II
25:23-Rise of foundation model paradigm
27:28-Why does this help?
31:06-Polymathic AI
37:52-Simplicity
42:09-Takeaways
42:42-Questions
So is this headline clickbait, as usual? Or could you provide timestamps with the main conclusions drawn shortly and clearly?
@@GEMSofGOD_comyep, it's just bog standard stuff that RNNs have been used for since they were first developed, just with more horsepower now.
@@orbatos I've now noticed an interesting Newtons metric part of this talk. Such searchers of patterns are interesting.
For anyone that does not do work with ML, the takeaway of symbolic regression as a means of model simplification may seem quite powerful at first, but often our rational to justify neural net usage is precisely due to the difficulty in the derivation of explainable analytical expressions to phenomena. People like Stephen Wolfram suggest that perhaps this assumption of assuming complex phenomena can be model analytically is precisely why we are having problems advancing. The title of the video to seasoned ML researchers sounds like the speaker will be explaining techniques to analyze neural net weights instead of talking about this.
It seems like very powerful idea, when AI observes the system, then learns to predict behaviour and then the rules of this predictions are used to delivery math statement. Wish the authors the best luck
It is precisely what I'm working on for some time now, very well explained in this presentation, nice work! (the idea of pySR is outrageously elegant, I absolutely love it!)
John Koza had Genetic Programming which is basically the same thing in the 90s. He made documentaries, talking about reusing learnt functions and everything, very interesting. Didn't really take off though, it just suffers from being slow like most evolutionary methods (unless you parallelise massively like OpenAI Evolution strategies) and can't learn more complex tasks that deep learning can. In another timeline it could've got more attention and maybe become better than neural nets
@@gumbo64maybe its application will be better suited for some other situations or environments or scales in the future if NNs hit some type of thing they cannot overcome quickly enough.
@@Fx_- We're currently hardcapped on current AI models with hardware but I am building a full stack system that takes advantage of currently existing hardware with implementing a daughter board to speed up the analogous computational requirements for large scale implementation. You'd be surprised at how little you need extremely large supercomputers when you scale more efficiently. Well and also leverage quantum computers for their relation with randomness.
Another banger from youtube algorithm
Yes but not for everyone, only those with the capability to appreciate this for what it is
@@JetJockey87 Edgy teenager says what?
facts
1 good vs 1e9 bad suggestions
@@DreadedEgg the implication here is he was less than two years of age when he signed up for a UA-cam account… I suppose anything is possible these days
I came here to read all the insane comments, and I’m not disappointed.
We love our crackpots don’t we folks
;) The typical crackpots are here to submit their opinion and here I can't even get past half of it for how insanely hard this topic this.
Great minds are .,..,...
It's so cool when people are simply arrogant, and offer nothing to counter those ideas with which they take issue! Keep it up!
Well then, allow me. EUUUUUAAAHHHHH EUAHHHHH AAAAA SKYNET GRAY GOO!!! Omg I DON'T UNDERSTAND MATH HOW CAN YOU DO IT BY YOURSELF? Ancient aliens!!!! David Ike, D.u.m.b.s, Robert Bigelow taco bell space station!!! REEEE SCREEEEEEE.
You're welcome. Also I looove math and science and astronomy. Happy learning!
So here we are, you guys seems to be chosen by algorithm for us to meet here. Welcome, for some reason.
It makes intuitive sense that a cat video is better initialization than noise. It's a real measurement of the physical world
I think it is mostly the fact that, as he said, cats don't teleport or disappear, so you have some sense of structure and continuity that aligns with the PDEs you want to solve.
@@lbgstzockt8493 You're saying the same thing. "Structure and continuity" come from this measurement of the real world (it's a video of a real cat, experiencing real physics).
@@lbgstzockt8493 Sounds like you've never had a cat. Structure and continuity is not a guarantee. XD
I think this is the ultimate proof that cats are fluids, so it helped the fluid simulation.
@@fkknsikk lol
The folding analogy looks a lot like convolution. Also, the piecewise continuous construction of functions is used extensively in waveform composition in circuit analysis applications, though the notation is different, using multiplication by the unit step function u(t).
Oragami manifold🎉🎉🎉🎉🎉🎉🎉of course🎉🎉🎉🎉🎉🎉🎉🎉
Folding goes into compression and data theory and is the basis for the holographic universe theory.
Thought the same thing. Can do the Evaluation as a convolution of the two activation functions. Nevertheless, i guess the representation is somewhat more intuitive this way, as the middle part can be extracted as well if needed.
Thought the same! (This vid appeared in my recs after watching the 3B1B convolutions video!)
On what he's actually describing with the folding (11:10), I think it's actually pretty easy to miss, since he assumes you kind of anticipate or half-understand what he's about to say, so he goes over it pretty quickly
So for anyone who coming to this completely naive or who might have missed it the first time, like I did...
The chart (d) essentially traces out chart (c) while (b) is increasing, then traces it in reverse while (b) is decreasing, and then traces it forwards again as (b) increases again
Some people might get slightly mad at me for pointing out the obvious
Well, it IS simple, and it's easy enough to intuit why it would happen once you see it, BUT it is only obvious once you see it, and it's easy to miss in real time (at least I think!)
Amazing talk, and great Research!
Being able to derive gravity laws from raw data is a cool example. How sensitive is this process to bad data? For example, non-unique samples, imprecise measurements, missing data (poor choice of sample space), irrelevant data, biased data, etc). I would expect any attempt to derive new theories from raw data to have this sort of problem in spades.
That is a really good question.
Love the definition of simplicity, I found that to be pretty insightful.
I am re-reading once again the book By David Foster Wallace History of Infinity. There he describes the book by Bacon Novum Organum. In book one there is an apt statement that I would like to paste
8. Even the effects already discovered are due to chance and experiment, rather than to the sciences. For our present sciences are nothing more than peculiar arrangements of matters already discovered, and not methods for discovery, or plans for new operations.
The "folding analogy" is incorrect. That is not how composition works. It works only in this case because of the very specific nature of the "first layer"(in his example).
Indeed.
Can you tell me more about what is incorrect?
There are multiple different awesome ideas in this presentations.
For example, an idea of having a neural net discovering new physics, or simply of being the better scientist than a human scientist. Such neural nets are on the verge of discovery or maybe in use right now.
But I think the symbolic distillation in the multidimensional space is the most intriguing to me and a subject that was worked on as long as the neural networks were here. Using a genetic algorithm but also maybe another (maybe bigger?) neural network is needed for such a symbolic distillation.
In a way, yes, the distillation is needed to speed up the inference process, but I can also imagine that the future AI (past the singularity) will not be using symbolic distillation. Simply, it will just create a better single model of reality in its network and such model will be enough to understand the reality around and to make (future) prediction of the behavior of the reality around.
We call it abstraction🎉🎉🎉🎉
And with all this advancement we don"t have fresh good water and we don"t have long term stable electricity and not enough minerals for development
@@shazzz_landthats because of the higher ups/elites not AI or technology.
@@denzelcanvasYT People don't fear AI, they fear capitalism
none of that will ever happen lol. Neural nets cannot reason. Theory is important. Science & physics alone aren't just data & statistics. Those two are actually pretty new to science.
I was wondering or missing the concept of Meta-Learning with transformers, especially because most of these physics simulations shown are quite low-dimensional. Put a ton of physics equations into a unifying language format, treat each problem as a gradient step of a transformer, and predict on new problems. In this way, your transformer has learned on other physics problems, and infers maybe the equation/solution to your problem right away. The difference to pre-training is that these tasks or problems are shown each at a time unlike the entire distribution without specification. There has been work to this on causal graphs, and low-dimensional image data of mnist, where the token size is the limitational factor of this approach, I believe.
Quote (16:40): State of the art for symbolic regression...
25 days later a paper was released where so called KAN's where used to do symbolic regression, and I am pretty sure that this will be the state of the art.
I know it was used only on small datasets and has some other flaws, but this is not worth talking about since we will make it work.
They also refrence Miles Cranmer.
KANs does not scale well
I can't shake the feeling that someone is going to train an AI model on a range of differently scaled phenomena (quantum mechanics, atomic physics, fluid dynamics, macro gravity / chemical / physical dynamics) and accidentally find an aligned theory of everything, and they'll only end up finding it because they noticed some weird behavior in the network while looking for something else.
Truly, "the greatest discoveries are typically denoted not by 'Eureka' but by 'Hm, that's funny...' "
The problem is thinking about these things as if the universe is distinguishing between scales. Any true "theory of everything" will by definition be scale invariant and the structures we see at different scales will be a natural result of the fundamental phenomenon at that level.
We don't discuss that human beings very rarely exist entirely independently. If there is a human being in a place, there is an assumption that they had parents, were raised to maturity/independence, and that must have occurred in a finite time period. These are such basic assumptions that no one would believe someone who claims they came into being fully formed and were an independent creation by a God or randomness. We cannot know what the original person or primordial ooze came to be simply by looking at our current local environment.
Just like the guy who finds a severe vulnerability in linux ecosystems, accidentally by just benchmarking a database. And shits, that happened recently lol
Someone watched pi..
Great presentation!
My main takeaway is that we need a more unified approach to neural network models. Interoperability is important and can substitute for or even supercede the quality increase of pre-training.
Jesus christ, okay UA-cam I will watch this video now stop putting it in my recommendations every damn time
You can press 'Not Interested' and it should stop suggesting it.
@@jumpinjohnnyruss I don’t think that’s what he’s talking about.
Ha I have surrendered to it just to get it off too, but unless you hit not interested, it will come back even if you watched it.
But I don't want to say not interested and have UA-cam think that I don't like a i because I do (btw) this video was damn interesting, thumb up for me
33:16 Mark my words. There won't be any foundational-level model can achieve 5-digit of accuracy like the finite difference does for PDE, which was popularized three hundred years ago by Euler. Using the model alone (without the help of non-blackbox outer algorithms or second-order optimizer), no matter you have 1000 billion params, or what, never. 1000 years later our AI overlords will still use finite difference (maybe the BDF table will be learned by blackbox).
As an artist using image generation models, it's become obvious that foundational models trained on very wide content perform much better in general.
It's similar to an artist drawing nudes and studying skeletons in order to draw fantasy characters better.
It's also been shown that newer foundational models that have their dataset neutered do not perform as well, even though they might be higher resolution, or generate more detail.
This is why I think it could be argued that training is transformative and falls under fair use. Unfortunately the marketing has been centered around making images that looked like other people's work (copying) which is a mistake. This has attracted people to file lawsuits against AI companies.
This could be mitigated if AI companies worked closer with actual artists in order to better understand the creative process and how that relates to presenting the technology as a tool for artists, similar to how this presentation is illustrating how to use these tools for scientists.
This is a very nice idea. I hope it will work! It will be very interesting to see new analytical expressions coming out of complicated phenomena.
Solving problems is the essence of the Hegelian dialectic.
Problem, reaction, solution -- The Hegelian dialectic!
Neural networks create solutions to input vectors or problems, your mind is therefore a reaction to the external world of problems!
Thesis (action) is dual to anti-thesis (reaction) creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic.
Concepts are dual to percepts -- the mind duality of Immanuel Kant.
Vectors (contravariant) are dual to co-vectors (covariant) -- Riemann geometry is dual.
Converting measurements or perceptions (vectors) into ideas or conceptions is a syntropic process -- teleological.
Your mind is building a "reaction space" from the input or "problem (vector) space" to create a "solution space" and this process is called problem solving or thinking (concepts) -- Hegel.
Targets, goals, or objectives are inherently teleological and problem solving is a syntropic process -- duality!
"Always two there are" -- Yoda.
Syntropy is dual to increasing entropy -- the 4th law of thermodynamics!
Well not sure this will go anywhere except maybe modify some of our archaic equations for nonlinear terms. The problem is probably related to NP hardness and using more expansive nonlinearity methods to crack certain problems that are more specified. We will always not know what we don't know. Using more general nonlinear models was bound to greatly improve our simulations. The real question for NN is this the MOST ACCURATE or most INSIGHTFUL and BEST of nonlinear methods to do so? Somehow I doubt this, but it's certainly a nice proof of principle and place to venture off further. To put all our faith in it might be a mistake though. We might be looking at long predicted by mathematicians limits to reductionism, and our first method to not overfit billions of parameters will give us an illusion that this is the only way, and we could be looking at a modern version of epicycles. If we want to really go further we need to use such models to not just get better at copying reality, but finding general rules that allow it's consistent creation and persistence through time. Perhaps one way to do this would be to consider physical type symmetries on weights.
RE: what you said at the end there - You're thinking of PINNs, check out Steve Brunton and Nathan Kutz
Hmm do you think resonance and harmonics might fit in here. I imagine that patterns of connections within NN/neural networks that are self-stabilizing in some way would tend to persist throughout iterations (a kind of memory). Physics gives us resonance and harmonics that describe periodic behavior in everything from atoms to predator-prey relationships to solar systems. The fourier transform essentially gives us a logic chain to describe any signal, but as some combination of periodic frequencies instead of linear lengths. It is a concept that arises again and again. Both quantum and relativistic perspectives of spacetime are highly influenced by periodic or near-periodic behavior. Maybe this is fundamental to NN as well and the cat videos taught the AI how to recognize low-dimensional periodic relationships in data. Which could explain why it helped as a preset for totally unrelated data. I'm not exactly sure if that was at all similar to what you were suggesting but it seemed related in my mind.
Half-baked thought sources:
www.quantamagazine.org/how-the-physics-of-resonance-shapes-reality-20220126/
www.sciencedirect.com/science/article/abs/pii/S0893608012002584 (machine learning with adaptive resonance)
Of course we only know what we know.
Won't modeling the known better, lead to discovering what sticks out abnormally?
This will probably lead us to newer discoveries, quicker.
yooo, i'm so glad i came across this! i've been thinking about how neural networks can teach us about our own thinking and pattern finding; i'm glad there is discussion about it
okay it's not about what i initially thought, but whoa. this polymath approach sounds excellent. i feel it's similar to how people who study many different fields can be quicker to grasp a novel problem
Wow this is incredible and sort of confirms some thoughts I’ve had about neural networks and the compression of knowledge.
been in the rabbit hole lately so glad this popped up you rock miles!
I was pretty surprised to see this not actually purpose much of anything other than using tools to analyse patterns, three same tools that have been in use for decades. Is this a venture pitch? Throwing more processing at it helps, but doesn't "solve" anything on its own.
There's a paper on Feature Imitating Networks that's gotten a few good applications in medical classification, and subtask induction is a similar line of thought. FINs are usually used to produce low dimensional outputs, but I was thinking about using them for generative surrogate modeling. FINs can help answer the question of how to use neural networks to discover new physics.
An idealized approach would turn every step of a coded simulator into something differentiable.
It occurs to me that the approach of this talk, and interpretability research generally, is essentially the inverse problem of trying to get neural networks to mimic arbitrary potentially nondifferentiable data workflows.
This is a great talk, laughed a lot at "literally".
Surely genetic algorithms struggle heavily with local minima. Does PySR avoid this with whatever method it uses?
I love the idea of using a foundation models approach for PDEs of different families to deal with small sample problems.
Never heard of either SR or program synthesis until this talk but both seem related to my interests, glad I watched this!
Adversarial examples for science is fucking insane and I love that guy's question.
All i always wanted to hear is in this video ! thanks !
Very cool visual at 28:12 - where would harmonic analysis fit?
So am i the only one that going to point out that SORA from OAI is basically a generalization for a 3d engine that might let us preform experiments!
30 mins to say that you can fit simpler models to a neural network data-generating process, and another 30 to say that more training data (even if relegated to what we call “pretraining”) improves performance.
ps: things are simple because they are ubiquitous and they are ubiquitous because it’s how the world works (law of conservation of mass and energy, i.e. addition), not because it’s “useful”
Serious questions here, isn't his "folding analogy" just superposition of waves? Or I am missing something?
The 'Avada Kedavra' potential of that pointy stick is immense. Brilliant presentation.
Read another book
one of the best suggestions of the algorithm. there is a phrase widely used in education circles nowadays: 'Learning how to learn' and it is often criticised as human babies are already born with the ability to learn. But in the case of machines I suspect that is the way to go. They lack the genetic encoding we embedded in our biological systems for so long. Maybe we should treat these early machine learning models as their DNA?
This is the first exciting concept I’ve heard in the current AI revolution
It is probably dumb or anyone else have trouble understanding folding analogy at 12:15? Is he suggesting that the planes are superimposed with one top of other?
Or is he suggesting that sum of figure a and figure b lead to figure c?
Can anyone help me in understanding it?
I didn't get what is the reason to use symbolic regression. Analytical relationships/models are not the same as symbolicly representables. "Derivability" is required.
model mining? brain digging? fascinating, i guess we gonna need some tools to uncover these gems from the nural nets or do we need to build the nets/models in a more comprehensive way?
35:21 Good pretrained in some epochs by using Polymathics results does not mean training from scratch has a worse error. It is just a matter of time the good model will have the same quality.
Right, the point is energy efficiency and optimized speed/quality for multiple applications. The pretraining is done once for the foundation model, which safes efforts for the various latter applications.
This is actually really important
I would say this is not as important as the book... called "where's my cheese". Have you seen it?
What an amazing fck of presentation. I mean, of course the subject and research is absolutely mind-blowing, but the presentation in itself is soooo crystal clear, I will surely aim for this kind of distilled communication, thank you!!
Thank you for your talk. I found it extremly interesting. I have some comment on your statement that simplicity is implied by utility : Differential Equations are very useful in describing our world, however they are at least in my mind not simple and to most people also not familiar. I would love to discuss about it !
At 17:53, he has a plot on the right side, but he seems to attain only an expression in the variables x and y. There is no equation, so how is he even able to make a plot against those 2 variables? If you try plotting some of the given expressions by equating them to a constant (e.g. 2(x+sin(y+1.3))=3 ), you don't get anything that looks like his plot.
If there is a 3rd variable (e.g. z, or something like f(x, y)), then the plot should be a 3D plot. Instead, the plot is 2D.
it's a mistake, they're implicitly equated to 0
17:20 If genetic algorithm is a bruce-force algorithm, why using it? Is its time complexity less than the bruce-force algorithm, similar to dynamic programming used in RL?
This is SO cool! My first thought was just having incredible speed once the neural net is simplified down. For systems that are heavily used, this is so important
Grate path to walk on .. wish luck to the lecturer and hiss fellow researches
Regarding simplicity: I think that you are missing something important about the addition operation that makes it "simple". We are also familiar with division (the arithmetic operation) and it is also useful, but we would not say that division is "simple" in the same way the addition is simple (or we would say that addition is simpler than division, even though both are "familiar" and "useful").
That is because addition is infinitely more "useful" than division. Literally any group of things, whether physical or not, coming together in some sense is addition. There are a lot of things next to each other in the universe lol. It is because it is so fundamental that it seems so "simple", because it is and they are just two different ways of saying the same thing.
@@samuelwaller4924 I was thinking of simplicity in an algorithmic sense: addition can be performed by a simple and fast parallel circuit, while division must be performed in a stepwise, linear way, where each step depends on the result of the previous one. Multiplication is similarly simpler than division, whereas subtraction exactly as simple as addition. My point is that these arithmetic operations are not "simple" or "complex" just because of our subjective experience with them, but because different operations actually have different innate properties, and it is a glaring flaw of analysis to think otherwise.
Folding is an analogue to reducing dimensional complexity.
Training LLMs on code doesn't teach them to reason a bit better, it teaches them to reason a LOT better. It makes sense if you think about it: what do you learn when you (a human being) learn to write software? You learn a new way of thinking.
Fantastic. At 55 minutes though, it is suggested that we don't have a simple concept like + built into us. Perhaps not in a blank neural net, but we for example are not born with a blank slate. It is clear that any toddler understands in some way, the concept of 'more' and 'less' even though they lack empirical understanding. With sufficiently robust generalized data sets based on physical principles, information theory as language and perhaps even the nature of emotions, given enough GPUs to sustain large inter-operational neural nets, would this not give rise to something more than the sum of it's parts?
Could you pre-train some layers (i.e. turn the standard activation functions for a few layers into pySR estimated functions) as a way to increase/change the dimensionality of the input data? Possibly could decrease the number of layers needed or time taken to train the network.
If not run early training with parallel genetically pruned custom activation layers to approach the space from different paths while trying to find the minimum loss.
No feasible for UHDLSS Feature Selection.
Is the part in 12.40 just convolution or am I just dreaming?
I don't know about all the fancy stuff but as a programmer this makes me 30 to 50% more productive and my daughter, who is a manager, makes her about 10 to 15% more productive.
Reminds me of a sociology paper with tons of seemingly complex math that, in the end, says something like, "school bullying is exacerbated when it goes unaddressed." So what was all the math for? Credibility.
one might reason out the implications of what he said here without him having to also provide the vision for how his work might be applied. or give it to a gpt and let it do it for you
My point being, he's no philosopher, but he's demonstrating something profound beyond his ability to express it
This is the reason why I like UA-cam
This is a brilliant idea. I hope this goes places
Fine-tune an LLM to interpret neural nets. Iterate and maybe symbolic regression (i.e. language) will help us supercharge LLM training. But hallucinations could be a major issue...
I already did that in February when I trained ChatGPT on quantum punctuation markers and de-markers.
Anthropic did this for GPT2
I tried this with my gynoid, but she she kicked me in the nuts.
Best 25 minutes of my life.. Physician telling me a language model is just a chatbot refined.. I can do that. Let's go!
Cats are basically fluids. No wonder preinitialization on cat videos helps learning Navier-Stokes equations.
The fact that AlphaFold discovered the remaining 199.7 million unique protein structures and Microsoft Quantum discovered 32 million new materials in essentially no time at all means there are yet _more_ discoveries to be observed unfolding from future AI models. There should be another scientific breakthrough sometime after the release of GPT-5, perhaps 2025-2026 there will be major breakthroughs in multiple sectors such as computing, medicine, physics, mathematics, as well as a new economical paradigm surfacing that involves AI and robots. The next set of major breakthroughs is likely to occur shortly after the $100 billion Microsoft/OpenAI data center comes online in 2028, although Google is also doing the same thing on a similar timeline now. I expect to start seeing flurries of discoveries and the pace accelerating around 2028-2030 and onward.
35:16 Yes, doing the model from scratch with traditional machine learning is worse compared to the pre-trained generative network, but only for the *same time frame*, if you give the traditional machine learning approach more *time*, then it can out-perform the pre-trained generative network, while the pre-trained network will just keep on spitting out the same type of results.
a proper comparison would require a 3 dimensional chart comparing model error vs #samples AND training time+network evaluation time.
The better approach is to use the pre-trained generative network to bootstrap samples for the genetic programming("Scratch-AViT-B") model thus getting the best of both.
Great presentation. Its marvelous to see a take on AI from a broad, scientific/mathematical perspective without too much focus on technicalities. Really exited to see how this might improve or add to our understanding of the/(this?:) ) universe.
It's*
@@JorgetePanete Thank you for pointing this out. It shows that LLms are already surpassing humans (like myself) in many respects - Chat GPT makes no spelling mistakes.
Yea it’s great if ur trying to use a system In a system to garner traction
With enough inputs you can make any curve or field match the current data. So it this even science ? I am very skeptical.
It will provide very little real insight, when you have inscrutable AI model able to predi t something. It might as well be the oracle of Delphi.
Slime mold is my favorite way of imagining it
My ass smells like fish and I haven't eaten fish in a good while.
So if you sought to get what an LLM knows out into some equations we could understand, what could they be like?
I would love to see some breakthrough in Dark Matter regime. There is so much data regarding Dark Matter yet no theory to back it up.
You might want to think about simplicity in terms of Kolmogorov complexity e.g your NN should try to emit the least complex, in the Kolmogorov sense, syntax tree.
Also, I think "+" is simple because it is closed over the field of integers. I think that if your operation takes you from one domain to another its more complicated. In that way you might consider using Category Theory. You could think about penalizing models that "move' further away into other mathematical spaces from a 'base" space.
Kolmogorov complexity can be thought of the ideal “lower bound” for a compressor/predictor in unsupervised learning.
But it’s also uncomputable which would make it hard to implement in practice 😅
@@coda-n6u true, I think I was trying to get at a weighting of symbols used. I’m not sure if that could be learned or would have to be assumed.
I think 1+1 is simple because is in some ways assumed ( forgetting Russell) whereas something difficult like say the Kullback-Liebler Divergence is defined in terms of simpler primitives
Edit: big picture would be you need some sort of error term to trade off against accuracy otherwise your tree grows without bound either in depth or complexity of the operators Consider it something like dropout or pruning.
@@DensityMatrix1 Yeah that's interesting! I feel like any theory with a sufficiently complex symbolic representation could be factored into smaller bits that could themselves be learned as features.
It's a big search problem, so I guess it's about allowing the algorithm to search deeply + generate complicated symbolic representations, but having it bias towards shorter ones (since they're more likely to be true).
Honestly a big problem I have no idea how to solve.
Solomonoff induction isn't tractable for beings with finite compute and AFAIK there's no standout best approximation to it. Myopic piecemeal modeling is probably better in many cases than trying for a theory of everything.
Biological or silicon network?
What happens if it’s trained on Schrödinger’s cat videos?
This was amazing-- confirms my suspicions.
Problem, reaction, solution -- The Hegelian dialectic!
Neural networks create solutions to input vectors or problems, your mind is therefore a reaction to the external world of problems!
Thesis (action) is dual to anti-thesis (reaction) creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic.
Concepts are dual to percepts -- the mind duality of Immanuel Kant.
Vectors (contravariant) are dual to co-vectors (covariant) -- Riemann geometry is dual.
Converting measurements or perceptions (vectors) into ideas or conceptions is a syntropic process -- teleological.
Your mind is building a "reaction space" from the input or "problem (vector) space" to create a "solution space" and this process is called problem solving or thinking (concepts) -- Hegel.
Targets, goals, or objectives are inherently teleological and problem solving is a syntropic process -- duality!
"Always two there are" -- Yoda.
Syntropy is dual to increasing entropy -- the 4th law of thermodynamics!
Are you aware of the fact that you didn't understand what this video is about at all?
@@ThePyrosirys You can treat input vectors as problems, watch the following:-
ua-cam.com/play/PLMrJAkhIeNNQ0BaKuBKY43k4xMo6NSbBa.html
Problems are becoming solutions (targets) via optimization -- a syntropic process, teleological.
Neural networks are therefore syntropic as they learn as they converge towards goals and solutions.
The learning process is teleological as your goal is to achieve a deeper understanding of reality.
Perceptions are dual to conceptions -- the mind duality of Immanuel Kant.
Machine learning is based upon the Hegelian dialectic if you treat your input vectors as problems!
@@ThePyrosirys Minimizing prediction errors is a syntropic process -- teleological.
"The brain is a prediction machine" -- Karl Friston, neuroscientist.
Syntropy is the correct word to use here and means that there is a 4th law of thermodynamics -- duality.
Average information (entropy) is dual to mutual or co-information (syntropy) -- information is dual!
Your brain processes information to optimize your predictions -- natural selection.
This is very interesting, can you please expand more
So what you are saying is: Our mind creates models based on patterns we observe to predict reallity?
How does that imply that information is dual? What do you even mean by "informatiom is dual?" and how do you apply Hegelian dialectics here? Tesis/Ant refere to conceptets that are contradictory to each other
Surely the problem with AI is Fudge In = Fudge Out, so if the Standard Model (and especially attempts to fix it) is full of fudge then fudge will result. I'm not saying the model outline below is correct, but if it is, or something pretty similar, no physics AI would come up with it, even if fed all the accepted (potentially) useful papers, and (filtered, biased, artefact-ridden) data..
--
POLECTRON FIELD: cell: a + & a - particle split by Full Split Energy as a positron+ & electron-. Bonds to 12 neighbours
MATTER: p+ / e- = half cell (& a cell as +-+ / -+-)? Polarises field as + & - shells. SPIN: centre polarisation axis
LECKY: total absolute charge. MASS: cells/lecky inside particles. INERTIA: field rebalances behind mass with a kick
STRONG GRAVITY: field repels mass. DARK ENERGY: voids grow as lecky shrinks cells and is lost to gravity gradients
DARK 'MATTER': galactic lecky gradient. Denser field slows acceleration and TIME, thinner field aids acceleration
BIG BANG: more proton-antiproton pairs malformed as proton-muon than antiproton-antimuon so hydrogen beat antihydrogen
POSITRONIUM: e+p. Muon: ep_e. Proton: pep. Neutron: pep_e. Tau: epep_e. Neutron mass is halfway between muon and tau
ANTIMATTER: 1,2 e_p pairs annihilate. 3: proton+anti proton or muon+anti muon. 4: neutron+anti neutron. 5: tau+anti tau
WEAK FORCE: unstable atoms form and annihilate e_p pairs. BETA- DECAY: pep_e => pep e. BETA+: pep + new e_p => pep_e p
NUCLEAR FORCE: neutron electrons bond to protons. ENTANGLEMENT: correlation broken by interaction? Physical link?
BLACK HOLE: atoms cut into neutrons fused as higher mass tau cores (epep). Field rotates. Core annihilates: ep => cell?
PHOTON: cell polarisation/lateral shift wave. LONGITUDINAL WAVE: gravitational wave, neutrino: 1 to 3 cell wave
DOUBLE SLIT: photon/particle field warps diffract and interfere, guiding the core. Detectors interfere with guides
ENTROPY: simplicity. Closed system complexity reduces over time. Uniformly (dis)ordered (hot)/cold field is simplest
This is not an endorsement of your alternative model, but the skepticism of models and digging deeper the conceptual ruts that we dig ourselves into. In flat world, we are all just lengths...
@@rugbybeef .. and widths unless it's a 1D flat world... I'm not into the Holographic Universe even though 2D is technically simpler than 3D - just not when we live in a 3D reality. Gravity, Dark Energy and Dark Matter need to be linked to one field, might as well make it an EM particle field. Neutron Mass is halfway between Muon and Tau bar a tiny bit of binding energy. I don't know why this relationship is not mentioned by anyone but me.
@@PrivateSi So Ive always wondered about this as vision is a 2D diminishment of our 3D world, I always believed that flatlanders would only see the lengths of their colleagues in a 1D analogue. Like if their square friend had distinct colors on each face, they would see and could infer their colleagues vertex. However differentiating a circle of radius 1 and a square of width 1 that rotated synchronously each time you tried to move around it would be impossible.
The issue is discovering the higher-ordering principle which subsumes a continuum of self singularities and discontinuities. Linear math works well in-between the singularities, but cannot extrapolate through them, in a sense they are like mathematical worm-holes. Attempts to linearize across the discontinuities will fail. A whole harmonically-related series will only be properly understood from the perspective of a higher-ordering principle, similar to the idea of projection from a higher magnitude to a lower dimensional space, or from the idea of negative curvature. The point is the epistemological assumption of a static model is problematic, the real world has static islands which are bounded within areas of great change, and so the basic function changes completely there, that is to say, the dynamics of change themselves change. So to bridge that gap you can’t just ignore it, or flatten it, you have to seek how to remap it in such a manner that it is no longer infinite, but cyclical, as Gauss did with the complex number domain.
Yeah, I read that like 5 times and have no idea what you’re trying to say.
@@Gideonrex1 pretty sure neither does he
interesting! was just learning about neural networks, so this is a pretty cool application :)
Solving problems is the essence of the Hegelian dialectic.
Problem, reaction, solution -- The Hegelian dialectic!
Neural networks create solutions to input vectors or problems, your mind is therefore a reaction to the external world of problems!
Thesis (action) is dual to anti-thesis (reaction) creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic.
Concepts are dual to percepts -- the mind duality of Immanuel Kant.
Vectors (contravariant) are dual to co-vectors (covariant) -- Riemann geometry is dual.
Converting measurements or perceptions (vectors) into ideas or conceptions is a syntropic process -- teleological.
Your mind is building a "reaction space" from the input or "problem (vector) space" to create a "solution space" and this process is called problem solving or thinking (concepts) -- Hegel.
Targets, goals, or objectives are inherently teleological and problem solving is a syntropic process -- duality!
"Always two there are" -- Yoda.
Syntropy is dual to increasing entropy -- the 4th law of thermodynamics!
This is some ingenious work
Add some thermodynamic constraints?
Simplicity is the absence of relative complexity.
Is this folding similar to convolution?
No, convolution layers, like 2d, take inputs and further extrapolate features by applying specific linear kernel methods (specific for 2d space or 3d), this seems to be doing something different where it is not a layer, but instead applies different layers together, by folding it over. Tbh don’t understand folding, but convolution layers are common in image problems so they are easier to understand
38:18 Why is “+” simple? Well, maybe because it is closely related to the concept of counting and “having” things. If I have two apples and I get (add) three more, I can just count the number of apples to verify that I now have five apples. That has got nothing to do with the concept of simplicity. Not sure if I even want to continue watching the whole thing…
Incredible lecture
Beautiful lecture. Been saying high dim data -> NN -> theory would be a good approach for many years now! Glad to see people working on this. 😊
Specific density squared ; Volume Quebed ; Vice Versa und AUgmentation Cycle
Flux TEMP composite material Augmentation Cycle und NEURO CELLULAR GEN_REGEN CYCLE @ NEUROPLASTICITY U V STABILIZER 7:50
I see similarity with physics informed neural network especially with Sparse identification of nonlinear dynamics (SINDy)
"Science today will be this one: the experimentalist arrives with a data collection unit, the theorist arrives with a Neural network and symbolic regression algorithm, we sit down, we plug in both machines, observe the two machines performing the scientific inquiry for us and then real Understanding comes. We did our super ego duty. The science is done out there for us. And maybe while the scientists sit there they come up with a truly novel idea together but it is pure curiosity, surplus, since the science is already done."
For "+," I do think it is simple because I hypothesize that the human brain does have built-in neurons specifically for counting small numbers (usually 5-9 varying between persons), so when you are an infant, you don't actually need to learn to count objects under this number (I suspect that in certain area of the brain, likely hippocampus, there are this amount of special neurons that are served as synaptic placeholders for the visual cortex in object identification. Then, it serves as the starting point to further learn the abstract concept of "+." That is also why "+" is the first mathematical operation that most humans (if not all) learned. If nothing is built-in, I wonder if someone can teach a human multiplication without them knowing addition. This experiment would be highly unethical, tho.
This is already well known. It's called 'subitizing'. I believe the research showed that subitizing is not implemented in separable neural substructures.
Thoughts from my deep ignorance
Regarding the idea that " +" might be assumed ( in replies) to be the first mathematical operation of human behavior. I wonder what would be different if looking at this from my perspective
"What if “ - “ is actually the first mathematical operation? What if the second operation, the “+ “ is the process of filling in the vacuum caused by the first “ - “ ? The first loss of coherence.. as an identifiable cellular membrane (ovum) being fully formed and then losing that coherence by the separation of the membrane experiences as a gap forced by penetration of new foreign material (sperm) that then becomes assimilated, exchanged. Not either or, + or - but shared - is part of + . And always was.
I'm going to state the obvious. That is smart. Yes it draws questions about AI explainability regarding deep learning NNs but what this chap is saying is quite brilliant. For me, as long as the conventional approach is combined with the model he is propounding, there should be some excellent science out of that. Then there can be even more science when we start to understand the reasons and mechanisms by which the deep learning neural networks some humans build are doing and are capable of what they are so. Let's not miss the point of what he is saying, at least what I interpret that he is saying... The NN is finding some order through patterns, it really is those patterns that are probably most related to something interesting, ie of scientific interest, then we can sift through the rest of the noise to see if something was missed, let's say we do that if questions are presented that don't have an answer. So all in all, it is a very powerful way of cutting through the fluff. If we then want to scientifically describe the fluff itself, it is now more distinct. I think what this guy is saying is brilliant. Incidentally, I think we ultimately find out that deep learning neural networks come to sensible decisions because the have the fidelity to tap into the innate intelligence structure of reality itself, but that is a next topic, although entirely pertinent.
Did you see the Lifestyle Trader ad? Proof that money is not just a commodity but logarithmic.
Symbolic Regression is starting to catch on but, as usual, people aren't using the Algorithmic Information Criterion so they end up with unprincipled choices on the Pareto frontier between residuals and model complexity if not unprincipled choices about how to weight the complexity of various "nodes" in the model's "expression".
A node's complexity is how much machine language code it takes to implement it on a CPU-only implementation. Error residuals are program literals aka "constants".
I don't know how many times I'm going to have to point this out to people before it gets through to them (probably well beyond the time maggots have forgotten what I tasted like). This whole notion that "+" is just what we're used to is intellectual poison.
Yes AI is definitely faster generating random ideas, and is also quicker fitting these random ideas to a data set. It’s a very powerful tool.
Could these models apply compression to themselves through techniques like quantization, pruning, and knowledge distillation becoming faster and faster and smaller until AGI emerges from a phone sized device which can invent warp drive?
Tbh all of reality can be encoded on one gigantic vector.
Probably not. There are hard limits on stuff like that.
Fantastic work, I thought we would take AI in this direction and here we have that reality.
How do you avoid p-hacking your data?
Stop using p
@@whatisrokosbasilisk80 so stop using statistics
@@joelwillis2043 If you think that all of statistics boils down to p-value, you don't know statistics.
Cool idea! Essentially, we can deduce symbolic, testable scientific theories from deep learning models using things like PySR. Making foundation models (which are trained on a wide variety of phenomena, not necessarily related to the area of application) for specific scientific application gives ANNs an advantage. Simplicity (explainability, legibility) comes from familiarity with a problem area, so we should be training models on lots of diverse examples to help them “get used” to solving these types of problems, even if the examples may seem irrelevant (cat videos & differential equations 🐈)
Interesting application of explainable AI 🎉 Congratulations on your research
36:36 becoming more basically intelligent because of understanding spacio temporal connectivity. The flashing faces in peripheral vision illusion it shows us The monsters we create when we lack that.
I think this is likely, but it will be another lowering of the goal posts of science. The first lowering was with newton, and the abandonment of the idea of understanding the causes of physical principles in intuitive terms. This was thrown out and replaced with building intelligible mathematical theories of the world, instead of trying to make the world itself intelligible. AI will be the next lowering of the goal posts, where we can no longer even make intelligible theories of the world, instead, the theories will be totally unintelligible to any human, locked away in the statistical correlations of some AI. Instead of trying to interpret data from the world, we'll be trying to interpret AI.
I think you missed the point of what he's saying. he's not saying we should just build ai and use neural nets to solve problems, rather we should train the network, see how it solves the problem, and attempt to deduce a theory from studying the actual mechanisms of the neural network itself. The network is not the end goal, it's merely a new tool for data analysis.
Thank you. I guess we have to let them have their journey. The good news is the AI will mirror back their unconscious motivations for wanting these increasingly abstract and unintelligible constructs.
This has crossed my mind and this is exciting indeed. High dimensionality patterns are often hidden but the fact that they are high dimension makes for the discovery of robust natural laws. We are in need of territory. We no no longer have to rely on empirical, philosophical or mathematical models to create natural laws. Data in high dimensionality can reveal many laws. Exciting times!