References: Toy Models of Superposition dynalist.io/d/n2ZWtnoYHrU1s4vnFSAQ519J transformer-circuits.pub/2022/toy_model/index.html ua-cam.com/video/R3nbXgMnVqQ/v-deo.html (Nanda) Supermasks in Superposition arxiv.org/abs/2006.14769 INTERPRETABILITY IN THE WILD arxiv.org/pdf/2211.00593.pdf Actually, Othello-GPT Has A Linear Emergent World Representation (Nanda) www.lesswrong.com/s/nhGNHyJHbrofpPbRG/p/nmxzr2zsjNtjaHh7x thegradient.pub/othello A Mathematical Framework for Transformer Circuits transformer-circuits.pub/2021/framework/index.html ua-cam.com/video/KV5gbOmHbjU/v-deo.html (Nanda) Attention is all you need arxiv.org/abs/1706.03762 A STRUCTURED SELF-ATTENTIVE SENTENCE EMBEDDING openreview.net/forum?id=BJC_jUqxe Distributed Representations of Words and Phrases and their Compositionality (Mikolov) arxiv.org/abs/1310.4546 Deep Residual Learning for Image Recognition arxiv.org/abs/1512.03385 Attribution Patching: Activation Patching At Industrial Scale (Nanda) www.neelnanda.io/mechanistic-interpretability/attribution-patching In-context Learning and Induction Heads transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html ua-cam.com/video/dCkQQYwPxdM/v-deo.html (Nanda) The Quantization Model of Neural Scaling arxiv.org/pdf/2303.13506.pdf Interpreting Neural Networks to Improve Politeness Comprehension aclanthology.org/D16-1216/ Progress measures for grokking via mechanistic interpretability arxiv.org/abs/2301.05217 (Nanda) ua-cam.com/video/IHikLL8ULa4/v-deo.html twitter.com/NeelNanda5/status/1616590887873839104 Grokking paper arxiv.org/abs/2201.02177 A Toy Model of Universality arxiv.org/abs/2302.03025 twitter.com/bilalchughtai_/status/1625948104121024516 A circuit for Python docstrings in a 4-layer attention-only transformer www.lesswrong.com/posts/u6KXXmKFbXfWzoAXn/a-circuit-for-python-docstrings-in-a-4-layer-attention-only Fodor, J. A., & Pylyshyn, Z. W. (1988). Connectionism and cognitive architecture: A critical analysis. psycnet.apa.org/record/1989-03804-001 Maximal Update Parametrization (μP) and Hyperparameter Transfer (μTransfer) github.com/microsoft/mup Spline theory of NNs proceedings.mlr.press/v80/balestriero18b/balestriero18b.pdf Counterarguments to the basic AI x-risk case (Katja Grace) www.lesswrong.com/posts/LDRQ5Zfqwi8GjzPYG/counterarguments-to-the-basic-ai-x-risk-case The alignment problem from a deep learning perspective (Ngo) arxiv.org/abs/2209.00626 Superintelligence: Paths, Dangers, Strategies. www.amazon.co.uk/Superintelligence-Dangers-Strategies-Nick-Bostrom/dp/0199678111
Some Quotes: "The empirical question of whether language models do this and the theoretical question of could they do this are two different things. In my view, the theoretical question is nonsense." "Models can be thought of as ensembles of shallow paths, with a trade-off between having more computation and better memory bandwidth." "Models, in my perspective, have linear representations more than geometric representations." "The model does not align features with neurons and superposition is a mechanistic hypothesis for why both of these phenomena occur." "An ensemble of shallow paths is a good way to think about models, and there's a trade-off between having more computation and better memory bandwidth." "Emergence is when things happen suddenly during training and go from not being there to being there fairly rapidly in a non-convex way, rather than gradually developing." "Language models predict the next token. They learn effective algorithms for doing this within the constraints of what is natural to represent within transformer layers." "The key thing to be careful of when probing is, is your probe doing the computation, or does the model genuinely have this represented?" "A lot of my motivation for this work comes from I care a lot about risk and alignment and how to make these systems good for the world." "There's lots of things that a sufficiently capable model could do that might be pretty destabilizing to society." "I guess I mostly just have the position of --man! It sure is kind of concerning that we have these systems that could potentially pose risks, but you don't know what they do and decide to deploy them." "I really want a better and more scientific understanding of emergence. Why does that happen? Really understanding particularly notable case studies of it." "I believe emergence is often underlain by the model learning some specific circuit or some small family of circuits in a fairly sudden phase transition that enables this overall emergent thing."
1:32:00 Loved the point that given a bounded context length, it is still very much a finite number of inputs that a model would receive, so philosophically, that LLMs are not very different to AlphaGo.
As others have already mentioned, this is one of the most substantive interviews to date on how we can interpret what is happening inside a neural network. A striking aspect of this interview, in addition to the valuable insights into this 'inscrutable pile of linear algebra' (Neel's alias for a neural network), is the brutal honesty that, in the end, we may never be able to fully interpret what is happening, but at least we need to try. Lastly, this is not one of those lectures where, whether you struggled or breezed through it (based on one's perspective), it feels well worth the time. Thank you for sharing this conversation.
For me this is in the top ten interviews ever. Dense of interesting information, a lot to listen and understand, and definitely some different from usual points of view.
The topic is certainly of most importance in the current state of AI technology, where we have a breakthrough scheme with surprising results, that relies on "spooky emergence".
First off, kudos for hosting such a great podcast and this interview in particular I feel shows the depth of the interviewer. I'm late in catching up on some of the episodes on this channel, but I've got to say this was a great episode. Neel can talk concisely and informatively in a way that is easy to follow about pretty heavy interpretability concepts. I'm definitely going to go check him out for more stuff. For example the clean summaries of Grokking vs general phase transitions, and the cleanup being critical, or superposition of infrequent or composite concepts in both knowledge and computation. And also the concept of potentially universal learned circuits (or families of such) being composited and combined, including induction for attention for in context learning. Just so much stuff in here that 4 hours didn't feel long at all. If anything the final roughly hour about AGI/ASI risk and x-risk was a bit less dense but still a good section to listen to. The first 3 hours are incredibly dense with information.
Thank you for reading your UA-cam comments, and thank you for such a great video and such great content. Cherished every minute of Tim and Neel sharing their worldly perspectives with us in these turbulent times. Up next, more amazing world views from Prof. Friston & Dr. Wolfram - better make some popcorn!
I also love reading the comments here, and I think this is one of the best communities. Very little spam, and much enhusiasm. Also it is great that the comments are read and responded to by the team, and taken feedback and inspiration from.
To talk about a periodic table of circuits reminds me a bit of the RNA World Hypothesis. Basic structures that don't dissipate but rather serve as building blocks of more complex structures 1:05:24. Also it makes me think of a fuzzy relation between knot theory and the topology created by NNs with the main difference of that this structures are probabilistic/noisy in contrast to platonically perfect topologies found in the knot theory periodic table
I would LOVE it if MLST could do some sort of intermittent little "knowledge bomb" videos here and there for those of us engaged in separate fields, but "morbidly curious" on the subject. Perhaps explain every now and then some term or concept that might not be so familiar for all of us. Even if MLST could "curate" in so far as linking or reposting some seminal well-formed key lectures on the topic would be highly appreciated, in this vast landscape of knowledge and ideas.
I've always wondered how to prevent "sub-circuits" from being overwritten. I can imagine some kind of gradient gating, or separate loss function, dedicated for maintaining these sub-circuits. I wonder for really large models if sufficiently large context is an effective gradient gating mechanism.
I think I follow, but some of the jargon is over my head. You did a great job trying to mitigate for that. It's my own lack of knowledge in the actual terms used in the industry. Perfect dialectics based conversation.
"But... non-linear probing is *particularly* sketchy" - so good. Neel Nanda's thinking about all this is so measured, informed, and logically sound. That combined with being deeply insightful and still willing to embrace the beauty and wonder and go searching for it is such a breath of fresh air. Thanks for this, such a fascinating subject. Has the potential for shedding light on some pretty heavy stuff, this is truly undiscovered country. I'd argue we can put goals in, we just can't know what goal we are encoding. And it won't be whatever goal you/we think it is. The geometric structures people found made my hair stand on end when I stumbled across the work. The toy models paper is one of the coolest things of all time, wraps up so many subjects in one neat experiment and being able to run the code is too good. If it helps anyone else, two personal rules that help me as I try to understand all this: Whatever you think it's doing, it's probably doing something different. Whatever you think it is, it's probably something else.
the one question i want to ask to neel that i dont believe gets addressed in this interview: "if everyone switched over from transformer models to neuro-symbolic models from the beginning, woud that just automatically render all known mechinterp trivial and solved, or would there still be remaining mysteries, and does neel think mechinterp and neuro-symbolic research could work hand in hand toward the same final goals in the end?" i hope i can get this answered. thanks. ☮
It's an interesting question. The field is definitely missing out on more neuro-symbolic research. I can only provide a historic note, that for purely symbolic systems - automated theorem provers - the results might still not be interpretable. For example: in 2016, a supercomputer was used to automatically generate a proof (a mathematically rigorous and precise explanation, that you can read and unsterdstand as a human, for why something, by logic, must be true) to the problem of "boolean Pythagorean Triples". The proof is in a form of text that you can read to know what the prover program is thinking and why it's thinking it every step of the way. The catch? This proof is 200 TB in size. It would take several lifetimes for a human to even read that text. It is completely inscrutible to us mortals. Why is that a problem? Mathematicians hoped that, since it is a difficult problem that doesn't take easily to common ways of solving it, they hoped that it would eventually be solved by more clever, more powerful techniques than we have now. Well, now the "AI" solved it, and we don't know how it did it, despite every nut in bolt in the system being completely observable. It may be that there is no clever scheme here and the prover simply brute-forced every single possibility. The answer is somewhere in that proof, that we can't meaningfully read because it's so humongous.
Humans understand language as a dynamic recurrent context with multiple potential vectors that can overlap and be used as a kind of "abstract model" in which language functions. For example, if you say "Is Paris in", that would be interpreted as having many potential contexts based on the specific sequence of tokens used, where the contexts apply to each token in the sentence, such as "Is" meaning either "exists", "exists at a certain location", "exists in a certain state" and so forth and "Paris" could be "person/place/thing" and "in" could mean spatial reference or logical reference. Not to mention each token has a context of whether the token is a legitimate word, which means exists in a dictionary as part of a given language and whether the sequence of tokens is part of a valid set of grammar rules., which is what makes words not equal to tokens in the sense of language understanding. (Ie, if a language model is trained on lorem ipsum texts, it only has a model of lorem ipsum patterns, which is not a knowledge of latin). That kind of dynamic context is not found in large language models because it is not based on simply predicting the next token as opposed to generating the correct context internally corresponding to the sequence of words given. And generally this is part of what is measured by reading comprehension tests and in a more advanced sense, logic tests. Also another good example of that human language context is the fact that most people have their own internal dictionary of terms which is ad hoc and based on a general ability to convey the meaning of a specific word that does not rely on rote memorization of a specific dictionary text. That context is also not covered by token prediction algorithms. In fact dictionaries themselves express this idea of multiple contexts being dynamically evaluated and or updated all at once in the fact that any word in a dictionary can have multiple meanings. Part of the problem with large language models is that many of the designers refuse to measure these systems according to these more rigorous types of evaluations. Which goes back to the fact that the whole point of large language models was more of a research endeavor to understand how these systems behave as more and more data is given to them in order to quantify the actual side effects, behaviors, benefits and functioning of such models and how those things can be used for solving specific types of problems. However, since these efforts cost so much money, most of this work is now privatized and those side effects and behaviors are just arm waved off as if there is no need to go into detail about the issues such research should be revealing. Not to mention the resulting system is built on an a-priori theory that is flawed, which is that just being given random text from a near limitless source of text will make a near omniscient kind of program which transcends simplistic functions such as plain language, which is false. FIrst, because language itself is something that is a model unto itself that has to be learned and second, learning a language does not automatically impart advanced knowledge of the world or anything else beyond language. As in, just because somebody knows French doesn't mean that they know advanced chemistry. That is simply not how it works in humans. A more reasonable test of any kind of language model should therefore be strictly be within the domain of language and language related functionality before going into these more esoteric domains. So, as mentioned before, if I train a language model on specific texts, can that language model reliably and accurately answer questions about that text? And the more critical question is how much training data is required to achieve a certain degree of accuracy in such a model. Similarly, what would the data and compute cost be for a language model that can reliably translate between languages and how does the complexity and cost increase as more language are added and is there any loss of semantic or comprehension as the number of languages goes up?
2:26:48 "Tokenizers are f*d". I have been thinking this as well. I just have this feeling that it would make way more sense if tokens would be at the word level. A part of a word as a token seems like noise in the system. You give meaning, an emphasis to something that is arbitrarily cut from a larger whole, and actually does not have proper meaning in itself. Then, as he describes, things, meanings, connections get built around that, and I would guess in clunky ways. It seems some way a bit nasty, like a "hack". I guess it originates from performance aspects of encoding tokens in a small format. I would bet it plays a part in the bad understandability of what these models do. I will at some point attempt getting into this stuff in the technical sense when I have time, currently just curious. If I build my own little model tech, I will attempt the word-level tokenizing. Again, someone might shoot my thoughts right off as wrong, and that is fine also, I genuinely have no clue atm, just hunches by my own tech experience as a dev.
At 3:08:23 is referenced Neel Nanda's article on getting started as a mech interp researcher and it's mentioned it will be in the description, but I do not see it there. I'm very interested and would love to read that post. Can you please link it?
@@MachineLearningStreetTalk Wonderful! Thanks! Perhaps you'll see some interpretability papers from me in the future, and know you've been a key inspiration 😌
I really feel like the transformer is insufficient to properly represent everything we want to represent. Some things *are* recursive in nature. You can’t just unroll the loop all the way. I feel like we need a heterogeneous network where some of it is transformers and some of it is other things. For example, arbitrarily deep mathematics, even addition with a lot of digits can’t be fully represented in a linear network. You need the ability to loop. I also feel like the network kind of needs to be allowed to maintain an internal state, kind of an inner monologue where it can think about something for a bit before outputting the next token. On the bright side, this internal monologue could be embeddings so that it could be translated into English and we could actually get some insight into what’s going on inside.
Sorry, I massively overexposed everything on the day. The footage was almost unwatchable without processing it like this. I learned a lot from this experience!
@12:30 semantics"? A lot of random junk thrown together "artfully" (linalg stuff) and trained can implement an exact algorithm say 80% of the time. So 80% of the time is interpretable and even discoverably so --- by who though? By us! You need a mind to interpret junk, or art, or whatever. Once we interpret, a chatGPT then can too, but not in the same qualia-filled way as us.
Thanks for putting this together. I can't help but feel that the average earth citizen would be shocked to know that the best and brightest in the field admit there's a reasonable chance that AI will be a terminal event in the next decade, give or take. Most people have NO idea.
I woke up to hearing this conversation and it had me dreaming some crazy shit 😂. Yall have a new subscriber eventually I ll figure what yaĺl talking about 😊
Even if everything was perfect and every one was healthy and happy, and it was going to be that way for millions of years - at some point, it all comes to an end. You can't beat thermodynamics, at some point entropy wins. What does it matter if that happens in a million million years, or 100? What need do you really have to invent a new religion around "effective altruism" and "long term good"? (the reasons are at least apparent when the most effective altruism is of course to give rich people our money but I digress) All we can truly do, is strive to be our best selves at all times and otherwise live until we die.
1. Learning language and facts are separable from learning behaviors (the latter is supervised learning + RLHF) 2. Animals and humans have been shaped by Evolution in a Darwinian Competition and are therefore competitive, hoarding, and self-preserving. This is conserved as instincts in our DNA. 3. Learning about racism doesn't make you a racist. Learning racist behavior from family and peers makes you a racist, because it matches your built in prejudices in your DNA. 4. LLMS and AIs have not evolved. They are created by Intelligent Design (Researchers and Engineers) and therefore have inherently neutral non-competitive behaviors. 5. We cannot guarantee non-evil non-racist children because they have their instincts and DNA. But we can guarantee that our AIs can learn any behavior we want, and nothing else in the way of behaviors. 6. Evil humans abusing AI is still a problem. The article on my SubStack called "AI Alignment is Trivial" hints at one strategy. 7. In a future SubStack article I'll discuss more directly how AI can Moderate Moloch
Yes, on the Earth today...there was a time I felt like I was at the leading edge of things, but now I'm a septugenarian, and my 80 kilohours is just about up. But I like The Long Now foundation, and it looks like you youngsters have things well in hand.... hopefully... ;*[}
I'm not a fan of arguing for conspiracies on this stuff, given that many of these people have been consistently arguing for the possibility of AI-x-risks years before OpenAI and Anthropic emerge or got all of the money they have. And now you also have people like Hinton, Bengio and Andrew Yao signing letters on AI risks, who are not working in any for-profit organisations, linked to EA and they are also putting themselves in an unpopular faction.
References:
Toy Models of Superposition
dynalist.io/d/n2ZWtnoYHrU1s4vnFSAQ519J
transformer-circuits.pub/2022/toy_model/index.html
ua-cam.com/video/R3nbXgMnVqQ/v-deo.html (Nanda)
Supermasks in Superposition
arxiv.org/abs/2006.14769
INTERPRETABILITY IN THE WILD
arxiv.org/pdf/2211.00593.pdf
Actually, Othello-GPT Has A Linear Emergent World Representation (Nanda)
www.lesswrong.com/s/nhGNHyJHbrofpPbRG/p/nmxzr2zsjNtjaHh7x
thegradient.pub/othello
A Mathematical Framework for Transformer Circuits
transformer-circuits.pub/2021/framework/index.html
ua-cam.com/video/KV5gbOmHbjU/v-deo.html (Nanda)
Attention is all you need
arxiv.org/abs/1706.03762
A STRUCTURED SELF-ATTENTIVE SENTENCE EMBEDDING
openreview.net/forum?id=BJC_jUqxe
Distributed Representations of Words and Phrases and their Compositionality (Mikolov)
arxiv.org/abs/1310.4546
Deep Residual Learning for Image Recognition
arxiv.org/abs/1512.03385
Attribution Patching: Activation Patching At Industrial Scale (Nanda)
www.neelnanda.io/mechanistic-interpretability/attribution-patching
In-context Learning and Induction Heads
transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html
ua-cam.com/video/dCkQQYwPxdM/v-deo.html (Nanda)
The Quantization Model of Neural Scaling
arxiv.org/pdf/2303.13506.pdf
Interpreting Neural Networks to Improve Politeness Comprehension
aclanthology.org/D16-1216/
Progress measures for grokking via mechanistic interpretability
arxiv.org/abs/2301.05217 (Nanda)
ua-cam.com/video/IHikLL8ULa4/v-deo.html
twitter.com/NeelNanda5/status/1616590887873839104
Grokking paper
arxiv.org/abs/2201.02177
A Toy Model of Universality
arxiv.org/abs/2302.03025
twitter.com/bilalchughtai_/status/1625948104121024516
A circuit for Python docstrings in a 4-layer attention-only transformer
www.lesswrong.com/posts/u6KXXmKFbXfWzoAXn/a-circuit-for-python-docstrings-in-a-4-layer-attention-only
Fodor, J. A., & Pylyshyn, Z. W. (1988). Connectionism and cognitive architecture: A critical analysis.
psycnet.apa.org/record/1989-03804-001
Maximal Update Parametrization (μP) and Hyperparameter Transfer (μTransfer)
github.com/microsoft/mup
Spline theory of NNs
proceedings.mlr.press/v80/balestriero18b/balestriero18b.pdf
Counterarguments to the basic AI x-risk case (Katja Grace)
www.lesswrong.com/posts/LDRQ5Zfqwi8GjzPYG/counterarguments-to-the-basic-ai-x-risk-case
The alignment problem from a deep learning perspective (Ngo)
arxiv.org/abs/2209.00626
Superintelligence: Paths, Dangers, Strategies.
www.amazon.co.uk/Superintelligence-Dangers-Strategies-Nick-Bostrom/dp/0199678111
Some Quotes:
"The empirical question of whether language models do this and the theoretical question of could they do this are two different things. In my view, the theoretical question is nonsense."
"Models can be thought of as ensembles of shallow paths, with a trade-off between having more computation and better memory bandwidth."
"Models, in my perspective, have linear representations more than geometric representations."
"The model does not align features with neurons and superposition is a mechanistic hypothesis for why both of these phenomena occur."
"An ensemble of shallow paths is a good way to think about models, and there's a trade-off between having more computation and better memory bandwidth."
"Emergence is when things happen suddenly during training and go from not being there to being there fairly rapidly in a non-convex way, rather than gradually developing."
"Language models predict the next token. They learn effective algorithms for doing this within the constraints of what is natural to represent within transformer layers."
"The key thing to be careful of when probing is, is your probe doing the computation, or does the model genuinely have this represented?"
"A lot of my motivation for this work comes from I care a lot about risk and alignment and how to make these systems good for the world."
"There's lots of things that a sufficiently capable model could do that might be pretty destabilizing to society."
"I guess I mostly just have the position of --man! It sure is kind of concerning that we have these systems that could potentially pose risks, but you don't know what they do and decide to deploy them."
"I really want a better and more scientific understanding of emergence. Why does that happen? Really understanding particularly notable case studies of it."
"I believe emergence is often underlain by the model learning some specific circuit or some small family of circuits in a fairly sudden phase transition that enables this overall emergent thing."
1:32:00 Loved the point that given a bounded context length, it is still very much a finite number of inputs that a model would receive, so philosophically, that LLMs are not very different to AlphaGo.
The birds chirping in the background are offering such a relaxing listening experience! 😊
As others have already mentioned, this is one of the most substantive interviews to date on how we can interpret what is happening inside a neural network. A striking aspect of this interview, in addition to the valuable insights into this 'inscrutable pile of linear algebra' (Neel's alias for a neural network), is the brutal honesty that, in the end, we may never be able to fully interpret what is happening, but at least we need to try. Lastly, this is not one of those lectures where, whether you struggled or breezed through it (based on one's perspective), it feels well worth the time. Thank you for sharing this conversation.
And they didn't say Black Box once!
meh..not really
Four hours, and I coulda listened for twice as long. Some great ideas, discussions, and perspectives. And some good humour. Thanks for the video.
Neel’s facial expressions and hand gestures are 10/10 on the likability scale 🤗
For me this is in the top ten interviews ever. Dense of interesting information, a lot to listen and understand, and definitely some different from usual points of view.
Yeah, and great production
The topic is certainly of most importance in the current state of AI technology, where we have a breakthrough scheme with surprising results, that relies on "spooky emergence".
In love with everything Neel* has to say 30 minutes in- can’t believe there’s so much more to enjoy!!
Thank you so much for the effort - these long streams help get me into the headspace!
One of my all time favorite interviews! Thank you!
First off, kudos for hosting such a great podcast and this interview in particular I feel shows the depth of the interviewer. I'm late in catching up on some of the episodes on this channel, but I've got to say this was a great episode. Neel can talk concisely and informatively in a way that is easy to follow about pretty heavy interpretability concepts. I'm definitely going to go check him out for more stuff. For example the clean summaries of Grokking vs general phase transitions, and the cleanup being critical, or superposition of infrequent or composite concepts in both knowledge and computation. And also the concept of potentially universal learned circuits (or families of such) being composited and combined, including induction for attention for in context learning. Just so much stuff in here that 4 hours didn't feel long at all. If anything the final roughly hour about AGI/ASI risk and x-risk was a bit less dense but still a good section to listen to. The first 3 hours are incredibly dense with information.
Thank you for reading your UA-cam comments, and thank you for such a great video and such great content. Cherished every minute of Tim and Neel sharing their worldly perspectives with us in these turbulent times.
Up next, more amazing world views from Prof. Friston & Dr. Wolfram - better make some popcorn!
I also love reading the comments here, and I think this is one of the best communities. Very little spam, and much enhusiasm. Also it is great that the comments are read and responded to by the team, and taken feedback and inspiration from.
Agreed....cheers!
What a marathon. Listened to the whole thing on a country drive and was not bored.
I think ill have to rewatch that interview many times
4 hrs?! I need my popcorn
To talk about a periodic table of circuits reminds me a bit of the RNA World Hypothesis. Basic structures that don't dissipate but rather serve as building blocks of more complex structures 1:05:24.
Also it makes me think of a fuzzy relation between knot theory and the topology created by NNs with the main difference of that this structures are probabilistic/noisy in contrast to platonically perfect topologies found in the knot theory periodic table
I would LOVE it if MLST could do some sort of intermittent little "knowledge bomb" videos here and there for those of us engaged in separate fields, but "morbidly curious" on the subject. Perhaps explain every now and then some term or concept that might not be so familiar for all of us. Even if MLST could "curate" in so far as linking or reposting some seminal well-formed key lectures on the topic would be highly appreciated, in this vast landscape of knowledge and ideas.
I've always wondered how to prevent "sub-circuits" from being overwritten. I can imagine some kind of gradient gating, or separate loss function, dedicated for maintaining these sub-circuits. I wonder for really large models if sufficiently large context is an effective gradient gating mechanism.
I think I follow, but some of the jargon is over my head. You did a great job trying to mitigate for that. It's my own lack of knowledge in the actual terms used in the industry. Perfect dialectics based conversation.
This conversation makes me think of the "autistic agent" I tried to describe. Very very interesting stuff.
This is awesome! I didn’t know mechanistic interpretability was a thing, but it satisfies the physicist in me 😁
"But... non-linear probing is *particularly* sketchy" - so good. Neel Nanda's thinking about all this is so measured, informed, and logically sound. That combined with being deeply insightful and still willing to embrace the beauty and wonder and go searching for it is such a breath of fresh air.
Thanks for this, such a fascinating subject. Has the potential for shedding light on some pretty heavy stuff, this is truly undiscovered country.
I'd argue we can put goals in, we just can't know what goal we are encoding. And it won't be whatever goal you/we think it is.
The geometric structures people found made my hair stand on end when I stumbled across the work. The toy models paper is one of the coolest things of all time, wraps up so many subjects in one neat experiment and being able to run the code is too good.
If it helps anyone else, two personal rules that help me as I try to understand all this:
Whatever you think it's doing, it's probably doing something different.
Whatever you think it is, it's probably something else.
I love how Neel is able to achieve these amazing results and then refer to them as "cute" xD. Great interview, learned a lot!
the one question i want to ask to neel that i dont believe gets addressed in this interview:
"if everyone switched over from transformer models to neuro-symbolic models from the beginning, woud that just automatically render all known mechinterp trivial and solved, or would there still be remaining mysteries, and does neel think mechinterp and neuro-symbolic research could work hand in hand toward the same final goals in the end?"
i hope i can get this answered.
thanks.
☮
It's an interesting question. The field is definitely missing out on more neuro-symbolic research. I can only provide a historic note, that for purely symbolic systems - automated theorem provers - the results might still not be interpretable. For example: in 2016, a supercomputer was used to automatically generate a proof (a mathematically rigorous and precise explanation, that you can read and unsterdstand as a human, for why something, by logic, must be true) to the problem of "boolean Pythagorean Triples". The proof is in a form of text that you can read to know what the prover program is thinking and why it's thinking it every step of the way.
The catch? This proof is 200 TB in size. It would take several lifetimes for a human to even read that text. It is completely inscrutible to us mortals. Why is that a problem? Mathematicians hoped that, since it is a difficult problem that doesn't take easily to common ways of solving it, they hoped that it would eventually be solved by more clever, more powerful techniques than we have now. Well, now the "AI" solved it, and we don't know how it did it, despite every nut in bolt in the system being completely observable. It may be that there is no clever scheme here and the prover simply brute-forced every single possibility. The answer is somewhere in that proof, that we can't meaningfully read because it's so humongous.
Humans understand language as a dynamic recurrent context with multiple potential vectors that can overlap and be used as a kind of "abstract model" in which language functions. For example, if you say "Is Paris in", that would be interpreted as having many potential contexts based on the specific sequence of tokens used, where the contexts apply to each token in the sentence, such as "Is" meaning either "exists", "exists at a certain location", "exists in a certain state" and so forth and "Paris" could be "person/place/thing" and "in" could mean spatial reference or logical reference. Not to mention each token has a context of whether the token is a legitimate word, which means exists in a dictionary as part of a given language and whether the sequence of tokens is part of a valid set of grammar rules., which is what makes words not equal to tokens in the sense of language understanding. (Ie, if a language model is trained on lorem ipsum texts, it only has a model of lorem ipsum patterns, which is not a knowledge of latin). That kind of dynamic context is not found in large language models because it is not based on simply predicting the next token as opposed to generating the correct context internally corresponding to the sequence of words given. And generally this is part of what is measured by reading comprehension tests and in a more advanced sense, logic tests.
Also another good example of that human language context is the fact that most people have their own internal dictionary of terms which is ad hoc and based on a general ability to convey the meaning of a specific word that does not rely on rote memorization of a specific dictionary text. That context is also not covered by token prediction algorithms. In fact dictionaries themselves express this idea of multiple contexts being dynamically evaluated and or updated all at once in the fact that any word in a dictionary can have multiple meanings.
Part of the problem with large language models is that many of the designers refuse to measure these systems according to these more rigorous types of evaluations. Which goes back to the fact that the whole point of large language models was more of a research endeavor to understand how these systems behave as more and more data is given to them in order to quantify the actual side effects, behaviors, benefits and functioning of such models and how those things can be used for solving specific types of problems. However, since these efforts cost so much money, most of this work is now privatized and those side effects and behaviors are just arm waved off as if there is no need to go into detail about the issues such research should be revealing. Not to mention the resulting system is built on an a-priori theory that is flawed, which is that just being given random text from a near limitless source of text will make a near omniscient kind of program which transcends simplistic functions such as plain language, which is false. FIrst, because language itself is something that is a model unto itself that has to be learned and second, learning a language does not automatically impart advanced knowledge of the world or anything else beyond language. As in, just because somebody knows French doesn't mean that they know advanced chemistry. That is simply not how it works in humans.
A more reasonable test of any kind of language model should therefore be strictly be within the domain of language and language related functionality before going into these more esoteric domains. So, as mentioned before, if I train a language model on specific texts, can that language model reliably and accurately answer questions about that text? And the more critical question is how much training data is required to achieve a certain degree of accuracy in such a model. Similarly, what would the data and compute cost be for a language model that can reliably translate between languages and how does the complexity and cost increase as more language are added and is there any loss of semantic or comprehension as the number of languages goes up?
2:26:48 "Tokenizers are f*d". I have been thinking this as well. I just have this feeling that it would make way more sense if tokens would be at the word level. A part of a word as a token seems like noise in the system. You give meaning, an emphasis to something that is arbitrarily cut from a larger whole, and actually does not have proper meaning in itself. Then, as he describes, things, meanings, connections get built around that, and I would guess in clunky ways. It seems some way a bit nasty, like a "hack". I guess it originates from performance aspects of encoding tokens in a small format. I would bet it plays a part in the bad understandability of what these models do. I will at some point attempt getting into this stuff in the technical sense when I have time, currently just curious. If I build my own little model tech, I will attempt the word-level tokenizing. Again, someone might shoot my thoughts right off as wrong, and that is fine also, I genuinely have no clue atm, just hunches by my own tech experience as a dev.
Very interesting and more in depth viewpoint of analysing Neural Networks, LLMs etc thanks. 👍
Kind of more fascinating the LLMs can't handle limited tiny prompt context and loose its way when it can manage so much around the context. 🤔😅
This is quite substantial! I am definitely going to get in on mech interp on the foundation models cropping up in my sector.
Great conversation - well done!
What is the name of the music in the intro?
Flight of the Bumblebee by Nikolai Rimsky-Korsakov
Algo bumps by comment engagements are my fist bumps.
Verily! Engagement promulgated
Best intro yet 🙏👍
At 3:08:23 is referenced Neel Nanda's article on getting started as a mech interp researcher and it's mentioned it will be in the description, but I do not see it there. I'm very interested and would love to read that post. Can you please link it?
www.neelnanda.io/mechanistic-interpretability/quickstart
@@MachineLearningStreetTalk Wonderful! Thanks! Perhaps you'll see some interpretability papers from me in the future, and know you've been a key inspiration 😌
Position in the Rotation of a monoid Group
or Monster Group.
10:05 a nun's first curry
Same 😂
I really feel like the transformer is insufficient to properly represent everything we want to represent. Some things *are* recursive in nature. You can’t just unroll the loop all the way. I feel like we need a heterogeneous network where some of it is transformers and some of it is other things.
For example, arbitrarily deep mathematics, even addition with a lot of digits can’t be fully represented in a linear network. You need the ability to loop. I also feel like the network kind of needs to be allowed to maintain an internal state, kind of an inner monologue where it can think about something for a bit before outputting the next token.
On the bright side, this internal monologue could be embeddings so that it could be translated into English and we could actually get some insight into what’s going on inside.
If there are similar circuits within deeper layers, and access to residual streams, doesn't that indicate transformers can be recursive?
kinda fractal
Why such heavy video filter?
Please don't put these effects and borders on in post production. It makes it hard to watch
Sorry, I massively overexposed everything on the day. The footage was almost unwatchable without processing it like this. I learned a lot from this experience!
@@MachineLearningStreetTalk haha, I was thinking "Neal looks like he's been painted"
Ah cool. Trippy.
@@MachineLearningStreetTalkI thought it looked kinda cool, and I only listen during exercise anyways.
Not a problem here, watched multiple times and only appreciate your hard work! Ganbatte!
The subtitles butchered "mech interp" all the while in like who the f is this Mcinturff guy?? awesome episode once again!
@12:30 semantics"? A lot of random junk thrown together "artfully" (linalg stuff) and trained can implement an exact algorithm say 80% of the time. So 80% of the time is interpretable and even discoverably so --- by who though? By us! You need a mind to interpret junk, or art, or whatever. Once we interpret, a chatGPT then can too, but not in the same qualia-filled way as us.
I didn't get "the knowledge is based" joke @2:15:10
Thanks for putting this together. I can't help but feel that the average earth citizen would be shocked to know that the best and brightest in the field admit there's a reasonable chance that AI will be a terminal event in the next decade, give or take. Most people have NO idea.
Great interview
1:08:00
I woke up to hearing this conversation and it had me dreaming some crazy shit 😂. Yall have a new subscriber eventually I ll figure what yaĺl talking about 😊
Even if everything was perfect and every one was healthy and happy, and it was going to be that way for millions of years - at some point, it all comes to an end. You can't beat thermodynamics, at some point entropy wins.
What does it matter if that happens in a million million years, or 100?
What need do you really have to invent a new religion around "effective altruism" and "long term good"? (the reasons are at least apparent when the most effective altruism is of course to give rich people our money but I digress)
All we can truly do, is strive to be our best selves at all times and otherwise live until we die.
Totally respect and relish Neel's pragmatic slant. But, it is easy to dismiss philosophy until you need it.
excellent interview, amazing discoveries. we need more science and less commercial applications of AI!
I think it needs some math or signal processing tools while it's training so it doesn't have to re-invent the sine table and DFT every time.
1. Learning language and facts are separable from learning behaviors (the latter is supervised learning + RLHF)
2. Animals and humans have been shaped by Evolution in a Darwinian Competition and are therefore competitive, hoarding, and self-preserving. This is conserved as instincts in our DNA.
3. Learning about racism doesn't make you a racist. Learning racist behavior from family and peers makes you a racist, because it matches your built in prejudices in your DNA.
4. LLMS and AIs have not evolved. They are created by Intelligent Design (Researchers and Engineers) and therefore have inherently neutral non-competitive behaviors.
5. We cannot guarantee non-evil non-racist children because they have their instincts and DNA. But we can guarantee that our AIs can learn any behavior we want, and nothing else in the way of behaviors.
6. Evil humans abusing AI is still a problem. The article on my SubStack called "AI Alignment is Trivial" hints at one strategy.
7. In a future SubStack article I'll discuss more directly how AI can Moderate Moloch
mindblowing how small the audience for this channel is. and infuriating. where are all of you in real life?
Yes, on the Earth today...there was a time I felt like I was at the leading edge of things, but now I'm a septugenarian, and my 80 kilohours is just about up. But I like The Long Now foundation, and it looks like you youngsters have things well in hand.... hopefully...
;*[}
When UA-cam turns into gpt6
in higher dimensions, most vectors are mutually (nearly) orthogonal. you guys don't seem to appreciate this.
OTHELLO - is a game - why do you not get it right ?
I will help no worries. :)
Glad to see Neel push back on your tired x-risk takes (little prestige or money is attached to ai risk). Here Tim, you dropped this:
(e/acc)
I'm not a fan of arguing for conspiracies on this stuff, given that many of these people have been consistently arguing for the possibility of AI-x-risks years before OpenAI and Anthropic emerge or got all of the money they have.
And now you also have people like Hinton, Bengio and Andrew Yao signing letters on AI risks, who are not working in any for-profit organisations, linked to EA and they are also putting themselves in an unpopular faction.
Criminal this has so few views
🙋
The interview is amazing, but the video stylization is jarring and really makes it ugly. Again, great interview though
Please don't add these filtering effects.
Sam Harris 🙄🤢🤮
That sonorous expository drawl tho