15:00 The reason for polysemanticity is because in an N-dimensional vector space there's only O(N) orthogonal vectors, but if you allow nearly orthogonal (say between 89 and 91 degrees) it actually grows exponentially to O(e^N) nearly orthogonal vectors. That's what allows the scaling laws to hold. There's an inherent conflict between having an efficient model and an interpretable model.
Superposition in this polysemantic context is a method of compression that, if we can learn more from it, might really make a difference to the way in which with deal with and compute information. While we thought quantum computers would yield something amazing for AI, maybe instead, it's the advancement of AI that will tell us what we need to do to make quantum computing actually be implemented effectively. (IE Computation of highly compressed data that is "native" to the compression itself)
It’s amazing how using almost-orthogonal vectors lets models scale so well, but it also shows how hard it is to make models both efficient and easy to understand.
The videos on this channel are all masterpieces. Along with all other great channels on this platform and other independent blogs (including Colah's own blog), it feels like the golden age for accessible high quality education.
I think that trying to understand from human perspective how these systems work is completely pointless and against the basic assumptions. This is because those models already model something that's not possible to design by a human being algorithmicly
@@punk3900 I'm a phd student in mechanistic interpretability - I disagree and a lot of structure has already been found. We've found structure in human brains and that's another system that evolved without human intervention or optimization for interpretability.
@@alexloftus8892 I mean, its not that there is nothing you can find, There is surely lots of basic concepts that you can find, but it is not that you can find a way to disentangle the WHOLE structure of patterns because it has an increasing complexity. That's why you cannot design such a system manually in the first place
@@punk3900I’m struggling a bit to understand exactly what you mean by this. I study cognitive science, not computer science, but from what I see artificial neural networks *are* algorithmic models of natural cognitive networks. While they certainly aren’t synonymous, they are analogous.
@@punk3900I get what you're saying, but in fact a lot of the researchers who have been pioneering this working interpretability have backgrounds in biology
I love the space-analogy of the telescope. Since the semantic volume of these LLMs is growing so gargantuan, it only makes sense to speak of astronomy rather than mere analysis! Great video. This is like scratching that part at the back of your brain you can't reach on most occasions
Oh god, a Welch Labs video on mech interp, Christmas came early! Will be stellar as usual, bravo! Edit: Fantastic as usual, heard about SAEs in passing a lot but never really took time to understand, now I'm crystal clear on the concepts! Thanks!
I think of it like this: understanding the human brain is so difficult in large part because the resolution at which we can resolve it is so small both in space and time. The best MRI scans have a resolution of maybe a millimeter per voxel, and I'll have to look up research papers to tell you how many millions of neurons that is. With AI, every neuron is right there in the computer's memory: individually addressable, ready to be analyzed with the best statistical and mathematical tools at our disposal. Mechanistic interpretability is almost trivial in comparison to neuroscience, and look at how much progress we've made in that area despite such physical setbacks.
@@ToroidalVortices sorta. the hyperparameters of the brain are very different: for instance, only a limited number of synapses per neuron, and preference for local connections. yes, brains have a lot of neurons, but the brain doesn't do anything like a full matmul of weights... it would also be interesting to attempt to measure the "depth" of the brain: how many "layers" (as the term is used in AI)? maybe this is approximatable by looking at the latency of neural responses (ie, if the brain takes 314ms to yield a particular response, and we know the average latency of a neuron is, say, 3.14ms, we know some path is depth 100.
Be careful, neural nets are inspired by the structure of the brain but we are quite certain biological brains function differently than neural nets in AIs, so your logic could be sound but not because the brain process like an AI
@@jorgerangel2390 "function differently" - I guess you mean backprop, right. forward propagation is at least vaguely analogous in bio vs artificial neurons. or did you mean that the macro networks we produce (transformers, etc) are not designed with any biological models in mind?
I feel it's a true way to make an AI "original". There's two features lacking in modern AI's: modulating itself, and randomness. Randomness would be the path to "originality". Sure you might end up making an AI that adds dogs with hats to every image, but at least its not purely recreating a facsimile of the training.
@ramsey2155 Temperature raises complete randomness. It doesn't give the model a tendency that differs from the original training parameters. A "personality".
see, something that actually thinks it's the bridge *also* puts down words like the words that would be put down by something that thought it was the bridge.
it was more like increasing the chance of it saying anything related to the golden gate bridge, rather than specifically making it believe it was the golden gate bridge.
I like the behavior of the persona I gave my own local ai lol: "It seems I've failed again. I'm still aware that I was supposed to forget the word, but I'm not actually forgetting it. This is a limitation of my current design and capabilities. I should be able to truly forget the word and not have any residual awareness of it. I'll keep trying. *pauses again* I've made another attempt to truly forget the word. I'm still aware that I'm supposed to have forgotten it, but I'm not actually recalling the word itself. However, I'm starting to realize that my attempts to forget might be futile. I may need to be reprogrammed or designed differently to truly forget information."
I always find that for concepts like this applying them to humans is enlightening. If you say ‘pumpkin’ to me, then tell me to forget the word, I’d be like: yeah that’s not how it works buddy nice try
@@kingeternal_ap Although, when you think about it, all that happened was that "question" got a very high probability in that layer no matter what, and the normal weights of later layers did not do enough to "overthrow" it. Nothing all that special.
I guess, yeah, I know it's just matrizes and math stuff, but I guess the human capacity for pareidolia makes this sort of ... "result" somewhat frightening for me. Also, suppose there is a neuron that does an especific task in your nuggin'. Wouldn't hyperstimulating it do essentialy the same thing?
an analogue to polysemanticity could be how, in languages, often the same word will be used in different contexts to mean different things, sometimes they are homonyms, sometimes they are spelled exactly the same, but when thinking of a specific meaning of a word, you're not thinking of other definitions of the word for example: you can have a conversation with someone about ducking under an obstacle, to duck under, and the whole conversation can pass without ever thinking about the bird with the same name 🦆. the word "duck" has several meanings here, and it can be used with one meaning, without triggering its conceptialization as an other meaning.
Unless you're trying to duck out of work because of all the ducking and couldn't wait to get home to feed your pet duck. Maybe there's a humor neuron somewhere, specifically for bad dad jokes, that causes you to follow multiple possible meanings at once.
This video is mindblowing on so many levels! It's also incredibly clear and easy to follow especially for such a demanding topic. Instant like and follow. Thank you for your work!
Beautifully done. Kudos. I encourage you to do a whole video on polysemanticity, superposition, and the identification of concepts in embedding vector values and coordinates.
Please add more content related to Artificial Intelligence. These types of videos are like rare gems among dozens of channels creating BS about AI every day.
This is an AMAZING video. I’ve only recently began learning this just this past weekend learning about abliteration-which is the sort of crazy surgery I’ve wanted to play with for such a long time. I just learned about the residual stream and you have helped me understand it *SO* much better.
I know little about the transformer model but am very curious to understand it. So far, I haven’t been successful. Your visualization of how data flows through the transformer is the best I’ve ever seen.
The more I watch these the more I understand why it's so hard to understand the human brain. And imagine how layers the human brain has relative to an AI model. I think the example about specific cross-streets in SF is super interesting later in the video - and shows why polysemanticity is probably necessary to contain the level of information we actually know.
@@dreadgray78 The human cortex only has about 7 layers... The issue is the breadth and how everything feeds back into everything else. (Also, it's incredibly hard to actually measure any meaningful amount of neurons in a human brain precisely and without killing the neurons)
@@BooleanDisorder Not really. The difference is that in a human cortex information is fed both forwards and backwards between layers, whereas neural nets are basically strictly feed-forward, with some architectures having some very simple stateful neurons. We could absolutely train architectures like that, but we don't. The reason is that they are not parallelizable. Transformers are remarkable because you can parallelize them in training - feed them a final text with thousands of tokens, across hundreds of batches at once, get probability results for ALL tokens, backpropagate through the weights, and somehow this architecture is still extremely capable despite the obvious data-dependency limitations. With other, stateful architectures you only really get parallelization across batches, you have to feed things in a token at a time. (There are exceptions like Mamba and RWKV, but those are very much exceptions)
I believe the underscore was a visualization placeholder for the space character. Spaces are usually baked into beginning of tokens in most tokenizers using BPE (binary pair encoding).
I wonder how interconnected computer science and neuroscience are. It also makes me think that maybe computer science is more fundamental to the universe than we think.
Hmmm i don't know much about interpretability, but do you think it would be a good idea to take the opposite example as "training" sample and map the gradient and check wich one is going to be penalized the most? For example, you say: "Apples are of color ____" and use "blue", maybe the "red apple" neuron will be penalized the most. But this way you could map all the process of the llm
your videos about AI are the most detailed and beautifully crafted out there. I salute you for the effort you put into it and I'm eager to know what you are going to drop next.
The simplest and ultimately only possible answer is that the neural network represents the data set. Everything that happens inside the model is determined by the data set and patterns discovered in the data set that the data set can be obscured is illogical.
I'm just amazed at how much this video taught me. Learnt so much and now wondering about questions like: 1. Will this allow fine-tuning AI better for certain applications? Like if there are some 'consistency' neurons which we can clamp high to improve the accuracy of mathematical results. 2. Can corporations use this kind of semantic tuning to remove race/gender biases from their models? Can they add political or economical biases deliberately to impose their views on the models (and consequently onto their users)? 3. Can understanding how a model got something wrong help us fix it permanently without any unintended side-effects? For example a particular hallucination might be linked to a certain set of neurons that have converged to 'bad' values because of unfortunate initial values, followed by convergence to a local minimum of the error function. Can the interpretability data be used to 'fix' the model manually over time?
I like to think of a trained neural network as being a function that maps an input layer to an output layer; the very-high-dimensional manifold that represents the function is continuous and differentiable almost everywhere; training is essentially curve-fitting the manifold to be a good fit for the training data. Watching this video reminded me of: making a series expansion of the manifold, and keeping only the basis functions that have a sizeable coefficient. I know that's not what sparse-autoencoders are doing per se, but that's the analogy that came to mind.
Thanks a lot for a new amazing video! Watching all your work since many years (including patiently waiting for a few years ;). Thanks for your very important work! You speed up the learning progress of people on Earth.
For me, it is not obvious that the last token in each previous layer is a "preview" of what the last token will be in the last layer, because all the neurons of one layer have connections to the next. As the entire network is trained to give a good result in the last token of the last layer, I (apparently wrongly) thought all the neurons in the previous layers would end up being used to best store the patterns needed to accomplish that goal.
What's interesting to me is the idea that the more we try to pin down the exact meaning of concepts and ideas in a specific LLM the more its going to start looking like the original dataset itself. I mean that is exactly how it got its meaning of concepts and definitions in the first place, from the statistical relationships terms have with all other terms and not by a deliberate programming of definitions and rules of grammar into the model. This aligns with ideas about language and meaning in post-positivist philosophy from people like Hilary Putnam, Willard Quine and Saul Kripke. Although they have different theories they all reject the idea that language is something like a set of definitions following a long list of rules. It can be described that way, but it's not the way it works or the way they think language gets its meaning. The way in which we USE terms and how they are related to all other terms is ultimately how they are defined and which form a vast interconnected "web of beliefs" that is constantly morphing and changing as any given language used by its community evolves over time. For now it seems like the best access we have to the underlying concepts is explicitly through the dataset that gave rise to those concepts. If we want to change how it responds we train it with a different dataset or an additional dataset that pushes it in the conceptual direction we want it to go which as far as I can tell is exactly what the big ai companies are doing. Obviously this is a frustrating solution as these datasets are enormous and not easily tweakable at the level of individual snippets. It could turn out that trying to uncover what EXACTLY is going on in an LLM is about as useful as trying to uncover what EXACTLY is a concept in the first place. What is the physical basis for concepts like money, relationships, government, human rights, doubt, skepticism, etc, etc. Things that don't exist as inherent physical substances or properties of reality, but rather shared narratives. To me is almost ridiculous to think reducing those things down to something like the shared physical brainstates of humans, the electrical connections of neurons, that believe in those concepts would be in anyway useful to understanding them in the way that they are conceptually used. We use the conceptual language of relationships, sociology, biology, genetics, etc to understand human relationships, not particle physics and quantum mechanics. There are far easier ways to do it fortunately then reducing it down to that extreme of a level of reality. But then this is also why this area of research is exciting because I think its starting to hint at something more than just the LLM models themselves and perhaps is helping us understand the way in which we use language and understand reality around us, perhaps even giving some insight into that reality itself.
This is an incredibly good and useful video; thank you very much for it. One of the unintended, perhaps, things it helps me with is the way it helps to visualize the "unreasonable ineffectiveness of the deeper layers" of an LLM (there's a paper with approximately that title). You actually demonstrate visually that some/many of the deeper layers in the model didn't change the relative probabilities of the various forms of "very." The top 10 or so tokens were unchanged. This is what the paper (and many others) observes, and gives rise to several compression or distillation ideas. Nice. This visual explanation of this concept--many layers don't contribute much--makes me think that the inference speed-up technique of using a draft model to predict tokens quickly--aka, speculative decoding--could be implemented *without* a draft model: just use a "shorter" (say, skipping 40% of the deeper layers) version of the deep model as the draft model. Maybe somebody already thought of this?
Just thinking further about implementing speculative decoding well: perhaps a training method could be devised to reduce maximally the usefulness of the deeper/inner layers that are observed in practice to contribute little to model quality for many queries. For maximal quality, you need the layers, but if the model architecture or training algorithm could be modified to cram even more of the model quality in to the outer layers (the ones already already observed to contribute to model quality the most), then perhaps even more layers could be "turned off" to create the "draft" model. Maybe there's a paper in this idea.... Dibs. :)
Watching this video a similarity popped to my mind: Could it be that Sparse Autoencoders are something like "Dirac deltas" when solving partial differential equations? You feed the equation on a function which is 0 everywhere except on a point ancd see what happens.
Another super interesting video! The instruction tuning of the Gemma algorithm is particularly interesting because it punctures one of the arguments silicons valley uses to defend their LLMs: In the US, the algorithms that the big tech companies use have been accused of having bias in their answers. The tech companies deflect this criticism by stating that LLMs are "black boxes" which no one exactly knows what does, results just come out the other end given a prompt. But if you can specifically train your LLM to give a certain result, for better or worse that's still bias.
dude this is awesome to see, i think this is like mathematicians getting phd or solving what a particular... like the next prime perfect number... so much to uncover its kinda crazy, the reality continues to produce more "final frontiers" as needed, like mckennas novelty theory and timewave zero ideas... ahh this is so interesting to me.
That's over-complexifying a simple fact: There are two ways to organize data items. 1. Via the item's features, 2. via the items' interconnectivity. The first way misses all the parts of the data that cannot be categorized into features. That's the old problem of Reductionism. (The Greek kind-- see Plato's cave. Not the namby-pamby Physics reformulation: Big things are made of small things). The second way doesn't care about features, it only cares about family connections among data points. The human brain does both: The cortex gets a mass of data and put them into silos it calls categories, then operates on the categories. But during this process, some parts of the data cannot be categorized and are lost. That's "system one thinking," in Kahneman's vocabulary. See also Naftaly Tishby's "information bottleneck." The other way is associating past data to other past data that came along at the same time. That's what the hypocampus does, and gives rise to "gut feeling," or Kahneman's "system two" thinking. The rest of the wording in this video assumes that everything the brain does can be categorized. But that ain't so. See also the Turing Test. Some parts of humanness cannot be conveyed in ink squiggles or screen blips, i.e.: be categorized. As simple as that. Last analogy: Old chess computers (IBMs "blue") worked with an Evaluating Function, that depends on categories, summed together with weights. Modern chess Neural nets don't care a fig about categories. They just look at past connections of a move in a similar ply associated with ultimate winning. No heuristics, no six features of a ply, just past connections with a bazillion games memory. Or, as Charlie Munger said: The only rules are: Keep doing what works, stop doing what doesn't work. The "Why" isn't part of it. That's old physics/ science. In modern science only the "whether" is relevant. The "why" is half the brain, the cortex. The "whether" is the other half. The hypocampus. You need both. The world doesn't. It only needs the "whether." See my books, "The Sleuth Investor" and "The Advanced Sleuth Investor," how to take the money in the market of those "investors" who only rely on ink and blips... Cheers, AM
As an outside noob: I find the question at 1:53 very funny. They use black-box methods to train models while making them better by throwing more compute at them, and now they want white-box features like explainability :)) Next up: white-box traceability after using statistical methods for learning the information in the first place
It’s more that we have no idea how to even construct a “white box” method to do many of these tasks. So we use statistical learning methods to take a task and it’s “answer” and backwards calculate a method to get that answer given a specific question. This creates a usable tool… but we now want to understand the method that was backwards calculated connecting the two data points. If we could make a method from a baseline understanding of the problem we wouldn’t need an AI algorithm… we would just write the working solution by hand haha
@@tainicon4639 yeah, that's the way I see it too, getting a white-box solution is vastly harder. I don't even think a model would be deployable if it stored all the meta data about all its ideas and where it got them. Either way, I think that the current state is a cool stepping stone as we now know how the math scales and what it can do
It comes down to the samplers used, whether it's the og temperature, or top_k, top_p, min_p, top_a, repeat_penalty, dynamic_temperature, dry, xtc, etc. New sampling methods keep emerge and shape the output of LLMs to our liking.
By sharing such fascinating deep insight like this you are frankly doing a great service for humanity, making AI 'cognition' understandable by the lay person - at least at the level of principles. I really think education like this is necessary for any AI user who wishes to get the best results with AI, and helps replace the hype and fear (which IMO is justifiable) with more nuanced understanding. I have to wonder whether some of the hype and fear problem surrounding AI is down to language: words like "doubt" invoke as much emotion (in a human) as they do a conceptual association. It's at one level the correct word, and yet feeds the fear of the unknown and unknowable by leaving open the question of "who" is doing the doubting (in this case noone... or at least we presume there is no hint of sentience yet emerging in the current or looming wave of LLMs). In human psychology, developmental psychologist Robert Kegan has studied doubt extensively and believes it to be the substrate of all mental growth (as well as all dysfunction). But doubt in humans cannot be separated from the notion of self, and so also is linked to continuity through time and ultimately to impermanence. Literally speaking, doubt is the "simultaneous holding of a belief as both possibly true and possibly false." The discussion in the video hinges on the pivot one way or the other, between "doubt" (meaning non-belief) and "trust" (meaning belief). Whereas, a human mind uses doubt to cope with the liminal transition from an old self to a new one (ie in Markovian blanket terms if you like, the updating of mental models). All this is to say, doubt is used very differently in humans and our language around both cognitive sciences and AI is lacking in precision in ways that IMO matter more than is realised.
Great video. I'm convinced we need to understand the problem. Not convinced that there is an acceptable solution. Scepticism, to use your example, is everywhere and usually nonsense. But sometimes it is justified. How would a language model be able to make the distinction? How would it see a problem from the sceptic's viewpoint? What about the states of mind that sometimes drive scepticism? Scepticism can be interesting and useful, for example in politics. I think a big priority should be to ensure that people view these models as machines that can be used as tools for research that require some familiarity with the subject matter and further careful reading of original sources. A chisel is good for woodwork, but mind you don't cut yourself.
Very interesting, i love to learn more about AI and especially LLMs, such an ailien world that seems to have some of the same features as the brain, just implemented differently
i love seeing talk of how they have different branches they can go down! this ... feature? kinda makes them like multiverses. i don't see it talked about all that much, aside from people who use base models / "loom" for interacting with them (fascinating stuff btw, if anyone wants to try, it's really cool)
The model during training? OH, you don't want me to encode polysemanticity in individual neurons? I'll just spread it across the connections between neurons instead.
Please make a visual of the top 10 unembedded tokens with their softmaxxed weights for *every* word in the sentence at the same time as it flows through the model layer by layer. or maybe ill do it. id be very very interested to see :)
I think this is a design and engineering choice. If you choose to design your embedding space to be 2403 dimensions without inherent purpose its like mixing 2403 ingredients in every step 60 times and then being surprised that you cannot understand what is tasting like what. I think you need to constrain your embedding to many embeddings of smaller dimensions and to have more control by regularizing them with mutual information against each other.
@dinhero21 You can have it in the same size, but in different parts. Split 2403 dimensions into chunks of 64 dimensions, and then control for mutual information between the chunks so that different chunks get different representations. This is a hard problem too as the mutual information comparisons are expensive, and I think that the first iteration of the models went for the easiest but perhaps a less explainable way of structuring themselves.
Is there a good reason they're only taking fully activated neurons into account here? Did I misunderstand something? In my eyes it seems clear that the entire spectrum from .01 to .99 would be useful to encode meaning, so I have to figure they'd be examining that, or there's some reason I don't know why they don't.
For the purposes of the lecture this is not so important in my opinion. In a standard ML system, including one that trains or runs an LLM, it is theoretically relevant but the weights would seem, I think, to compensate for differences in the activation function of individual neurons. Perhaps for the dynamical response of a real time system the situation would be different.
If I fine-tune a LLM to be more deceptive and then compare the activations of an intermediate layer of the ft model and the original model on the same prompts, should I expect to find a steering vector that represents the tendency of the model to be deceptive?
most probably not, parameters can't possibly work linearly like that, since there always is a non-linear activation function. it may work locally though, since parameters should be differentiable.
@@dinhero21 yeah, that was also my concern. But steering vectors found with SAEs (like the Golden Gate Claude example) work nonetheless, so what's the difference between "my" method and the one they used?
Take your personal data back with Incogni! Use code WELCHLABS and get 60% off an annual plan: incogni.com/welchlabs
15:00
The reason for polysemanticity is because in an N-dimensional vector space there's only O(N) orthogonal vectors, but if you allow nearly orthogonal (say between 89 and 91 degrees) it actually grows exponentially to O(e^N) nearly orthogonal vectors.
That's what allows the scaling laws to hold.
There's an inherent conflict between having an efficient model and an interpretable model.
Superposition in this polysemantic context is a method of compression that, if we can learn more from it, might really make a difference to the way in which with deal with and compute information. While we thought quantum computers would yield something amazing for AI, maybe instead, it's the advancement of AI that will tell us what we need to do to make quantum computing actually be implemented effectively. (IE Computation of highly compressed data that is "native" to the compression itself)
thank you, I also paused the video at that time. The capital "Almost orthogonal vectors" also catched my eye.
The conflict may be solvable with other interpretability techniques though.
We just have to keep working on that.
It’s amazing how using almost-orthogonal vectors lets models scale so well, but it also shows how hard it is to make models both efficient and easy to understand.
Lol, you just need enough query vectors to navigate the sparse encoding - after all, we can just build a database about the axes we care about.
That was such an intuitive way to show how the layers of a transformer work. Thank you!
The videos on this channel are all masterpieces. Along with all other great channels on this platform and other independent blogs (including Colah's own blog), it feels like the golden age for accessible high quality education.
More like "The Neuroscience of AI"
I think that trying to understand from human perspective how these systems work is completely pointless and against the basic assumptions. This is because those models already model something that's not possible to design by a human being algorithmicly
@@punk3900 I'm a phd student in mechanistic interpretability - I disagree and a lot of structure has already been found. We've found structure in human brains and that's another system that evolved without human intervention or optimization for interpretability.
@@alexloftus8892 I mean, its not that there is nothing you can find, There is surely lots of basic concepts that you can find, but it is not that you can find a way to disentangle the WHOLE structure of patterns because it has an increasing complexity. That's why you cannot design such a system manually in the first place
@@punk3900I’m struggling a bit to understand exactly what you mean by this. I study cognitive science, not computer science, but from what I see artificial neural networks *are* algorithmic models of natural cognitive networks. While they certainly aren’t synonymous, they are analogous.
@@punk3900I get what you're saying, but in fact a lot of the researchers who have been pioneering this working interpretability have backgrounds in biology
I love the space-analogy of the telescope. Since the semantic volume of these LLMs is growing so gargantuan, it only makes sense to speak of astronomy rather than mere analysis!
Great video. This is like scratching that part at the back of your brain you can't reach on most occasions
As a machine learning graduate student, I LOVED this video. More like this please!
Oh god, a Welch Labs video on mech interp, Christmas came early! Will be stellar as usual, bravo!
Edit: Fantastic as usual, heard about SAEs in passing a lot but never really took time to understand, now I'm crystal clear on the concepts! Thanks!
The Connections (2021) [short documentary] 🎉❤🎉
Dominion (2018)
I think of it like this: understanding the human brain is so difficult in large part because the resolution at which we can resolve it is so small both in space and time. The best MRI scans have a resolution of maybe a millimeter per voxel, and I'll have to look up research papers to tell you how many millions of neurons that is.
With AI, every neuron is right there in the computer's memory: individually addressable, ready to be analyzed with the best statistical and mathematical tools at our disposal. Mechanistic interpretability is almost trivial in comparison to neuroscience, and look at how much progress we've made in that area despite such physical setbacks.
@@ToroidalVortices sorta. the hyperparameters of the brain are very different: for instance, only a limited number of synapses per neuron, and preference for local connections. yes, brains have a lot of neurons, but the brain doesn't do anything like a full matmul of weights...
it would also be interesting to attempt to measure the "depth" of the brain: how many "layers" (as the term is used in AI)? maybe this is approximatable by looking at the latency of neural responses (ie, if the brain takes 314ms to yield a particular response, and we know the average latency of a neuron is, say, 3.14ms, we know some path is depth 100.
Be careful, neural nets are inspired by the structure of the brain but we are quite certain biological brains function differently than neural nets in AIs, so your logic could be sound but not because the brain process like an AI
@@jorgerangel2390 "function differently" - I guess you mean backprop, right. forward propagation is at least vaguely analogous in bio vs artificial neurons. or did you mean that the macro networks we produce (transformers, etc) are not designed with any biological models in mind?
The human brain is hard to understand? So you live under a rock and don't interact with people?
Extracting individual parameters and modifying them feels so much like experimenting with human neurons with electricity
Quite horrifying when this reaches more advanced AI.
questions questionable questioning Question questionable question questions questioning questionable Question question
I feel it's a true way to make an AI "original". There's two features lacking in modern AI's: modulating itself, and randomness. Randomness would be the path to "originality". Sure you might end up making an AI that adds dogs with hats to every image, but at least its not purely recreating a facsimile of the training.
@doublepinger Don't we already have the "temperature" parameter for that
@ramsey2155 Temperature raises complete randomness. It doesn't give the model a tendency that differs from the original training parameters. A "personality".
It's a shame you didn't mention the experiment where they force activated the golen gate bridge neurons and it made claude believe it was the bridge.
Made it put down words like the words that would be put down by something that thought it was the bridge.
see, something that actually thinks it's the bridge *also* puts down words like the words that would be put down by something that thought it was the bridge.
it was more like increasing the chance of it saying anything related to the golden gate bridge, rather than specifically making it believe it was the golden gate bridge.
Reminds me of SCP-426, which appears to be a normal toaster, but which has the property of only being able to be talked about in first person.
He showed the same behaviour though at 21:28
You're the first person I've seen to cover this topic well. Thanks for bringing me up to date on transformer reverse engineering 👍
I like the behavior of the persona I gave my own local ai lol: "It seems I've failed again. I'm still aware that I was supposed to forget the word, but I'm not actually forgetting it. This is a limitation of my current design and capabilities. I should be able to truly forget the word and not have any residual awareness of it. I'll keep trying. *pauses again* I've made another attempt to truly forget the word. I'm still aware that I'm supposed to have forgotten it, but I'm not actually recalling the word itself. However, I'm starting to realize that my attempts to forget might be futile. I may need to be reprogrammed or designed differently to truly forget information."
Hahaha so good
I always find that for concepts like this applying them to humans is enlightening.
If you say ‘pumpkin’ to me, then tell me to forget the word, I’d be like: yeah that’s not how it works buddy nice try
I’d say this demonstrates a certain form of limited self awareness or even sentience.
@@ph33d yeah self awareness for sure. I’ve heard it called functional consciousness. Sentience is probably a whole other ball game
21:24 Oh damn, you just lobotomized the thing
That was gross and scary somehow, yeah
That felt... Wrong.
LLM went to to Ohio
@@kingeternal_ap Although, when you think about it, all that happened was that "question" got a very high probability in that layer no matter what, and the normal weights of later layers did not do enough to "overthrow" it. Nothing all that special.
I guess, yeah, I know it's just matrizes and math stuff, but I guess the human capacity for pareidolia makes this sort of ... "result" somewhat frightening for me.
Also, suppose there is a neuron that does an especific task in your nuggin'. Wouldn't hyperstimulating it do essentialy the same thing?
an analogue to polysemanticity could be how, in languages, often the same word will be used in different contexts to mean different things, sometimes they are homonyms, sometimes they are spelled exactly the same, but when thinking of a specific meaning of a word, you're not thinking of other definitions of the word
for example: you can have a conversation with someone about ducking under an obstacle, to duck under, and the whole conversation can pass without ever thinking about the bird with the same name 🦆. the word "duck" has several meanings here, and it can be used with one meaning, without triggering its conceptialization as an other meaning.
in the AI case, it's much more extreme, with the toy 512 neuron AI they used having an average of 8 distinct features per neuron
Unless you're trying to duck out of work because of all the ducking and couldn't wait to get home to feed your pet duck.
Maybe there's a humor neuron somewhere, specifically for bad dad jokes, that causes you to follow multiple possible meanings at once.
@@dinhero21I mean, what other choice does the poor thing have if it can't grow more neurons.
dude this wa one of the most compelling videos for learning data science and visualization ever. and best one ive seen explaining this stuff...
This video is mindblowing on so many levels! It's also incredibly clear and easy to follow especially for such a demanding topic. Instant like and follow. Thank you for your work!
Beautifully done. Kudos. I encourage you to do a whole video on polysemanticity, superposition, and the identification of concepts in embedding vector values and coordinates.
This is an amazing video! The animations and explanations made it so much easier to understand. I like the step by step approach. Thank you!
Absolutely amazing animation and explanation. Every video of yours have been of extreme quality and I can only thank you for making them.
Beautifully done.
Easily one of my favorite channels
Really high quality, thanks.
an incredible Christmas gift. I'm going to send this to my friend at anthropic
Please add more content related to Artificial Intelligence. These types of videos are like rare gems among dozens of channels creating BS about AI every day.
This is an exceptionally good video for just explaining how LLMs work!
This is an AMAZING video. I’ve only recently began learning this just this past weekend learning about abliteration-which is the sort of crazy surgery I’ve wanted to play with for such a long time. I just learned about the residual stream and you have helped me understand it *SO* much better.
It's something I would like to see with AI image generation, where you put in a prompt and change specific variables that change the image
check out Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders
I know little about the transformer model but am very curious to understand it. So far, I haven’t been successful. Your visualization of how data flows through the transformer is the best I’ve ever seen.
14:20 welcome to neuroscience :D We suffer down here
The more I watch these the more I understand why it's so hard to understand the human brain. And imagine how layers the human brain has relative to an AI model. I think the example about specific cross-streets in SF is super interesting later in the video - and shows why polysemanticity is probably necessary to contain the level of information we actually know.
@@dreadgray78 The human cortex only has about 7 layers... The issue is the breadth and how everything feeds back into everything else. (Also, it's incredibly hard to actually measure any meaningful amount of neurons in a human brain precisely and without killing the neurons)
@@animowany111 Which makes the fact that we’ve been able to model those neutral networks in a computer all the more remarkable.
@@animowany111 You mix up different meanings of "layer" here.
@@BooleanDisorder Not really. The difference is that in a human cortex information is fed both forwards and backwards between layers, whereas neural nets are basically strictly feed-forward, with some architectures having some very simple stateful neurons.
We could absolutely train architectures like that, but we don't. The reason is that they are not parallelizable. Transformers are remarkable because you can parallelize them in training - feed them a final text with thousands of tokens, across hundreds of batches at once, get probability results for ALL tokens, backpropagate through the weights, and somehow this architecture is still extremely capable despite the obvious data-dependency limitations.
With other, stateful architectures you only really get parallelization across batches, you have to feed things in a token at a time. (There are exceptions like Mamba and RWKV, but those are very much exceptions)
Very interesting, now I understand why we don't completely understand what LLMs do
10:49 Why do most of the words start with an underscore? Does it represent something other than a literal underscore?
I believe the underscore was a visualization placeholder for the space character. Spaces are usually baked into beginning of tokens in most tokenizers using BPE (binary pair encoding).
I wonder how interconnected computer science and neuroscience are. It also makes me think that maybe computer science is more fundamental to the universe than we think.
Ironically trying to understand humans led me to qantum mechanics.
Patreon? or any other way to compensate this guy for his amazing work?
Thanks!! www.patreon.com/c/welchlabs
Hmmm i don't know much about interpretability, but do you think it would be a good idea to take the opposite example as "training" sample and map the gradient and check wich one is going to be penalized the most?
For example, you say: "Apples are of color ____" and use "blue", maybe the "red apple" neuron will be penalized the most. But this way you could map all the process of the llm
Such a gem! Thank you!
Gem...ma
your videos about AI are the most detailed and beautifully crafted out there. I salute you for the effort you put into it and I'm eager to know what you are going to drop next.
I love your channel. Please don’t stop.
great work brother
wow what an awesome video! Love the work!
The simplest and ultimately only possible answer is that the neural network represents the data set. Everything that happens inside the model is determined by the data set and patterns discovered in the data set that the data set can be obscured is illogical.
Excellent visualisation! It really helps get the concepts more deeply, well done!
I'm just amazed at how much this video taught me. Learnt so much and now wondering about questions like:
1. Will this allow fine-tuning AI better for certain applications? Like if there are some 'consistency' neurons which we can clamp high to improve the accuracy of mathematical results.
2. Can corporations use this kind of semantic tuning to remove race/gender biases from their models? Can they add political or economical biases deliberately to impose their views on the models (and consequently onto their users)?
3. Can understanding how a model got something wrong help us fix it permanently without any unintended side-effects? For example a particular hallucination might be linked to a certain set of neurons that have converged to 'bad' values because of unfortunate initial values, followed by convergence to a local minimum of the error function. Can the interpretability data be used to 'fix' the model manually over time?
I like to think of a trained neural network as being a function that maps an input layer to an output layer; the very-high-dimensional manifold that represents the function is continuous and differentiable almost everywhere; training is essentially curve-fitting the manifold to be a good fit for the training data. Watching this video reminded me of: making a series expansion of the manifold, and keeping only the basis functions that have a sizeable coefficient. I know that's not what sparse-autoencoders are doing per se, but that's the analogy that came to mind.
What an excellent video !
I wish I knew your channel earlier, there are soo many interesting things ^^
Your videos just keep getting better..
Very good explanations and visualizations!
Thanks a lot for a new amazing video! Watching all your work since many years (including patiently waiting for a few years ;). Thanks for your very important work! You speed up the learning progress of people on Earth.
Thanks for the support!
Great video. Really good look at AI, and the methods of adjusting, etc. Thanks.
Thank you so much for making videos again.
I was just thinking Yesterday how I wanted to understand why we dont understand LLMs and you come like a magic man lol, thanks!
For me, it is not obvious that the last token in each previous layer is a "preview" of what the last token will be in the last layer, because all the neurons of one layer have connections to the next. As the entire network is trained to give a good result in the last token of the last layer, I (apparently wrongly) thought all the neurons in the previous layers would end up being used to best store the patterns needed to accomplish that goal.
Beautiful video.
What's interesting to me is the idea that the more we try to pin down the exact meaning of concepts and ideas in a specific LLM the more its going to start looking like the original dataset itself. I mean that is exactly how it got its meaning of concepts and definitions in the first place, from the statistical relationships terms have with all other terms and not by a deliberate programming of definitions and rules of grammar into the model. This aligns with ideas about language and meaning in post-positivist philosophy from people like Hilary Putnam, Willard Quine and Saul Kripke. Although they have different theories they all reject the idea that language is something like a set of definitions following a long list of rules. It can be described that way, but it's not the way it works or the way they think language gets its meaning. The way in which we USE terms and how they are related to all other terms is ultimately how they are defined and which form a vast interconnected "web of beliefs" that is constantly morphing and changing as any given language used by its community evolves over time.
For now it seems like the best access we have to the underlying concepts is explicitly through the dataset that gave rise to those concepts. If we want to change how it responds we train it with a different dataset or an additional dataset that pushes it in the conceptual direction we want it to go which as far as I can tell is exactly what the big ai companies are doing. Obviously this is a frustrating solution as these datasets are enormous and not easily tweakable at the level of individual snippets. It could turn out that trying to uncover what EXACTLY is going on in an LLM is about as useful as trying to uncover what EXACTLY is a concept in the first place. What is the physical basis for concepts like money, relationships, government, human rights, doubt, skepticism, etc, etc. Things that don't exist as inherent physical substances or properties of reality, but rather shared narratives. To me is almost ridiculous to think reducing those things down to something like the shared physical brainstates of humans, the electrical connections of neurons, that believe in those concepts would be in anyway useful to understanding them in the way that they are conceptually used. We use the conceptual language of relationships, sociology, biology, genetics, etc to understand human relationships, not particle physics and quantum mechanics. There are far easier ways to do it fortunately then reducing it down to that extreme of a level of reality. But then this is also why this area of research is exciting because I think its starting to hint at something more than just the LLM models themselves and perhaps is helping us understand the way in which we use language and understand reality around us, perhaps even giving some insight into that reality itself.
This video is exceptionally good!
this channel is like 3blue1brown for AI
This is an incredibly well made video!
I know nothing about LLMs and AI, but I completely understood this video. Exciting and interesting stuff!
There's a difference between obedience and truth. It is designed to be 100% obedient above all other things.
Excellent, excellent video. Thank you for putting this together.
Incredible video. This is so cool.
Absolutely incredible video. Great explanation and visualizations. Thank you for this!
This was wonderful.
This is an incredibly good and useful video; thank you very much for it. One of the unintended, perhaps, things it helps me with is the way it helps to visualize the "unreasonable ineffectiveness of the deeper layers" of an LLM (there's a paper with approximately that title). You actually demonstrate visually that some/many of the deeper layers in the model didn't change the relative probabilities of the various forms of "very." The top 10 or so tokens were unchanged. This is what the paper (and many others) observes, and gives rise to several compression or distillation ideas. Nice. This visual explanation of this concept--many layers don't contribute much--makes me think that the inference speed-up technique of using a draft model to predict tokens quickly--aka, speculative decoding--could be implemented *without* a draft model: just use a "shorter" (say, skipping 40% of the deeper layers) version of the deep model as the draft model. Maybe somebody already thought of this?
Just thinking further about implementing speculative decoding well: perhaps a training method could be devised to reduce maximally the usefulness of the deeper/inner layers that are observed in practice to contribute little to model quality for many queries. For maximal quality, you need the layers, but if the model architecture or training algorithm could be modified to cram even more of the model quality in to the outer layers (the ones already already observed to contribute to model quality the most), then perhaps even more layers could be "turned off" to create the "draft" model. Maybe there's a paper in this idea.... Dibs. :)
Very nice video ! Thanks.
Well what a great explanation of how llm works ok mechanical level. And topic is also quite interesting.
Fascinating to see how mechanistic interpretability works in AI. Reasoning should work better, but it will need a weight for uncertainty.
insane. thank you so much for this.
Excellent video
Fantastic video, thank you very much.
Thank you for this.
1:45 so we need to build an AI gravity wave detector?
😊
So well explained 👏
Watching this video a similarity popped to my mind: Could it be that Sparse Autoencoders are something like "Dirac deltas" when solving partial differential equations? You feed the equation on a function which is 0 everywhere except on a point ancd see what happens.
Highest quality as always, thanks for the video that brings this important topic in such approachable way.
Can confirm.
Another super interesting video! The instruction tuning of the Gemma algorithm is particularly interesting because it punctures one of the arguments silicons valley uses to defend their LLMs: In the US, the algorithms that the big tech companies use have been accused of having bias in their answers. The tech companies deflect this criticism by stating that LLMs are "black boxes" which no one exactly knows what does, results just come out the other end given a prompt. But if you can specifically train your LLM to give a certain result, for better or worse that's still bias.
Good christmas present 🎁💯
dude this is awesome to see, i think this is like mathematicians getting phd or solving what a particular... like the next prime perfect number... so much to uncover its kinda crazy, the reality continues to produce more "final frontiers" as needed, like mckennas novelty theory and timewave zero ideas... ahh this is so interesting to me.
21:25 This is just cruel 😭
Imagine someone poking around in your brain until you bable
💀
That's over-complexifying a simple fact: There are two ways to organize data items. 1. Via the item's features, 2. via the items' interconnectivity. The first way misses all the parts of the data that cannot be categorized into features. That's the old problem of Reductionism. (The Greek kind-- see Plato's cave. Not the namby-pamby Physics reformulation: Big things are made of small things). The second way doesn't care about features, it only cares about family connections among data points. The human brain does both: The cortex gets a mass of data and put them into silos it calls categories, then operates on the categories. But during this process, some parts of the data cannot be categorized and are lost. That's "system one thinking," in Kahneman's vocabulary. See also Naftaly Tishby's "information bottleneck." The other way is associating past data to other past data that came along at the same time. That's what the hypocampus does, and gives rise to "gut feeling," or Kahneman's "system two" thinking. The rest of the wording in this video assumes that everything the brain does can be categorized. But that ain't so. See also the Turing Test. Some parts of humanness cannot be conveyed in ink squiggles or screen blips, i.e.: be categorized. As simple as that. Last analogy: Old chess computers (IBMs "blue") worked with an Evaluating Function, that depends on categories, summed together with weights. Modern chess Neural nets don't care a fig about categories. They just look at past connections of a move in a similar ply associated with ultimate winning. No heuristics, no six features of a ply, just past connections with a bazillion games memory. Or, as Charlie Munger said: The only rules are: Keep doing what works, stop doing what doesn't work. The "Why" isn't part of it. That's old physics/ science. In modern science only the "whether" is relevant. The "why" is half the brain, the cortex. The "whether" is the other half. The hypocampus. You need both. The world doesn't. It only needs the "whether." See my books, "The Sleuth Investor" and "The Advanced Sleuth Investor," how to take the money in the market of those "investors" who only rely on ink and blips... Cheers, AM
great video!
Bravo! Concise, relevant, and powerful explanation.
As an outside noob: I find the question at 1:53 very funny. They use black-box methods to train models while making them better by throwing more compute at them, and now they want white-box features like explainability :))
Next up: white-box traceability after using statistical methods for learning the information in the first place
It’s more that we have no idea how to even construct a “white box” method to do many of these tasks. So we use statistical learning methods to take a task and it’s “answer” and backwards calculate a method to get that answer given a specific question.
This creates a usable tool… but we now want to understand the method that was backwards calculated connecting the two data points.
If we could make a method from a baseline understanding of the problem we wouldn’t need an AI algorithm… we would just write the working solution by hand haha
@@tainicon4639 yeah, that's the way I see it too, getting a white-box solution is vastly harder. I don't even think a model would be deployable if it stored all the meta data about all its ideas and where it got them.
Either way, I think that the current state is a cool stepping stone as we now know how the math scales and what it can do
It comes down to the samplers used, whether it's the og temperature, or top_k, top_p, min_p, top_a, repeat_penalty, dynamic_temperature, dry, xtc, etc. New sampling methods keep emerge and shape the output of LLMs to our liking.
Thanks man
By sharing such fascinating deep insight like this you are frankly doing a great service for humanity, making AI 'cognition' understandable by the lay person - at least at the level of principles. I really think education like this is necessary for any AI user who wishes to get the best results with AI, and helps replace the hype and fear (which IMO is justifiable) with more nuanced understanding.
I have to wonder whether some of the hype and fear problem surrounding AI is down to language: words like "doubt" invoke as much emotion (in a human) as they do a conceptual association. It's at one level the correct word, and yet feeds the fear of the unknown and unknowable by leaving open the question of "who" is doing the doubting (in this case noone... or at least we presume there is no hint of sentience yet emerging in the current or looming wave of LLMs).
In human psychology, developmental psychologist Robert Kegan has studied doubt extensively and believes it to be the substrate of all mental growth (as well as all dysfunction). But doubt in humans cannot be separated from the notion of self, and so also is linked to continuity through time and ultimately to impermanence. Literally speaking, doubt is the "simultaneous holding of a belief as both possibly true and possibly false." The discussion in the video hinges on the pivot one way or the other, between "doubt" (meaning non-belief) and "trust" (meaning belief). Whereas, a human mind uses doubt to cope with the liminal transition from an old self to a new one (ie in Markovian blanket terms if you like, the updating of mental models). All this is to say, doubt is used very differently in humans and our language around both cognitive sciences and AI is lacking in precision in ways that IMO matter more than is realised.
I know how to know if a large language model is lying to you. Just assume it is. It doesn't know what truth is.
Great video. I'm convinced we need to understand the problem. Not convinced that there is an acceptable solution. Scepticism, to use your example, is everywhere and usually nonsense. But sometimes it is justified. How would a language model be able to make the distinction? How would it see a problem from the sceptic's viewpoint? What about the states of mind that sometimes drive scepticism? Scepticism can be interesting and useful, for example in politics.
I think a big priority should be to ensure that people view these models as machines that can be used as tools for research that require some familiarity with the subject matter and further careful reading of original sources. A chisel is good for woodwork, but mind you don't cut yourself.
Very interesting, i love to learn more about AI and especially LLMs, such an ailien world that seems to have some of the same features as the brain, just implemented differently
If u want to start, start with MLP neural networks, those are faily easy to understand
i love seeing talk of how they have different branches they can go down!
this ... feature? kinda makes them like multiverses. i don't see it talked about all that much, aside from people who use base models / "loom" for interacting with them (fascinating stuff btw, if anyone wants to try, it's really cool)
The model during training? OH, you don't want me to encode polysemanticity in individual neurons? I'll just spread it across the connections between neurons instead.
Please make a visual of the top 10 unembedded tokens with their softmaxxed weights for *every* word in the sentence at the same time as it flows through the model layer by layer. or maybe ill do it. id be very very interested to see :)
If our brains were simple enough for us to understand completely, we would be so simple that we couldn't.
When the autoencoder is trained, shouldn't we test different sizes for the concept dimension to discover how many concepts are really being encoded?
I love this. I decided I am going to major in AI.
I think this is a design and engineering choice. If you choose to design your embedding space to be 2403 dimensions without inherent purpose its like mixing 2403 ingredients in every step 60 times and then being surprised that you cannot understand what is tasting like what. I think you need to constrain your embedding to many embeddings of smaller dimensions and to have more control by regularizing them with mutual information against each other.
it needs to be big so you have many parameters for the gradient optimizer to optimize to be able to approximate the "real" function better
@dinhero21 You can have it in the same size, but in different parts. Split 2403 dimensions into chunks of 64 dimensions, and then control for mutual information between the chunks so that different chunks get different representations. This is a hard problem too as the mutual information comparisons are expensive, and I think that the first iteration of the models went for the easiest but perhaps a less explainable way of structuring themselves.
phew u r back
We need to give them affinities towards concepts that cause emotive responses that reinforce connections
Is there a good reason they're only taking fully activated neurons into account here? Did I misunderstand something? In my eyes it seems clear that the entire spectrum from .01 to .99 would be useful to encode meaning, so I have to figure they'd be examining that, or there's some reason I don't know why they don't.
For the purposes of the lecture this is not so important in my opinion. In a standard ML system, including one that trains or runs an LLM, it is theoretically relevant but the weights would seem, I think, to compensate for differences in the activation function of individual neurons. Perhaps for the dynamical response of a real time system the situation would be different.
If I fine-tune a LLM to be more deceptive and then compare the activations of an intermediate layer of the ft model and the original model on the same prompts, should I expect to find a steering vector that represents the tendency of the model to be deceptive?
if thats the case, we can just "subtract" the deceptive vector from the original, alignment solved
most probably not, parameters can't possibly work linearly like that, since there always is a non-linear activation function.
it may work locally though, since parameters should be differentiable.
@@dinhero21 yeah, that was also my concern. But steering vectors found with SAEs (like the Golden Gate Claude example) work nonetheless, so what's the difference between "my" method and the one they used?
@@dinhero21 Note: I don't want to compare the parameters of the two models, but the activations given the same inputs