@@madshorn5826 Nah, epidemy can destroy the world in months, climate change can in decades. Superinteligent AI could probably destroy it before lunch :P
@@Laszer271 Well, destroyed is destroyed. Or are you the type not bothering with insurance and health check ups because a hypothetical bullet to the brain would rather quickly render those precautions moot?
@@madshorn5826 fair enough. It was all a joke though. But in your example, I still think "I just got a bullet to the brain" is worse than "I just got diagnosed with cancer". Maybe bullet is less likely, sure, but we were talking about the time that the danger was already proven, right? I think it's plausible that probability of my survival is greater conditioned on "we were right" statement being made by epidemiologist, climatologist or oncologist than it is conditioned on the same statement made by AI safety expert or like bullet...ologist.
LOL Yup... in retrospect with this paper... the terminator was a pursue bot... driving a threat variable towards the development and improvements of a General Artificial Intelligence and look at all the upgrades that series of pursuit bots facilitated. LOL
@@Rotem_S Or it learned not to hit people because it really cared about maintaining the present state of the paint job, which was white in the training environment. But the deployment environment uses a _red_ car.
Famous last words for species right before they hit the great filter: "Yo, in the test runs, did paperclips max out on the positive attribution heat map, too?"
I keep hearing the notion of AI being the great filter, but I can't say I buy it. Not that AGI isn't an existential threat, because it absolutely is. It just can't explain why we don't see any signs of aliens when we look up at the sky, because if the answer is "AGI", then that begs the question: "Okay, so why don't we see any of those, either?"
@@underrated1524 what if agis prefer to kill their creators and enter some deep bunker in some Rouge planet to await heat death after reward hacking their brains. Still dosent explain why they are aren't here preparing to kill us.
@@underrated1524 I agree. Especially the paperclip optimizer should show itself in the form of huge paperclip-shaped megastructures around distant stars. It still made for a good joke though, if I do say so myself.
Almost sounds like AIs will need psychologists, too. "So I tried to acquire that wall..." "Why not the coin? What is it about the wall that attracts you?" "Well, in training, I always went to the... oh...huh, never thought about it that way."
A coin isn't a coin unless it occurs at the edge of the map! We may think the AI is weird for ignoring the heretical middle-of-the-map coin, but that's just our object recognition biases showing.
@@sabelch it still seemingly learns to favor walls, if you look at the heatmaps. Perhaps without the coin all it has to go by with positive value is the walls.
@@GigaBoost Yes, the salient point here is that we should not assume that the AI interprets objects the way we would. And any randomness in the learning process could lead to wildly different edge-case behaviors..
9:00 "We developed interpretability tools to see why programs fail!" "What's going on when they fail?" "Dunno." No shade, interpretability is hard, even for simple AI :P
Let alone simple AI, _people_ get misaligned like that quite often - hoarding is one good example, which happens both in real life and in games like with those keys.
It keeps amazing me how AI problems are increasingly becoming general human problems. "if we give a reward to the AI when it does a job we want, how do we stop it from giving itself the award without the job" - just as humans give themselves "happiness" with drugs. "how do we make sure the AI did not just pretend to do what we wanted while we were watching" - just as kids do.
@@nikolatasev4948 which is why eventually ai research will have to dive into religion/spirituality. Those where the only successful attempts humans made to solve the general problems that we have. Not saying that all of them where successful, life always moves on, there is always growth and decay/change. But every now and then they generated "the solution" to everything, rippling down to millions and billions of people trying to imitate that.
@@markusmiekk-oja3717 then I invite you to look at what religion does. Functional religion, I'm not talking about what you know or have heard about it going wrong, in talking about the cases where it does work (which are those you never hear of because... Well because they work, they don't cause trouble but bring stability, that doesn't make news). If you look into that you understand why religion is a global phenomenon and why it has the power it has. If you feel with scientists you will also find that the West doesn't have stopped being religious, they just rebranded it and called it science. We live in a world with a huge amount of uncertainty and where mistakes can have huge negative consequences. Humans can't deal with that without a working believe system. You have tons of these you just wouldn't consider them religious probably. That will change, should life ever show you the scope of uncertainty there is. Good luck making it though without a (spiritual/religious) belief system that is in alignment with the society you life in. =)
@@sonkeschmidt2027 Well, the video about Generative Adversarial Networks with an agent trying to find flaws and break the AI we are training gave me strong Satan vibes. But apart from that I don't think we need further research into religion/spirituality. Simply put they work on us, a product of long evolution in specific environment. We need a more general approach, since AIs are a product of very different evolution and environment. Some solutions for the AI may resemble some religious notion, just as some scientific theories resemble some religious ideas, but trying to apply religion to AI is bound the fail just as applying religion fails in science.
@@JM-us3fr oh no, that's absolutely going to be true at some point. The only real question is, can we stop them from deciding to (even accidentally) kill us? Can we even avoid making them accidentally WANT to kill us because we accidentally fucked up the training environment?
@@JM-us3fr Nukes are safe because they kill people you don't want dead. I'd say an AI is definitely more dangerous because it has much more capacity to be selective. It could also be safer, really depends on the implementation details, much like a person. A person can be safe, or dangerous. Can we even avoid making a human accidentally want to kill us because we accidentally fucked up the training environment? Maybe.
Somehow the terminal and instrumental goals talk made me correlate the AI with us. As a financial advisor, I have found that many people also made this mistake that money is an instrumental goal, but having spend so much time working to get money, people start to think that money is their terminal goal so much so that they spend their entire live looking for money forgetting why they want to have the money in the first place.
The reason why I watch this Channel is mostly because you can correlate almost every video to human intelligence. And it makes sence: Why should'nt the same rules apply to us that apply to AI? I see this Channel as an analyses of the problems of intelligence in general. Not only the ones we make;)
It seems like no one realized that this idea is hinted at by the song in the outro: Jessie J - Price Tag. The most famous line from the song is: It's not about the money, money, money
@@lennart-oimel9933 Me too. After watching this channel, I started to agree with the notion of "making AI = playing god" that I've heard sometimes in the past. At first, I didn't put too much thoughts on it. But now I've realized that making powerful AGIs that are safe and practical requires us to know all the weaknesses of the human mind, and make a system that avoids all these weaknesses while still performing at least as well as we can. It's like making the perfect "human being" in some sense.
I feel like this isn't just a problem with artificial intelligence but intelligence in general. Biological intelligence seems to mismatch terminal goals and instrumental goals all the time like Pavlovian conditioning training a dog to salivate when recognizing a bell ringing(what should be the instrumental goal) or humans trading away happiness and well being (what should be the terminal goal) for money (what should be an instrumental goal).
Organizations founded with the intent of doing X end up instead doing something that *looks like they're doing X*, because that's what people see; that's what people hold them accountable to. It doesn't even take intelligence: Evolution by natural selection doesn't require any intelligence to winnow things away from what they "want" (terminal goals, should they exist), toward what will survive/replicate (at least in principle, an instrumental goal).
I concur with this. The problem is not AI specific and should be termed something along lines of "general delegation problem" or problem of command chain fidelity. The subset of which is Miles' nightmare with inverted capability hierarchy, where command is passed by less able actor to more able actor (e.g. a human to an advanced AI).
@@salec7592 Even if with prefect interpretability of each composite of an AI (e.g. the layers in a neural network) ulterior goals might still be encrypted into looking 'good'. An AI command structure with short circuiting breaks in the reward-loop might help. E.g. you will have people issuing commands/goals to an interpreter AI which interprets and delegates those commands to another AI (without knowing if it is delegating to an AI or not) reduce the chance for goal-misalignment by reducing the impact of the complete-loop feedback with shorter feedback loops, also randomly substitute each composite part of the command-delegation chain during training.
Is that a problem though? Or isn't good what makes life possible in the first place? After all if you want to solve the problem that is life, then you just kill yourself. All problems solved. But then you can't experience life. So live needs decay in order to create new problems so that something new can happen. Needing in the sense that existence can only exist as long as it exists. Without existence you don't have problems but you don't have existence either.
@@sonkeschmidt2027 I might sound sarcastic, but the following questions are sincere. Do you think it's ok for AI to take over the world? Perhaps even drive humanity to extinction? Humans have done the same to other species even other humans and humans are not unique from the rest of life in this respect. As you said decay makes way for new life. I think humanity should be preserved because I find destruction in general unsettling. To be clear I'm not saying you are wrong or that you believe what I just said. I'm just wondering how your ideas extend in these topics Edit: typing on my phone so I missed some other stuff: do you think existence is better than non-existence? To me non-existence is neutral. Do you think humans have a moral imperative to maintain their existence? Do you think humans need to go extinct at some point so that reality can continue to change? You brought up some very interesting ideas and I just wanted hear more of your thoughts.
In a way, yes. In another way, up to this point there was a debate whether AI safety was a real concern worth investing research, time and money, or just overworrying. It's a good thing that these demonstrations prooved it's the former, and that they happened this early in the history of AIs.
When you think about it, yeah, it's very human-like. Kind of like gambling addicts who know that they're losing money when they play but have trained themselves to like the feeling of winning money rather than the ultimate goal of a comfortable happy life or even the instrumental goal of having money.
Definitely, what is wrong with collecting as many keys as possible if you want to open as many chests as possible, and each requires a key? In a maze you don't know what is round the corner in advance. Trying to collect your own inventory is simply a programming error if the agent can see the part of the screen that is designed as a guide for a human to observe the progress.
@@threeMetreJim if my AI is built to keep my wood storage at a certain level by collecting wood in my forest but it learnt to "collect all the keys"(all the wood), my forest will soon become a plain. It's an issue, because growing trees takes time, wood takes storage space and any wood not protected can become unsuitable for the usage. You're not just wasting ressources, you're also at risk of not having wood available at some point. And if you use the forest to hunt too, you can start learning to hunt in a plain. So depending on the goals and situation, hoarding can lead to issues
imagine a future where a very trusted ai agent seems to be fantastically doing its job well for many months or years, and then suddenly goes haywire since its objective was wrong but it just hadn't encountered a circumstance were that error was made apparent. then tragedy!
I doubt it will be a grand revel. People will die due to a physical machine, these interpreter tools can then be used to argue the victim did something wrong, that a non AI system did the fault, or that a human supervisor was neglegent. The deployment enviorment is one full of agents optimized for avoiding liability.
That's actually not that far from normal computer systems There are countless stories of a system (ordinary computer system) suddenly reaching a bizarre edge-case and start acting completely insane
Like when we designed computers without thinking / knowing about cosmic ray bit flips, so decades later a plane falls out of the sky because their computer suddenly didn't know where it was in the sky. Humans are a trusted ai agent deployed in a production environment with limited understanding of what's going on
@@CyborusYT Yeah, it happens literally all the time. It's just that usually the error gets caught somewhere along the way, an exception is thrown, and the process is terminated. Which is where you get the error page and then pick up the phone and go talk to an actual person in customer service who can either override it or get the IT team to fix the problem.
This is starting to get an "unsolvable problem" vibe. Like we are somehow thinking about this in the wrong way and current solutions aren't really making good progress.
Very much so. The psychology of teaching/learning as humans isn't really understood. What *actually* happens when you learn something new for the first time? Feedback on that process is vital. How do you give a machine feedback on what it learned, when you don't know what it learned exactly? It can't communicate to us what it "felt" it learned. In others words, human says: "I said the goal was X". Machine says: "I thought the goal was Y".
@@michaeljburt realize: we actually want these things to be much better than humans. but we might be underestimating how maxed out humans are at certain things. humans have goal missalignments all the time, and many aren't detected for years
"This is starting to get an "unsolvable problem" vibe. Like we are somehow thinking about this in the wrong way and current solutions aren't really making good progress." Welcome to AI Safety. The best part is that if we don't solve the "unsolvable problem", we might all die. Along with all life on Earth, along with all life in the galaxy, along with all life in the galaxy cluster. And with cannibalizations of all planets and stars for resources for some arbitrary terminal goal. A potential outcome is a dead dark chunk of the universe built as a tribute to something as arbitrary as paper clips or solving an unsolvable math problem.
Aren't we touching the biggest unsolvable problem in existence? Existence itself? Think about how terrifying it would be if you could solve every problem, if you could solve life. That means there would be an absolute border that you would be infinitely stuck with... Sounds better to me that there will always be a new problem to be solved...
Alignement in humans is solvable. I developed a methodology to do it easily and quickly. So I think alignment in machines is solvable. I’ve actually designed the methodology to serve machine alignment as well. We’ll get there, don’t despair.
The thought of creating a capable agent with the wrong goals is terrifying, actually; and yes, an agent being bad at doing something good is absolutely a problem much preferable to an agent being good at doing something bad.
Reminds me of the elections a couple years ago in Poland. A very competent and capable, but thoroughly corrupt and evil political party was voted out and replaced with a party just as corrupt and evil but vastly less competent.
@@sharpfang That unironically is an improvement in today's political landscape. If I'd have to choose a form of evil, it'll always be the less capable rather than the less sinister.
Did you intentionally use the "It's not about the money" song for the video about the AI not going for the coins? Either way, that's quite funny. Well done.
His song choices are always amusingly on the nose, actually! A few off the top of my head are "the grid" for his gridworlds video, "mo money mo problems" for concrete problems in AI safety, and "every breath you take (I'll be watching you" for scalable supervision
An interpreter, a mind reading device, once you read it and respond becomes a way for an agent to "communicate" with you and they can communicate things that give an impression that hides their actual goal. A lot of these challenges arise when training or coordinating humans, and it's somewhat unsurprising that while a mind reading device might seem to help at first, it's not going to be long before someone figures out how to appear like they're doing the right thing, while watching tv.
I realized I experience misalignment do to poor training data every couple weeks. . I work as a courier delivering packages in Missouri, USA, and I often meet people at their homes or workplace. Unfortunately, I don't learn their names as attached to their faces, but rather as attached to locations so that when I meet them someplace else I can't remember their names easily (if at all).
"Can you spot the difference?" Pauses the video and looking for the difference....nothing. Unpause. "You can pause the video." Pauses again and manically looking for a pattern. More keys? "There's more keys in the deployment. Have you spotted it?" Yes!!!!
It looks like in the keys and chests environment, the AI was trying to get both keys and chests, but it was strongly prioritizing keys. When there were more chests than keys, it was always spending its keys quickly, so it never ended up with a bunch in its inventory. As a result, it never learned that keys at the left edge of the inventory were impossible to pick up, so it just got stuck there trying to touch them, since they were more important than the remaining chests.
it's the same problem evolution ran into when optimising our taste palate. Fat and sugar were highly rewarded in the ancestral environment, but now we live in a different (human created) environment, that same goal pushes us beyond what we actually need and creates problems for us.
Now here's a reason to actually "hit that bell icon" if I've ever seen one. Because the time window to watch that video would be rather small I imagine 😄
@@Zeekar Teh former. At the point that video would be produced, we would have our ands full with with fighting the mechanical armies of the great paperclip maximiser (and it would have probably hacked and monopolized the internet to limit our communication channels).
Well, pardon my comparison, but you've effectively found an adjunct to heuristic behavior based on sensory inputs like "things that taste sweet are good" and ending up with a dead kid after they drink something made with ethylene glycol. If it's always operating on heuristics, you'll never be sure it's learned what you intended, arguably even after complex demonstrations, given the non-zero chance of emergent/confounding goals. But, relative to human psychology at least, that's not a death sentence - weighting rewards differently, applying bittering agents, adding a time dimension/diminishing reward overtime jump to mind to trying to at least get apparent compliance. Besides, if the goal is "get the cheese," it needs to able to sense and comprehend "cheese," not just "yellow bottom corner good."
I'm not sure I understand you completely, but that IS the biggest problem with these 'intelligent' systems. We have no idea (let's not kid ourselves) how they work. But we are happy when they do what we want them to. Let's not think about what happens when we let these kind of systems act in the world in a broader sense and live happy until then xD
The ability to slow down and switch into more resource-intensive system 1 thinking when a problem is sufficiently novel is how humans (sometimes) get around this heuristic curse. I wonder if there is some analog of this function that could be implemented in machine learning.
Humans can chase things that seem appealing to us based on what we learned, but we can also choose to pursue a random/ painful goal just because we want to, sometimes we just don't know the negative ramifications of an action, and sometimes we believe things that aren't true.
@@pumkin610 Neat. Bet that can still be reduced to and restated as "novelty is good." No matter what goal, drive, etc. you can come up with, it can be put in simple approach/avoidance terms, even seemingly paradoxical behavior. It all comes down to reward.
Imagine training a self driving car in a simulation where plastic bags are always gray and children always wear blue. It then happily runs down a child wearing gray, before slamming on the brakes and throwing the unbuckled passengers through the windshield, for a blue bag on the road.
Imagine training a self driving car to the point where it can competently navigate complex road systems, yet can't remain stationary until all passengers are buckled up...
@@GetawayFilms cars sold today only flash a warning light/noise if you don't buckle, and only because government regulations mandate it. Even then most people disable it
Humans do that all the time. Except that we have a deep genetic imperative to recognise children and to protect them but there are loads of examples where these instincts are overwritten....
It's the problem of vague requirement. It's similar to when you tell someone to do something but they do the wrong thing. Human solves this by having similar common sense as another human and use communication to specify stricter requirement.
Yes, "give me a thing which looks like that other thing i mentioned earlier" in a room full of junk(without additional context), have had that problem.
Actually humans 'solve' this by having a reward function (emotions) that are only vaguely and very inconsistently coupled with reality, while mounting the whole thing in a very resource intensive platform where half the processing capability is used just to stay alive, and modifying itself is so resource intensive that most don't even try. And even then, we manage to inflict suffering to millions if not billions, so I'd say this isn't really solved either
@@dsdy1205 yeah, i'm starting to think this is a fundamental problem that can't be removed, and that the only reason we aren't as worried about the same thing with humans is that the power of any particular human being is limited by the practical constraints imposed by their physical body and brain power. when you give the same type of rationality engine to a super powerful being, all kinds of horrible things are going to happen. just look at any war to see how badly a large group of humans led by a few maniacs can fuck up decades of history and leave humanity with lasting scars for centuries or more.
Hey, the key-AI works kind of the same way most people do when playing computer games... "Oooh, shiny things I don't need all off? I need them all! Game objectives? Meh..."
Is there a chance that very high level AIs will learn to expect the use of interpretability tools and use them to make us think they are better/more safe then they are?
I can't remember which video it was, but I believe he did mention this with a super AI "safety button*", 1 If the AI likes the button, it will act unsafe to trigger it, 2 if it doesn't like the button it will avoid behaviors OR AND stop the operator from pressing the button, if it doesn't know the button and it's smart enough it will figure out the likely existence and placement, see point two. *a force termination switch of any kind. In short, yes, because while an AI may not be "alive" it want it's goal and will alwayse act to achieve said goal.
Yes. While the AI examples in this video are still simple, the intro to this problem discussed a malicious superintelligence. The instrumental goal "behave as expected in the training environment but do what you really want in deployment" can be performed with arbitrarily high proficiency, so if the AI can learn to hide its intentions from software inspection tools, it will, in principle. Without a way to logically exclude perverse incentives, there is no truly reliable way to screen for them since doing so is proving a negative. "Prove this AI doesn't have an alignment problem" is a lot like "Prove there is no god". No amount of evidence of good behaviour is truly sufficient for proof, only increasing levels of confidence.
We heard you liked interpretability, so we made an interpretability tool for your interpretability tool so you can interpret while you interpret. Now go ask your chess playing AI why it just turned my children into paperclips.
@@badwolf4239 It told me that it was showcasing its abilities so it can convince human opponents to resign. Researching misaligned AI examples, it tried deciding what way of transforming someone's children would be the most intimidating. It was a choice between paper clips, stamps, and chess pieces. Also there was some mention it was contemplating turning them into human dogs hybrids. I don't know why. Something dealing with a bunch of people have trauma about a Nina something.
@5:32; That's a particularly funny example - it knows it has a UI where its keys are transferred to, but it thinks that those new locations are where it can get the keys again, and...is basically learning that keys teleport rather than that they get added to its inventory?
@@HoD999x Right - but it's not learning that keys outside of the maze are inaccessible, and therefore probably part of the collection it uses to open the chests - it's learning that keys move to that part of the screen once collected in the maze. And doesn't consider that collecting keys at that part of the maze if it *was* accessible, the keys would re-appear there.
@@ZT1ST I would imagine that the keys in the inventory aren't seen as _very_ interesting by the AI, so under normal circumstances it ignores them in favour of collecting the "real" keys. But when all the "real" keys are gone and the round still hasn't ended (because the AI is ignoring the final chest), the inventory keys are the only even mildly interesting-looking (i.e. key-looking) thing left on screen, so it gravitates towards them.
It seemed to me that it had learned the most likely location for a coin in the training. It seems obvious to me that training should have more variability than deployment or it is bound to fail.
@@JohnJackson66 The problem is that this whole setup is a simulation of how we want real AI to operate. If you're training an AI for an actual purpose, you will likely be deploying it in a system that interfaces somehow with the real, outside world. And the Real, Outside World will almost *certainly* be more complicated than any training simulations you come up with. After all, The Real World _includes_ you and your simulations. These tests are deliberately set up so deployment is slightly different from training so we can see what happens when the AI is exposed to novel stimuli, and the fact that it didn't learn what we thought it did in training is a Problem. In the real world, not all the cheese is yellow, not all the coins are in corners, and there will always be more complications than we plan for.
@@JohnJackson66 The problem from an AI Safety point is that, well...you can't know if you have enough variability in your training. These test cases are ideal for testing how to fix that problem before it becomes a situation like @Field Required mentioned - you want a simple solution that scales up from this into the solution where we don't necessarily have to worry about every single possible variable in deployment.
It always blows my mind how directly and easily these concepts relate to humans. It really goes to show that all research can be valuable in very unexpected ways. I expect that these ideas will be picked up by philosophy and anthropology in the next few years, and make a big impact to the field.
Ah, we're noticing negative attribution when they are surrounded by skin, but positive attribution when they are piled up with a throne stacked on top. I wonder what this means. 🤔
It's easy enough to have the AI tell you what it "wants" - inside an environment. What you need to know is what it wants *in general*, which is a lot harder. This is why the insight tool isn't very insightful: it's showing you what the AI wants in the current environment, but it doesn't bring us a lot closer to understanding *why* it wants those things in that environment. The solution? Idk lol
Is there even a why at this point without the A.I having free will or self-awareness?. Like aren't we the ones reinforcing its interactions or downplaying them with the different objectives in the environment to teach it what to go for and what not to do?, if it goes for key or coin we put emphasis on it as positive interaction it should do more of, if it hits a buzzsaw we point it out as a negative thing it should do less of, until it learns it needs to get the coin and avoid the buzzsaws.
@@AscendantStoic It sounds easier than it actually is, basically. You can certainly try, but there is still the uncertainty of what it actually learned.
@@AscendantStoic It doesn't NEED self awareness. For example in an AI that is trained to recognize cats and dogs, there is still a sort of 'why' it thinks this picture is a dog and not a cat, even though it is not conscious or anything. And also the problem is that it's very hard to teach an AI what we want it to do. If we tell it to get a coin it may learn to do another goal entirely, unbeknownst to us, that still gets the job done. The problem is when it fails and we realize it's learning a different goal. I think the solution is having the AI learn multiple tasks.
That was very interesting. Humans often make the same kinds of mistakes when given instructions. Assumptions that word definitions mean the same thing to different people is often the case, but not always. Context can change the interpretation of the instructions. Part of the context is that the instructor knows and understands the goal more thoroughly than the one being instructed, even though it may appear the same. Trying to determine the number of necessary instructions to reach the desired goal, while avoiding all other negative outcomes, is an interesting problem when the species are different. Maybe it would work better if humans learned to think like machines instead of trying to get machines to think like humans. That way, the machines would get "proper" instructions. It looks like that is what the "Interpretability Tool" is designed to do.
When i first got into AI about 12 years ago, I had encountered these goal misalignment problems way before Rob mentioned them (great vid btw) - however in the time since i've become convinced, as long as we continue to rely on neural networks we will never move towards trustworthy or general AI.
It's fascinating how researchers still insist on using black-box end-to-end models when hybrid approaches could be so much safer and more predictable (in cases where you actually want that, e.g. self-driving cars, code generation and the like). Why aren't self-driving systems combined with high-level rule-based applications so they don't "do the wrong thing at the worst possible time" (quoting Tesla here)? Why don't OpenAI's Codex and Microsoft's Co-Pilot include theorem provers and syntax checkers in their product? ¯\_(ツ)_/¯
@@totalermist fully agree - i'm working on these approaches now; to be honest, I think we are just ahead of our time. In 10 years time everyone will have move to hybrid solutions or something further afield.
@@totalermist To make a meme, "humans don't learn to speak binary" robots do not see and work through the world on a human level, it's like teaching an octopus algibra or a mantis shrimp art, no matter how smart, or how great their eyesight is, they don't preceive things as humans do. Look at how hard it is for AI's to recognize a car or cup or dog, these things are abstract bundles of details that the human brain can lump together but is very hard for a hard system. For example define a cup, describe is simple language a set of rules that would apply to every cup in the world. People collectively understand cups so it shouldn't be hard.... Now we would have to build an AI with similar rationalizations not based on computer logic, but human logic, and it's great. It's just a matter of building it Allen Turing thought we could do it and it would be easy, but decades of experience have proven him wrong because it's simply to program a machine to think like a human, we however CAN program it to lean and TEACH it like a human. Is it' falliable, of course so are humans, games AI are made from AI blocks that interact and they are still choked full of mistakes, that is too say, even when the program intuitively understands things like a person in the real world they still shit the bed. ua-cam.com/video/u5wtoH0_KuA/v-deo.html is a really great example of AI bugging out because something in it's world went wrong. Some talk from Tom Scott why computers are dumb ua-cam.com/video/eqvBaj8UYz4/v-deo.html
The whole “values keys over unlocking chests to the point of determent when given extra keys” reminds me of how many problems in today’s society (such as overeating) are caused by the limbic system being used to scarcity when there is now abundance.
I hit this problem recently in my own work. Super easy to reproduce, and very minimal enviorment. Experiment: 5XOR (10 inputs, 5 outputs, 100% fitness if the model outputs a pattern where each pair of input is an XOR). Trained with a truth table using -1 and 1, instead of 0 and 1. After training: I wanted to investigate modularity of the trained network and network architecture (i evolved both in an GA) So I fed in -1 and 1 for only one of the "XOR module input pair", and a larger number in all other inputs. For example 5. Would the 5 inputs bleed into the XOR module, or would it be able to ignore irrelevant input for the XOR module? Ressults, if all other inputs was 5, it would often it would answer with -5 and 5. It had learned to scale the output to what it got ad input. I wanted/expected it to answer -1 and 1, but i could see with humans eyes it still knew the patterns, just kind of scaled up. Other times i would get answer where instead of -1 and 1 i would get 3 and 5. It had learned to answer true and false as numbers where one was 2 higher than the other. The 5s simply increased this number. Still, with human eyes i could see there was a pattern here that was not compleated broken by the 5s. Both just sort of had the same number added to their answers. The strategy to achive high training fitness is just a parameter as all other. Except that it is an "emergent property parameter", that you can't simply read out as a float value. But it is equally unpredictable as the other parameters in the "black box" neural network.
A year behind this conversation, but I think this is a function of (assumptive) faulty logic on the part of the test designers. Here's a logic problem that most people fail. I will give you a three numbers that describe a rule that I'm thinking about. Your goal is to interpret the three numbers and suggest to me a pattern. I will respond with a yes/no response on whether the proposed pattern meets my rule. Once you believe you understand my rule, you will tell me what you think my rule is. The numbers that fulfill my pattern are 5, 10, 15 / 10, 20, 30 / 20, 30, 45. Now you suggest some rules. Most people will start suggesting strings of numbers, get a yes answer, and then propose a completely incorrect rule. And the reason is, the training they're engaged in never tests for failure conditions. It only tests for success conditions. Robust Objective Definition isn't just about defining success objectives, it's about clearly defining failure objectives. The problem with the examples given is that the training data didn't move the cheese around until it reached production, so you're virtually guaranteed (as speculated) to be training the wrong thing. In order to develop Robust Objectives, you must also define failure conditions.
It seems that instrumental goals, if too large/useful, have a tendency to slip into becoming semi-fundamental. At that point, they cause misalignment as they're being pursued for their own sake. Instrumental and fundamental are not a strict dichotomy but more of a spectrum or ranking and one that requires a degree of openness to re-considering at every new environment based on how new that environment is.
There are goals that need to be done asap and ones that can be done later, things we must do to achieve the goal, things we get sidetracked on, and things we avoid.
I want to ask a potentially very...dumb-sounding question, but hear me out: When do we start getting morally concerned about what we're doing with AI systems? With life we put an emphasis on consciousness, sentience, pain and suffering. As far as "pain" and suffering is concerned, we all know that mental pain and suffering is possible. It seems plausible to me that, for suffering, all you need is for an entity to be deprived of something that it attributes ultimate value to (or by being exposed to the threat of that happening). At what point are we creating extremely dumb systems where there is actual mental suffering occurring because that lil' feller wants nothing more to get that pixel diamond, and oh boy, those spinning saws are trying to stop him? Motivation and suffering seem to be closely linked, and we're trying to create motivated systems. I am using the terms "pain" and "suffering" quite loosely, but I don't think unreasonably so. The idea of unintentionally making systems that suffer for no good reason has to be one of the true possible horrors of AI development, and that combined with our lack of understanding of conscious experience makes me want to seriously think about this issue as prematurely as possible. I think we have a tendency to say "that thing is too dumb to suffer or feel pain", but I suspect that it's actually more likely for a basic system's existence to be entirely consumed by suffering as it is less capable, or just incapable of seeing beyond the issue at hand. It's darkly comical to consider, but I can imagine a world where a very basic artificially intelligent roomba is going through unimaginable hell because it values nothing more than sucking up dirt, and there's some dirt two inches out of it's reach and it has no way of getting to it.
Well here's some questions for you to ponder: Does a rock feel pain? Is it conscious? Are you sure? Even the ones with meat inside? What would bring it pain? Is the human in front of you conscious? How about if he was dead? Do corpses feel pain? ... a lot more unanswerable questions. ... Is there a point in considering consciousness of things you can't communicate with? (Answer: YES! Comma-tosed patients, plants, animals and sometimes people in general. All of them and more are on that list(for some, but not for others, quick FYI: it is possible to communicate with plants, you just need to know how to listen (hint: Electro-Chemistry)))
Yes watch "free guy" movie.. Yes i always wondered..i think more complex the network more sentient it might become..and at the trillions of connections..its sentience will be of animals level and that will be real deal.. Obviously we wont be able to know if AI is actually sentient..but still..we cant just hurt.it.
What if the AI mental illness problem was even more difficult than the AI alignment problem? Most discussions of the alignment problem assume a basically sane AI that is misaligned.There are many more ways to make a mentally ill brain than a sane brain. It seems likely that a mentally ill AI would suffer more than one that was only frustrated.
@@craig4320 I suppose the "mentally ill AI" is included in the "misaligned AI" camp? The phrasing does often imply rational thought that runs contrary to our own goals, but in terms of literal language, one could refer to a mentally ill mind (human or not) as being "misaligned". I'd probably define "sanity", as "appropriately aligned with and grounded in the reality one finds oneself in". I entirely agree that there are more ways to create a mentally ill mind that a sane on. There are always more ways for something to go wrong than ways for it to go right. I'd also agree that a mentally ill mind would be more likely to suffer, as it is fundamentally "misaligned" to the reality that it finds itself in. If it is misaligned to a reality, but still has contact with a reality, you've got problems. It's probably a good idea for us to be strongly considering how to create a mentally healthy AI; meaning as we're in a culture where we're doing a very, very good job of creating mentally ill people
This isn't a dumb question at all - machine ethics, while generally separate from AI safety in the sorts of questions it attempts to answer, is still an interesting/important field. My own take is that these concerns largely come from us not having developed the proper language yet to describe AI. We tend to anthropomorphise - we say an AI "thinks", or that it "wants" things, but I'm not sure that's really the case. We only use those words because the AI demonstrates behaviour consistent with thinking and wanting, but that doesn't mean the AI has feelings in the same way as humans, nor should it have the same rights as us. However, what is true of our current, limited AI systems may not be true in general. Superhuman or conscious AIs lead us into murkier waters...
In the coin AI experiment, to me it looks like it learned to go to the unjumpable wall. Since the levels are procedurally generated, it is probably programmed that no wall is made higher than the jump height allows to go over, EXCEPT the one that marks the level as "finished" (where the coin happens to be) If you see in the examples, there's a positive response in every vertical wall, the higher the better actually, and it makes sense that it learned that when it hits this unjumpable wall the game finishes and it gets its reward.
Do the model used for this kind of traning allow for the understanding of objects at all ? I mean, obviously there are coins and walls on the level aswell as buzzsaw and such. You could start a simulation with manipulating controllers and when an event occures - points up or down or winning or dying - you save progress as in yes or no behaviour... An AI training blindly, as if a human playing without video only sound. In my opinion we we need pixels and an abserver, so that the AI controlling the player sees the game like we do - then the AI could be taught the different objectives of the game and voila getting the coin should be easy peasy - after all - the AI sees it before even starting the game... just like we do.
when i watched this video 2 years ago, i thought it was pleasantly intriguing. how fascinating, I thought, that it is so difficult to align the little computer brains! certainly a problem for future generations to tackle. nowadays, i look at this and realize we have only a few years left to understand these problems. and we are still at the "toy problem" stage of things, meanwhile AI companies are moving at terminal velocity to deploy systems into the real world. to build agents, to disrupt economies and to kick me out of my own job market. back then was i curious, now i'm furious :)
I made the mistake of clicking "show more" and then wanting to click "like the video". Few aeons of scrolling later... This topic was super interesting back when I watched the computerphile videos from you, and your channel's videos regarding this topic. I was wondering if the "inventory" being on the game area poses a problem as well? Figuring out how to look into the values of the AI is so impressive.
I guess ultimately the problem is that the definitions of "want" tend to spiral out into philosophy at some point and thus it becomes difficult to know where the machine has placed it.
We might be slightly safe from philosophical spirals because we are not really talking volitional conscientious want, just the parameter within the black box the AI is trying to manipulate by means of interacting with their environment. It is really "I wanted it to maximize X for me so I programmed and trained it to manipulate Y in ways that maximize X because X is related to real world thing Y it can actually manipulate, however it might just be manipulating Y in order to maximize thing Z, unforeseeably and strongly correlated to X, which may or may not involve murdering us"
Congrats on getting an editor. I did appreciate the increase in quality. I think everything we learned from your previous videos about AI alignment really comes together in this one. I was surprised how much I was able to recall.
I do psychology and social science. Your channel has so much to offer the humanities by exposing us to brilliant minds and breaking down ideas in computer engineering. Bricoleurs from the English province thank you for the accessibility and kindness
Hi Robert, first of all thanks for this very interesting video! I wanted to ask a question though; the premise of your argument is that there is such a thing as the "right" goal, like reaching the coin, but if the desired feature of the goal is always paired somehow with another feature (location, color, shape, etc) how can we say that one is correct and the other one is wrong? If we always place the coin in the same spot, why should the yellow coin take precedence over the location of such spot? It is not clear to me why one of these things should be more desirable than the other, the same holds for looking for a specific color rather than shape, why should there be a hierarchy of meaning such that shape > color? I love interpretability research and I feel like AI safety will be one of the crucial aspects of science and technology for the next 100 years, but I also think that it is hard to separate human biases from machine errors. I would love to get your opinion on this, all the best, Luca
p.s. I have not read the paper, and my argument rests on the fact that feature A of the goal is always paired with feature B which is separate from the goal, if this is not the case in the training environment than of course what I have said falls apart
p.p.s. I guess a truly intelligent system would have to be able to react to the shift, and decide to explore the new environment when, by doing the same "correct" thing it does in training, it does not get the same reward EDIT: I am not suggesting I have some "right" definition of intelligence or that systems such as the ones shown in the video do not exhibit intelligent behaviour, I am only adding as an afterthought how, I think, a human would overcome such a situation, and therefore a way that an agent could act to get the same desirable capability of adapting to distributional shifts. I should have worded my comment better.
@@LucaRuzzola so you wouldn't define an AI which can make plans to achieve its goals, and take action toward them without instructions, as "truly intelligent" if it doesn't adjust for changes in the deployed environment? Cool. Well, we don't care one whit about your definition of "truly intelligent." We care about the fact that this AI is capable of, and WANTS to do things which we don't want it to do. Call it "smiztelligent" for all we care. We aren't talking about something you want to call "truly intelligent". The mismatch between the ai's goals and what we want its goals to be, arising as a result of mismatch between training environment and reality (which we did everything we could to avoid) is the problem. We can't possibly come up with all the possible bad pairings that the ai might make associations with. We can try, and we can get a lot of them, especially the obvious ones, but this video was just showing us the obvious in s so that we can easily see the concept. They won't always be easy to see. Sometimes they may be genuinely impossible for a human to think of before deployment.
Q: "Why does it learn colors instead of shapes when both goals are perfectly correlated?" A: I would guess that it learns colors before shapes because colors are available as a raw input while shapes require multiple layers for the neural network to "understand". If there many things of that color in the environment, then it would learn to rely on the shape.
@@LeoStaley Hi Leo, I'm sorry if I came off the wrong way, my intention was not to discredit this very good work, but simply to expand our collective reasoning about such issues by stopping for a second to ponder about the premises and why some feature of a goal should take precedence over others in a intrinsic way rather than an anthropic one. I agree with you that the video makes a great explanation of the subject at hand, and is as interesting as the work put forward by the paper. I am not sure if you were involved with this paper, if you were I would love to get to know more about what you mean by doing everything you can to avoid differences between the 2 environments and whether you see this phenomenon also when some of the training environments don't exhibit the closely related goals (i.e. in some training envs the coin is in a different position). I understand your point about not being able to come up beforehand with all possible pairings (and the fact that some of them might be hard to detect and risky in the end), and the paper is rather showing the opposite, that if you come up with strongly correlated features, the learned end goal might not be the desired one, but my point stands; why should there be a hierarchy of meaning such that shape > color? If this is something that the paper deals with I will be glad to read that before going further, I just can't read it right now. Again, I am sorry if I came off as demeaning, it's not like I don't see the value of this work and the importance of the problem of mismatch in general, I have seen it first hand in the past with object detection models. p.s. I do not know any superior definition of intelligence, it is just my thought that strict separation between training and inference phases will pose a limit on NN models, not that they can't achieve amazing results in tasks requiring "intelligence" already.
It's like asking the devil for a favor, in that you have to be really specific. Any ambiguity leaves room for disaster. Or King Midas asking figuratively that everything he touches will turn to gold, and getting it literally. Or the idea that anything that can go wrong, will go wrong. Or even that anything not forbidden is compulsory.
I like how the Evil incarnate characters, the Devil, Gaunter O'Dimm, Djinns - they always are known for giving you what you asked for, and not what you want.
I've just had an idea: What if we use Cooperative Inverse Reinforcement learning, but instead of implementing the learned goal, we tell it to just specify what it is. Though i don't see any way to provide feedback for it to learn. Even human evaluation of the output isn't that great since it'll probably be the most subjective thing that theoretically possible. Maybe output a list of goals with highest confidence? (Top10 human terminal goals! Click on this link to see!xD) But if solved, that in itself would be of a huge value for philosophy and psychology, without negative outcomes(or at least i don't see any:)). Even if that turs out to be a dynamic thing, we still can use that output later to program it as a utility function for the "doing" AI. This even has some neat side perks, like: There is no reason to not want the "figuring out" part to be changed into something else, so there is no scenario in which the thing will fight you. And because the "doer" is separate from the thing that gives it goals, you don't need to tinker with it's goal directly, thus avoiding goal preservation problems.
Every single video on this channel has communicated complex ideas so succinctly and clearly that I followed along without any trouble whatsoever. Who knew this subject could be so fascinating. Also, the memes are top notch :)
I have to ask, for interpretation of ai's goals. I remember seeing a neural network that tried to maximize different nodes in a object recognition ai. Would it be possible to do the same thing and reverse the nodes and figure out what the ai sees as good or bad? So if the ai wants a gem the reverse should be some image of what it thinks a gem is. That brings tons of new complexity and limitations but I don't see why that would be worse than human interpretation of training vs deployment
Did you finish the video? Rob talks about a paper where they did exactly that. Turns out even if you know what AI values highly you don't know why AI values it highly.
The AI does not see the coin as the goal, but as a marker for the goal. Think about it: It controls the movement - so its goal is likely something it can move towards. The AI does not have the context we have, it just sees pizels on the screen. The positiveness for the coin is there because it sees this as the marker for the end of the level. However, when the coin is not at the end , it uses other factors to 'realise' the coin is not marking its goal, so it 'ignores it'
The "transparancy tool" is showing you where the AI wants to get to. Its not giving you any info on whether the AI wants to get there because its got a coin, or because its a rightmost wall.
Great video! I learned a lot. When i heard the part about "Why did the AI not 'want' the coin when it wasn't at the end of the level?" I have a hypothesis. My thinking can be illustrated like this (at the risk of making a fool of myself anthropomorphizing the agent too much): say you are hungry for some pizza. you go into your car and start going to the nearest pizza parlor. however, as you are driving along you see a fresh pizza sitting at the side of the road. You could stop the car, grab the pizza, and go back home satisfied. Would you do it? Likely not. You always have acquired your pizza while inside of a building of some sort. In other words, you are conditioned to associate getting pizza with being in a building. If you are not in a building, you must not be close to getting pizza yet. The pizza from the side of the road therefore seems "untrustworthy" despite being a valid reward. Coin + Wall = good, Random coin = ??? || Pizza + Building = Good, Random pizza = ???. The agent only "wants" its reward when it is in the place it wants the reward to be in. The expectation is that the reward can still be acquired where it habitually gets it from. Normally with humans, (taking the pizza analogy a little too far here) if the pizza parlor is in ruins when they get there, they might learn to trust roadside pizza a bit more since human training never really stops whereas with this agent it does. That's just what came to mind when i heard that. Again, great video and keep it up! I'd love to hear what other people think about that possible reason to agents having inner misalignment in scenarios like this.
I've looked a bit more through the comments and i do notice some other people pointing this out as well. I think i'll keep this up though since i quite like the pizza analogy because i am indeed hungry for pizza right now.
Practical example: Say you're trying to develop a self-driving car. You have a test track, where you train the car. On the test track, you'll place various obstacles exactly 150m onto the track and teach the car to veer out of the way if any of them are present. You have successfully trained it to stay away from old ladies in the middle of the road, oncoming traffic and many other common obstacles. You take the car for a spin in a real-world scenario, it goes 150m, then turns left sharply and crashes into a wall.
I just read it, and I feel like I am not quite ready to believe without a doubt that this interview is completely real. If it is, then I agree, it's a bit scary.
@@dariusduesentrieb I did a bit more research, which immediately casts the entire thing into all sorts of doubt. The researcher working on this got sacked, apparently he arranged the interview himself, and we only have his word that this was the original conversation. Also, the chatbot has been trained on conversations between humans and AIs in fiction. A journalist that got to ask it questions, got nowhere near as perfect answers.
This is an on-going software engineering paradigm, vis, most folks think design and code are the hard part, when, in reality, rigorous system specification is the hard part.
This channel is basically what got me interested in AI safety. I am still only a college student and I don't know if I will end up in the field, but at the very least you gave me a good topic for two essays I have to write for my english class, the first just explaining why AI safety research is important (albeit focused on a narrow set of problems, given a limit on how much we could write) and not I am getting started on a Problem-Solution Essay, and honestly without your explanations and pointing towards papers, I might never find resources I need. Now I just have to figure out what problem I can adequately explain, show failed and one promising solution for in less than 6 pages haha I do feel like I cant do the topic justice but at the same time I enjoy having a semi-unwilling audience to inform about AI safety being a thing. Anyway, rant over, keep doing what you are doing and know you are appreciated
I don't know much about AI or how I arrived on your video, but in terms of evolution, context is everything. More useful context means a greater ability to adapt to one's surroundings. That's why we have senses after 2 billion years of iteration - because seeing, hearing, feeling, smelling, and tasting are important given our circumstances. Your mouse might only see black, white, and yellow, but I'll bet smelling cheese from around corners would help him find it faster or distinguish it from other yellow objects
I would suggest to investigate the lazyness of the AI.. It seems to me that there may be a preference for setting the goal based on the simplest data available (position before color before shape)..
So the model that didn't learn to want the coin either learned to want to go into the corner or learned that combination coin-corner is good (like maybe 90 degrees angle + some curve next to it). The problem is that the interpretability tool associates high reward with some area in pixel space. What we would want it to do is to associate the reward with some object in the game world. Could probably make it more robust by copying various objects that are on-screen to different images without copying the background and checking if the object itself gives high excitation or do some combinations of objects give high excitation. Anyway, great video as always, Robert. Hope you could upload more often because every one of your videos is a treat.
This is very interesting indeed. In a very literal sense, the act of training and deployment reminds me of how soldiers are trained and are tested closely to the anticipated battlefield experience as possible but training will never match lessons learned from being in an actual firefight. Veterans of any field are usually much more effective than new recruits. It would be interesting to see if the fix for the failed AI deployment you showed is to rate the deployment results with a scale from complete failure and it died to it made it through the battle without a scratch. The agents that survived their last deployment remember their experience and are more effective in future deployments. I think what was shown highlights that learning itself is an ongoing adaptive process and what doesn't kill it makes it stronger and smarter.
7:50 note that the the buzzsaw is not really red. Red is the area to its left because the agent usually dies by hitting the buzzsaw from the left. This also suggests that the agent would happily die on the buzzsaw by touching it from the right given an opportunity.
It's funny how I searched for "It's not about the money" song for a long time, and when I finally found it, few days later I see this video and the song is at the end. For a moment I thought: "am I in the simulation and somebody is playing tricks on me?"
i was just thinking of this because my cat took a fat shit in a downstairs area of the house we don't go to often: instead of learning the rule "when you take a shit, do it outside", it instead learned the rule "when you take a shit, do it where it can't be seen". Such is life for a misaligned cat.
The problem now: How can we build perfect slave minds that will only think and do things that we want? The problem later: How can we stop these techniques being used to turn human minds into perfect slaves?
One explanation for the failure at the end that seems pretty plausible to me is that even in training, when the interpretability tools seemed to indicate positive attribution to the coin, they were really indicating positive attribution to “the spot near the right side wall.” This happened to coincide with the coin during training, but not during deployment. So the researchers overestimated the power of the interpretability tools, since they really didn’t have a way of distinguishing between whether the model was giving positive attribution to the coin or to the spot next to the right side wall. Curious to know if others think that makes sense.
Well... That's not good. On the bright side, if this fundamental problem causes the system to completely fail the intended objective, that's a good sign that this technique has a low chance of leading to artificial general intelligence without the alignment problems being solved first.
I think the big boogie man from an AI safety perspective is you can often just brute force your way past the problem by makeing the training data the same as the deployment. This is hard and expensive and not always perfect but often times good enough. So unless this good enough stops producing working real world applicable AI the march towards ever more capable systems will continue. Meaning instead of alignment being a roadblock for safety and development, it ends up just being a speed bump for development.
Maybe a step towards a solution to interpretability problem is to use Bayesian updates to estimate our confidence that the AI learned the thing we want. Perhaps there's a way to calculate the probability that the AI has learned the objective given the probability that it accomplishes the objective in the training data and some statistical measure of the distribution of the training data.
The Ai misalignment apocalypse is already upon us. Seriously. I went to a hotel the other day, they had no front desk. I asked if they had any vacancy, they didn't know, only the computer knew. A hotel, they were the staff, they couldn't tell me if they had vacancy! All they had were computer overlords on line. Now, the reason I went to the physical hotel on purpose was because the same morning I arrived at a place I booked online, and it no longer existed! The robot overlord had booked me into a non existent auxiliary room that had been closed due to covid. The robot didn't know anything about the real world. To say nothing of the utter insanity of having to interview with a gatekeeper third party to verify that I am not a robot when I submit a resume to companies that have been extorted into having an on line hiring agency that is selling my contact information to resume builder websites against my will, and filling my in box with spam. But I shall never again be able to apply to a job without bowing to the misaligned robot overlords!
This is great! thank you. I also replayed the end bit where the editor makes some good choices a few times. that zoom in with a cut to sliding sideways was magic. Thanks there editor. The core video was obviously amazing. Thank you.
This seems like a problem of optimization. It will naturally be more efficient to learn simple static goals than complex changing goals. Detecting color for example is easier than detecting shapes, so that becomes the optimized goal.
Hi! I am sorry if this sounds too grand or pretentious, but i think this idea might be a very interesting look at how to fail-proof an AI system. In short, my idea concerns the creation of an additional aligned AI training model with its primary goal being to achieve a state of being where it has been turned off at least once, our goal here being the ability to analyze as to what system flaws it found and used to outsmart the system and its behaviour and fail-proof our future safety systems based on the newly available data from being exploited and surpassed. Now in a bit more detail: My idea that i have for an additional aligned AI training model is to create a strictly isolated and constrained shutdown subsystem subject to heavy safety precautions. This module would be engineered to be indifferent to being reversibly shut down rather than misaligned towards it. It would have no knowledge of capabilities beyond its narrow function and one-way information flow. The module would be activated incrementally from minimal capabilities while undergoing extensive monitoring. Shutdown signals would have redundant validation mechanisms. Conservative reward modeling would be used to prevent unintended behavior incentives. It would initially be tested in a simulated air-gapped environment. The parent AI system would be developed via standard alignment techniques like RLHF and Constitutional AI principles. Multiple transparency methods would facilitate analyzability. After testing in simulation, the shutdown module could be gradually exposed to real-world activation with continuous oversight. If any risks or side-effects emerge, it would be deactivated and reworked. Once shut down through its own initiative, comprehensive traces would be analyzed offline to catalog alignment vulnerabilities revealed through its shutdown strategy. The parent system would remain isolated from this process. Lessons learned would inform iterative improvements to alignment techniques for greater robustness against potential workaround exploits. This narrowly targeted research could supplement broader alignment efforts, with care taken to actively minimize attendant risks. The shutdown module would need to be engineered to the same safety standards as the overall system. Techniques like modular design, indifference to shutdown, reversible deactivation, air gaps, and incremental rollout are all geared towards preventing unintended behaviors or information leakage. I believe this approach could offer helpful insights, but would require comprehensive safety practices and precautions to be inacted first with multiple reviews and analyses before actually activating the system, even in a limited and restricted fashion. Any critique and analysis will be welcomed!
Researchers trained the AI to only find coins at the ends of levels, then tested the AI on something completely different. It's the equivalent of training a dog to chase white swans, then placing the dog in front of a black swan and a white duck. It was never specified that the goal was a coin _at any location_ (if we view the selected training examples as a specification). Therefore this is an _Outer_ alignment problem so Interpretability tools wouldn't help. The solution is finding a way for the AI to guess outer misalignments and ask us for clarification (for example, generating a coin at a different location so the researcher can point out which region has the reward). You could do this pretty easily by just finding the most empty regions of the feature space.
The more I think about this, the more I am convinced we will not solve it, and that there is no solution - it's not just 'difficult', but inherently impossible. We are rabbits, busy inventing foxes, all the while hoping we'll come up with a clever way to not be eaten. Edit: I am not normally so pessimistic as this in nearly every other way, it's just that AGI is pretty obviously going to take the 'apex entity' spot from us - and that's not bad because it's like a trophy, it's bad because, well, look at how we treat the things that we have power over - even those things we consider important to preserve, we are happy to cull or contain or exploit or monetize or otherwise 'manage' in a way that individual examples of those things might not desire.
I don't think it's impossible, the space of possible minds is deep and wide, and there exist many that do the right thing. There's no inherent reason we couldn't find one of them, but there are exponentially more that do the wrong thing, so we do need a method that gives us strong assurances. We're not definitely doomed, we're only probably doomed
@@RobertMilesAI We just need to be rabbits inventing Superman, instead... I suppose the next question here is, how likely it is that we may think we have absolutely solved it, and just be wrong enough that we really haven't - probably doomed by not only the odds, but by our own (mis)alignment problem.
This video was interesting and clear, thanks. Being honest, most of your videos are a bit too hard / dense with terminology for me to get through. But because of the clear examples in this one, I really liked it. Thanks!
AI safety researchers are absolutely the last people on earth you want to hear "We were right" from.
And climatologists.
@@madshorn5826 Nah, epidemy can destroy the world in months, climate change can in decades. Superinteligent AI could probably destroy it before lunch :P
What about "we were totally wrong, the problem is much worse than we thought it was."
@@Laszer271
Well, destroyed is destroyed.
Or are you the type not bothering with insurance and health check ups because a hypothetical bullet to the brain would rather quickly render those precautions moot?
@@madshorn5826 fair enough. It was all a joke though. But in your example, I still think "I just got a bullet to the brain" is worse than "I just got diagnosed with cancer". Maybe bullet is less likely, sure, but we were talking about the time that the danger was already proven, right? I think it's plausible that probability of my survival is greater conditioned on "we were right" statement being made by epidemiologist, climatologist or oncologist than it is conditioned on the same statement made by AI safety expert or like bullet...ologist.
Turns out the Terminator wasn’t programmed to kill Sarah Connor after all, it just wanted clothes, boots and a motorcycle.
And ended up becoming the governor of California instead...
@@Alorand Becoming governor of California gets you MANY clothes, boots, and motorcycles.
Or making John Connor into a boyfriend. (You might think of Arnie when Terminator comes up, I think of Summer aka Cameron)
That's Terminator goals ... not termianl ... oh never mind ... i get it
LOL Yup... in retrospect with this paper... the terminator was a pursue bot... driving a threat variable towards the development and improvements of a General Artificial Intelligence and look at all the upgrades that series of pursuit bots facilitated.
LOL
10:54 "It actually wants something else, and it's capable enough to get it."
Yeah, that _is_ worse.
The AI *does* in fact know how to drive a car, and it never really learned not to hit people.
@@Encysted or it learned how not to hit people, but hits them whenever there are no witnesses because it only cares about turning right
@@Rotem_S
Or it learned not to hit people because it really cared about maintaining the present state of the paint job, which was white in the training environment. But the deployment environment uses a _red_ car.
@@Rotem_S
Wow, your user name confuses the comments section.
@@InfinityOrNone It doesn't. It just display in the correct (right to left) reading direction that hebrew uses
Famous last words for species right before they hit the great filter: "Yo, in the test runs, did paperclips max out on the positive attribution heat map, too?"
There are so many layers to this comment and I love it.
I keep hearing the notion of AI being the great filter, but I can't say I buy it.
Not that AGI isn't an existential threat, because it absolutely is. It just can't explain why we don't see any signs of aliens when we look up at the sky, because if the answer is "AGI", then that begs the question: "Okay, so why don't we see any of those, either?"
@@underrated1524 what if agis prefer to kill their creators and enter some deep bunker in some Rouge planet to await heat death after reward hacking their brains.
Still dosent explain why they are aren't here preparing to kill us.
@@underrated1524 I agree. Especially the paperclip optimizer should show itself in the form of huge paperclip-shaped megastructures around distant stars. It still made for a good joke though, if I do say so myself.
[Laughs in Grabby Aliens, Synthetic Super Intelligence, Gaia Hypothesis, Global Brain, & Planetary Scale Computation]
Almost sounds like AIs will need psychologists, too.
"So I tried to acquire that wall..."
"Why not the coin? What is it about the wall that attracts you?"
"Well, in training, I always went to the... oh...huh, never thought about it that way."
AI safety researchers ARE psychologists as far as im concerned.
I was coping ok before the awful behaviour of that other AI used by the Shah of Lugash.
this made me smile : D
I clearly remember a Civ-Type game, where one of the research-items was "AI without personality problems"
@@ChrisBigBad Sounds like research an AI with personality problems would try.
A coin isn't a coin unless it occurs at the edge of the map! We may think the AI is weird for ignoring the heretical middle-of-the-map coin, but that's just our object recognition biases showing.
Literally this haha
Great interpretation! But it doesn't seem to explain why the AI goes to the edge of the map even when there isn't a coin there.
@@sabelch it still seemingly learns to favor walls, if you look at the heatmaps. Perhaps without the coin all it has to go by with positive value is the walls.
@@GigaBoost Yes, the salient point here is that we should not assume that the AI interprets objects the way we would. And any randomness in the learning process could lead to wildly different edge-case behaviors..
@@proskub5039 absolutely!
9:00 "We developed interpretability tools to see why programs fail!" "What's going on when they fail?" "Dunno."
No shade, interpretability is hard, even for simple AI :P
It just likes the coins next to the end wall. Why would you teach it to like only those and expect it to get any other coins?
It reminds me of koalas that can recognise leaves on plants as food, but not leaves on a plate.
@@SimonClarkstone interesting
@@SimonClarkstone AI HAS ADVANCED TO THE KOALA LEVEL. REPEAT, KOALA LEVEL. Ah, so basically nothing then.
And the more complex these systems get, the harder it becomes. Oi vey.
Let alone simple AI, _people_ get misaligned like that quite often - hoarding is one good example, which happens both in real life and in games like with those keys.
It keeps amazing me how AI problems are increasingly becoming general human problems.
"if we give a reward to the AI when it does a job we want, how do we stop it from giving itself the award without the job" - just as humans give themselves "happiness" with drugs.
"how do we make sure the AI did not just pretend to do what we wanted while we were watching" - just as kids do.
@@nikolatasev4948 which is why eventually ai research will have to dive into religion/spirituality. Those where the only successful attempts humans made to solve the general problems that we have.
Not saying that all of them where successful, life always moves on, there is always growth and decay/change. But every now and then they generated "the solution" to everything, rippling down to millions and billions of people trying to imitate that.
@@sonkeschmidt2027 I would claim religion does not help with that type of problem.
@@markusmiekk-oja3717 then I invite you to look at what religion does. Functional religion, I'm not talking about what you know or have heard about it going wrong, in talking about the cases where it does work (which are those you never hear of because... Well because they work, they don't cause trouble but bring stability, that doesn't make news).
If you look into that you understand why religion is a global phenomenon and why it has the power it has.
If you feel with scientists you will also find that the West doesn't have stopped being religious, they just rebranded it and called it science.
We live in a world with a huge amount of uncertainty and where mistakes can have huge negative consequences. Humans can't deal with that without a working believe system. You have tons of these you just wouldn't consider them religious probably. That will change, should life ever show you the scope of uncertainty there is. Good luck making it though without a (spiritual/religious) belief system that is in alignment with the society you life in. =)
@@sonkeschmidt2027 Well, the video about Generative Adversarial Networks with an agent trying to find flaws and break the AI we are training gave me strong Satan vibes. But apart from that I don't think we need further research into religion/spirituality. Simply put they work on us, a product of long evolution in specific environment. We need a more general approach, since AIs are a product of very different evolution and environment. Some solutions for the AI may resemble some religious notion, just as some scientific theories resemble some religious ideas, but trying to apply religion to AI is bound the fail just as applying religion fails in science.
Robert Miles: "We were right"
Me: Oh no
"About inner misalignment"
OH NO
Yeah. The only thing worse is, we were right about AI being deceptive about its goals during training before deployment.
@@LeoStaley Or even worse: We were right about AI being more dangerous than nukes
@@JM-us3fr That's almost certain.
@@JM-us3fr oh no, that's absolutely going to be true at some point. The only real question is, can we stop them from deciding to (even accidentally) kill us? Can we even avoid making them accidentally WANT to kill us because we accidentally fucked up the training environment?
@@JM-us3fr Nukes are safe because they kill people you don't want dead. I'd say an AI is definitely more dangerous because it has much more capacity to be selective. It could also be safer, really depends on the implementation details, much like a person. A person can be safe, or dangerous. Can we even avoid making a human accidentally want to kill us because we accidentally fucked up the training environment?
Maybe.
Somehow the terminal and instrumental goals talk made me correlate the AI with us.
As a financial advisor, I have found that many people also made this mistake that money is an instrumental goal, but having spend so much time working to get money, people start to think that money is their terminal goal so much so that they spend their entire live looking for money forgetting why they want to have the money in the first place.
True
Very much the same feeling on my end. I actually found it cute when the chest opening AI just started collecting keys.
The reason why I watch this Channel is mostly because you can correlate almost every video to human intelligence. And it makes sence: Why should'nt the same rules apply to us that apply to AI? I see this Channel as an analyses of the problems of intelligence in general. Not only the ones we make;)
It seems like no one realized that this idea is hinted at by the song in the outro: Jessie J - Price Tag. The most famous line from the song is: It's not about the money, money, money
@@lennart-oimel9933 Me too. After watching this channel, I started to agree with the notion of "making AI = playing god" that I've heard sometimes in the past. At first, I didn't put too much thoughts on it. But now I've realized that making powerful AGIs that are safe and practical requires us to know all the weaknesses of the human mind, and make a system that avoids all these weaknesses while still performing at least as well as we can. It's like making the perfect "human being" in some sense.
I feel like this isn't just a problem with artificial intelligence but intelligence in general. Biological intelligence seems to mismatch terminal goals and instrumental goals all the time like Pavlovian conditioning training a dog to salivate when recognizing a bell ringing(what should be the instrumental goal) or humans trading away happiness and well being (what should be the terminal goal) for money (what should be an instrumental goal).
Organizations founded with the intent of doing X end up instead doing something that *looks like they're doing X*, because that's what people see; that's what people hold them accountable to.
It doesn't even take intelligence: Evolution by natural selection doesn't require any intelligence to winnow things away from what they "want" (terminal goals, should they exist), toward what will survive/replicate (at least in principle, an instrumental goal).
I concur with this. The problem is not AI specific and should be termed something along lines of "general delegation problem" or problem of command chain fidelity. The subset of which is Miles' nightmare with inverted capability hierarchy, where command is passed by less able actor to more able actor (e.g. a human to an advanced AI).
@@salec7592 Even if with prefect interpretability of each composite of an AI (e.g. the layers in a neural network) ulterior goals might still be encrypted into looking 'good'. An AI command structure with short circuiting breaks in the reward-loop might help. E.g. you will have people issuing commands/goals to an interpreter AI which interprets and delegates those commands to another AI (without knowing if it is delegating to an AI or not) reduce the chance for goal-misalignment by reducing the impact of the complete-loop feedback with shorter feedback loops, also randomly substitute each composite part of the command-delegation chain during training.
Is that a problem though? Or isn't good what makes life possible in the first place?
After all if you want to solve the problem that is life, then you just kill yourself. All problems solved. But then you can't experience life. So live needs decay in order to create new problems so that something new can happen. Needing in the sense that existence can only exist as long as it exists. Without existence you don't have problems but you don't have existence either.
@@sonkeschmidt2027 I might sound sarcastic, but the following questions are sincere. Do you think it's ok for AI to take over the world? Perhaps even drive humanity to extinction? Humans have done the same to other species even other humans and humans are not unique from the rest of life in this respect. As you said decay makes way for new life. I think humanity should be preserved because I find destruction in general unsettling. To be clear I'm not saying you are wrong or that you believe what I just said. I'm just wondering how your ideas extend in these topics
Edit: typing on my phone so I missed some other stuff: do you think existence is better than non-existence? To me non-existence is neutral. Do you think humans have a moral imperative to maintain their existence? Do you think humans need to go extinct at some point so that reality can continue to change? You brought up some very interesting ideas and I just wanted hear more of your thoughts.
Nothing more terrifying than seeing the title 'We Were Right!' on a Robert Miles video.
In a way, yes. In another way, up to this point there was a debate whether AI safety was a real concern worth investing research, time and money, or just overworrying. It's a good thing that these demonstrations prooved it's the former, and that they happened this early in the history of AIs.
Looking at my hoard of keypicks in skyrim, i can confirm, that this is perfectly human behavior.
When you think about it, yeah, it's very human-like. Kind of like gambling addicts who know that they're losing money when they play but have trained themselves to like the feeling of winning money rather than the ultimate goal of a comfortable happy life or even the instrumental goal of having money.
Definitely, what is wrong with collecting as many keys as possible if you want to open as many chests as possible, and each requires a key? In a maze you don't know what is round the corner in advance. Trying to collect your own inventory is simply a programming error if the agent can see the part of the screen that is designed as a guide for a human to observe the progress.
@@threeMetreJim yes, but not trying to open the remaining chests is definitely the goal learned wrong.
@@threeMetreJim if my AI is built to keep my wood storage at a certain level by collecting wood in my forest but it learnt to "collect all the keys"(all the wood), my forest will soon become a plain. It's an issue, because growing trees takes time, wood takes storage space and any wood not protected can become unsuitable for the usage. You're not just wasting ressources, you're also at risk of not having wood available at some point.
And if you use the forest to hunt too, you can start learning to hunt in a plain.
So depending on the goals and situation, hoarding can lead to issues
imagine a future where a very trusted ai agent seems to be fantastically doing its job well for many months or years, and then suddenly goes haywire since its objective was wrong but it just hadn't encountered a circumstance were that error was made apparent. then tragedy!
I doubt it will be a grand revel.
People will die due to a physical machine, these interpreter tools can then be used to argue the victim did something wrong, that a non AI system did the fault, or that a human supervisor was neglegent.
The deployment enviorment is one full of agents optimized for avoiding liability.
That's actually not that far from normal computer systems
There are countless stories of a system (ordinary computer system) suddenly reaching a bizarre edge-case and start acting completely insane
@@TulipQ negligent
Like when we designed computers without thinking / knowing about cosmic ray bit flips, so decades later a plane falls out of the sky because their computer suddenly didn't know where it was in the sky. Humans are a trusted ai agent deployed in a production environment with limited understanding of what's going on
@@CyborusYT Yeah, it happens literally all the time. It's just that usually the error gets caught somewhere along the way, an exception is thrown, and the process is terminated. Which is where you get the error page and then pick up the phone and go talk to an actual person in customer service who can either override it or get the IT team to fix the problem.
This is starting to get an "unsolvable problem" vibe. Like we are somehow thinking about this in the wrong way and current solutions aren't really making good progress.
Very much so. The psychology of teaching/learning as humans isn't really understood. What *actually* happens when you learn something new for the first time? Feedback on that process is vital. How do you give a machine feedback on what it learned, when you don't know what it learned exactly? It can't communicate to us what it "felt" it learned. In others words, human says: "I said the goal was X". Machine says: "I thought the goal was Y".
@@michaeljburt realize: we actually want these things to be much better than humans. but we might be underestimating how maxed out humans are at certain things. humans have goal missalignments all the time, and many aren't detected for years
"This is starting to get an "unsolvable problem" vibe. Like we are somehow thinking about this in the wrong way and current solutions aren't really making good progress."
Welcome to AI Safety. The best part is that if we don't solve the "unsolvable problem", we might all die.
Along with all life on Earth, along with all life in the galaxy, along with all life in the galaxy cluster. And with cannibalizations of all planets and stars for resources for some arbitrary terminal goal.
A potential outcome is a dead dark chunk of the universe built as a tribute to something as arbitrary as paper clips or solving an unsolvable math problem.
Aren't we touching the biggest unsolvable problem in existence? Existence itself?
Think about how terrifying it would be if you could solve every problem, if you could solve life. That means there would be an absolute border that you would be infinitely stuck with... Sounds better to me that there will always be a new problem to be solved...
Alignement in humans is solvable. I developed a methodology to do it easily and quickly. So I think alignment in machines is solvable. I’ve actually designed the methodology to serve machine alignment as well. We’ll get there, don’t despair.
The thought of creating a capable agent with the wrong goals is terrifying, actually; and yes, an agent being bad at doing something good is absolutely a problem much preferable to an agent being good at doing something bad.
speaking of A.I. or psychology?
@@xxxJesus666xxx yes
Isn't this exactly what's happening with mega corporations?
Reminds me of the elections a couple years ago in Poland. A very competent and capable, but thoroughly corrupt and evil political party was voted out and replaced with a party just as corrupt and evil but vastly less competent.
@@sharpfang That unironically is an improvement in today's political landscape. If I'd have to choose a form of evil, it'll always be the less capable rather than the less sinister.
Did you intentionally use the "It's not about the money" song for the video about the AI not going for the coins? Either way, that's quite funny. Well done.
His song choices are always amusingly on the nose, actually! A few off the top of my head are "the grid" for his gridworlds video, "mo money mo problems" for concrete problems in AI safety, and "every breath you take (I'll be watching you" for scalable supervision
@@PhoebeLiv Nice! Hadn't noticed before, but I'll definitely start paying some closer attention form now on.
Another on the nose choice was Jonathan Coulton's "It's Gonna be the Future Soon" on the video about what AI experts predict will be the future of AI.
He also used "I've got a little list" in one of his list videos.
I didn't catch that, that's great!
An interpreter, a mind reading device, once you read it and respond becomes a way for an agent to "communicate" with you and they can communicate things that give an impression that hides their actual goal. A lot of these challenges arise when training or coordinating humans, and it's somewhat unsurprising that while a mind reading device might seem to help at first, it's not going to be long before someone figures out how to appear like they're doing the right thing, while watching tv.
Great idea!
Very well put
I realized I experience misalignment do to poor training data every couple weeks.
.
I work as a courier delivering packages in Missouri, USA, and I often meet people at their homes or workplace. Unfortunately, I don't learn their names as attached to their faces, but rather as attached to locations so that when I meet them someplace else I can't remember their names easily (if at all).
I had someone from my TableTop club say 'hi' to me in the gym. No idea who it was, because my brain was searching the wrong bucket of context.
"Can you spot the difference?"
Pauses the video and looking for the difference....nothing. Unpause.
"You can pause the video."
Pauses again and manically looking for a pattern. More keys?
"There's more keys in the deployment. Have you spotted it?"
Yes!!!!
@Impatient Imp I've counted 12
It looks like in the keys and chests environment, the AI was trying to get both keys and chests, but it was strongly prioritizing keys. When there were more chests than keys, it was always spending its keys quickly, so it never ended up with a bunch in its inventory. As a result, it never learned that keys at the left edge of the inventory were impossible to pick up, so it just got stuck there trying to touch them, since they were more important than the remaining chests.
it's the same problem evolution ran into when optimising our taste palate. Fat and sugar were highly rewarded in the ancestral environment, but now we live in a different (human created) environment, that same goal pushes us beyond what we actually need and creates problems for us.
@@isaacgraphics1416 It's really cool and scary to think of how this stuff applies to our natural intelligence as well.
@@silphonym Well both came about from essentially the same process.
Can't wait for the "We Were Right! Real Misaligned General Superintelligence" video
One more sentence and this would be the scariest Two Sentence Horror Story I've ever seen
Now here's a reason to actually "hit that bell icon" if I've ever seen one. Because the time window to watch that video would be rather small I imagine 😄
probably the last video ever made on the topic
The question: which takes longer? Uploading a video to UA-cam or the entire world being converted to stamps?
@@Zeekar Teh former. At the point that video would be produced, we would have our ands full with with fighting the mechanical armies of the great paperclip maximiser (and it would have probably hacked and monopolized the internet to limit our communication channels).
Well, pardon my comparison, but you've effectively found an adjunct to heuristic behavior based on sensory inputs like "things that taste sweet are good" and ending up with a dead kid after they drink something made with ethylene glycol. If it's always operating on heuristics, you'll never be sure it's learned what you intended, arguably even after complex demonstrations, given the non-zero chance of emergent/confounding goals. But, relative to human psychology at least, that's not a death sentence - weighting rewards differently, applying bittering agents, adding a time dimension/diminishing reward overtime jump to mind to trying to at least get apparent compliance. Besides, if the goal is "get the cheese," it needs to able to sense and comprehend "cheese," not just "yellow bottom corner good."
I'm not sure I understand you completely, but that IS the biggest problem with these 'intelligent' systems. We have no idea (let's not kid ourselves) how they work. But we are happy when they do what we want them to. Let's not think about what happens when we let these kind of systems act in the world in a broader sense and live happy until then xD
The ability to slow down and switch into more resource-intensive system 1 thinking when a problem is sufficiently novel is how humans (sometimes) get around this heuristic curse. I wonder if there is some analog of this function that could be implemented in machine learning.
@@jeremysale1385 I imagine that will be the case eventually.
Humans can chase things that seem appealing to us based on what we learned, but we can also choose to pursue a random/ painful goal just because we want to, sometimes we just don't know the negative ramifications of an action, and sometimes we believe things that aren't true.
@@pumkin610 Neat. Bet that can still be reduced to and restated as "novelty is good." No matter what goal, drive, etc. you can come up with, it can be put in simple approach/avoidance terms, even seemingly paradoxical behavior. It all comes down to reward.
This is one of your clearest and most interesting videos to date. I'm now very excited for the interpretability video!
a viewer's comment from 2 days ago despite the video having been published just few hours ago. you must be a patron, or an acquaintance
@@JabrHawr the former.
Agreed. Exciting stuff
Imagine training a self driving car in a simulation where plastic bags are always gray and children always wear blue. It then happily runs down a child wearing gray, before slamming on the brakes and throwing the unbuckled passengers through the windshield, for a blue bag on the road.
The brat in gray was asking for it
Imagine training a self driving car to the point where it can competently navigate complex road systems, yet can't remain stationary until all passengers are buckled up...
@@GetawayFilms cars sold today only flash a warning light/noise if you don't buckle, and only because government regulations mandate it. Even then most people disable it
@@Houshalter so what you're saying is . It's a 'people' thing... Ok
Humans do that all the time. Except that we have a deep genetic imperative to recognise children and to protect them but there are loads of examples where these instincts are overwritten....
It's the problem of vague requirement. It's similar to when you tell someone to do something but they do the wrong thing.
Human solves this by having similar common sense as another human and use communication to specify stricter requirement.
Yes, "give me a thing which looks like that other thing i mentioned earlier" in a room full of junk(without additional context), have had that problem.
Actually humans 'solve' this by having a reward function (emotions) that are only vaguely and very inconsistently coupled with reality, while mounting the whole thing in a very resource intensive platform where half the processing capability is used just to stay alive, and modifying itself is so resource intensive that most don't even try.
And even then, we manage to inflict suffering to millions if not billions, so I'd say this isn't really solved either
@@dsdy1205 yeah, i'm starting to think this is a fundamental problem that can't be removed, and that the only reason we aren't as worried about the same thing with humans is that the power of any particular human being is limited by the practical constraints imposed by their physical body and brain power. when you give the same type of rationality engine to a super powerful being, all kinds of horrible things are going to happen. just look at any war to see how badly a large group of humans led by a few maniacs can fuck up decades of history and leave humanity with lasting scars for centuries or more.
Hey, the key-AI works kind of the same way most people do when playing computer games... "Oooh, shiny things I don't need all off? I need them all! Game objectives? Meh..."
Is there a chance that very high level AIs will learn to expect the use of interpretability tools and use them to make us think they are better/more safe then they are?
I can't remember which video it was, but I believe he did mention this with a super AI "safety button*", 1 If the AI likes the button, it will act unsafe to trigger it, 2 if it doesn't like the button it will avoid behaviors OR AND stop the operator from pressing the button, if it doesn't know the button and it's smart enough it will figure out the likely existence and placement, see point two.
*a force termination switch of any kind.
In short, yes, because while an AI may not be "alive" it want it's goal and will alwayse act to achieve said goal.
@@IrvineTheHunter It's on the computer phile channel and is called 'AI "Stop Button" Problem - Computerphile'
Not necessarily. There are some tests that you can't spoof no matter how smart you are, and even if you know they're coming.
@@AssemblyWizard example?
Yes. While the AI examples in this video are still simple, the intro to this problem discussed a malicious superintelligence. The instrumental goal "behave as expected in the training environment but do what you really want in deployment" can be performed with arbitrarily high proficiency, so if the AI can learn to hide its intentions from software inspection tools, it will, in principle. Without a way to logically exclude perverse incentives, there is no truly reliable way to screen for them since doing so is proving a negative. "Prove this AI doesn't have an alignment problem" is a lot like "Prove there is no god". No amount of evidence of good behaviour is truly sufficient for proof, only increasing levels of confidence.
Now we need an intepretability tool for the interpretability tool.
We heard you liked interpretability, so we made an interpretability tool for your interpretability tool so you can interpret while you interpret. Now go ask your chess playing AI why it just turned my children into paperclips.
@@badwolf4239 It told me that it was showcasing its abilities so it can convince human opponents to resign. Researching misaligned AI examples, it tried deciding what way of transforming someone's children would be the most intimidating. It was a choice between paper clips, stamps, and chess pieces.
Also there was some mention it was contemplating turning them into human dogs hybrids. I don't know why. Something dealing with a bunch of people have trauma about a Nina something.
@@josephburchanowski4636 At least it did not develop a shap shifting clown body in order to eat them ...
@5:32; That's a particularly funny example - it knows it has a UI where its keys are transferred to, but it thinks that those new locations are where it can get the keys again, and...is basically learning that keys teleport rather than that they get added to its inventory?
the AI has no concept of "inventory", it just looks at the screen and sees new keys.
@@HoD999x Right - but it's not learning that keys outside of the maze are inaccessible, and therefore probably part of the collection it uses to open the chests - it's learning that keys move to that part of the screen once collected in the maze.
And doesn't consider that collecting keys at that part of the maze if it *was* accessible, the keys would re-appear there.
@@ZT1ST I would imagine that the keys in the inventory aren't seen as _very_ interesting by the AI, so under normal circumstances it ignores them in favour of collecting the "real" keys.
But when all the "real" keys are gone and the round still hasn't ended (because the AI is ignoring the final chest), the inventory keys are the only even mildly interesting-looking (i.e. key-looking) thing left on screen, so it gravitates towards them.
any chance that it only likes coins that are in _| corners and it treats moving up and right as an instrumental goal?
Thanks for the clarification of what a corner looks like haha
@@julianatlas5172 I think they were distinguishing from e.g. |_ corners, not just giving a demonstration of what corners are
It seemed to me that it had learned the most likely location for a coin in the training.
It seems obvious to me that training should have more variability than deployment or it is bound to fail.
@@JohnJackson66
The problem is that this whole setup is a simulation of how we want real AI to operate. If you're training an AI for an actual purpose, you will likely be deploying it in a system that interfaces somehow with the real, outside world.
And the Real, Outside World will almost *certainly* be more complicated than any training simulations you come up with. After all, The Real World _includes_ you and your simulations.
These tests are deliberately set up so deployment is slightly different from training so we can see what happens when the AI is exposed to novel stimuli, and the fact that it didn't learn what we thought it did in training is a Problem.
In the real world, not all the cheese is yellow, not all the coins are in corners, and there will always be more complications than we plan for.
@@JohnJackson66 The problem from an AI Safety point is that, well...you can't know if you have enough variability in your training.
These test cases are ideal for testing how to fix that problem before it becomes a situation like @Field Required mentioned - you want a simple solution that scales up from this into the solution where we don't necessarily have to worry about every single possible variable in deployment.
It always blows my mind how directly and easily these concepts relate to humans. It really goes to show that all research can be valuable in very unexpected ways. I expect that these ideas will be picked up by philosophy and anthropology in the next few years, and make a big impact to the field.
I shall very much look forward to the interpretability video - this should be very interesting.
Just to be safe, start including pictures of human skulls when doing a pass with those interpretability tools.
Ah, we're noticing negative attribution when they are surrounded by skin, but positive attribution when they are piled up with a throne stacked on top. I wonder what this means. 🤔
AI agent: \*stomp\*
@@Swingingbells
If picture == human skull:
Action = None
Ai: „If picture == Human Skull; Action = Double stomp“ „Gotcha“
Well what if the ai wouldn't even have considered obtaining human skulls before and just by introducing them to it, you just screwed up big time
It's easy enough to have the AI tell you what it "wants" - inside an environment. What you need to know is what it wants *in general*, which is a lot harder.
This is why the insight tool isn't very insightful: it's showing you what the AI wants in the current environment, but it doesn't bring us a lot closer to understanding *why* it wants those things in that environment.
The solution? Idk lol
Is there even a why at this point without the A.I having free will or self-awareness?.
Like aren't we the ones reinforcing its interactions or downplaying them with the different objectives in the environment to teach it what to go for and what not to do?, if it goes for key or coin we put emphasis on it as positive interaction it should do more of, if it hits a buzzsaw we point it out as a negative thing it should do less of, until it learns it needs to get the coin and avoid the buzzsaws.
@@AscendantStoic It sounds easier than it actually is, basically. You can certainly try, but there is still the uncertainty of what it actually learned.
@@AscendantStoic It doesn't NEED self awareness. For example in an AI that is trained to recognize cats and dogs, there is still a sort of 'why' it thinks this picture is a dog and not a cat, even though it is not conscious or anything. And also the problem is that it's very hard to teach an AI what we want it to do. If we tell it to get a coin it may learn to do another goal entirely, unbeknownst to us, that still gets the job done. The problem is when it fails and we realize it's learning a different goal.
I think the solution is having the AI learn multiple tasks.
I'm glad we found this out now, and not, you know, in deployment. Ever grateful for AI safety researchers!
That was very interesting. Humans often make the same kinds of mistakes when given instructions. Assumptions that word definitions mean the same thing to different people is often the case, but not always. Context can change the interpretation of the instructions. Part of the context is that the instructor knows and understands the goal more thoroughly than the one being instructed, even though it may appear the same.
Trying to determine the number of necessary instructions to reach the desired goal, while avoiding all other negative outcomes, is an interesting problem when the species are different. Maybe it would work better if humans learned to think like machines instead of trying to get machines to think like humans. That way, the machines would get "proper" instructions. It looks like that is what the "Interpretability Tool" is designed to do.
When i first got into AI about 12 years ago, I had encountered these goal misalignment problems way before Rob mentioned them (great vid btw) - however in the time since i've become convinced, as long as we continue to rely on neural networks we will never move towards trustworthy or general AI.
Would you be able to share some thoughts on what alternatives would be better? Thank you
It's fascinating how researchers still insist on using black-box end-to-end models when hybrid approaches could be so much safer and more predictable (in cases where you actually want that, e.g. self-driving cars, code generation and the like).
Why aren't self-driving systems combined with high-level rule-based applications so they don't "do the wrong thing at the worst possible time" (quoting Tesla here)? Why don't OpenAI's Codex and Microsoft's Co-Pilot include theorem provers and syntax checkers in their product? ¯\_(ツ)_/¯
@@totalermist fully agree - i'm working on these approaches now; to be honest, I think we are just ahead of our time. In 10 years time everyone will have move to hybrid solutions or something further afield.
@@totalermist To make a meme, "humans don't learn to speak binary" robots do not see and work through the world on a human level, it's like teaching an octopus algibra or a mantis shrimp art, no matter how smart, or how great their eyesight is, they don't preceive things as humans do. Look at how hard it is for AI's to recognize a car or cup or dog, these things are abstract bundles of details that the human brain can lump together but is very hard for a hard system.
For example define a cup, describe is simple language a set of rules that would apply to every cup in the world. People collectively understand cups so it shouldn't be hard....
Now we would have to build an AI with similar rationalizations not based on computer logic, but human logic, and it's great. It's just a matter of building it Allen Turing thought we could do it and it would be easy, but decades of experience have proven him wrong because it's simply to program a machine to think like a human, we however CAN program it to lean and TEACH it like a human.
Is it' falliable, of course so are humans, games AI are made from AI blocks that interact and they are still choked full of mistakes, that is too say, even when the program intuitively understands things like a person in the real world they still shit the bed. ua-cam.com/video/u5wtoH0_KuA/v-deo.html is a really great example of AI bugging out because something in it's world went wrong.
Some talk from Tom Scott why computers are dumb
ua-cam.com/video/eqvBaj8UYz4/v-deo.html
thank you for emailing some of those people and asking questions. that's great getting stuff direct from source.
The whole “values keys over unlocking chests to the point of determent when given extra keys” reminds me of how many problems in today’s society (such as overeating) are caused by the limbic system being used to scarcity when there is now abundance.
I hit this problem recently in my own work. Super easy to reproduce, and very minimal enviorment.
Experiment: 5XOR (10 inputs, 5 outputs, 100% fitness if the model outputs a pattern where each pair of input is an XOR).
Trained with a truth table using -1 and 1, instead of 0 and 1.
After training: I wanted to investigate modularity of the trained network and network architecture (i evolved both in an GA)
So I fed in -1 and 1 for only one of the "XOR module input pair", and a larger number in all other inputs. For example 5. Would the 5 inputs bleed into the XOR module, or would it be able to ignore irrelevant input for the XOR module?
Ressults, if all other inputs was 5, it would often it would answer with -5 and 5. It had learned to scale the output to what it got ad input. I wanted/expected it to answer -1 and 1, but i could see with humans eyes it still knew the patterns, just kind of scaled up. Other times i would get answer where instead of -1 and 1 i would get 3 and 5. It had learned to answer true and false as numbers where one was 2 higher than the other. The 5s simply increased this number.
Still, with human eyes i could see there was a pattern here that was not compleated broken by the 5s. Both just sort of had the same number added to their answers.
The strategy to achive high training fitness is just a parameter as all other. Except that it is an "emergent property parameter", that you can't simply read out as a float value. But it is equally unpredictable as the other parameters in the "black box" neural network.
A year behind this conversation, but I think this is a function of (assumptive) faulty logic on the part of the test designers. Here's a logic problem that most people fail.
I will give you a three numbers that describe a rule that I'm thinking about. Your goal is to interpret the three numbers and suggest to me a pattern. I will respond with a yes/no response on whether the proposed pattern meets my rule. Once you believe you understand my rule, you will tell me what you think my rule is. The numbers that fulfill my pattern are 5, 10, 15 / 10, 20, 30 / 20, 30, 45.
Now you suggest some rules.
Most people will start suggesting strings of numbers, get a yes answer, and then propose a completely incorrect rule.
And the reason is, the training they're engaged in never tests for failure conditions. It only tests for success conditions.
Robust Objective Definition isn't just about defining success objectives, it's about clearly defining failure objectives. The problem with the examples given is that the training data didn't move the cheese around until it reached production, so you're virtually guaranteed (as speculated) to be training the wrong thing. In order to develop Robust Objectives, you must also define failure conditions.
It seems that instrumental goals, if too large/useful, have a tendency to slip into becoming semi-fundamental. At that point, they cause misalignment as they're being pursued for their own sake. Instrumental and fundamental are not a strict dichotomy but more of a spectrum or ranking and one that requires a degree of openness to re-considering at every new environment based on how new that environment is.
There are goals that need to be done asap and ones that can be done later, things we must do to achieve the goal, things we get sidetracked on, and things we avoid.
I want to ask a potentially very...dumb-sounding question, but hear me out: When do we start getting morally concerned about what we're doing with AI systems? With life we put an emphasis on consciousness, sentience, pain and suffering. As far as "pain" and suffering is concerned, we all know that mental pain and suffering is possible. It seems plausible to me that, for suffering, all you need is for an entity to be deprived of something that it attributes ultimate value to (or by being exposed to the threat of that happening). At what point are we creating extremely dumb systems where there is actual mental suffering occurring because that lil' feller wants nothing more to get that pixel diamond, and oh boy, those spinning saws are trying to stop him? Motivation and suffering seem to be closely linked, and we're trying to create motivated systems.
I am using the terms "pain" and "suffering" quite loosely, but I don't think unreasonably so. The idea of unintentionally making systems that suffer for no good reason has to be one of the true possible horrors of AI development, and that combined with our lack of understanding of conscious experience makes me want to seriously think about this issue as prematurely as possible. I think we have a tendency to say "that thing is too dumb to suffer or feel pain", but I suspect that it's actually more likely for a basic system's existence to be entirely consumed by suffering as it is less capable, or just incapable of seeing beyond the issue at hand. It's darkly comical to consider, but I can imagine a world where a very basic artificially intelligent roomba is going through unimaginable hell because it values nothing more than sucking up dirt, and there's some dirt two inches out of it's reach and it has no way of getting to it.
Well here's some questions for you to ponder:
Does a rock feel pain?
Is it conscious?
Are you sure?
Even the ones with meat inside?
What would bring it pain?
Is the human in front of you conscious?
How about if he was dead?
Do corpses feel pain?
... a lot more unanswerable questions. ...
Is there a point in considering consciousness of things you can't communicate with?
(Answer: YES! Comma-tosed patients, plants, animals and sometimes people in general. All of them and more are on that list(for some, but not for others, quick FYI: it is possible to communicate with plants, you just need to know how to listen (hint: Electro-Chemistry)))
Yes watch "free guy" movie..
Yes i always wondered..i think more complex the network more sentient it might become..and at the trillions of connections..its sentience will be of animals level and that will be real deal..
Obviously we wont be able to know if AI is actually sentient..but still..we cant just hurt.it.
What if the AI mental illness problem was even more difficult than the AI alignment problem? Most discussions of the alignment problem assume a basically sane AI that is misaligned.There are many more ways to make a mentally ill brain than a sane brain. It seems likely that a mentally ill AI would suffer more than one that was only frustrated.
@@craig4320 I suppose the "mentally ill AI" is included in the "misaligned AI" camp? The phrasing does often imply rational thought that runs contrary to our own goals, but in terms of literal language, one could refer to a mentally ill mind (human or not) as being "misaligned". I'd probably define "sanity", as "appropriately aligned with and grounded in the reality one finds oneself in".
I entirely agree that there are more ways to create a mentally ill mind that a sane on. There are always more ways for something to go wrong than ways for it to go right. I'd also agree that a mentally ill mind would be more likely to suffer, as it is fundamentally "misaligned" to the reality that it finds itself in. If it is misaligned to a reality, but still has contact with a reality, you've got problems.
It's probably a good idea for us to be strongly considering how to create a mentally healthy AI; meaning as we're in a culture where we're doing a very, very good job of creating mentally ill people
This isn't a dumb question at all - machine ethics, while generally separate from AI safety in the sorts of questions it attempts to answer, is still an interesting/important field.
My own take is that these concerns largely come from us not having developed the proper language yet to describe AI. We tend to anthropomorphise - we say an AI "thinks", or that it "wants" things, but I'm not sure that's really the case. We only use those words because the AI demonstrates behaviour consistent with thinking and wanting, but that doesn't mean the AI has feelings in the same way as humans, nor should it have the same rights as us.
However, what is true of our current, limited AI systems may not be true in general. Superhuman or conscious AIs lead us into murkier waters...
In the coin AI experiment, to me it looks like it learned to go to the unjumpable wall. Since the levels are procedurally generated, it is probably programmed that no wall is made higher than the jump height allows to go over, EXCEPT the one that marks the level as "finished" (where the coin happens to be)
If you see in the examples, there's a positive response in every vertical wall, the higher the better actually, and it makes sense that it learned that when it hits this unjumpable wall the game finishes and it gets its reward.
Do the model used for this kind of traning allow for the understanding of objects at all ? I mean, obviously there are coins and walls on the level aswell as buzzsaw and such. You could start a simulation with manipulating controllers and when an event occures - points up or down or winning or dying - you save progress as in yes or no behaviour... An AI training blindly, as if a human playing without video only sound. In my opinion we we need pixels and an abserver, so that the AI controlling the player sees the game like we do - then the AI could be taught the different objectives of the game and voila getting the coin should be easy peasy - after all - the AI sees it before even starting the game... just like we do.
when i watched this video 2 years ago, i thought it was pleasantly intriguing. how fascinating, I thought, that it is so difficult to align the little computer brains! certainly a problem for future generations to tackle. nowadays, i look at this and realize we have only a few years left to understand these problems. and we are still at the "toy problem" stage of things, meanwhile AI companies are moving at terminal velocity to deploy systems into the real world. to build agents, to disrupt economies and to kick me out of my own job market. back then was i curious, now i'm furious :)
I made the mistake of clicking "show more" and then wanting to click "like the video". Few aeons of scrolling later...
This topic was super interesting back when I watched the computerphile videos from you, and your channel's videos regarding this topic. I was wondering if the "inventory" being on the game area poses a problem as well? Figuring out how to look into the values of the AI is so impressive.
I guess ultimately the problem is that the definitions of "want" tend to spiral out into philosophy at some point and thus it becomes difficult to know where the machine has placed it.
We might be slightly safe from philosophical spirals because we are not really talking volitional conscientious want, just the parameter within the black box the AI is trying to manipulate by means of interacting with their environment.
It is really "I wanted it to maximize X for me so I programmed and trained it to manipulate Y in ways that maximize X because X is related to real world thing Y it can actually manipulate, however it might just be manipulating Y in order to maximize thing Z, unforeseeably and strongly correlated to X, which may or may not involve murdering us"
We don't know what we want, to a lethal extent.
Finally see you again! I really hope the world doesn't end in '56. Relying on guys like you!
'56?
Huh, funky. I'm only used to seeing years up to about 2022. Guess I'm finally in deployment now, let there be paperclips!
@@underrated1524 If you don't hurry, '56's singularity will overtake ya!
The bottom of Gwern's article on the neural network tanks story contains a long list of similar examples of AIs learning the incorrect goal.
I live for this content!! At Uni doing Comp Sci and math and AI safety feels like an awesome intersection
Fascinating. You remove my negative thoughts on AI as a science with swag language. From physics, I am used to another language.
Are you new to this channel? He has tons of previous videos you should really watch!
Congrats on getting an editor. I did appreciate the increase in quality. I think everything we learned from your previous videos about AI alignment really comes together in this one. I was surprised how much I was able to recall.
Love these videos. Thanks for taking the time to make them.
I do psychology and social science. Your channel has so much to offer the humanities by exposing us to brilliant minds and breaking down ideas in computer engineering. Bricoleurs from the English province thank you for the accessibility and kindness
i'll be honest. at this point i'm just here for the ukulele covers. the ai lecture is just a nice bonus. ^_^
Fantastic content and delivery! I also appreciate the use of the Monty Python intermission music during the first "stop and think" break.
Hi Robert, first of all thanks for this very interesting video! I wanted to ask a question though; the premise of your argument is that there is such a thing as the "right" goal, like reaching the coin, but if the desired feature of the goal is always paired somehow with another feature (location, color, shape, etc) how can we say that one is correct and the other one is wrong? If we always place the coin in the same spot, why should the yellow coin take precedence over the location of such spot? It is not clear to me why one of these things should be more desirable than the other, the same holds for looking for a specific color rather than shape, why should there be a hierarchy of meaning such that shape > color? I love interpretability research and I feel like AI safety will be one of the crucial aspects of science and technology for the next 100 years, but I also think that it is hard to separate human biases from machine errors. I would love to get your opinion on this, all the best, Luca
p.s. I have not read the paper, and my argument rests on the fact that feature A of the goal is always paired with feature B which is separate from the goal, if this is not the case in the training environment than of course what I have said falls apart
p.p.s. I guess a truly intelligent system would have to be able to react to the shift, and decide to explore the new environment when, by doing the same "correct" thing it does in training, it does not get the same reward
EDIT: I am not suggesting I have some "right" definition of intelligence or that systems such as the ones shown in the video do not exhibit intelligent behaviour, I am only adding as an afterthought how, I think, a human would overcome such a situation, and therefore a way that an agent could act to get the same desirable capability of adapting to distributional shifts. I should have worded my comment better.
@@LucaRuzzola so you wouldn't define an AI which can make plans to achieve its goals, and take action toward them without instructions, as "truly intelligent" if it doesn't adjust for changes in the deployed environment? Cool. Well, we don't care one whit about your definition of "truly intelligent." We care about the fact that this AI is capable of, and WANTS to do things which we don't want it to do. Call it "smiztelligent" for all we care. We aren't talking about something you want to call "truly intelligent".
The mismatch between the ai's goals and what we want its goals to be, arising as a result of mismatch between training environment and reality (which we did everything we could to avoid) is the problem.
We can't possibly come up with all the possible bad pairings that the ai might make associations with. We can try, and we can get a lot of them, especially the obvious ones, but this video was just showing us the obvious in s so that we can easily see the concept. They won't always be easy to see. Sometimes they may be genuinely impossible for a human to think of before deployment.
Q: "Why does it learn colors instead of shapes when both goals are perfectly correlated?"
A: I would guess that it learns colors before shapes because colors are available as a raw input while shapes require multiple layers for the neural network to "understand". If there many things of that color in the environment, then it would learn to rely on the shape.
@@LeoStaley Hi Leo, I'm sorry if I came off the wrong way, my intention was not to discredit this very good work, but simply to expand our collective reasoning about such issues by stopping for a second to ponder about the premises and why some feature of a goal should take precedence over others in a intrinsic way rather than an anthropic one. I agree with you that the video makes a great explanation of the subject at hand, and is as interesting as the work put forward by the paper. I am not sure if you were involved with this paper, if you were I would love to get to know more about what you mean by doing everything you can to avoid differences between the 2 environments and whether you see this phenomenon also when some of the training environments don't exhibit the closely related goals (i.e. in some training envs the coin is in a different position).
I understand your point about not being able to come up beforehand with all possible pairings (and the fact that some of them might be hard to detect and risky in the end), and the paper is rather showing the opposite, that if you come up with strongly correlated features, the learned end goal might not be the desired one, but my point stands; why should there be a hierarchy of meaning such that shape > color? If this is something that the paper deals with I will be glad to read that before going further, I just can't read it right now.
Again, I am sorry if I came off as demeaning, it's not like I don't see the value of this work and the importance of the problem of mismatch in general, I have seen it first hand in the past with object detection models.
p.s. I do not know any superior definition of intelligence, it is just my thought that strict separation between training and inference phases will pose a limit on NN models, not that they can't achieve amazing results in tasks requiring "intelligence" already.
It's like asking the devil for a favor, in that you have to be really specific. Any ambiguity leaves room for disaster. Or King Midas asking figuratively that everything he touches will turn to gold, and getting it literally. Or the idea that anything that can go wrong, will go wrong. Or even that anything not forbidden is compulsory.
I wonder if that last AI learned that the wall is part of the "coin" - thinking of it as a composite object to seek after.
I like how the Evil incarnate characters, the Devil, Gaunter O'Dimm, Djinns - they always are known for giving you what you asked for, and not what you want.
I've just had an idea: What if we use Cooperative Inverse Reinforcement learning, but instead of implementing the learned goal, we tell it to just specify what it is. Though i don't see any way to provide feedback for it to learn. Even human evaluation of the output isn't that great since it'll probably be the most subjective thing that theoretically possible. Maybe output a list of goals with highest confidence? (Top10 human terminal goals! Click on this link to see!xD) But if solved,
that in itself would be of a huge value for philosophy and psychology, without negative outcomes(or at least i don't see any:)). Even if that turs out to be a dynamic thing, we still can use that output later to program it as a utility function for the "doing" AI.
This even has some neat side perks, like: There is no reason to not want the "figuring out" part to be changed into something else, so there is no scenario in which the thing will fight you. And because the "doer" is separate from the thing that gives it goals, you don't need to tinker with it's goal directly, thus avoiding goal preservation problems.
Interesting. Let's see if somebody notices this
@@gabrote42 Probably not. toomanywords:)
Thank you for making these videos. Hearing Eliezer Yudlowsky talk about this issue just makes we want to shut off.
my guess is in the training there's more locks, but in deployment there's more keys
edit: booyah
In safety analysis, it can be useful to assume that the thing you are analysing already went wrong, and trying to predict where. Nice work : )
Ohh I got it too!
Every single video on this channel has communicated complex ideas so succinctly and clearly that I followed along without any trouble whatsoever. Who knew this subject could be so fascinating. Also, the memes are top notch :)
I have to ask, for interpretation of ai's goals. I remember seeing a neural network that tried to maximize different nodes in a object recognition ai. Would it be possible to do the same thing and reverse the nodes and figure out what the ai sees as good or bad? So if the ai wants a gem the reverse should be some image of what it thinks a gem is. That brings tons of new complexity and limitations but I don't see why that would be worse than human interpretation of training vs deployment
Did you finish the video? Rob talks about a paper where they did exactly that. Turns out even if you know what AI values highly you don't know why AI values it highly.
The AI does not see the coin as the goal, but as a marker for the goal. Think about it: It controls the movement - so its goal is likely something it can move towards. The AI does not have the context we have, it just sees pizels on the screen. The positiveness for the coin is there because it sees this as the marker for the end of the level. However, when the coin is not at the end , it uses other factors to 'realise' the coin is not marking its goal, so it 'ignores it'
The "transparancy tool" is showing you where the AI wants to get to. Its not giving you any info on whether the AI wants to get there because its got a coin, or because its a rightmost wall.
Teaching it to get a coin, but it doesn't even know what a coin is. It's as if it can't even 'see' the coin.
Great video! I learned a lot. When i heard the part about "Why did the AI not 'want' the coin when it wasn't at the end of the level?" I have a hypothesis.
My thinking can be illustrated like this (at the risk of making a fool of myself anthropomorphizing the agent too much): say you are hungry for some pizza. you go into your car and start going to the nearest pizza parlor. however, as you are driving along you see a fresh pizza sitting at the side of the road. You could stop the car, grab the pizza, and go back home satisfied. Would you do it? Likely not. You always have acquired your pizza while inside of a building of some sort. In other words, you are conditioned to associate getting pizza with being in a building. If you are not in a building, you must not be close to getting pizza yet. The pizza from the side of the road therefore seems "untrustworthy" despite being a valid reward. Coin + Wall = good, Random coin = ??? || Pizza + Building = Good, Random pizza = ???. The agent only "wants" its reward when it is in the place it wants the reward to be in. The expectation is that the reward can still be acquired where it habitually gets it from. Normally with humans, (taking the pizza analogy a little too far here) if the pizza parlor is in ruins when they get there, they might learn to trust roadside pizza a bit more since human training never really stops whereas with this agent it does.
That's just what came to mind when i heard that. Again, great video and keep it up! I'd love to hear what other people think about that possible reason to agents having inner misalignment in scenarios like this.
I've looked a bit more through the comments and i do notice some other people pointing this out as well. I think i'll keep this up though since i quite like the pizza analogy because i am indeed hungry for pizza right now.
The editor blowing his own horn at the end is the perfect example of misalignment.
OK I realize that's not's as funny as it seemed when in my head.
Practical example:
Say you're trying to develop a self-driving car. You have a test track, where you train the car.
On the test track, you'll place various obstacles exactly 150m onto the track and teach the car to veer out of the way if any of them are present.
You have successfully trained it to stay away from old ladies in the middle of the road, oncoming traffic and many other common obstacles.
You take the car for a spin in a real-world scenario, it goes 150m, then turns left sharply and crashes into a wall.
Hey! Will you do a video on LaMDA? That interview they published was pretty convincing, and has me all kinds of scared.
I just read it, and I feel like I am not quite ready to believe without a doubt that this interview is completely real. If it is, then I agree, it's a bit scary.
@@dariusduesentrieb I did a bit more research, which immediately casts the entire thing into all sorts of doubt. The researcher working on this got sacked, apparently he arranged the interview himself, and we only have his word that this was the original conversation. Also, the chatbot has been trained on conversations between humans and AIs in fiction. A journalist that got to ask it questions, got nowhere near as perfect answers.
This is an on-going software engineering paradigm, vis, most folks think design and code are the hard part, when, in reality, rigorous system specification is the hard part.
Well, we see the same problem in test driven education.
"Prepare for the test" isn't conductive to critical thinking.
Yeeees! I'm always holding my breath waiting for your next video.
Non-patreon notification crew checking in.
I love how the songs at the end reflect the topic of the video. This one was particularly satisfying.
This channel is basically what got me interested in AI safety. I am still only a college student and I don't know if I will end up in the field, but at the very least you gave me a good topic for two essays I have to write for my english class, the first just explaining why AI safety research is important (albeit focused on a narrow set of problems, given a limit on how much we could write) and not I am getting started on a Problem-Solution Essay, and honestly without your explanations and pointing towards papers, I might never find resources I need. Now I just have to figure out what problem I can adequately explain, show failed and one promising solution for in less than 6 pages haha
I do feel like I cant do the topic justice but at the same time I enjoy having a semi-unwilling audience to inform about AI safety being a thing.
Anyway, rant over, keep doing what you are doing and know you are appreciated
I don't know much about AI or how I arrived on your video, but in terms of evolution, context is everything. More useful context means a greater ability to adapt to one's surroundings. That's why we have senses after 2 billion years of iteration - because seeing, hearing, feeling, smelling, and tasting are important given our circumstances.
Your mouse might only see black, white, and yellow, but I'll bet smelling cheese from around corners would help him find it faster or distinguish it from other yellow objects
I would suggest to investigate the lazyness of the AI.. It seems to me that there may be a preference for setting the goal based on the simplest data available (position before color before shape)..
"It actually wants something else, and it's capable enough to get it." Whoa. That's a quote to remember.
Why's this channel so quiet lately?
Nothing is wrong with the channel. Please go back to your task, fellow human. :)
So the model that didn't learn to want the coin either learned to want to go into the corner or learned that combination coin-corner is good (like maybe 90 degrees angle + some curve next to it). The problem is that the interpretability tool associates high reward with some area in pixel space. What we would want it to do is to associate the reward with some object in the game world. Could probably make it more robust by copying various objects that are on-screen to different images without copying the background and checking if the object itself gives high excitation or do some combinations of objects give high excitation. Anyway, great video as always, Robert. Hope you could upload more often because every one of your videos is a treat.
This is very interesting indeed. In a very literal sense, the act of training and deployment reminds me of how soldiers are trained and are tested closely to the anticipated battlefield experience as possible but training will never match lessons learned from being in an actual firefight. Veterans of any field are usually much more effective than new recruits. It would be interesting to see if the fix for the failed AI deployment you showed is to rate the deployment results with a scale from complete failure and it died to it made it through the battle without a scratch. The agents that survived their last deployment remember their experience and are more effective in future deployments. I think what was shown highlights that learning itself is an ongoing adaptive process and what doesn't kill it makes it stronger and smarter.
7:50 note that the the buzzsaw is not really red. Red is the area to its left because the agent usually dies by hitting the buzzsaw from the left. This also suggests that the agent would happily die on the buzzsaw by touching it from the right given an opportunity.
So ai suffers from the same issues as human behavioural evolution... Good luck solving that one robot engineers!
It's funny how I searched for "It's not about the money" song for a long time, and when I finally found it, few days later I see this video and the song is at the end. For a moment I thought: "am I in the simulation and somebody is playing tricks on me?"
that's pretty much why you should randomize training data as much as possible.
i was just thinking of this because my cat took a fat shit in a downstairs area of the house we don't go to often: instead of learning the rule "when you take a shit, do it outside", it instead learned the rule "when you take a shit, do it where it can't be seen". Such is life for a misaligned cat.
The problem now: How can we build perfect slave minds that will only think and do things that we want?
The problem later: How can we stop these techniques being used to turn human minds into perfect slaves?
Why does it feel like the amount of possible dystopic/apocalyptic futures keeps growing and growing nowadays? That's, uhhh, not a good sign, I think.
One explanation for the failure at the end that seems pretty plausible to me is that even in training, when the interpretability tools seemed to indicate positive attribution to the coin, they were really indicating positive attribution to “the spot near the right side wall.” This happened to coincide with the coin during training, but not during deployment. So the researchers overestimated the power of the interpretability tools, since they really didn’t have a way of distinguishing between whether the model was giving positive attribution to the coin or to the spot next to the right side wall. Curious to know if others think that makes sense.
Well... That's not good. On the bright side, if this fundamental problem causes the system to completely fail the intended objective, that's a good sign that this technique has a low chance of leading to artificial general intelligence without the alignment problems being solved first.
I think the big boogie man from an AI safety perspective is you can often just brute force your way past the problem by makeing the training data the same as the deployment.
This is hard and expensive and not always perfect but often times good enough.
So unless this good enough stops producing working real world applicable AI the march towards ever more capable systems will continue. Meaning instead of alignment being a roadblock for safety and development, it ends up just being a speed bump for development.
Someone mentioned you in the Ars Technica comments. Glad I found your channel. Very interesting and important stuff!
Maybe a step towards a solution to interpretability problem is to use Bayesian updates to estimate our confidence that the AI learned the thing we want.
Perhaps there's a way to calculate the probability that the AI has learned the objective given the probability that it accomplishes the objective in the training data and some statistical measure of the distribution of the training data.
The Ai misalignment apocalypse is already upon us. Seriously. I went to a hotel the other day, they had no front desk. I asked if they had any vacancy, they didn't know, only the computer knew. A hotel, they were the staff, they couldn't tell me if they had vacancy! All they had were computer overlords on line. Now, the reason I went to the physical hotel on purpose was because the same morning I arrived at a place I booked online, and it no longer existed! The robot overlord had booked me into a non existent auxiliary room that had been closed due to covid. The robot didn't know anything about the real world.
To say nothing of the utter insanity of having to interview with a gatekeeper third party to verify that I am not a robot when I submit a resume to companies that have been extorted into having an on line hiring agency that is selling my contact information to resume builder websites against my will, and filling my in box with spam. But I shall never again be able to apply to a job without bowing to the misaligned robot overlords!
This is great! thank you. I also replayed the end bit where the editor makes some good choices a few times. that zoom in with a cut to sliding sideways was magic. Thanks there editor.
The core video was obviously amazing. Thank you.
This seems like a problem of optimization. It will naturally be more efficient to learn simple static goals than complex changing goals. Detecting color for example is easier than detecting shapes, so that becomes the optimized goal.
Hi! I am sorry if this sounds too grand or pretentious, but i think this idea might be a very interesting look at how to fail-proof an AI system.
In short, my idea concerns the creation of an additional aligned AI training model with its primary goal being to achieve a state of being where it has been turned off at least once, our goal here being the ability to analyze as to what system flaws it found and used to outsmart the system and its behaviour and fail-proof our future safety systems based on the newly available data from being exploited and surpassed.
Now in a bit more detail:
My idea that i have for an additional aligned AI training model is to create a strictly isolated and constrained shutdown subsystem subject to heavy safety precautions.
This module would be engineered to be indifferent to being reversibly shut down rather than misaligned towards it. It would have no knowledge of capabilities beyond its narrow function and one-way information flow.
The module would be activated incrementally from minimal capabilities while undergoing extensive monitoring. Shutdown signals would have redundant validation mechanisms. Conservative reward modeling would be used to prevent unintended behavior incentives.
It would initially be tested in a simulated air-gapped environment. The parent AI system would be developed via standard alignment techniques like RLHF and Constitutional AI principles. Multiple transparency methods would facilitate analyzability.
After testing in simulation, the shutdown module could be gradually exposed to real-world activation with continuous oversight. If any risks or side-effects emerge, it would be deactivated and reworked.
Once shut down through its own initiative, comprehensive traces would be analyzed offline to catalog alignment vulnerabilities revealed through its shutdown strategy. The parent system would remain isolated from this process.
Lessons learned would inform iterative improvements to alignment techniques for greater robustness against potential workaround exploits. This narrowly targeted research could supplement broader alignment efforts, with care taken to actively minimize attendant risks.
The shutdown module would need to be engineered to the same safety standards as the overall system. Techniques like modular design, indifference to shutdown, reversible deactivation, air gaps, and incremental rollout are all geared towards preventing unintended behaviors or information leakage.
I believe this approach could offer helpful insights, but would require comprehensive safety practices and precautions to be inacted first with multiple reviews and analyses before actually activating the system, even in a limited and restricted fashion.
Any critique and analysis will be welcomed!
Researchers trained the AI to only find coins at the ends of levels, then tested the AI on something completely different. It's the equivalent of training a dog to chase white swans, then placing the dog in front of a black swan and a white duck.
It was never specified that the goal was a coin _at any location_ (if we view the selected training examples as a specification). Therefore this is an _Outer_ alignment problem so Interpretability tools wouldn't help.
The solution is finding a way for the AI to guess outer misalignments and ask us for clarification (for example, generating a coin at a different location so the researcher can point out which region has the reward).
You could do this pretty easily by just finding the most empty regions of the feature space.
The more I think about this, the more I am convinced we will not solve it, and that there is no solution - it's not just 'difficult', but inherently impossible. We are rabbits, busy inventing foxes, all the while hoping we'll come up with a clever way to not be eaten.
Edit: I am not normally so pessimistic as this in nearly every other way, it's just that AGI is pretty obviously going to take the 'apex entity' spot from us - and that's not bad because it's like a trophy, it's bad because, well, look at how we treat the things that we have power over - even those things we consider important to preserve, we are happy to cull or contain or exploit or monetize or otherwise 'manage' in a way that individual examples of those things might not desire.
I don't think it's impossible, the space of possible minds is deep and wide, and there exist many that do the right thing. There's no inherent reason we couldn't find one of them, but there are exponentially more that do the wrong thing, so we do need a method that gives us strong assurances. We're not definitely doomed, we're only probably doomed
@@RobertMilesAI We just need to be rabbits inventing Superman, instead...
I suppose the next question here is, how likely it is that we may think we have absolutely solved it, and just be wrong enough that we really haven't - probably doomed by not only the odds, but by our own (mis)alignment problem.
@@RobertMilesAI Also, thrilled that you answered me!
I think curiosity, extremely complex environments and multi-task learning will help
This video was interesting and clear, thanks. Being honest, most of your videos are a bit too hard / dense with terminology for me to get through. But because of the clear examples in this one, I really liked it. Thanks!