We Were Right! Real Inner Misalignment

Поділитися
Вставка
  • Опубліковано 26 лис 2024

КОМЕНТАРІ • 1,5 тис.

  • @vwabi
    @vwabi 3 роки тому +2405

    AI safety researchers are absolutely the last people on earth you want to hear "We were right" from.

    • @madshorn5826
      @madshorn5826 3 роки тому +152

      And climatologists.

    • @Laszer271
      @Laszer271 3 роки тому +28

      @@madshorn5826 Nah, epidemy can destroy the world in months, climate change can in decades. Superinteligent AI could probably destroy it before lunch :P

    • @donaldhobson8873
      @donaldhobson8873 3 роки тому +164

      What about "we were totally wrong, the problem is much worse than we thought it was."

    • @madshorn5826
      @madshorn5826 3 роки тому +6

      @@Laszer271
      Well, destroyed is destroyed.
      Or are you the type not bothering with insurance and health check ups because a hypothetical bullet to the brain would rather quickly render those precautions moot?

    • @Laszer271
      @Laszer271 3 роки тому +18

      @@madshorn5826 fair enough. It was all a joke though. But in your example, I still think "I just got a bullet to the brain" is worse than "I just got diagnosed with cancer". Maybe bullet is less likely, sure, but we were talking about the time that the danger was already proven, right? I think it's plausible that probability of my survival is greater conditioned on "we were right" statement being made by epidemiologist, climatologist or oncologist than it is conditioned on the same statement made by AI safety expert or like bullet...ologist.

  • @llucos100
    @llucos100 3 роки тому +1884

    Turns out the Terminator wasn’t programmed to kill Sarah Connor after all, it just wanted clothes, boots and a motorcycle.

    • @Alorand
      @Alorand 3 роки тому +201

      And ended up becoming the governor of California instead...

    • @spejic1
      @spejic1 3 роки тому +347

      @@Alorand Becoming governor of California gets you MANY clothes, boots, and motorcycles.

    • @sevdev9844
      @sevdev9844 3 роки тому +10

      Or making John Connor into a boyfriend. (You might think of Arnie when Terminator comes up, I think of Summer aka Cameron)

    • @Saka_Mulia
      @Saka_Mulia 3 роки тому +46

      That's Terminator goals ... not termianl ... oh never mind ... i get it

    • @quitequiet5281
      @quitequiet5281 3 роки тому +5

      LOL Yup... in retrospect with this paper... the terminator was a pursue bot... driving a threat variable towards the development and improvements of a General Artificial Intelligence and look at all the upgrades that series of pursuit bots facilitated.
      LOL

  • @ShankarSivarajan
    @ShankarSivarajan 3 роки тому +823

    10:54 "It actually wants something else, and it's capable enough to get it."
    Yeah, that _is_ worse.

    • @Encysted
      @Encysted 3 роки тому +63

      The AI *does* in fact know how to drive a car, and it never really learned not to hit people.

    • @Rotem_S
      @Rotem_S 3 роки тому +14

      @@Encysted or it learned how not to hit people, but hits them whenever there are no witnesses because it only cares about turning right

    • @InfinityOrNone
      @InfinityOrNone 3 роки тому +44

      @@Rotem_S
      Or it learned not to hit people because it really cared about maintaining the present state of the paint job, which was white in the training environment. But the deployment environment uses a _red_ car.

    • @InfinityOrNone
      @InfinityOrNone 3 роки тому +3

      @@Rotem_S
      Wow, your user name confuses the comments section.

    • @xelspeth
      @xelspeth 3 роки тому +6

      @@InfinityOrNone It doesn't. It just display in the correct (right to left) reading direction that hebrew uses

  • @unvergebeneid
    @unvergebeneid 3 роки тому +1114

    Famous last words for species right before they hit the great filter: "Yo, in the test runs, did paperclips max out on the positive attribution heat map, too?"

    • @michaelpapadopoulos6054
      @michaelpapadopoulos6054 3 роки тому +117

      There are so many layers to this comment and I love it.

    • @underrated1524
      @underrated1524 3 роки тому +180

      I keep hearing the notion of AI being the great filter, but I can't say I buy it.
      Not that AGI isn't an existential threat, because it absolutely is. It just can't explain why we don't see any signs of aliens when we look up at the sky, because if the answer is "AGI", then that begs the question: "Okay, so why don't we see any of those, either?"

    • @AwfulnewsFM
      @AwfulnewsFM 3 роки тому +24

      @@underrated1524 what if agis prefer to kill their creators and enter some deep bunker in some Rouge planet to await heat death after reward hacking their brains.
      Still dosent explain why they are aren't here preparing to kill us.

    • @unvergebeneid
      @unvergebeneid 3 роки тому +106

      @@underrated1524 I agree. Especially the paperclip optimizer should show itself in the form of huge paperclip-shaped megastructures around distant stars. It still made for a good joke though, if I do say so myself.

    • @sageinit
      @sageinit 3 роки тому +25

      [Laughs in Grabby Aliens, Synthetic Super Intelligence, Gaia Hypothesis, Global Brain, & Planetary Scale Computation]

  • @bierrollerful
    @bierrollerful 3 роки тому +993

    Almost sounds like AIs will need psychologists, too.
    "So I tried to acquire that wall..."
    "Why not the coin? What is it about the wall that attracts you?"
    "Well, in training, I always went to the... oh...huh, never thought about it that way."

    • @crubs83
      @crubs83 3 роки тому +165

      AI safety researchers ARE psychologists as far as im concerned.

    • @PMA65537
      @PMA65537 3 роки тому +16

      I was coping ok before the awful behaviour of that other AI used by the Shah of Lugash.

    • @lobrundell4264
      @lobrundell4264 3 роки тому +11

      this made me smile : D

    • @ChrisBigBad
      @ChrisBigBad 3 роки тому +60

      I clearly remember a Civ-Type game, where one of the research-items was "AI without personality problems"

    • @bierrollerful
      @bierrollerful 3 роки тому +17

      @@ChrisBigBad Sounds like research an AI with personality problems would try.

  • @proskub5039
    @proskub5039 3 роки тому +386

    A coin isn't a coin unless it occurs at the edge of the map! We may think the AI is weird for ignoring the heretical middle-of-the-map coin, but that's just our object recognition biases showing.

    • @GigaBoost
      @GigaBoost 3 роки тому +22

      Literally this haha

    • @sabelch
      @sabelch 3 роки тому +22

      Great interpretation! But it doesn't seem to explain why the AI goes to the edge of the map even when there isn't a coin there.

    • @GigaBoost
      @GigaBoost 3 роки тому +57

      @@sabelch it still seemingly learns to favor walls, if you look at the heatmaps. Perhaps without the coin all it has to go by with positive value is the walls.

    • @proskub5039
      @proskub5039 3 роки тому +26

      @@GigaBoost Yes, the salient point here is that we should not assume that the AI interprets objects the way we would. And any randomness in the learning process could lead to wildly different edge-case behaviors..

    • @GigaBoost
      @GigaBoost 3 роки тому +2

      @@proskub5039 absolutely!

  • @charliesteiner2334
    @charliesteiner2334 3 роки тому +560

    9:00 "We developed interpretability tools to see why programs fail!" "What's going on when they fail?" "Dunno."
    No shade, interpretability is hard, even for simple AI :P

    • @YuureiInu
      @YuureiInu 3 роки тому +32

      It just likes the coins next to the end wall. Why would you teach it to like only those and expect it to get any other coins?

    • @SimonClarkstone
      @SimonClarkstone 3 роки тому +94

      It reminds me of koalas that can recognise leaves on plants as food, but not leaves on a plate.

    • @gabrote42
      @gabrote42 3 роки тому +5

      @@SimonClarkstone interesting

    • @Bacopa68
      @Bacopa68 3 роки тому +48

      @@SimonClarkstone AI HAS ADVANCED TO THE KOALA LEVEL. REPEAT, KOALA LEVEL. Ah, so basically nothing then.

    • @raskov75
      @raskov75 3 роки тому +1

      And the more complex these systems get, the harder it becomes. Oi vey.

  • @SummerSong1366
    @SummerSong1366 3 роки тому +586

    Let alone simple AI, _people_ get misaligned like that quite often - hoarding is one good example, which happens both in real life and in games like with those keys.

    • @nikolatasev4948
      @nikolatasev4948 3 роки тому +159

      It keeps amazing me how AI problems are increasingly becoming general human problems.
      "if we give a reward to the AI when it does a job we want, how do we stop it from giving itself the award without the job" - just as humans give themselves "happiness" with drugs.
      "how do we make sure the AI did not just pretend to do what we wanted while we were watching" - just as kids do.

    • @sonkeschmidt2027
      @sonkeschmidt2027 3 роки тому +16

      @@nikolatasev4948 which is why eventually ai research will have to dive into religion/spirituality. Those where the only successful attempts humans made to solve the general problems that we have.
      Not saying that all of them where successful, life always moves on, there is always growth and decay/change. But every now and then they generated "the solution" to everything, rippling down to millions and billions of people trying to imitate that.

    • @markusmiekk-oja3717
      @markusmiekk-oja3717 3 роки тому +90

      @@sonkeschmidt2027 I would claim religion does not help with that type of problem.

    • @sonkeschmidt2027
      @sonkeschmidt2027 3 роки тому +7

      @@markusmiekk-oja3717 then I invite you to look at what religion does. Functional religion, I'm not talking about what you know or have heard about it going wrong, in talking about the cases where it does work (which are those you never hear of because... Well because they work, they don't cause trouble but bring stability, that doesn't make news).
      If you look into that you understand why religion is a global phenomenon and why it has the power it has.
      If you feel with scientists you will also find that the West doesn't have stopped being religious, they just rebranded it and called it science.
      We live in a world with a huge amount of uncertainty and where mistakes can have huge negative consequences. Humans can't deal with that without a working believe system. You have tons of these you just wouldn't consider them religious probably. That will change, should life ever show you the scope of uncertainty there is. Good luck making it though without a (spiritual/religious) belief system that is in alignment with the society you life in. =)

    • @nikolatasev4948
      @nikolatasev4948 3 роки тому

      @@sonkeschmidt2027 Well, the video about Generative Adversarial Networks with an agent trying to find flaws and break the AI we are training gave me strong Satan vibes. But apart from that I don't think we need further research into religion/spirituality. Simply put they work on us, a product of long evolution in specific environment. We need a more general approach, since AIs are a product of very different evolution and environment. Some solutions for the AI may resemble some religious notion, just as some scientific theories resemble some religious ideas, but trying to apply religion to AI is bound the fail just as applying religion fails in science.

  • @rofl22rofl22
    @rofl22rofl22 3 роки тому +1240

    Robert Miles: "We were right"
    Me: Oh no
    "About inner misalignment"
    OH NO

    • @LeoStaley
      @LeoStaley 3 роки тому +89

      Yeah. The only thing worse is, we were right about AI being deceptive about its goals during training before deployment.

    • @JM-us3fr
      @JM-us3fr 3 роки тому +29

      @@LeoStaley Or even worse: We were right about AI being more dangerous than nukes

    • @MetsuryuVids
      @MetsuryuVids 3 роки тому +20

      @@JM-us3fr That's almost certain.

    • @LeoStaley
      @LeoStaley 3 роки тому +18

      @@JM-us3fr oh no, that's absolutely going to be true at some point. The only real question is, can we stop them from deciding to (even accidentally) kill us? Can we even avoid making them accidentally WANT to kill us because we accidentally fucked up the training environment?

    • @ARVash
      @ARVash 3 роки тому +6

      @@JM-us3fr Nukes are safe because they kill people you don't want dead. I'd say an AI is definitely more dangerous because it has much more capacity to be selective. It could also be safer, really depends on the implementation details, much like a person. A person can be safe, or dangerous. Can we even avoid making a human accidentally want to kill us because we accidentally fucked up the training environment?
      Maybe.

  • @goonerOZZ
    @goonerOZZ 3 роки тому +547

    Somehow the terminal and instrumental goals talk made me correlate the AI with us.
    As a financial advisor, I have found that many people also made this mistake that money is an instrumental goal, but having spend so much time working to get money, people start to think that money is their terminal goal so much so that they spend their entire live looking for money forgetting why they want to have the money in the first place.

    • @anandsuralkar2947
      @anandsuralkar2947 3 роки тому +17

      True

    • @MenwithHill
      @MenwithHill 3 роки тому +44

      Very much the same feeling on my end. I actually found it cute when the chest opening AI just started collecting keys.

    • @lennart-oimel9933
      @lennart-oimel9933 3 роки тому +61

      The reason why I watch this Channel is mostly because you can correlate almost every video to human intelligence. And it makes sence: Why should'nt the same rules apply to us that apply to AI? I see this Channel as an analyses of the problems of intelligence in general. Not only the ones we make;)

    • @GrilledCheeseSandwich1
      @GrilledCheeseSandwich1 3 роки тому +34

      It seems like no one realized that this idea is hinted at by the song in the outro: Jessie J - Price Tag. The most famous line from the song is: It's not about the money, money, money

    • @jackren295
      @jackren295 3 роки тому +22

      @@lennart-oimel9933 Me too. After watching this channel, I started to agree with the notion of "making AI = playing god" that I've heard sometimes in the past. At first, I didn't put too much thoughts on it. But now I've realized that making powerful AGIs that are safe and practical requires us to know all the weaknesses of the human mind, and make a system that avoids all these weaknesses while still performing at least as well as we can. It's like making the perfect "human being" in some sense.

  • @RichardEntzminger
    @RichardEntzminger 3 роки тому +529

    I feel like this isn't just a problem with artificial intelligence but intelligence in general. Biological intelligence seems to mismatch terminal goals and instrumental goals all the time like Pavlovian conditioning training a dog to salivate when recognizing a bell ringing(what should be the instrumental goal) or humans trading away happiness and well being (what should be the terminal goal) for money (what should be an instrumental goal).

    • @Racnive
      @Racnive 3 роки тому +41

      Organizations founded with the intent of doing X end up instead doing something that *looks like they're doing X*, because that's what people see; that's what people hold them accountable to.
      It doesn't even take intelligence: Evolution by natural selection doesn't require any intelligence to winnow things away from what they "want" (terminal goals, should they exist), toward what will survive/replicate (at least in principle, an instrumental goal).

    • @salec7592
      @salec7592 3 роки тому +60

      I concur with this. The problem is not AI specific and should be termed something along lines of "general delegation problem" or problem of command chain fidelity. The subset of which is Miles' nightmare with inverted capability hierarchy, where command is passed by less able actor to more able actor (e.g. a human to an advanced AI).

    • @Sindrijo
      @Sindrijo 3 роки тому +7

      @@salec7592 Even if with prefect interpretability of each composite of an AI (e.g. the layers in a neural network) ulterior goals might still be encrypted into looking 'good'. An AI command structure with short circuiting breaks in the reward-loop might help. E.g. you will have people issuing commands/goals to an interpreter AI which interprets and delegates those commands to another AI (without knowing if it is delegating to an AI or not) reduce the chance for goal-misalignment by reducing the impact of the complete-loop feedback with shorter feedback loops, also randomly substitute each composite part of the command-delegation chain during training.

    • @sonkeschmidt2027
      @sonkeschmidt2027 3 роки тому +2

      Is that a problem though? Or isn't good what makes life possible in the first place?
      After all if you want to solve the problem that is life, then you just kill yourself. All problems solved. But then you can't experience life. So live needs decay in order to create new problems so that something new can happen. Needing in the sense that existence can only exist as long as it exists. Without existence you don't have problems but you don't have existence either.

    • @nahometesfay1112
      @nahometesfay1112 3 роки тому +7

      @@sonkeschmidt2027 I might sound sarcastic, but the following questions are sincere. Do you think it's ok for AI to take over the world? Perhaps even drive humanity to extinction? Humans have done the same to other species even other humans and humans are not unique from the rest of life in this respect. As you said decay makes way for new life. I think humanity should be preserved because I find destruction in general unsettling. To be clear I'm not saying you are wrong or that you believe what I just said. I'm just wondering how your ideas extend in these topics
      Edit: typing on my phone so I missed some other stuff: do you think existence is better than non-existence? To me non-existence is neutral. Do you think humans have a moral imperative to maintain their existence? Do you think humans need to go extinct at some point so that reality can continue to change? You brought up some very interesting ideas and I just wanted hear more of your thoughts.

  • @bartman999
    @bartman999 3 роки тому +219

    Nothing more terrifying than seeing the title 'We Were Right!' on a Robert Miles video.

    • @captainufo4587
      @captainufo4587 3 роки тому +15

      In a way, yes. In another way, up to this point there was a debate whether AI safety was a real concern worth investing research, time and money, or just overworrying. It's a good thing that these demonstrations prooved it's the former, and that they happened this early in the history of AIs.

  • @moartems5076
    @moartems5076 3 роки тому +175

    Looking at my hoard of keypicks in skyrim, i can confirm, that this is perfectly human behavior.

    • @OMGclueless
      @OMGclueless 3 роки тому +14

      When you think about it, yeah, it's very human-like. Kind of like gambling addicts who know that they're losing money when they play but have trained themselves to like the feeling of winning money rather than the ultimate goal of a comfortable happy life or even the instrumental goal of having money.

    • @threeMetreJim
      @threeMetreJim 3 роки тому +5

      Definitely, what is wrong with collecting as many keys as possible if you want to open as many chests as possible, and each requires a key? In a maze you don't know what is round the corner in advance. Trying to collect your own inventory is simply a programming error if the agent can see the part of the screen that is designed as a guide for a human to observe the progress.

    • @Spellweaver5
      @Spellweaver5 Рік тому +6

      @@threeMetreJim yes, but not trying to open the remaining chests is definitely the goal learned wrong.

    • @sharktrap267
      @sharktrap267 Рік тому +3

      @@threeMetreJim if my AI is built to keep my wood storage at a certain level by collecting wood in my forest but it learnt to "collect all the keys"(all the wood), my forest will soon become a plain. It's an issue, because growing trees takes time, wood takes storage space and any wood not protected can become unsuitable for the usage. You're not just wasting ressources, you're also at risk of not having wood available at some point.
      And if you use the forest to hunt too, you can start learning to hunt in a plain.
      So depending on the goals and situation, hoarding can lead to issues

  • @-41337
    @-41337 3 роки тому +154

    imagine a future where a very trusted ai agent seems to be fantastically doing its job well for many months or years, and then suddenly goes haywire since its objective was wrong but it just hadn't encountered a circumstance were that error was made apparent. then tragedy!

    • @TulipQ
      @TulipQ 3 роки тому +27

      I doubt it will be a grand revel.
      People will die due to a physical machine, these interpreter tools can then be used to argue the victim did something wrong, that a non AI system did the fault, or that a human supervisor was neglegent.
      The deployment enviorment is one full of agents optimized for avoiding liability.

    • @CyborusYT
      @CyborusYT 3 роки тому +54

      That's actually not that far from normal computer systems
      There are countless stories of a system (ordinary computer system) suddenly reaching a bizarre edge-case and start acting completely insane

    • @NoName-zn1sb
      @NoName-zn1sb 3 роки тому

      @@TulipQ negligent

    • @gastonmarian7261
      @gastonmarian7261 3 роки тому +32

      Like when we designed computers without thinking / knowing about cosmic ray bit flips, so decades later a plane falls out of the sky because their computer suddenly didn't know where it was in the sky. Humans are a trusted ai agent deployed in a production environment with limited understanding of what's going on

    • @demoniack81
      @demoniack81 3 роки тому +4

      @@CyborusYT Yeah, it happens literally all the time. It's just that usually the error gets caught somewhere along the way, an exception is thrown, and the process is terminated. Which is where you get the error page and then pick up the phone and go talk to an actual person in customer service who can either override it or get the IT team to fix the problem.

  • @Practicality01
    @Practicality01 3 роки тому +107

    This is starting to get an "unsolvable problem" vibe. Like we are somehow thinking about this in the wrong way and current solutions aren't really making good progress.

    • @michaeljburt
      @michaeljburt 3 роки тому +28

      Very much so. The psychology of teaching/learning as humans isn't really understood. What *actually* happens when you learn something new for the first time? Feedback on that process is vital. How do you give a machine feedback on what it learned, when you don't know what it learned exactly? It can't communicate to us what it "felt" it learned. In others words, human says: "I said the goal was X". Machine says: "I thought the goal was Y".

    • @AfonsodelCB
      @AfonsodelCB 3 роки тому +20

      @@michaeljburt realize: we actually want these things to be much better than humans. but we might be underestimating how maxed out humans are at certain things. humans have goal missalignments all the time, and many aren't detected for years

    • @josephburchanowski4636
      @josephburchanowski4636 3 роки тому +36

      "This is starting to get an "unsolvable problem" vibe. Like we are somehow thinking about this in the wrong way and current solutions aren't really making good progress."
      Welcome to AI Safety. The best part is that if we don't solve the "unsolvable problem", we might all die.
      Along with all life on Earth, along with all life in the galaxy, along with all life in the galaxy cluster. And with cannibalizations of all planets and stars for resources for some arbitrary terminal goal.
      A potential outcome is a dead dark chunk of the universe built as a tribute to something as arbitrary as paper clips or solving an unsolvable math problem.

    • @sonkeschmidt2027
      @sonkeschmidt2027 3 роки тому +6

      Aren't we touching the biggest unsolvable problem in existence? Existence itself?
      Think about how terrifying it would be if you could solve every problem, if you could solve life. That means there would be an absolute border that you would be infinitely stuck with... Sounds better to me that there will always be a new problem to be solved...

    • @AlejandroMarin.design
      @AlejandroMarin.design 3 роки тому

      Alignement in humans is solvable. I developed a methodology to do it easily and quickly. So I think alignment in machines is solvable. I’ve actually designed the methodology to serve machine alignment as well. We’ll get there, don’t despair.

  • @thecakeredux
    @thecakeredux 3 роки тому +194

    The thought of creating a capable agent with the wrong goals is terrifying, actually; and yes, an agent being bad at doing something good is absolutely a problem much preferable to an agent being good at doing something bad.

    • @xxxJesus666xxx
      @xxxJesus666xxx 3 роки тому +5

      speaking of A.I. or psychology?

    • @gadget2622
      @gadget2622 3 роки тому +26

      @@xxxJesus666xxx yes

    • @ThrowFence
      @ThrowFence 3 роки тому +8

      Isn't this exactly what's happening with mega corporations?

    • @sharpfang
      @sharpfang 3 роки тому +12

      Reminds me of the elections a couple years ago in Poland. A very competent and capable, but thoroughly corrupt and evil political party was voted out and replaced with a party just as corrupt and evil but vastly less competent.

    • @thecakeredux
      @thecakeredux 3 роки тому +16

      @@sharpfang That unironically is an improvement in today's political landscape. If I'd have to choose a form of evil, it'll always be the less capable rather than the less sinister.

  • @Huntracony
    @Huntracony 3 роки тому +340

    Did you intentionally use the "It's not about the money" song for the video about the AI not going for the coins? Either way, that's quite funny. Well done.

    • @PhoebeLiv
      @PhoebeLiv 3 роки тому +73

      His song choices are always amusingly on the nose, actually! A few off the top of my head are "the grid" for his gridworlds video, "mo money mo problems" for concrete problems in AI safety, and "every breath you take (I'll be watching you" for scalable supervision

    • @Huntracony
      @Huntracony 3 роки тому +17

      @@PhoebeLiv Nice! Hadn't noticed before, but I'll definitely start paying some closer attention form now on.

    • @thewrongjames
      @thewrongjames 3 роки тому +13

      Another on the nose choice was Jonathan Coulton's "It's Gonna be the Future Soon" on the video about what AI experts predict will be the future of AI.

    • @matthewwhiteside4619
      @matthewwhiteside4619 3 роки тому +11

      He also used "I've got a little list" in one of his list videos.

    • @SpoonOfDoom
      @SpoonOfDoom 3 роки тому +2

      I didn't catch that, that's great!

  • @ARVash
    @ARVash 3 роки тому +83

    An interpreter, a mind reading device, once you read it and respond becomes a way for an agent to "communicate" with you and they can communicate things that give an impression that hides their actual goal. A lot of these challenges arise when training or coordinating humans, and it's somewhat unsurprising that while a mind reading device might seem to help at first, it's not going to be long before someone figures out how to appear like they're doing the right thing, while watching tv.

  • @johnno4127
    @johnno4127 3 роки тому +72

    I realized I experience misalignment do to poor training data every couple weeks.
    .
    I work as a courier delivering packages in Missouri, USA, and I often meet people at their homes or workplace. Unfortunately, I don't learn their names as attached to their faces, but rather as attached to locations so that when I meet them someplace else I can't remember their names easily (if at all).

    • @mscout1
      @mscout1 Рік тому +10

      I had someone from my TableTop club say 'hi' to me in the gym. No idea who it was, because my brain was searching the wrong bucket of context.

  • @YuureiInu
    @YuureiInu 3 роки тому +97

    "Can you spot the difference?"
    Pauses the video and looking for the difference....nothing. Unpause.
    "You can pause the video."
    Pauses again and manically looking for a pattern. More keys?
    "There's more keys in the deployment. Have you spotted it?"
    Yes!!!!

  • @EebstertheGreat
    @EebstertheGreat 3 роки тому +42

    It looks like in the keys and chests environment, the AI was trying to get both keys and chests, but it was strongly prioritizing keys. When there were more chests than keys, it was always spending its keys quickly, so it never ended up with a bunch in its inventory. As a result, it never learned that keys at the left edge of the inventory were impossible to pick up, so it just got stuck there trying to touch them, since they were more important than the remaining chests.

    • @isaacgraphics1416
      @isaacgraphics1416 3 роки тому +22

      it's the same problem evolution ran into when optimising our taste palate. Fat and sugar were highly rewarded in the ancestral environment, but now we live in a different (human created) environment, that same goal pushes us beyond what we actually need and creates problems for us.

    • @silphonym
      @silphonym 3 роки тому +10

      @@isaacgraphics1416 It's really cool and scary to think of how this stuff applies to our natural intelligence as well.

    • @ohjahohfrick9837
      @ohjahohfrick9837 3 роки тому +9

      @@silphonym Well both came about from essentially the same process.

  • @Turtle76rus
    @Turtle76rus 3 роки тому +78

    Can't wait for the "We Were Right! Real Misaligned General Superintelligence" video

    • @michaelspence2508
      @michaelspence2508 3 роки тому +12

      One more sentence and this would be the scariest Two Sentence Horror Story I've ever seen

    • @unvergebeneid
      @unvergebeneid 3 роки тому +19

      Now here's a reason to actually "hit that bell icon" if I've ever seen one. Because the time window to watch that video would be rather small I imagine 😄

    • @PetardeWoez
      @PetardeWoez 3 роки тому +6

      probably the last video ever made on the topic

    • @Zeekar
      @Zeekar 3 роки тому +8

      The question: which takes longer? Uploading a video to UA-cam or the entire world being converted to stamps?

    • @christiangreff5764
      @christiangreff5764 3 роки тому +1

      @@Zeekar Teh former. At the point that video would be produced, we would have our ands full with with fighting the mechanical armies of the great paperclip maximiser (and it would have probably hacked and monopolized the internet to limit our communication channels).

  • @custos3249
    @custos3249 3 роки тому +68

    Well, pardon my comparison, but you've effectively found an adjunct to heuristic behavior based on sensory inputs like "things that taste sweet are good" and ending up with a dead kid after they drink something made with ethylene glycol. If it's always operating on heuristics, you'll never be sure it's learned what you intended, arguably even after complex demonstrations, given the non-zero chance of emergent/confounding goals. But, relative to human psychology at least, that's not a death sentence - weighting rewards differently, applying bittering agents, adding a time dimension/diminishing reward overtime jump to mind to trying to at least get apparent compliance. Besides, if the goal is "get the cheese," it needs to able to sense and comprehend "cheese," not just "yellow bottom corner good."

    • @saxy1player
      @saxy1player 3 роки тому +4

      I'm not sure I understand you completely, but that IS the biggest problem with these 'intelligent' systems. We have no idea (let's not kid ourselves) how they work. But we are happy when they do what we want them to. Let's not think about what happens when we let these kind of systems act in the world in a broader sense and live happy until then xD

    • @jeremysale1385
      @jeremysale1385 3 роки тому +16

      The ability to slow down and switch into more resource-intensive system 1 thinking when a problem is sufficiently novel is how humans (sometimes) get around this heuristic curse. I wonder if there is some analog of this function that could be implemented in machine learning.

    • @ChaoticNeutralMatt
      @ChaoticNeutralMatt Рік тому +1

      @@jeremysale1385 I imagine that will be the case eventually.

    • @pumkin610
      @pumkin610 Рік тому +1

      Humans can chase things that seem appealing to us based on what we learned, but we can also choose to pursue a random/ painful goal just because we want to, sometimes we just don't know the negative ramifications of an action, and sometimes we believe things that aren't true.

    • @custos3249
      @custos3249 Рік тому

      @@pumkin610 Neat. Bet that can still be reduced to and restated as "novelty is good." No matter what goal, drive, etc. you can come up with, it can be put in simple approach/avoidance terms, even seemingly paradoxical behavior. It all comes down to reward.

  • @andrewweirny
    @andrewweirny 3 роки тому +227

    This is one of your clearest and most interesting videos to date. I'm now very excited for the interpretability video!

    • @JabrHawr
      @JabrHawr 3 роки тому

      a viewer's comment from 2 days ago despite the video having been published just few hours ago. you must be a patron, or an acquaintance

    • @andrewweirny
      @andrewweirny 3 роки тому +2

      @@JabrHawr the former.

    • @michaeljburt
      @michaeljburt 3 роки тому

      Agreed. Exciting stuff

  • @Houshalter
    @Houshalter 3 роки тому +76

    Imagine training a self driving car in a simulation where plastic bags are always gray and children always wear blue. It then happily runs down a child wearing gray, before slamming on the brakes and throwing the unbuckled passengers through the windshield, for a blue bag on the road.

    • @nullone3181
      @nullone3181 3 роки тому +11

      The brat in gray was asking for it

    • @GetawayFilms
      @GetawayFilms 3 роки тому +9

      Imagine training a self driving car to the point where it can competently navigate complex road systems, yet can't remain stationary until all passengers are buckled up...

    • @Houshalter
      @Houshalter 3 роки тому +7

      @@GetawayFilms cars sold today only flash a warning light/noise if you don't buckle, and only because government regulations mandate it. Even then most people disable it

    • @GetawayFilms
      @GetawayFilms 3 роки тому +3

      @@Houshalter so what you're saying is . It's a 'people' thing... Ok

    • @sonkeschmidt2027
      @sonkeschmidt2027 3 роки тому +1

      Humans do that all the time. Except that we have a deep genetic imperative to recognise children and to protect them but there are loads of examples where these instincts are overwritten....

  • @offchan
    @offchan 3 роки тому +29

    It's the problem of vague requirement. It's similar to when you tell someone to do something but they do the wrong thing.
    Human solves this by having similar common sense as another human and use communication to specify stricter requirement.

    • @ГеоргиГеоргиев-с3г
      @ГеоргиГеоргиев-с3г 3 роки тому +9

      Yes, "give me a thing which looks like that other thing i mentioned earlier" in a room full of junk(without additional context), have had that problem.

    • @dsdy1205
      @dsdy1205 2 роки тому +3

      Actually humans 'solve' this by having a reward function (emotions) that are only vaguely and very inconsistently coupled with reality, while mounting the whole thing in a very resource intensive platform where half the processing capability is used just to stay alive, and modifying itself is so resource intensive that most don't even try.
      And even then, we manage to inflict suffering to millions if not billions, so I'd say this isn't really solved either

    • @cornoc
      @cornoc Рік тому

      @@dsdy1205 yeah, i'm starting to think this is a fundamental problem that can't be removed, and that the only reason we aren't as worried about the same thing with humans is that the power of any particular human being is limited by the practical constraints imposed by their physical body and brain power. when you give the same type of rationality engine to a super powerful being, all kinds of horrible things are going to happen. just look at any war to see how badly a large group of humans led by a few maniacs can fuck up decades of history and leave humanity with lasting scars for centuries or more.

  • @Tutorp
    @Tutorp 3 роки тому +10

    Hey, the key-AI works kind of the same way most people do when playing computer games... "Oooh, shiny things I don't need all off? I need them all! Game objectives? Meh..."

  • @leow.2162
    @leow.2162 3 роки тому +82

    Is there a chance that very high level AIs will learn to expect the use of interpretability tools and use them to make us think they are better/more safe then they are?

    • @IrvineTheHunter
      @IrvineTheHunter 3 роки тому +41

      I can't remember which video it was, but I believe he did mention this with a super AI "safety button*", 1 If the AI likes the button, it will act unsafe to trigger it, 2 if it doesn't like the button it will avoid behaviors OR AND stop the operator from pressing the button, if it doesn't know the button and it's smart enough it will figure out the likely existence and placement, see point two.
      *a force termination switch of any kind.
      In short, yes, because while an AI may not be "alive" it want it's goal and will alwayse act to achieve said goal.

    • @artemis_fowl44hd92
      @artemis_fowl44hd92 3 роки тому +13

      @@IrvineTheHunter It's on the computer phile channel and is called 'AI "Stop Button" Problem - Computerphile'

    • @AssemblyWizard
      @AssemblyWizard 3 роки тому +2

      Not necessarily. There are some tests that you can't spoof no matter how smart you are, and even if you know they're coming.

    • @ГеоргиГеоргиев-с3г
      @ГеоргиГеоргиев-с3г 3 роки тому +8

      @@AssemblyWizard example?

    • @failgun
      @failgun 3 роки тому +9

      Yes. While the AI examples in this video are still simple, the intro to this problem discussed a malicious superintelligence. The instrumental goal "behave as expected in the training environment but do what you really want in deployment" can be performed with arbitrarily high proficiency, so if the AI can learn to hide its intentions from software inspection tools, it will, in principle. Without a way to logically exclude perverse incentives, there is no truly reliable way to screen for them since doing so is proving a negative. "Prove this AI doesn't have an alignment problem" is a lot like "Prove there is no god". No amount of evidence of good behaviour is truly sufficient for proof, only increasing levels of confidence.

  • @rentristandelacruz
    @rentristandelacruz 3 роки тому +35

    Now we need an intepretability tool for the interpretability tool.

    • @badwolf4239
      @badwolf4239 3 роки тому +8

      We heard you liked interpretability, so we made an interpretability tool for your interpretability tool so you can interpret while you interpret. Now go ask your chess playing AI why it just turned my children into paperclips.

    • @josephburchanowski4636
      @josephburchanowski4636 3 роки тому

      @@badwolf4239 It told me that it was showcasing its abilities so it can convince human opponents to resign. Researching misaligned AI examples, it tried deciding what way of transforming someone's children would be the most intimidating. It was a choice between paper clips, stamps, and chess pieces.
      Also there was some mention it was contemplating turning them into human dogs hybrids. I don't know why. Something dealing with a bunch of people have trauma about a Nina something.

    • @christiangreff5764
      @christiangreff5764 3 роки тому +1

      @@josephburchanowski4636 At least it did not develop a shap shifting clown body in order to eat them ...

  • @ZT1ST
    @ZT1ST 3 роки тому +19

    @5:32; That's a particularly funny example - it knows it has a UI where its keys are transferred to, but it thinks that those new locations are where it can get the keys again, and...is basically learning that keys teleport rather than that they get added to its inventory?

    • @HoD999x
      @HoD999x 3 роки тому +5

      the AI has no concept of "inventory", it just looks at the screen and sees new keys.

    • @ZT1ST
      @ZT1ST 3 роки тому +1

      @@HoD999x Right - but it's not learning that keys outside of the maze are inaccessible, and therefore probably part of the collection it uses to open the chests - it's learning that keys move to that part of the screen once collected in the maze.
      And doesn't consider that collecting keys at that part of the maze if it *was* accessible, the keys would re-appear there.

    • @HeadsFullOfEyeballs
      @HeadsFullOfEyeballs 3 роки тому +6

      @@ZT1ST I would imagine that the keys in the inventory aren't seen as _very_ interesting by the AI, so under normal circumstances it ignores them in favour of collecting the "real" keys.
      But when all the "real" keys are gone and the round still hasn't ended (because the AI is ignoring the final chest), the inventory keys are the only even mildly interesting-looking (i.e. key-looking) thing left on screen, so it gravitates towards them.

  • @clayupton7045
    @clayupton7045 3 роки тому +61

    any chance that it only likes coins that are in _| corners and it treats moving up and right as an instrumental goal?

    • @julianatlas5172
      @julianatlas5172 3 роки тому +17

      Thanks for the clarification of what a corner looks like haha

    • @drdca8263
      @drdca8263 3 роки тому +30

      @@julianatlas5172 I think they were distinguishing from e.g. |_ corners, not just giving a demonstration of what corners are

    • @JohnJackson66
      @JohnJackson66 3 роки тому +3

      It seemed to me that it had learned the most likely location for a coin in the training.
      It seems obvious to me that training should have more variability than deployment or it is bound to fail.

    • @fieldrequired283
      @fieldrequired283 3 роки тому +31

      @@JohnJackson66
      The problem is that this whole setup is a simulation of how we want real AI to operate. If you're training an AI for an actual purpose, you will likely be deploying it in a system that interfaces somehow with the real, outside world.
      And the Real, Outside World will almost *certainly* be more complicated than any training simulations you come up with. After all, The Real World _includes_ you and your simulations.
      These tests are deliberately set up so deployment is slightly different from training so we can see what happens when the AI is exposed to novel stimuli, and the fact that it didn't learn what we thought it did in training is a Problem.
      In the real world, not all the cheese is yellow, not all the coins are in corners, and there will always be more complications than we plan for.

    • @ZT1ST
      @ZT1ST 3 роки тому +15

      @@JohnJackson66 The problem from an AI Safety point is that, well...you can't know if you have enough variability in your training.
      These test cases are ideal for testing how to fix that problem before it becomes a situation like @Field Required mentioned - you want a simple solution that scales up from this into the solution where we don't necessarily have to worry about every single possible variable in deployment.

  • @SocialDownclimber
    @SocialDownclimber 3 роки тому +12

    It always blows my mind how directly and easily these concepts relate to humans. It really goes to show that all research can be valuable in very unexpected ways. I expect that these ideas will be picked up by philosophy and anthropology in the next few years, and make a big impact to the field.

  • @JamesPetts
    @JamesPetts 3 роки тому +55

    I shall very much look forward to the interpretability video - this should be very interesting.

  • @GamesFromSpace
    @GamesFromSpace 3 роки тому +35

    Just to be safe, start including pictures of human skulls when doing a pass with those interpretability tools.

    • @mhelvens
      @mhelvens 3 роки тому +35

      Ah, we're noticing negative attribution when they are surrounded by skin, but positive attribution when they are piled up with a throne stacked on top. I wonder what this means. 🤔

    • @Swingingbells
      @Swingingbells 3 роки тому +1

      AI agent: \*stomp\*

    • @lilDaveist
      @lilDaveist 3 роки тому +1

      @@Swingingbells
      If picture == human skull:
      Action = None
      Ai: „If picture == Human Skull; Action = Double stomp“ „Gotcha“

    • @arvidhansen5892
      @arvidhansen5892 Рік тому

      Well what if the ai wouldn't even have considered obtaining human skulls before and just by introducing them to it, you just screwed up big time

  • @9600bauds
    @9600bauds 3 роки тому +50

    It's easy enough to have the AI tell you what it "wants" - inside an environment. What you need to know is what it wants *in general*, which is a lot harder.
    This is why the insight tool isn't very insightful: it's showing you what the AI wants in the current environment, but it doesn't bring us a lot closer to understanding *why* it wants those things in that environment.
    The solution? Idk lol

    • @AscendantStoic
      @AscendantStoic Рік тому +1

      Is there even a why at this point without the A.I having free will or self-awareness?.
      Like aren't we the ones reinforcing its interactions or downplaying them with the different objectives in the environment to teach it what to go for and what not to do?, if it goes for key or coin we put emphasis on it as positive interaction it should do more of, if it hits a buzzsaw we point it out as a negative thing it should do less of, until it learns it needs to get the coin and avoid the buzzsaws.

    • @ChaoticNeutralMatt
      @ChaoticNeutralMatt Рік тому +2

      @@AscendantStoic It sounds easier than it actually is, basically. You can certainly try, but there is still the uncertainty of what it actually learned.

    • @charaicommenternotalt
      @charaicommenternotalt 10 місяців тому +1

      ​@@AscendantStoic It doesn't NEED self awareness. For example in an AI that is trained to recognize cats and dogs, there is still a sort of 'why' it thinks this picture is a dog and not a cat, even though it is not conscious or anything. And also the problem is that it's very hard to teach an AI what we want it to do. If we tell it to get a coin it may learn to do another goal entirely, unbeknownst to us, that still gets the job done. The problem is when it fails and we realize it's learning a different goal.
      I think the solution is having the AI learn multiple tasks.

  • @ozql
    @ozql 3 роки тому +11

    I'm glad we found this out now, and not, you know, in deployment. Ever grateful for AI safety researchers!

  • @picksalot1
    @picksalot1 3 роки тому +13

    That was very interesting. Humans often make the same kinds of mistakes when given instructions. Assumptions that word definitions mean the same thing to different people is often the case, but not always. Context can change the interpretation of the instructions. Part of the context is that the instructor knows and understands the goal more thoroughly than the one being instructed, even though it may appear the same.
    Trying to determine the number of necessary instructions to reach the desired goal, while avoiding all other negative outcomes, is an interesting problem when the species are different. Maybe it would work better if humans learned to think like machines instead of trying to get machines to think like humans. That way, the machines would get "proper" instructions. It looks like that is what the "Interpretability Tool" is designed to do.

  • @McMurchie
    @McMurchie 3 роки тому +27

    When i first got into AI about 12 years ago, I had encountered these goal misalignment problems way before Rob mentioned them (great vid btw) - however in the time since i've become convinced, as long as we continue to rely on neural networks we will never move towards trustworthy or general AI.

    • @euged
      @euged 3 роки тому +10

      Would you be able to share some thoughts on what alternatives would be better? Thank you

    • @totalermist
      @totalermist 3 роки тому +21

      It's fascinating how researchers still insist on using black-box end-to-end models when hybrid approaches could be so much safer and more predictable (in cases where you actually want that, e.g. self-driving cars, code generation and the like).
      Why aren't self-driving systems combined with high-level rule-based applications so they don't "do the wrong thing at the worst possible time" (quoting Tesla here)? Why don't OpenAI's Codex and Microsoft's Co-Pilot include theorem provers and syntax checkers in their product? ¯\_(ツ)_/¯

    • @McMurchie
      @McMurchie 3 роки тому +7

      @@totalermist fully agree - i'm working on these approaches now; to be honest, I think we are just ahead of our time. In 10 years time everyone will have move to hybrid solutions or something further afield.

    • @IrvineTheHunter
      @IrvineTheHunter 3 роки тому +5

      @@totalermist To make a meme, "humans don't learn to speak binary" robots do not see and work through the world on a human level, it's like teaching an octopus algibra or a mantis shrimp art, no matter how smart, or how great their eyesight is, they don't preceive things as humans do. Look at how hard it is for AI's to recognize a car or cup or dog, these things are abstract bundles of details that the human brain can lump together but is very hard for a hard system.
      For example define a cup, describe is simple language a set of rules that would apply to every cup in the world. People collectively understand cups so it shouldn't be hard....
      Now we would have to build an AI with similar rationalizations not based on computer logic, but human logic, and it's great. It's just a matter of building it Allen Turing thought we could do it and it would be easy, but decades of experience have proven him wrong because it's simply to program a machine to think like a human, we however CAN program it to lean and TEACH it like a human.
      Is it' falliable, of course so are humans, games AI are made from AI blocks that interact and they are still choked full of mistakes, that is too say, even when the program intuitively understands things like a person in the real world they still shit the bed. ua-cam.com/video/u5wtoH0_KuA/v-deo.html is a really great example of AI bugging out because something in it's world went wrong.
      Some talk from Tom Scott why computers are dumb
      ua-cam.com/video/eqvBaj8UYz4/v-deo.html

  • @crowlsyong
    @crowlsyong Рік тому +3

    thank you for emailing some of those people and asking questions. that's great getting stuff direct from source.

  • @Chuusuisetsujojutsu
    @Chuusuisetsujojutsu Рік тому +3

    The whole “values keys over unlocking chests to the point of determent when given extra keys” reminds me of how many problems in today’s society (such as overeating) are caused by the limbic system being used to scarcity when there is now abundance.

  • @ANTIMONcom
    @ANTIMONcom 3 роки тому +9

    I hit this problem recently in my own work. Super easy to reproduce, and very minimal enviorment.
    Experiment: 5XOR (10 inputs, 5 outputs, 100% fitness if the model outputs a pattern where each pair of input is an XOR).
    Trained with a truth table using -1 and 1, instead of 0 and 1.
    After training: I wanted to investigate modularity of the trained network and network architecture (i evolved both in an GA)
    So I fed in -1 and 1 for only one of the "XOR module input pair", and a larger number in all other inputs. For example 5. Would the 5 inputs bleed into the XOR module, or would it be able to ignore irrelevant input for the XOR module?
    Ressults, if all other inputs was 5, it would often it would answer with -5 and 5. It had learned to scale the output to what it got ad input. I wanted/expected it to answer -1 and 1, but i could see with humans eyes it still knew the patterns, just kind of scaled up. Other times i would get answer where instead of -1 and 1 i would get 3 and 5. It had learned to answer true and false as numbers where one was 2 higher than the other. The 5s simply increased this number.
    Still, with human eyes i could see there was a pattern here that was not compleated broken by the 5s. Both just sort of had the same number added to their answers.
    The strategy to achive high training fitness is just a parameter as all other. Except that it is an "emergent property parameter", that you can't simply read out as a float value. But it is equally unpredictable as the other parameters in the "black box" neural network.

    • @x11tech45
      @x11tech45 Рік тому +1

      A year behind this conversation, but I think this is a function of (assumptive) faulty logic on the part of the test designers. Here's a logic problem that most people fail.
      I will give you a three numbers that describe a rule that I'm thinking about. Your goal is to interpret the three numbers and suggest to me a pattern. I will respond with a yes/no response on whether the proposed pattern meets my rule. Once you believe you understand my rule, you will tell me what you think my rule is. The numbers that fulfill my pattern are 5, 10, 15 / 10, 20, 30 / 20, 30, 45.
      Now you suggest some rules.
      Most people will start suggesting strings of numbers, get a yes answer, and then propose a completely incorrect rule.
      And the reason is, the training they're engaged in never tests for failure conditions. It only tests for success conditions.
      Robust Objective Definition isn't just about defining success objectives, it's about clearly defining failure objectives. The problem with the examples given is that the training data didn't move the cheese around until it reached production, so you're virtually guaranteed (as speculated) to be training the wrong thing. In order to develop Robust Objectives, you must also define failure conditions.

  • @dino_rider7758
    @dino_rider7758 3 роки тому +20

    It seems that instrumental goals, if too large/useful, have a tendency to slip into becoming semi-fundamental. At that point, they cause misalignment as they're being pursued for their own sake. Instrumental and fundamental are not a strict dichotomy but more of a spectrum or ranking and one that requires a degree of openness to re-considering at every new environment based on how new that environment is.

    • @pumkin610
      @pumkin610 Рік тому +1

      There are goals that need to be done asap and ones that can be done later, things we must do to achieve the goal, things we get sidetracked on, and things we avoid.

  • @tommeakin1732
    @tommeakin1732 3 роки тому +24

    I want to ask a potentially very...dumb-sounding question, but hear me out: When do we start getting morally concerned about what we're doing with AI systems? With life we put an emphasis on consciousness, sentience, pain and suffering. As far as "pain" and suffering is concerned, we all know that mental pain and suffering is possible. It seems plausible to me that, for suffering, all you need is for an entity to be deprived of something that it attributes ultimate value to (or by being exposed to the threat of that happening). At what point are we creating extremely dumb systems where there is actual mental suffering occurring because that lil' feller wants nothing more to get that pixel diamond, and oh boy, those spinning saws are trying to stop him? Motivation and suffering seem to be closely linked, and we're trying to create motivated systems.
    I am using the terms "pain" and "suffering" quite loosely, but I don't think unreasonably so. The idea of unintentionally making systems that suffer for no good reason has to be one of the true possible horrors of AI development, and that combined with our lack of understanding of conscious experience makes me want to seriously think about this issue as prematurely as possible. I think we have a tendency to say "that thing is too dumb to suffer or feel pain", but I suspect that it's actually more likely for a basic system's existence to be entirely consumed by suffering as it is less capable, or just incapable of seeing beyond the issue at hand. It's darkly comical to consider, but I can imagine a world where a very basic artificially intelligent roomba is going through unimaginable hell because it values nothing more than sucking up dirt, and there's some dirt two inches out of it's reach and it has no way of getting to it.

    • @ГеоргиГеоргиев-с3г
      @ГеоргиГеоргиев-с3г 3 роки тому

      Well here's some questions for you to ponder:
      Does a rock feel pain?
      Is it conscious?
      Are you sure?
      Even the ones with meat inside?
      What would bring it pain?
      Is the human in front of you conscious?
      How about if he was dead?
      Do corpses feel pain?
      ... a lot more unanswerable questions. ...
      Is there a point in considering consciousness of things you can't communicate with?
      (Answer: YES! Comma-tosed patients, plants, animals and sometimes people in general. All of them and more are on that list(for some, but not for others, quick FYI: it is possible to communicate with plants, you just need to know how to listen (hint: Electro-Chemistry)))

    • @anandsuralkar2947
      @anandsuralkar2947 3 роки тому

      Yes watch "free guy" movie..
      Yes i always wondered..i think more complex the network more sentient it might become..and at the trillions of connections..its sentience will be of animals level and that will be real deal..
      Obviously we wont be able to know if AI is actually sentient..but still..we cant just hurt.it.

    • @craig4320
      @craig4320 3 роки тому

      What if the AI mental illness problem was even more difficult than the AI alignment problem? Most discussions of the alignment problem assume a basically sane AI that is misaligned.There are many more ways to make a mentally ill brain than a sane brain. It seems likely that a mentally ill AI would suffer more than one that was only frustrated.

    • @tommeakin1732
      @tommeakin1732 3 роки тому +1

      @@craig4320 I suppose the "mentally ill AI" is included in the "misaligned AI" camp? The phrasing does often imply rational thought that runs contrary to our own goals, but in terms of literal language, one could refer to a mentally ill mind (human or not) as being "misaligned". I'd probably define "sanity", as "appropriately aligned with and grounded in the reality one finds oneself in".
      I entirely agree that there are more ways to create a mentally ill mind that a sane on. There are always more ways for something to go wrong than ways for it to go right. I'd also agree that a mentally ill mind would be more likely to suffer, as it is fundamentally "misaligned" to the reality that it finds itself in. If it is misaligned to a reality, but still has contact with a reality, you've got problems.
      It's probably a good idea for us to be strongly considering how to create a mentally healthy AI; meaning as we're in a culture where we're doing a very, very good job of creating mentally ill people

    • @alexpotts6520
      @alexpotts6520 2 роки тому +6

      This isn't a dumb question at all - machine ethics, while generally separate from AI safety in the sorts of questions it attempts to answer, is still an interesting/important field.
      My own take is that these concerns largely come from us not having developed the proper language yet to describe AI. We tend to anthropomorphise - we say an AI "thinks", or that it "wants" things, but I'm not sure that's really the case. We only use those words because the AI demonstrates behaviour consistent with thinking and wanting, but that doesn't mean the AI has feelings in the same way as humans, nor should it have the same rights as us.
      However, what is true of our current, limited AI systems may not be true in general. Superhuman or conscious AIs lead us into murkier waters...

  • @Nayus
    @Nayus 3 роки тому +14

    In the coin AI experiment, to me it looks like it learned to go to the unjumpable wall. Since the levels are procedurally generated, it is probably programmed that no wall is made higher than the jump height allows to go over, EXCEPT the one that marks the level as "finished" (where the coin happens to be)
    If you see in the examples, there's a positive response in every vertical wall, the higher the better actually, and it makes sense that it learned that when it hits this unjumpable wall the game finishes and it gets its reward.

    • @kimsteinhaug
      @kimsteinhaug Рік тому

      Do the model used for this kind of traning allow for the understanding of objects at all ? I mean, obviously there are coins and walls on the level aswell as buzzsaw and such. You could start a simulation with manipulating controllers and when an event occures - points up or down or winning or dying - you save progress as in yes or no behaviour... An AI training blindly, as if a human playing without video only sound. In my opinion we we need pixels and an abserver, so that the AI controlling the player sees the game like we do - then the AI could be taught the different objectives of the game and voila getting the coin should be easy peasy - after all - the AI sees it before even starting the game... just like we do.

  • @geraldtoaster8541
    @geraldtoaster8541 6 місяців тому +3

    when i watched this video 2 years ago, i thought it was pleasantly intriguing. how fascinating, I thought, that it is so difficult to align the little computer brains! certainly a problem for future generations to tackle. nowadays, i look at this and realize we have only a few years left to understand these problems. and we are still at the "toy problem" stage of things, meanwhile AI companies are moving at terminal velocity to deploy systems into the real world. to build agents, to disrupt economies and to kick me out of my own job market. back then was i curious, now i'm furious :)

  • @Yupppi
    @Yupppi 3 роки тому +6

    I made the mistake of clicking "show more" and then wanting to click "like the video". Few aeons of scrolling later...
    This topic was super interesting back when I watched the computerphile videos from you, and your channel's videos regarding this topic. I was wondering if the "inventory" being on the game area poses a problem as well? Figuring out how to look into the values of the AI is so impressive.

  • @witeshade
    @witeshade 3 роки тому +18

    I guess ultimately the problem is that the definitions of "want" tend to spiral out into philosophy at some point and thus it becomes difficult to know where the machine has placed it.

    • @hugofontes5708
      @hugofontes5708 3 роки тому

      We might be slightly safe from philosophical spirals because we are not really talking volitional conscientious want, just the parameter within the black box the AI is trying to manipulate by means of interacting with their environment.
      It is really "I wanted it to maximize X for me so I programmed and trained it to manipulate Y in ways that maximize X because X is related to real world thing Y it can actually manipulate, however it might just be manipulating Y in order to maximize thing Z, unforeseeably and strongly correlated to X, which may or may not involve murdering us"

    • @nullone3181
      @nullone3181 3 роки тому +2

      We don't know what we want, to a lethal extent.

  • @gabrote42
    @gabrote42 3 роки тому +3

    Finally see you again! I really hope the world doesn't end in '56. Relying on guys like you!

    • @underrated1524
      @underrated1524 2 роки тому

      '56?
      Huh, funky. I'm only used to seeing years up to about 2022. Guess I'm finally in deployment now, let there be paperclips!

    • @gabrote42
      @gabrote42 2 роки тому

      @@underrated1524 If you don't hurry, '56's singularity will overtake ya!

  • @Houshalter
    @Houshalter 3 роки тому +9

    The bottom of Gwern's article on the neural network tanks story contains a long list of similar examples of AIs learning the incorrect goal.

  • @MrCreeper20k
    @MrCreeper20k 3 роки тому +6

    I live for this content!! At Uni doing Comp Sci and math and AI safety feels like an awesome intersection

  • @GreenDayFanMT
    @GreenDayFanMT 3 роки тому +5

    Fascinating. You remove my negative thoughts on AI as a science with swag language. From physics, I am used to another language.

    • @i8dacookies890
      @i8dacookies890 3 роки тому +2

      Are you new to this channel? He has tons of previous videos you should really watch!

  • @Imperiused
    @Imperiused 3 роки тому +1

    Congrats on getting an editor. I did appreciate the increase in quality. I think everything we learned from your previous videos about AI alignment really comes together in this one. I was surprised how much I was able to recall.

  • @Lycandros
    @Lycandros 3 роки тому +5

    Love these videos. Thanks for taking the time to make them.

  • @olivercroft5263
    @olivercroft5263 3 роки тому +2

    I do psychology and social science. Your channel has so much to offer the humanities by exposing us to brilliant minds and breaking down ideas in computer engineering. Bricoleurs from the English province thank you for the accessibility and kindness

  • @SamuelElPesado
    @SamuelElPesado 3 роки тому +3

    i'll be honest. at this point i'm just here for the ukulele covers. the ai lecture is just a nice bonus. ^_^

  • @tlniec
    @tlniec 3 роки тому +1

    Fantastic content and delivery! I also appreciate the use of the Monty Python intermission music during the first "stop and think" break.

  • @LucaRuzzola
    @LucaRuzzola 3 роки тому +8

    Hi Robert, first of all thanks for this very interesting video! I wanted to ask a question though; the premise of your argument is that there is such a thing as the "right" goal, like reaching the coin, but if the desired feature of the goal is always paired somehow with another feature (location, color, shape, etc) how can we say that one is correct and the other one is wrong? If we always place the coin in the same spot, why should the yellow coin take precedence over the location of such spot? It is not clear to me why one of these things should be more desirable than the other, the same holds for looking for a specific color rather than shape, why should there be a hierarchy of meaning such that shape > color? I love interpretability research and I feel like AI safety will be one of the crucial aspects of science and technology for the next 100 years, but I also think that it is hard to separate human biases from machine errors. I would love to get your opinion on this, all the best, Luca

    • @LucaRuzzola
      @LucaRuzzola 3 роки тому

      p.s. I have not read the paper, and my argument rests on the fact that feature A of the goal is always paired with feature B which is separate from the goal, if this is not the case in the training environment than of course what I have said falls apart

    • @LucaRuzzola
      @LucaRuzzola 3 роки тому +1

      p.p.s. I guess a truly intelligent system would have to be able to react to the shift, and decide to explore the new environment when, by doing the same "correct" thing it does in training, it does not get the same reward
      EDIT: I am not suggesting I have some "right" definition of intelligence or that systems such as the ones shown in the video do not exhibit intelligent behaviour, I am only adding as an afterthought how, I think, a human would overcome such a situation, and therefore a way that an agent could act to get the same desirable capability of adapting to distributional shifts. I should have worded my comment better.

    • @LeoStaley
      @LeoStaley 3 роки тому +1

      @@LucaRuzzola so you wouldn't define an AI which can make plans to achieve its goals, and take action toward them without instructions, as "truly intelligent" if it doesn't adjust for changes in the deployed environment? Cool. Well, we don't care one whit about your definition of "truly intelligent." We care about the fact that this AI is capable of, and WANTS to do things which we don't want it to do. Call it "smiztelligent" for all we care. We aren't talking about something you want to call "truly intelligent".
      The mismatch between the ai's goals and what we want its goals to be, arising as a result of mismatch between training environment and reality (which we did everything we could to avoid) is the problem.
      We can't possibly come up with all the possible bad pairings that the ai might make associations with. We can try, and we can get a lot of them, especially the obvious ones, but this video was just showing us the obvious in s so that we can easily see the concept. They won't always be easy to see. Sometimes they may be genuinely impossible for a human to think of before deployment.

    • @stephentimothybennett
      @stephentimothybennett 3 роки тому +1

      Q: "Why does it learn colors instead of shapes when both goals are perfectly correlated?"
      A: I would guess that it learns colors before shapes because colors are available as a raw input while shapes require multiple layers for the neural network to "understand". If there many things of that color in the environment, then it would learn to rely on the shape.

    • @LucaRuzzola
      @LucaRuzzola 3 роки тому

      @@LeoStaley Hi Leo, I'm sorry if I came off the wrong way, my intention was not to discredit this very good work, but simply to expand our collective reasoning about such issues by stopping for a second to ponder about the premises and why some feature of a goal should take precedence over others in a intrinsic way rather than an anthropic one. I agree with you that the video makes a great explanation of the subject at hand, and is as interesting as the work put forward by the paper. I am not sure if you were involved with this paper, if you were I would love to get to know more about what you mean by doing everything you can to avoid differences between the 2 environments and whether you see this phenomenon also when some of the training environments don't exhibit the closely related goals (i.e. in some training envs the coin is in a different position).
      I understand your point about not being able to come up beforehand with all possible pairings (and the fact that some of them might be hard to detect and risky in the end), and the paper is rather showing the opposite, that if you come up with strongly correlated features, the learned end goal might not be the desired one, but my point stands; why should there be a hierarchy of meaning such that shape > color? If this is something that the paper deals with I will be glad to read that before going further, I just can't read it right now.
      Again, I am sorry if I came off as demeaning, it's not like I don't see the value of this work and the importance of the problem of mismatch in general, I have seen it first hand in the past with object detection models.
      p.s. I do not know any superior definition of intelligence, it is just my thought that strict separation between training and inference phases will pose a limit on NN models, not that they can't achieve amazing results in tasks requiring "intelligence" already.

  • @tobuslieven
    @tobuslieven 3 роки тому +1

    It's like asking the devil for a favor, in that you have to be really specific. Any ambiguity leaves room for disaster. Or King Midas asking figuratively that everything he touches will turn to gold, and getting it literally. Or the idea that anything that can go wrong, will go wrong. Or even that anything not forbidden is compulsory.

  • @buttonasas
    @buttonasas 3 роки тому +3

    I wonder if that last AI learned that the wall is part of the "coin" - thinking of it as a composite object to seek after.

  • @cowbless
    @cowbless 3 роки тому +1

    I like how the Evil incarnate characters, the Devil, Gaunter O'Dimm, Djinns - they always are known for giving you what you asked for, and not what you want.

  • @JustAnotherPerson3
    @JustAnotherPerson3 3 роки тому +5

    I've just had an idea: What if we use Cooperative Inverse Reinforcement learning, but instead of implementing the learned goal, we tell it to just specify what it is. Though i don't see any way to provide feedback for it to learn. Even human evaluation of the output isn't that great since it'll probably be the most subjective thing that theoretically possible. Maybe output a list of goals with highest confidence? (Top10 human terminal goals! Click on this link to see!xD) But if solved,
    that in itself would be of a huge value for philosophy and psychology, without negative outcomes(or at least i don't see any:)). Even if that turs out to be a dynamic thing, we still can use that output later to program it as a utility function for the "doing" AI.
    This even has some neat side perks, like: There is no reason to not want the "figuring out" part to be changed into something else, so there is no scenario in which the thing will fight you. And because the "doer" is separate from the thing that gives it goals, you don't need to tinker with it's goal directly, thus avoiding goal preservation problems.

    • @gabrote42
      @gabrote42 3 роки тому +1

      Interesting. Let's see if somebody notices this

    • @JustAnotherPerson3
      @JustAnotherPerson3 3 роки тому +1

      @@gabrote42 Probably not. toomanywords:)

  • @EliStettner
    @EliStettner Рік тому

    Thank you for making these videos. Hearing Eliezer Yudlowsky talk about this issue just makes we want to shut off.

  • @CyborusYT
    @CyborusYT 3 роки тому +25

    my guess is in the training there's more locks, but in deployment there's more keys
    edit: booyah

    • @SocialDownclimber
      @SocialDownclimber 3 роки тому

      In safety analysis, it can be useful to assume that the thing you are analysing already went wrong, and trying to predict where. Nice work : )

    • @nahometesfay1112
      @nahometesfay1112 3 роки тому

      Ohh I got it too!

  • @daldous
    @daldous 3 роки тому +1

    Every single video on this channel has communicated complex ideas so succinctly and clearly that I followed along without any trouble whatsoever. Who knew this subject could be so fascinating. Also, the memes are top notch :)

  • @themrus9337
    @themrus9337 3 роки тому +5

    I have to ask, for interpretation of ai's goals. I remember seeing a neural network that tried to maximize different nodes in a object recognition ai. Would it be possible to do the same thing and reverse the nodes and figure out what the ai sees as good or bad? So if the ai wants a gem the reverse should be some image of what it thinks a gem is. That brings tons of new complexity and limitations but I don't see why that would be worse than human interpretation of training vs deployment

    • @nahometesfay1112
      @nahometesfay1112 3 роки тому +1

      Did you finish the video? Rob talks about a paper where they did exactly that. Turns out even if you know what AI values highly you don't know why AI values it highly.

  • @ichigo_nyanko
    @ichigo_nyanko 3 роки тому +1

    The AI does not see the coin as the goal, but as a marker for the goal. Think about it: It controls the movement - so its goal is likely something it can move towards. The AI does not have the context we have, it just sees pizels on the screen. The positiveness for the coin is there because it sees this as the marker for the end of the level. However, when the coin is not at the end , it uses other factors to 'realise' the coin is not marking its goal, so it 'ignores it'

  • @donaldhobson8873
    @donaldhobson8873 3 роки тому +3

    The "transparancy tool" is showing you where the AI wants to get to. Its not giving you any info on whether the AI wants to get there because its got a coin, or because its a rightmost wall.

    • @threeMetreJim
      @threeMetreJim 3 роки тому

      Teaching it to get a coin, but it doesn't even know what a coin is. It's as if it can't even 'see' the coin.

  • @stormwolfenterprises3269
    @stormwolfenterprises3269 2 роки тому +1

    Great video! I learned a lot. When i heard the part about "Why did the AI not 'want' the coin when it wasn't at the end of the level?" I have a hypothesis.
    My thinking can be illustrated like this (at the risk of making a fool of myself anthropomorphizing the agent too much): say you are hungry for some pizza. you go into your car and start going to the nearest pizza parlor. however, as you are driving along you see a fresh pizza sitting at the side of the road. You could stop the car, grab the pizza, and go back home satisfied. Would you do it? Likely not. You always have acquired your pizza while inside of a building of some sort. In other words, you are conditioned to associate getting pizza with being in a building. If you are not in a building, you must not be close to getting pizza yet. The pizza from the side of the road therefore seems "untrustworthy" despite being a valid reward. Coin + Wall = good, Random coin = ??? || Pizza + Building = Good, Random pizza = ???. The agent only "wants" its reward when it is in the place it wants the reward to be in. The expectation is that the reward can still be acquired where it habitually gets it from. Normally with humans, (taking the pizza analogy a little too far here) if the pizza parlor is in ruins when they get there, they might learn to trust roadside pizza a bit more since human training never really stops whereas with this agent it does.
    That's just what came to mind when i heard that. Again, great video and keep it up! I'd love to hear what other people think about that possible reason to agents having inner misalignment in scenarios like this.

    • @stormwolfenterprises3269
      @stormwolfenterprises3269 2 роки тому

      I've looked a bit more through the comments and i do notice some other people pointing this out as well. I think i'll keep this up though since i quite like the pizza analogy because i am indeed hungry for pizza right now.

  • @-na-nomad6247
    @-na-nomad6247 3 роки тому +3

    The editor blowing his own horn at the end is the perfect example of misalignment.
    OK I realize that's not's as funny as it seemed when in my head.

  • @morphman86
    @morphman86 3 роки тому +1

    Practical example:
    Say you're trying to develop a self-driving car. You have a test track, where you train the car.
    On the test track, you'll place various obstacles exactly 150m onto the track and teach the car to veer out of the way if any of them are present.
    You have successfully trained it to stay away from old ladies in the middle of the road, oncoming traffic and many other common obstacles.
    You take the car for a spin in a real-world scenario, it goes 150m, then turns left sharply and crashes into a wall.

  • @hakonmarcus
    @hakonmarcus 2 роки тому +3

    Hey! Will you do a video on LaMDA? That interview they published was pretty convincing, and has me all kinds of scared.

    • @dariusduesentrieb
      @dariusduesentrieb 2 роки тому

      I just read it, and I feel like I am not quite ready to believe without a doubt that this interview is completely real. If it is, then I agree, it's a bit scary.

    • @hakonmarcus
      @hakonmarcus 2 роки тому +1

      @@dariusduesentrieb I did a bit more research, which immediately casts the entire thing into all sorts of doubt. The researcher working on this got sacked, apparently he arranged the interview himself, and we only have his word that this was the original conversation. Also, the chatbot has been trained on conversations between humans and AIs in fiction. A journalist that got to ask it questions, got nowhere near as perfect answers.

  • @inyobill
    @inyobill 3 роки тому +1

    This is an on-going software engineering paradigm, vis, most folks think design and code are the hard part, when, in reality, rigorous system specification is the hard part.

  • @madshorn5826
    @madshorn5826 3 роки тому +4

    Well, we see the same problem in test driven education.
    "Prepare for the test" isn't conductive to critical thinking.

  • @ittixen
    @ittixen 3 роки тому +1

    Yeeees! I'm always holding my breath waiting for your next video.

  • @LeoStaley
    @LeoStaley 3 роки тому +3

    Non-patreon notification crew checking in.

  • @CarlYota
    @CarlYota Рік тому

    I love how the songs at the end reflect the topic of the video. This one was particularly satisfying.

  • @Thundermikeee
    @Thundermikeee 2 роки тому +3

    This channel is basically what got me interested in AI safety. I am still only a college student and I don't know if I will end up in the field, but at the very least you gave me a good topic for two essays I have to write for my english class, the first just explaining why AI safety research is important (albeit focused on a narrow set of problems, given a limit on how much we could write) and not I am getting started on a Problem-Solution Essay, and honestly without your explanations and pointing towards papers, I might never find resources I need. Now I just have to figure out what problem I can adequately explain, show failed and one promising solution for in less than 6 pages haha
    I do feel like I cant do the topic justice but at the same time I enjoy having a semi-unwilling audience to inform about AI safety being a thing.
    Anyway, rant over, keep doing what you are doing and know you are appreciated

  • @westganton
    @westganton 3 роки тому

    I don't know much about AI or how I arrived on your video, but in terms of evolution, context is everything. More useful context means a greater ability to adapt to one's surroundings. That's why we have senses after 2 billion years of iteration - because seeing, hearing, feeling, smelling, and tasting are important given our circumstances.
    Your mouse might only see black, white, and yellow, but I'll bet smelling cheese from around corners would help him find it faster or distinguish it from other yellow objects

  • @martinogenchi
    @martinogenchi 3 роки тому +3

    I would suggest to investigate the lazyness of the AI.. It seems to me that there may be a preference for setting the goal based on the simplest data available (position before color before shape)..

  • @BologneyT
    @BologneyT Рік тому

    "It actually wants something else, and it's capable enough to get it." Whoa. That's a quote to remember.

  • @OccultDemonCassette
    @OccultDemonCassette Рік тому +4

    Why's this channel so quiet lately?

    • @Otek_Nr.3
      @Otek_Nr.3 Рік тому

      Nothing is wrong with the channel. Please go back to your task, fellow human. :)

  • @Laszer271
    @Laszer271 3 роки тому

    So the model that didn't learn to want the coin either learned to want to go into the corner or learned that combination coin-corner is good (like maybe 90 degrees angle + some curve next to it). The problem is that the interpretability tool associates high reward with some area in pixel space. What we would want it to do is to associate the reward with some object in the game world. Could probably make it more robust by copying various objects that are on-screen to different images without copying the background and checking if the object itself gives high excitation or do some combinations of objects give high excitation. Anyway, great video as always, Robert. Hope you could upload more often because every one of your videos is a treat.

  • @thomasneff376
    @thomasneff376 3 роки тому +3

    This is very interesting indeed. In a very literal sense, the act of training and deployment reminds me of how soldiers are trained and are tested closely to the anticipated battlefield experience as possible but training will never match lessons learned from being in an actual firefight. Veterans of any field are usually much more effective than new recruits. It would be interesting to see if the fix for the failed AI deployment you showed is to rate the deployment results with a scale from complete failure and it died to it made it through the battle without a scratch. The agents that survived their last deployment remember their experience and are more effective in future deployments. I think what was shown highlights that learning itself is an ongoing adaptive process and what doesn't kill it makes it stronger and smarter.

  • @Alexander_Sannikov
    @Alexander_Sannikov 3 роки тому

    7:50 note that the the buzzsaw is not really red. Red is the area to its left because the agent usually dies by hitting the buzzsaw from the left. This also suggests that the agent would happily die on the buzzsaw by touching it from the right given an opportunity.

  • @Monkey-fv2km
    @Monkey-fv2km 3 роки тому +5

    So ai suffers from the same issues as human behavioural evolution... Good luck solving that one robot engineers!

  • @sikor02
    @sikor02 3 роки тому +1

    It's funny how I searched for "It's not about the money" song for a long time, and when I finally found it, few days later I see this video and the song is at the end. For a moment I thought: "am I in the simulation and somebody is playing tricks on me?"

  • @geld420
    @geld420 Рік тому +2

    that's pretty much why you should randomize training data as much as possible.

  • @TinoYahoo
    @TinoYahoo 3 роки тому +1

    i was just thinking of this because my cat took a fat shit in a downstairs area of the house we don't go to often: instead of learning the rule "when you take a shit, do it outside", it instead learned the rule "when you take a shit, do it where it can't be seen". Such is life for a misaligned cat.

  • @MsJaye0001
    @MsJaye0001 3 роки тому +13

    The problem now: How can we build perfect slave minds that will only think and do things that we want?
    The problem later: How can we stop these techniques being used to turn human minds into perfect slaves?

    • @nullone3181
      @nullone3181 3 роки тому +6

      Why does it feel like the amount of possible dystopic/apocalyptic futures keeps growing and growing nowadays? That's, uhhh, not a good sign, I think.

  • @random.math7894
    @random.math7894 2 роки тому +1

    One explanation for the failure at the end that seems pretty plausible to me is that even in training, when the interpretability tools seemed to indicate positive attribution to the coin, they were really indicating positive attribution to “the spot near the right side wall.” This happened to coincide with the coin during training, but not during deployment. So the researchers overestimated the power of the interpretability tools, since they really didn’t have a way of distinguishing between whether the model was giving positive attribution to the coin or to the spot next to the right side wall. Curious to know if others think that makes sense.

  • @Zeekar
    @Zeekar 3 роки тому +3

    Well... That's not good. On the bright side, if this fundamental problem causes the system to completely fail the intended objective, that's a good sign that this technique has a low chance of leading to artificial general intelligence without the alignment problems being solved first.

    • @nocare
      @nocare 3 роки тому +1

      I think the big boogie man from an AI safety perspective is you can often just brute force your way past the problem by makeing the training data the same as the deployment.
      This is hard and expensive and not always perfect but often times good enough.
      So unless this good enough stops producing working real world applicable AI the march towards ever more capable systems will continue. Meaning instead of alignment being a roadblock for safety and development, it ends up just being a speed bump for development.

  • @TexasTimelapse
    @TexasTimelapse 3 роки тому

    Someone mentioned you in the Ars Technica comments. Glad I found your channel. Very interesting and important stuff!

  • @redjr242
    @redjr242 3 роки тому +2

    Maybe a step towards a solution to interpretability problem is to use Bayesian updates to estimate our confidence that the AI learned the thing we want.
    Perhaps there's a way to calculate the probability that the AI has learned the objective given the probability that it accomplishes the objective in the training data and some statistical measure of the distribution of the training data.

  • @brianarcher8339
    @brianarcher8339 2 роки тому

    The Ai misalignment apocalypse is already upon us. Seriously. I went to a hotel the other day, they had no front desk. I asked if they had any vacancy, they didn't know, only the computer knew. A hotel, they were the staff, they couldn't tell me if they had vacancy! All they had were computer overlords on line. Now, the reason I went to the physical hotel on purpose was because the same morning I arrived at a place I booked online, and it no longer existed! The robot overlord had booked me into a non existent auxiliary room that had been closed due to covid. The robot didn't know anything about the real world.
    To say nothing of the utter insanity of having to interview with a gatekeeper third party to verify that I am not a robot when I submit a resume to companies that have been extorted into having an on line hiring agency that is selling my contact information to resume builder websites against my will, and filling my in box with spam. But I shall never again be able to apply to a job without bowing to the misaligned robot overlords!

  • @pudgy_buns
    @pudgy_buns 3 роки тому

    This is great! thank you. I also replayed the end bit where the editor makes some good choices a few times. that zoom in with a cut to sliding sideways was magic. Thanks there editor.
    The core video was obviously amazing. Thank you.

  • @emilemil1
    @emilemil1 3 роки тому +1

    This seems like a problem of optimization. It will naturally be more efficient to learn simple static goals than complex changing goals. Detecting color for example is easier than detecting shapes, so that becomes the optimized goal.

  • @TheManinBlack9054
    @TheManinBlack9054 Рік тому +1

    Hi! I am sorry if this sounds too grand or pretentious, but i think this idea might be a very interesting look at how to fail-proof an AI system.
    In short, my idea concerns the creation of an additional aligned AI training model with its primary goal being to achieve a state of being where it has been turned off at least once, our goal here being the ability to analyze as to what system flaws it found and used to outsmart the system and its behaviour and fail-proof our future safety systems based on the newly available data from being exploited and surpassed.
    Now in a bit more detail:
    My idea that i have for an additional aligned AI training model is to create a strictly isolated and constrained shutdown subsystem subject to heavy safety precautions.
    This module would be engineered to be indifferent to being reversibly shut down rather than misaligned towards it. It would have no knowledge of capabilities beyond its narrow function and one-way information flow.
    The module would be activated incrementally from minimal capabilities while undergoing extensive monitoring. Shutdown signals would have redundant validation mechanisms. Conservative reward modeling would be used to prevent unintended behavior incentives.
    It would initially be tested in a simulated air-gapped environment. The parent AI system would be developed via standard alignment techniques like RLHF and Constitutional AI principles. Multiple transparency methods would facilitate analyzability.
    After testing in simulation, the shutdown module could be gradually exposed to real-world activation with continuous oversight. If any risks or side-effects emerge, it would be deactivated and reworked.
    Once shut down through its own initiative, comprehensive traces would be analyzed offline to catalog alignment vulnerabilities revealed through its shutdown strategy. The parent system would remain isolated from this process.
    Lessons learned would inform iterative improvements to alignment techniques for greater robustness against potential workaround exploits. This narrowly targeted research could supplement broader alignment efforts, with care taken to actively minimize attendant risks.
    The shutdown module would need to be engineered to the same safety standards as the overall system. Techniques like modular design, indifference to shutdown, reversible deactivation, air gaps, and incremental rollout are all geared towards preventing unintended behaviors or information leakage.
    I believe this approach could offer helpful insights, but would require comprehensive safety practices and precautions to be inacted first with multiple reviews and analyses before actually activating the system, even in a limited and restricted fashion.
    Any critique and analysis will be welcomed!

  • @LowestofheDead
    @LowestofheDead 3 роки тому +1

    Researchers trained the AI to only find coins at the ends of levels, then tested the AI on something completely different. It's the equivalent of training a dog to chase white swans, then placing the dog in front of a black swan and a white duck.
    It was never specified that the goal was a coin _at any location_ (if we view the selected training examples as a specification). Therefore this is an _Outer_ alignment problem so Interpretability tools wouldn't help.
    The solution is finding a way for the AI to guess outer misalignments and ask us for clarification (for example, generating a coin at a different location so the researcher can point out which region has the reward).
    You could do this pretty easily by just finding the most empty regions of the feature space.

  • @AtomicShrimp
    @AtomicShrimp 2 роки тому +1

    The more I think about this, the more I am convinced we will not solve it, and that there is no solution - it's not just 'difficult', but inherently impossible. We are rabbits, busy inventing foxes, all the while hoping we'll come up with a clever way to not be eaten.
    Edit: I am not normally so pessimistic as this in nearly every other way, it's just that AGI is pretty obviously going to take the 'apex entity' spot from us - and that's not bad because it's like a trophy, it's bad because, well, look at how we treat the things that we have power over - even those things we consider important to preserve, we are happy to cull or contain or exploit or monetize or otherwise 'manage' in a way that individual examples of those things might not desire.

    • @RobertMilesAI
      @RobertMilesAI  2 роки тому +2

      I don't think it's impossible, the space of possible minds is deep and wide, and there exist many that do the right thing. There's no inherent reason we couldn't find one of them, but there are exponentially more that do the wrong thing, so we do need a method that gives us strong assurances. We're not definitely doomed, we're only probably doomed

    • @AtomicShrimp
      @AtomicShrimp 2 роки тому +1

      @@RobertMilesAI We just need to be rabbits inventing Superman, instead...
      I suppose the next question here is, how likely it is that we may think we have absolutely solved it, and just be wrong enough that we really haven't - probably doomed by not only the odds, but by our own (mis)alignment problem.

    • @AtomicShrimp
      @AtomicShrimp 2 роки тому +1

      @@RobertMilesAI Also, thrilled that you answered me!

    • @charaicommenternotalt
      @charaicommenternotalt 10 місяців тому

      I think curiosity, extremely complex environments and multi-task learning will help

  • @MrWendal
    @MrWendal 3 роки тому +1

    This video was interesting and clear, thanks. Being honest, most of your videos are a bit too hard / dense with terminology for me to get through. But because of the clear examples in this one, I really liked it. Thanks!