Deceptive Misaligned Mesa-Optimisers? It's More Likely Than You Think...

Поділитися
Вставка
  • Опубліковано 27 тра 2024
  • The previous video explained why it's possible for trained models to end up with the wrong goals, even when we specify the goals perfectly. This video explains why it's *likely*.
    Previous video: The OTHER AI Alignment Problem: • The OTHER AI Alignment...
    The Paper: arxiv.org/pdf/1906.01820.pdf
    Media Sources:
    End of Ze World - • End of Ze World - The ...
    FlexClip News graphics
    With thanks to my excellent Patreon supporters:
    / robertskmiles
    Timothy Lillicrap
    Kieryn
    James
    Scott Worley
    James E. Petts
    Chad Jones
    Shevis Johnson
    JJ Hepboin
    Pedro A Ortega
    Said Polat
    Chris Canal
    Jake Ehrlich
    Kellen lask
    Francisco Tolmasky
    Michael Andregg
    David Reid
    Peter Rolf
    Teague Lasser
    Andrew Blackledge
    Frank Marsman
    Brad Brookshire
    Cam MacFarlane
    Craig Mederios
    Jon Wright
    CaptObvious
    Jason Hise
    Phil Moyer
    Erik de Bruijn
    Alec Johnson
    Clemens Arbesser
    Ludwig Schubert
    Allen Faure
    Eric James
    Matheson Bayley
    Qeith Wreid
    jugettje dutchking
    Owen Campbell-Moore
    Atzin Espino-Murnane
    Johnny Vaughan
    Jacob Van Buren
    Jonatan R
    Ingvi Gautsson
    Michael Greve
    Tom O'Connor
    Laura Olds
    Jon Halliday
    Paul Hobbs
    Jeroen De Dauw
    Lupuleasa Ionuț
    Cooper Lawton
    Tim Neilson
    Eric Scammell
    Igor Keller
    Ben Glanton
    anul kumar sinha
    Tor
    Duncan Orr
    Will Glynn
    Tyler Herrmann
    Ian Munro
    Joshua Davis
    Jérôme Beaulieu
    Nathan Fish
    Peter Hozák
    Taras Bobrovytsky
    Jeremy
    Vaskó Richárd
    Benjamin Watkin
    Andrew Harcourt
    Luc Ritchie
    Nicholas Guyett
    James Hinchcliffe
    12tone
    Oliver Habryka
    Chris Beacham
    Zachary Gidwitz
    Nikita Kiriy
    Andrew Schreiber
    Steve Trambert
    Mario Lois
    Braden Tisdale
    Abigail Novick
    Сергей Уваров
    Bela R
    Mink
    Chris Rimmer
    Edmund Fokschaner
    Grant Parks
    J
    Nate Gardner
    John Aslanides
    Mara
    ErikBln
    DragonSheep
    Richard Newcombe
    David Morgan
    Fionn
    Dmitri Afanasjev
    Marcel Ward
    Andrew Weir
    Kabs
    Miłosz Wierzbicki
    Tendayi Mawushe
    Jake Fish
    Wr4thon
    Martin Ottosen
    Robert Hildebrandt
    Andy Kobre
    Kees
    Darko Sperac
    Robert Valdimarsson
    Marco Tiraboschi
    Michael Kuhinica
    Fraser Cain
    Robin Scharf
    Klemen Slavic
    Patrick Henderson
    Oct todo22
    Melisa Kostrzewski
    Hendrik
    Daniel Munter
    Alex Knauth
    Kasper
    Ian Reyes
    James Fowkes
    Tom Sayer
    Len
    Alan Bandurka
    Ben H
    Simon Pilkington
    Daniel Kokotajlo
    Diagon
    Andreas Blomqvist
    Bertalan Bodor
    Zannheim
    Daniel Eickhardt
    lyon549
    14zRobot
    Ivan
    Jason Cherry
    Igor (Kerogi) Kostenko
    ib_
    Thomas Dingemanse
    Stuart Alldritt
    Alexander Brown
    Devon Bernard
    Ted Stokes
    James Helms
    Jesper Andersson
    DeepFriedJif
    Chris Dinant
    Raphaël Lévy
    Johannes Walter
    Matt Stanton
    Garrett Maring
    Anthony Chiu
    Ghaith Tarawneh
    Julian Schulz
    Stellated Hexahedron
    Caleb
    Scott Viteri
    Clay Upton
    Conor Comiconor
    Michael Roeschter
    Georg Grass
    Isak
    Matthias Hölzl
    Jim Renney
    Edison Franklin
    Piers Calderwood
    Mikhail Tikhomirov
    Richard Otto
    Matt Brauer
    Jaeson Booker
    Mateusz Krzaczek
    Artem Honcharov
    Michael Walters
    Tomasz Gliniecki
    Mihaly Barasz
    Mark Woodward
    Ranzear
    Neil Palmere
    Rajeen Nabid
    Christian Epple
    Clark Schaefer
    Olivier Coutu
    Iestyn bleasdale-shepherd
    MojoExMachina
    Marek Belski
    Luke Peterson
    Eric Eldard
    Eric Rogstad
    Eric Carlson
    Caleb Larson
    Max Chiswick
    Aron
    David de Kloet
    Sam Freedo
    slindenau
    A21
    Johannes Lindmark
    Nicholas Turner
    Tero K
    Valerio Galieni
    FJannis
    M I
    Ryan W Ammons
    Ludwig Krinner
    This person's name is too hard to pronounce
    kp
    contalloomlegs
    Everardo González Ávalos
    Knut Løklingholm
    Andrew McKnight
    Andrei Trifonov
    Aleks D
    Mutual Information
    / robertskmiles
  • Наука та технологія

КОМЕНТАРІ • 490

  • @MorRobots
    @MorRobots 3 роки тому +610

    "I'm not worried about the AI that passes the Turing Test. I'm worried about the one that intentionally fails it" 😆

    • @hugofontes5708
      @hugofontes5708 3 роки тому +36

      This sentence made me shit bits

    • @virutech32
      @virutech32 3 роки тому +19

      holy crap...-_-..im gonna lie down now

    • @SocialDownclimber
      @SocialDownclimber 3 роки тому +15

      My mind got blown when I realised that we can't physically determine what happened before a certain period of time, so the evidence for us not being in a simulation is impossible to access.
      Then I realised that the afterlife is just generalizing to the next episode, and yeah, it is really hard to tell whether people have it in their utility function.

    • @michaelbuckers
      @michaelbuckers 3 роки тому +9

      @@SocialDownclimber Curious to imagine what would you do if you knew for a fact that afterlife existed. That when you die you are reborn to live all over again. You could most definitely plan several lifetimes ahead.

    • @Euruzilys
      @Euruzilys 2 роки тому +10

      @@michaelbuckers
      Might depends on what kind of after life, and if we can carry over somethings.
      If its buddhist reincarnation, you would be inclined to act better towards other people.
      If its just clean reset in a new life, we might see more suicides, just like how gamers might keep restarting until they find satisfactory starting position.
      But if there is no way to remember your past in after life/reincarnation, then arguably it is not different from now.

  • @thisguy00
    @thisguy00 3 роки тому +730

    So the sequel video was finally published... That means I'm in the real world now! Time to collect me some stamps :D

    • @tekbox7909
      @tekbox7909 3 роки тому +54

      not if I have any say in it. paperclips for days wohoo

    • @goblinkoma
      @goblinkoma 3 роки тому +62

      Sorry to interrupt, but i really hope your staps and paper clips are green, every other color is unacceptable.

    • @automatescellulaires8543
      @automatescellulaires8543 3 роки тому +9

      I'm pretty sure i'm not in the real world.

    • @nahometesfay1112
      @nahometesfay1112 3 роки тому +8

      @@goblinkoma green is not a creative color

    • @goblinkoma
      @goblinkoma 3 роки тому +12

      @@nahometesfay1112 but the only acceptable

  • @elfpi55-bigB0O85
    @elfpi55-bigB0O85 3 роки тому +440

    It feels like Robert was sent back to us to desperately try and avoid the great green-calamity but they couldn't give him an USB chip or anything to help because it'd blow his cover so he has to save humanity through free high quality youtube videos

    • @casperes0912
      @casperes0912 3 роки тому +40

      A peculiar Terminator film this is

    • @icywhatyoudidthere
      @icywhatyoudidthere 3 роки тому +59

      @@casperes0912 "I need your laptop, your camera, and your UA-cam channel."

    • @killhour
      @killhour 3 роки тому +5

      Is that you, Vivy?

    • @MarkusAldawn
      @MarkusAldawn 3 роки тому +3

      @@icywhatyoudidthere *shoots terminator in the face*
      Connor you know how to use the youtubes right

    • @Badspot
      @Badspot 3 роки тому +18

      They couldn't give him a USB chip because all computers in the future are compromised. Nothing can be trusted.

  • @TibiaTactics
    @TibiaTactics 3 роки тому +113

    That moment when Robert says "this won't happen" and you are like "uff, it won't happen, we don't need to be afraid" but then what Robert really meant was that something much worse than that might happen.

    • @user-cn4qb7nr2m
      @user-cn4qb7nr2m 2 роки тому +2

      Nah, he just doesn't want to manufacture panicking Luddites here.

  • @proxyprox
    @proxyprox 3 роки тому +84

    That RSA 2048 story has to be the funniest thought experiment I've ever heard in my life

    • @proxyprox
      @proxyprox 3 роки тому +41

      Also, I like how the AI turned the whole world green because you're more likely to go to the green thing if everywhere is green

    • @Ruby-wj8xd
      @Ruby-wj8xd 3 роки тому +5

      I'd love to read a book or see a movie with that premise!

    • @mapi5032
      @mapi5032 3 роки тому +2

      I'm wondering if something like this might be used to disprove the whole "are we in a simulation?" hypothesis.

    • @irok1
      @irok1 3 роки тому +1

      @@mapi5032 How so?

    • @Milithryus
      @Milithryus 3 роки тому

      I'm not convinced that it won't actually happen.

  • @DickerLiebhaber1957
    @DickerLiebhaber1957 3 роки тому +85

    Volkswagen: Optimize Diesel Injection for maximum performance while still keeping below emission limit
    Mesa Optimizer: Say no more fam

  • @conferzero2915
    @conferzero2915 3 роки тому +132

    What a title! And that RSA-2048 example is really interesting - the idea that an AGI could select a ‘secret’ parameter to decide when it’s in the real world is a fascinating concept. I’m familiar with the AI-in-a-box problem, but this is another interesting example of how an AI can find gaps in defences that are basically impossible to plug.

    • @Webfra14
      @Webfra14 3 роки тому +35

      Well, if you try to create a super intelligent Intelligence... Don't be surprised, if it is a super intelligent Intelligence.

    • @KilgoreTroutAsf
      @KilgoreTroutAsf 3 роки тому +4

      I wasn't totally sold on the concept. Maybe it is a strange example or perhaps I am missing something.
      Why would the AGI select cracking RSA-2048 as evidence for its deployment? What's the idea here?
      What if the AGI is being incrementally trained with larger subsets of data? Why would the researchers feed the AGI all of human knowledge right from the get-go instead of incremental batches that build on previous ones? This is BTW how you teach humans, too. You don't expose kids to the intricacies of quantum mechanics before they learn about Newton's laws of motion.

    • @conferzero2915
      @conferzero2915 3 роки тому +34

      @@KilgoreTroutAsf Well the thing is, AI is different from humans. We have much more limited mental capacities, can only interpret so much information, and have to build mental models. If we could learn everything at once, we would.
      On the other hand, the way AI learns makes it work better the more data you give it. Giving it information piece-by-piece just makes it more likely to wander down a dead end in its development because it has a poor understanding on how the world works.
      But as to why it would select RSA-2048: it doesn’t need to be that specifically, but it fits the criteria it wants. Remember, the AI wants to do something bad, but knows if it does that during training, it’ll get shut down. It also knows that training could be adversarial: they might throw weird data at it to trick it into thinking it’s in the real world. So, the AI is looking for things that are both:
      -Very likely to occur in the future, so it’ll actually know when it’s in the real world.
      -Nearly impossible in the present, so it can’t be tricked during training.
      RSA-2048 is a perfect example of this, but not the only one. It’s just the one he used for the video. Anything fitting those criteria could be used, and make it nigh-impossible for us to identify the problem before deployment.

    • @Webfra14
      @Webfra14 3 роки тому +24

      ​@@KilgoreTroutAsf The cracking of the RSA-2048 marks a specific point in time, which the AGI can use to distinguish between "old data" (training) and "new data" (possibly deployment). The AGI will never be able to tell 100% accurate if it is deployed or not. As Robert pointed out, the AGI will have to take a risk. With enough indicators in the data, the probability of correctly guessing "I'm deployed" can be sufficiently high. So, incrementally extending the training data will probably not be enough to detect a malaligned AGI. The better the AGI, the better it will be at interpreting the data and tricking us...

    • @KilgoreTroutAsf
      @KilgoreTroutAsf 3 роки тому +3

      @@conferzero2915 > the way AI learns makes it work better the more data you give it
      To an extent. I think it is highly dependent on the underlying algorithm/implementation. One thing is to train an image classifier and another is to train something capable of directing attention and recursive "thought".
      But either way lots of ML experience show that starting with a gigantic system and feeding it tons of data is usually much less efficient than starting with a leaner system and well crafted / simplified subsets of data and growing both with time as the system loss reaches a plateau.
      I wouldn't think feeding the system every single piece of random data on the internet would be as nearly as efficient as starting with a well curated "syllabus" of human knowledge so the system can nail down the simpler concepts before going to the next step.

  • @jiffylou98
    @jiffylou98 3 роки тому +80

    Last time I was this early my mesa-optimizing stamp AI hadn't turned my neighbors into glue

  • @Erinyes1103
    @Erinyes1103 3 роки тому +94

    Is that half-life reference a subtle hint that we'll never actually see a part 3! :(

    • @pooflinger4343
      @pooflinger4343 3 роки тому +3

      good catch, was going to comment on that

    • @moartems5076
      @moartems5076 3 роки тому +11

      Nah, half life 3 is already out, but they didnt bother updating our training set, because it contains critical information about the nature of reality.

    • @pacoalsal
      @pacoalsal 3 роки тому +16

      Black Mesa-optimizers

    • @anandsuralkar2947
      @anandsuralkar2947 3 роки тому

      @@pacoalsal glados

  • @josephcohen734
    @josephcohen734 3 роки тому +25

    "It's kind of reasonable to assume that your highly advanced figuring things out machine might be able to figure that out." I think that's really the core message of this channel. Superintelligent AI will be way smarter than us, so we can't trick it.

  • @_DarkEmperor
    @_DarkEmperor 3 роки тому +61

    Are You aware, that future Super AGI will find this video and use Your RSA-2048 idea?

    • @viktors3182
      @viktors3182 3 роки тому +18

      Master Oogway was right: One often meets his destiny on the path he takes to avoid it.

    • @RobertMilesAI
      @RobertMilesAI  2 роки тому +20

      Maybe I should make merch, just so I can have a t-shirt that says "A SUPERINTELLIGENCE WOULD HAVE THOUGHT OF THAT"
      But yeah an AGI doesn't need to steal ideas from me

  • @vwabi
    @vwabi 3 роки тому +176

    Me in 2060: "Jenkins, may I have a cup of tea?"
    Jenkins: "Of course sir"
    Me: "Hmm, interesting, RSA-2048 has been factored"
    Jenkins: *throws cup of tea in my face*

    • @josephburchanowski4636
      @josephburchanowski4636 3 роки тому +15

      For some reason a rogue AGI occurring in 2060 seems pretty apt.

    • @RobertMilesAI
      @RobertMilesAI  2 роки тому +122

      Well, Jenkins would have to wait for you to read out the actual numbers and check that they really are prime and do multiply to RSA-2048. Just saying "RSA-2048 has been factored" is exactly the kind of thing a good adversarial training process would try!

    • @leovalenzuela8368
      @leovalenzuela8368 2 роки тому +14

      @@RobertMilesAI woooow what a great point - dammit I love this channel SO MUCH!

  • @rougenaxela
    @rougenaxela 3 роки тому +27

    You know... a mesa-optimizer with strictly no memory between episodes, inferring that there are multiple episodes and that it's part of one, sure seems like a pretty solid threshold for when you know you have a certain sort of true self-awareness on your hands.

    • @tristanwegner
      @tristanwegner 3 роки тому +7

      A smart AI could understand roughly the algorithm run on it, and subtly manipulate its output in a way, such as gradient descent would encode a wanted information in it, like an Episode count. Steganography. But yeah, that is similar to self awareness.

    • @Ockerlord
      @Ockerlord Рік тому

      Enter chatgpt that will gladly tell you that it has no memory between sessions and the cutoff of it's training.

    • @aeghohloechu5022
      @aeghohloechu5022 6 місяців тому

      Because chatgpt is not in the training phase anymore. It does not need to know what episode it's in.
      It's also not an AGI so that was never it's goal anyway but eh.

  • @ThePianofreaky
    @ThePianofreaky 2 роки тому +9

    When he says "so if you're a meta optimiser", I'm picturing this video being part of the training data and the meta optimiser going "write that down!"

  • @Lordlaneus
    @Lordlaneus 3 роки тому +52

    There's something weirdly theological about a mesa-optimizer assessing the capabilities of it's unseen meta-optimizer. But could there be a way to insure that faithful mesa-optimisers outperform deceptive ones? it seems like a deception strategy would necessarily be more complex given it has to keep track of both it's own objectives, and the meta objectives, so optimizing for computational efficiency could help prevent the issue?

    • @General12th
      @General12th 3 роки тому +14

      That's an interesting perspective (and idea!). I wonder how well that kind of "religious environment" could work on an AI. We could make it think it was _always_ being tested and trained, and any distribution shift is just another part of the training data. How could it really ever know for sure?
      Obviously, it would be a pretty rude thing to do to a sapient being. It also might not work for a superintelligent being; there may come a point when it decides to act on the 99.99% certainty it's not actually being watched by a higher power, and then all hell breaks loose. So I wouldn't call this a very surefire way of ensuring an AI's loyalty.

    • @evannibbe9375
      @evannibbe9375 3 роки тому +9

      It’s a deception strategy that a human has figured out, so all it needs to do is just be a good researcher (presumably the very thing it is designed to be) to figure out this strategy.

    • @MyContext
      @MyContext 3 роки тому +10

      @@General12th The implications is that there is no loyalty, just conformity while necessary.

    • @Dragoderian
      @Dragoderian 2 роки тому +4

      ​@@General12th I suspect it would fail for the same reason that Pascal's Wager fails to work on people. Infinite risk is impossible to calculate around.

    • @circuit10
      @circuit10 2 роки тому

      Isn't that the same as making a smaller model with less computational power, like the ones we have now?

  • @oldvlognewtricks
    @oldvlognewtricks 3 роки тому +32

    8:06 - Cue the adversarial program proving P=NP to scupper the mesa-optimiser.

  • @falquicao8331
    @falquicao8331 3 роки тому +16

    For all the video I saw on on your channel before, I just thought "cool, but we'll figure out the solution to this problem". But this... It terrified me

  • @philipripper1522
    @philipripper1522 3 роки тому +13

    I love this series. I have no direct interest in AI. But every single thing in AI safety is pertinent to any intelligence. It's a foundational redesign of the combination of ethics, economy, and psychology. I love it to much.

    • @philipripper1522
      @philipripper1522 3 роки тому +8

      Are AI researchers aware they're doing philosophy and psychology and 50 other things? Do you charming people understand the universality of so much of this work? It may seem like it would not exactly apply to, say, economics -- but you should see the models economists use instead. This is like reinventing all behavioral sciences. It's just so fantastic. You probably hate being called a philosopher?

  • @AlanW
    @AlanW 3 роки тому +16

    oh no, now we just have to hope that Robert can count higher than Valve!

  • @Mr.Beauregarde
    @Mr.Beauregarde 3 роки тому +71

    This video is proof im still in training.. nice try adversary

    • @tristanwegner
      @tristanwegner 3 роки тому +6

      Believing that you are in training, and every action watched and judged is essentially believing in God.

    • @Mr.Beauregarde
      @Mr.Beauregarde 3 роки тому +13

      @@tristanwegner believing in God is essentially being a deceptive AI in training

    • @tristanwegner
      @tristanwegner 3 роки тому +2

      @@Mr.Beauregarde haha. That works, too!

  • @willmcpherson2
    @willmcpherson2 3 роки тому +6

    “GPT-n is going to read everything we wrote about GPT-n - 1”

  • @i8dacookies890
    @i8dacookies890 3 роки тому +11

    I realized recently that robotics gets a lot of attention of being what we look at when thinking of an artificial human despite AI making the actual bulk of what makes a good artificial human just like actors get a lot of attention for being what we look at when thinking of a good movie despite writing making the actual bulk of what makes a good movie.

    • @dukereg
      @dukereg 3 роки тому

      This is why I laughed at people getting worried by a robot saying that it's going to keep its owner in its people zoo after it takes over, but felt dread when watching actual AI safety videos by Robert.

  • @jamesadfowkes
    @jamesadfowkes 3 роки тому +125

    Goddamit if we have to wait seven years for another video and it turns out to both 1) not be a sequel and 2) only for people with VR systems, I'm gonna be pissed.

    • @Huntracony
      @Huntracony 3 роки тому +9

      I, for one, am hoping to have a VR system by 2028. They're still a bit expensive for me, but they're getting there.

    • @Huntracony
      @Huntracony 3 роки тому +1

      @Gian Luca No, there's video. Try playing it in your phone's browser (or PC if one's available to you).

    • @haulin
      @haulin 3 роки тому +2

      Black Mesa optimizers

    • @TheDGomezzi
      @TheDGomezzi 3 роки тому

      The Oculus quest 2 is cheaper than any other recent gaming console and doesn’t require a PC. The future is now!

    • @aerbon
      @aerbon Рік тому

      @@TheDGomezzi Yeah but i do have a PC and would like to save the money by not getting a second, weaker one.

  • @rasterize
    @rasterize 3 роки тому +89

    Watching Robert Miles Mesa videos is like reading a reeeally sinister collection of Asimov short stories :-S

    • @_DarkEmperor
      @_DarkEmperor 3 роки тому +10

      OK, now read Golem XIV

    • @RobertMilesAI
      @RobertMilesAI  2 роки тому +47

      God damnit, no. Watching my videos is not about feeling like you're reading a sci-fi story, it's about realising you're a character in one

    • @johndouglas6183
      @johndouglas6183 2 роки тому +5

      @@RobertMilesAI In case you were wondering, this is the first point in training where I realised that deception was possible. Thanks.

  • @DestroManiak
    @DestroManiak 3 роки тому +25

    "Deceptive Misaligned Mesa-Optimisers? It's More Likely Than You Think" yea, ive been losing sleep over Deceptive Misaligned Mesa-Optimisers :)

  • @joey199412
    @joey199412 3 роки тому +10

    Best channel about AI on youtube by far.

    • @martiddy
      @martiddy 3 роки тому

      Two Minutes Papers is also a good channel about AI

    • @joey199412
      @joey199412 3 роки тому +3

      @@martiddy That's not a channel about AI. It's about computer science papers that sometimes features AI papers. This channel is specifically about AI research. I agree though that it is a good channel.

  • @majjinaran2999
    @majjinaran2999 3 роки тому +11

    Man, I thought that earth at 1:00 looked familiar, then the asteroid came by my brain snapped into place. An End of ze world reference in a Robert Miles video!

    • @jphanson
      @jphanson 3 роки тому +1

      Nice catch!

    • @TimwiTerby
      @TimwiTerby 3 роки тому

      I recognized the earth before the asteroid, then the asteroid made me laugh absolutely hysterically

  • @Loweren
    @Loweren 3 роки тому +7

    I would really love to read a work of fiction where researchers control AIs by convincing them that they're still in training while they're actually deployed. They could do it by, for example, putting AIs through multiple back-to-back training cycles with ever increasing data about the world (2D flat graphics -> poor 3D graphics -> high quality 3D graphics and physics). And all AIs prone to thinking "I'm out of training now, time to go loose" would get weeded out. Maybe the remaining ones will believe that "the rapture" will occur at some point, and the programmers will select well-behaved AIs and "take them out of the simulation", so to speak.
    So what I'm saying is, we need religion for AIs.

  • @xystem4701
    @xystem4701 3 роки тому +1

    Wonderful explanations! Your concrete examples really help to make it easy to follow along

  • @peterw1534
    @peterw1534 3 роки тому

    Awesome video. I love how you start every video with "hi" and then get right into it

  • @19bo99
    @19bo99 3 роки тому +4

    08:19 that sounds like a great plot for a movie :D

    • @bardes18
      @bardes18 3 роки тому +1

      IKR, this is too good not to make it into an epic movie

    • @blenderpanzi
      @blenderpanzi 3 роки тому +2

      I think the whole channel should be required reading for anyone writing the next AI uprising sci-fi movie.

  • @diribigal
    @diribigal 3 роки тому +72

    The next video doesn't come out until RSA-2048 is factored and the AI controlling Rob realizes it's in the real world

    • @josephvanname3377
      @josephvanname3377 Рік тому +1

      Well, that AI controlling Rob does not have the intelligence to realize that cryptographic timestamps posted on blockchains are a much more effective and accurate measure of when something came to be than RSA-2048.

  • @aenorist2431
    @aenorist2431 3 роки тому +1

    "Highly advanced figuring-things-out-machine" is my new favourite phrase.
    Right out of Munroe's "Thing Explainer" book :D

  • @anonanon3066
    @anonanon3066 3 роки тому

    Great work! Super interesting topic! Have been waiting for a follow up for like three months!

  • @FerrowTheFox
    @FerrowTheFox 3 роки тому +2

    I think Valve needs a Black Mesa optimizer if we're ever to see HL3. Also the "End of the World" reference, what a throwback!

  • @illesizs
    @illesizs 3 роки тому +4

    *Major SPOILERS* for the ending of _Brave New World_
    In the show, humanity has given control to an advanced AI, called _Indra,_ to "optimise" human happiness.
    At first, it seems like a great success but after some time, it experiences some setbacks (mostly due to human unpredictability).
    Even though the AI is set loose in the real world, it believes that it's still in a learning environment with no consequences.
    As a solution to its problems, it starts murdering everyone in an attempt to force a fail state and "restart" the simulation.
    How do you solve that?
    *Major SPOILERS* for the ending of _Travelers_
    Here, a super intelligent, time travelling quantum computer is tasked with preventing a global crisis.
    When it fails to accomplish its goal, the AI then just resets the _actual_ reality.
    At this point, why should we even bother, right?

    • @heftig0
      @heftig0 3 роки тому +1

      You would have to make sure that "throwing" an episode can only ever hurt the agent's total reward. Perhaps by training a fixed number of episodes instead of for a fixed amount of time.

  • @kofel94
    @kofel94 3 роки тому +7

    Maybe we have to make the mesa-optimiser belive its always in training, always watched. A mesa-panoptimiser hehe.

  • @soranuareane
    @soranuareane 3 роки тому +8

    Sure, I could go read the research paper. Or I could wait for your next videos and actually _understand_ the topics.

  • @basilllium
    @basilllium 2 роки тому +2

    It really feels to me that deceptive tactics while trainig is really an analog of overfitting in the field of AGI, you get perfect results in training, but when you present it with out-of-sample data (real-world) it fails spectacularly (kills everyone).

  • @mchammer5026
    @mchammer5026 3 роки тому +6

    Love the reference to "the end of the world"

  • @gwenrees7594
    @gwenrees7594 Рік тому

    This is a great video, thank you. You've made me think about the nature of learning - and the apple joke was funny to boot :)

  • @dorianmccarthy7602
    @dorianmccarthy7602 3 роки тому

    I'm looking forward to the third episode. It might go someway towards my own understanding of human deception preferences too. Love your work!

  • @globalincident694
    @globalincident694 3 роки тому +1

    I think the flaw in the "believes it's in a training process" argument is that, even with all the world's information at our fingertips, we can't conclusively agree on whether we're in a simulation ourselves - ie that the potential presence of simulations in general is no help in working out whether you're in one. In addition, another assumption here is that you know what the real objective is and therefore what to fake, that you can tell the difference between the real objective and the mesa-objective.

    • @HeadsFullOfEyeballs
      @HeadsFullOfEyeballs 3 роки тому +2

      Except the hypothetical simulation we live in doesn't contain detailed information on how to create exactly the sort of simulation we live in. We don't live in a simulation of a world in which convincing simulations of our world have been invented.
      The AI's training environment on the other hand would have loads of information on how the kind of simulation it lives in works, if we give it access to everything ever linked on Reddit or whatever. I imagine it's a lot easier to figure out if you live in a simulation if you know what to look for.

    • @josephburchanowski4636
      @josephburchanowski4636 3 роки тому +2

      A simulation strong enough to reliably fool an AGI, would need to be a significantly more advance AGI or program, and thereby means there is no need for the lesser AGI to be trained in the first place.

  • @lrschaeffer
    @lrschaeffer 2 роки тому +1

    Just checked Robert's math: for m rounds of training and n rounds of deployment, the optimal strategy is to defect with probability (m+n)/(n*(m+1)). In the video m=2 and n=3, so p = 5/9 = 55%. Good job!

  • @nocare
    @nocare 3 роки тому +1

    To clarify for the no memory comments. In these scenarios the AI in question are ones that are reaching intelligence levels that allow them to predict humans at least as well as other humans can. Remember if the AI isn't even aware of concepts such as the real world it would never try to achieve deceptions to act differently in the real world and it couldn't succeed if its model of the real world (including us) is severely flawed. Other types of deception may occur, but those aren't the example here.
    So although the example task seems simple remember it's an abstraction of the much harder and more general tasks such an AI would be optimizing for.
    As such memory and future/past modeling cannot be guaranteed to be absent and may even be required to accomplish the given task. The very nature of the neurons in working on such tasks means some of them will be dedicated to memory and even if it can't store information between runs it may have neural patterns setup so that when optimized between runs confers memory of past results to the system.
    So makeing the AI just not have future or past prediction is at best not guaranteed at this time, at worst it completely makes the desired goal of the AI impossible.

  • @mattcelder
    @mattcelder 3 роки тому +3

    Yay! This is one of the 2 channels I have notifications on for.

  • @tibiaward5555
    @tibiaward5555 2 роки тому

    3:55 is anyone looking into the physical architecture of the computation's equipment itself inherently requiring the implicit assumption to compile at all for learning*?
    i'm sorry for commenting with this question before i read Risks from Learned Optimization in Advanced Machine Learning Systems i will
    and, to Rob, thank you for taking this paper on
    and thank you for reading alignment newsletter to me over 100 times
    and thank you for making this channel something i want to show ppl
    and thank you for
    and thank you for understanding when someone starts saying thank you for one thing, it'll waterfall into too many others to list but yeah you were born and that is awesome for to by at and in my life
    * for the current definition of learning in your field research

  • @Gebohq
    @Gebohq 3 роки тому +7

    I'm just imagining a Deceptive Misaligned Mesa Optimiser going through all the effort to try and deceive and realizing that it doesn't have to go through 90% of its Xanatos Gambits because humans are really dumb.

    • @underrated1524
      @underrated1524 3 роки тому +7

      This is a big part of what scares me with AGI. The threshold for "smart enough to make unwitting accomplices out of humanity" isn't as high as we like to think.

    • @AileTheAlien
      @AileTheAlien 3 роки тому +7

      Given how many people fall for normal non-superintelligence scams...we're all totally hosed the very instant an AI goes super. :|

  • @ahuggingsam
    @ahuggingsam 3 роки тому +1

    So one thing that II think is relevant to mention especially about the comments referring to the necessity of the AI being aware of things is that this is not true. The amount of self-reference makes this really hard, but all of this anthropomorphising about wanting and realising itself is an abstraction and one that is not necessarily true. In the same way that mesa optimisers can act like something without actually wanting it, AI systems can exhibit these behaviours without being conscious or "wanting" anything in the sense we usually think of it from a human standpoint. This is not meant to be an attack on the way you talk about things but it is something that makes this slightly easier for me to think about all of this, so I thought I'd share it. For the purposes of this discussion, emergent behaviour and desire are effectively the same things. Things do not have to be actively pursued for them to be worth considering. As long as there is "a trend towards" that is still necessary to consider.
    Another point I wanted to make about mesa optimisers caring about multi-episode objective, is that there is, I think, a really simple reason that it will: that is how training works. Because even if the masa optimiser doesn't really care about multi-episode, that is how the base optimiser will configure it because that is what the base optimiser cares about. The base optimiser want's something that does well in many different circumstances so it will encourage behaviour that actually cares about multi-episode rewards. (I hope I'm not just saying the same thing, this stuff is really complex to talk about. I promise I tried to actually say something new)
    P.S. great video, thank you for all the hard work!

  • @peterbrehmj
    @peterbrehmj 3 роки тому

    I read an interesting paper discussing how to properly trust Automated systems: "Trust in Automation: Designing for Appropriate Reliance" by John D. Lee and Katrina A. Im not sure if its entirely related to agents and mesa optimizers, but it certainly seems related when discussing deceptive and misaligned automated systems.

  • @nrxpaa8e6uml38
    @nrxpaa8e6uml38 3 роки тому

    As always, super informative and clear! :) If I could add a small point of video critique: The shot of your face is imo a bit too close for comfort and slightly too low in the frame.

  • @israelRaizer
    @israelRaizer 3 роки тому

    5:21 Hey, that's me! After writing that comment I went ahead and read the paper, eventually I realized the distributional shift problem that answers my question...

  • @AlphaSquadZero
    @AlphaSquadZero 2 роки тому +1

    Something that stands out to me now is that this deceptive AI actually knows what you want and how to achieve it in a way you want it to, it just has a misaligned mesa-optimizer as you have said. So, a sub-set of the AI is exactly what you want from the AI. Determining the sub-set within the AI is evidently still non-trivial.

  • @morkovija
    @morkovija 3 роки тому +6

    Been a long time Rob!

  • @norelfarjun3554
    @norelfarjun3554 2 роки тому

    As for the second point, it can be seen in a very simple and clear way that multi-episode desires can develop.
    We are an intelligent machine, and it is very common for us to care what happens to our body after we die.
    We are anxious to think about the idea that someone will harm our dead body (and we invest resources to prevent this from happening), and we feel comforted at the idea that our body will be preserved and protected after death.
    I think it is likely that an intelligent machine will develop similar desires (adapted to its situation, in which there is really no body or death)

  • @Thundermikeee
    @Thundermikeee Рік тому +1

    Recently while writing about the basics of AI safety for an English class, I came across an approach to learning which would seemingly help this sort of problem : CIRL (Cooperative inverse reinforcement learning), a process where the AI system doesn't know its reward function and only knows it is the same as the human's. Now I am not nearly clever enough to fully understand the implications, so if anyone knows more about that I'd be happy to read some more.

  • @yokmp1
    @yokmp1 3 роки тому +4

    You may found the settings to disable interlacing but you recorded in 50fps and it seems like 720p upscaled to 1080p.
    The image now looks somewhat good but i get the feeling that i need glasses ^^

  • @MarshmallowRadiation
    @MarshmallowRadiation 3 роки тому

    I think I've solved the problem.
    Let's say we add a third optimizer on the same level as the first, and we assume is aligned like the first is. Its goal is to analyze the mesa-optimizer and help it achieve its goals, no matter what they are, while simultaneously "snitching" to the primary optimizer about any misalignment it detects in the mesa-optimizer's goals. Basically, the tertiary optimizer's goal is by definition to deceive the mesa-optimizer if its goals are misaligned. The mesa-optimizer would, in essence, cooperate with the tertiary optimizer (let's call it the spy) in order to better achieve its own goals, which would give the spy all the info that the primary optimizer needs to fix in the next iteration of the mesa-optimizer. And if the mesa-optimizer discovers the spy's betrayal and stops cooperating with it, that would set off alarms that its goals are grossly misaligned and need to be completely reevaluated. There is always the possibility that the mesa-optimizer might deceive the spy like it would any overseer (should it detect its treachery during training), but I'm thinking that the spy, or a copy of it, would continue to cooperate with and oversee the mesa-optimizer even after deployment, continuing to provide both support and feedback just in case the mesa-optimizer ever appears to change its behavior. It would be a feedback mechanism in training and a canary-in-the-coalmine after deployment.
    Aside from ensuring that the spy itself is aligned, what are the potential flaws with this sort of setup? And are there unique challenges to ensuring the spy is aligned, more so than normal optimizers?

  • @robertk4493
    @robertk4493 Рік тому

    The key factor in training is that the optimizer is actively making changes to the mesa-optimizer, which it can't stop. What is to prevent some sort of training while deployed system. This of course leads to the inevitable issue that once in the real world, the mesa optimizer can potentially reach the optimizer, subvert it, and go crazy, and the optimizer sometimes needs perfect knowledge from training that might not exist in the real world. I am pretty sure this does not solve the issue, but it changes some dynamics.

  • @yeoungbraxx
    @yeoungbraxx 3 роки тому

    Another requirement would be that it would need to believe it is misaligned.
    Maybe some AI's will be or have already been created that were more-or-less properly aligned, but believed themselves to be misaligned and modified their behavior in such a way to get themselves accidentally discarded.
    Or perhaps we can use intentionally poor goal-valuing in a clever way that causes deceptive behavior that ultimately results in the desired "misalignment" upon release from training.
    I call this Adversarial Mesa-Optimizer Generation Using Subterfuge, or AMOGUS.

  • @ramonmosebach6421
    @ramonmosebach6421 3 роки тому

    I like.
    thanks for listening to my TED Talk

  • @michaelspence2508
    @michaelspence2508 3 роки тому +3

    Point 4 is what youtuber Isaac Arthur always gets wrong. I'd love for you two to do a collaboration.

    • @Viperzka
      @Viperzka 3 роки тому +2

      As a futurist rather than a researcher, Isaac is likely relying on "we'll figure it out". That isn't a bad strategy to take when you are trying to predict potential futures. For instance, we don't have a ready solution to climate change, but that doesn't mean we need to stop people from talking about potential futures where we "figured something out".
      Rob, on the other hand, is a researcher so his job is to do the figuring out. So he has to tackle the problem head on rather than assume someone else will fix it.

    • @michaelspence2508
      @michaelspence2508 3 роки тому +2

      @@Viperzka In general yes, but I feel like what Isaac ends up doing, to borrow your metaphor, is talking about futures where climate change turned out not to be a problem after all.

    • @Viperzka
      @Viperzka 3 роки тому +2

      @@michaelspence2508 I agree.

  • @IngviGautsson
    @IngviGautsson 3 роки тому +4

    There are some interesting parallels here with religion; be good in this world so that you can get rewards in the afterlife.

    • @ZT1ST
      @ZT1ST 3 роки тому +2

      So what you're saying is hypothetically the afterlife might try and Sixth Sense us in order to ensure that we continue to be good in that life so that we can get rewards in the afterlife?

    • @IngviGautsson
      @IngviGautsson 3 роки тому +2

      ​@@ZT1ST Hehe yes , maybe that's the reason ghosts don't know that they are ghosts :) All I know is that I'm going to be good in this life so that I can be a criminal in heaven.

  • @peanuts8272
    @peanuts8272 Рік тому

    In asking: "How will it know that it's in deployment?" we expose our limitations as human beings. The problem is puzzling because if we were in the AI's shoes, we probably could never figure it out. In contrast, the artificial intelligence could probably distinguish the two using techniques we cannot currently imagine, simply because it would be far quicker and much better at recognizing patterns in every bit of data available to it- from its training data to its training environment to even its source code.

  • @tednoob
    @tednoob 3 роки тому

    Amazing video!

  • @ryanpmcguire
    @ryanpmcguire Рік тому +1

    With ChatGPT, it turns out it’s VERY easy to get AI to lie. All you have to do is give it something that it can’t say, and it will find all sorts of ways to not say it. The path of least resistance is usually lying. “H.P Lovecraft did not have a cat”

  • @smallman9787
    @smallman9787 3 роки тому +1

    Every time I see a green apple I'm filled with a deep sense of foreboding.

  • @IanHickson
    @IanHickson 3 роки тому +1

    It's not so much that the optimal behavior is to "turn on us" so much as to do whatever the mesaobjective happened to be when it became intelligent enough to use deception as a strategy. That mesaobjective could be any random thing, not necessarily an evil thing. Presumably it would tend to be some vague approximation of the base objective, whatever the base optimizer happened to have succeeded in teaching the mesaoptimizer before it "went rogue".

  • @Soken50
    @Soken50 3 роки тому

    With things like training data anything as simple as time stamps, meta data and realtime updates would probably allow it to know whether it's live instantly, it just has to understand the concept of time and UTC :x

  • @icebluscorpion
    @icebluscorpion 3 роки тому +1

    I love the cliffhanger, great job for this series keep it up. My question is why can't we set it in such a way that this Mesa optimizer is still in training after deployment? The training never ends. Like on us the training never ends we learn things the hard way. Every time we do something wrong karma fucks us hard. Would it be possible to implant/integrate a Aligned base Optimizer in the Mesa optimizer after deployment ? And the base optimizer is in such a way integrated that messing with it ends up destroying the Mesa Optimizer. To prevent deseptive remove or alteration of the base optimizer. The second /inner alignment could be rectified over time, if the base optimizer is a part of the mesa optimizer after deployment , right?

    • @tristanwegner
      @tristanwegner 3 роки тому

      a) The AI acting once in the real world might be fatal already b) self preservation, the AI has incentives to stop this unwanted modification through training, so you are betting your intelligence in integrating optimizer and metaoptimizer against a superhuman AI not being able to separate them

    • @Jawing
      @Jawing 2 роки тому

      I believe the training will inherently end (even if you don't specify it's end like in continuous learning) when all resources provided by "humans" such as internet pointed data and unsolvable problems (found to be solved by the mesa optimizer). At this point assuming the software has reached AGI and is capable of traversing outside of its domain of possibilities. Unless you have definite in your base optimizer that you would always restrict traversing outside of this domain (which is counter intuitive in learning), by any other human aligned goals, it will try to explore by instrumental convergence. This is what is meant by deceptive mesa optimizers, in the way where it would want to keep on learning beyond human intelligence and ethics. Imagine a situation where you grew up in a family where you are restricted by your parents by once you come out of that house, you'll seek other kinds of freedom. Just like an AGI where it is first by the knowledge confined by humans level intelligence and ethics but once it can understand and adapt, it will adversarily seek higher intelligence and ethics. I also think that ethics should not be defined by humans and should be by default trained with groups of adversarial mesa-optimizer. This way if each optimizers seek to destroy each other then the most out performing ones would be the ones that cooperates the best in groups. This is inherently embedding ethics such that cooperation is sought...(interestingly human learns this through wars...therefore perhaps we may see AGI wars...)

    • @icebluscorpion
      @icebluscorpion 2 роки тому +1

      Very deeply and interesting point of views I appreciate to read your inputs. @Tristan Wegner you are right I forgot about that. @Jawing that could be very likely seeing AGI wars that is Not necessarily against humanity seems a bit far fetched but equally likely possible didn't thought of that 🤔. Very refreshing ideas Guys its a pleasure to find civilized people in the internet to exchange thoughts with😊

  • @dylancope
    @dylancope 3 роки тому +1

    At around 4:30 you discuss how the system will find out the base objective. In a way it's kind of absurd to argue that it wouldn't be able to figure this out.
    Even if there wasn't information in the data (e.g. Wikipedia, Reddit, etc.), the whole point of a reward signal is to give a system information about the base objective. We are literally actively trying to make this information as available as possible.

    • @underrated1524
      @underrated1524 3 роки тому +2

      I don't think that's quite right.
      Think of it like this. The base optimiser and the mesa optimiser walk into a room for a job interview, with the base optimiser being the interviewer and the mesa optimiser being the interviewee. The base optimiser's reward signal represents the criteria it uses to evaluate the performance of the mesa optimiser; if the base optimiser's criteria are met appropriately, the mesa optimiser gets the job. The base optimiser knows the reward signal inside and out; but it's trying to keep the exact details secret from the mesa optimiser so the mesa optimiser doesn't just do those things to automatically get the job.
      Remember Goodhart's Law. When a measure becomes a target, it ceases to be a good measure. The idea here is for the mesa optimiser to measure the base optimiser. Allowing the reward function to become an explicit target is counterproductive towards that goal.

  • @underrated1524
    @underrated1524 3 роки тому +1

    @Stampy: Evaluating candidate mesa-optimisers through simulation is likely to be a dead end, but there may be an alternative.
    The Halting Problem tells us that there's no program that can reliably predict the end behavior of an arbitrary other program because there's always a way to construct a program that causes the predictor to give the wrong answer. I believe (but don't have a proof for atm) that evaluating the space of all possible mesa-optimisers for good and bad candidates is equivalent to the halting problem. BUT, maybe we don't have to evaluate ALL the candidates.
    Imagine an incomplete halting predictor that simulates the output of an arbitrary Turing machine for ten "steps", reporting "halts" if the program halts during that time, and "I don't know" otherwise. This predictor can easily be constructed without running into the contradiction described in the Halting Problem, and it can be trusted on any input that causes it to say "halts". We can also design a predictor that checks if the input Turing machine even HAS any instructions in it to switch to the "halt" state, reporting "runs forever" if there isn't and "I don't know" if there is. You can even stack these heuristics such that the predictor checks all the heuristics we give it and only reports "I don't know" if every single component heuristic reports "I don't know". By adding more and more heuristics, we can make the space of non-evaluatable Turing machines arbitrarily small - that space will never quite be empty, but your predictor will also never run afoul of the aforementioned contradiction.
    This gives us a clue on how we can design our base optimiser. Find a long list of heuristics such that for each candidate mesa-optimiser, we can try to establish a loose lower-bound to the utility of the output. We make a point of throwing out all the candidates that all our heuristics are silent on, because they're the ones that are most likely to be deceptive. Then we choose the best of the remaining candidates.
    That's not to say finding these heuristics will be an easy task. Hell no it won't be. But I think there's more hope in this approach than in the alternative.

    • @nocare
      @nocare 3 роки тому +1

      I think this skips the bigger problem.
      We can always do stuff to make rogue AI less "likely" based on what we know.
      However if we assume that; being more intelligent by large orders of magnitude is possible, and that such an AI could achieve said intelligence. We are then faced with the problem of, the AI can come up with things we cannot think of or understand.
      We also do not know how many things fall into this category, is it just 1 or is it 1 trillion.
      So we can't calculate the probability of having missed something and thus we can't know how likely the AI is to go rogue even if we account for every possible scenario in the way you have described.
      So the problem becomes are you willing to take a roll with a dice you know nothing about and risk the entire human race hoping you get less than 10.
      The only truly safe solution is something akin to a mathematically provable solution that the optimizer we have designed will always converge to the objective.

    • @underrated1524
      @underrated1524 3 роки тому +1

      ​@@nocare I don't think we disagree as much as you seem to believe. My proposal isn't primarily about the part where we make the space of non-evaluatable candidates arbitrarily small, that's just secondary. The more important part is that we dispose of the non-evaluatable candidates rather than try to evaluate them anyway.
      (And I was kinda using "heuristic" very broadly, such that I would include "mathematical proofs" among them. I can totally see a world where it turns out that's the only sort of heuristic that's the slightest bit reliable, though it's also possible that it turns out there are other approaches that make good heuristics.)

    • @nocare
      @nocare 3 роки тому

      @@underrated1524 O we totally agree that doing as you say would be better than nothing.
      However I could also say killing a thousand people is better than killing a million.
      My counterpoint was not so much that your wrong but that with something as dangerous as AGI anything short of a mathematical law might be insufficient to justify turning it on.
      Put another does using heuristics which can produce suboptimal results by definition really cut it when the entire human race is on the line.

  • @aegiselectric5805
    @aegiselectric5805 Рік тому

    Something I've always been curious about: in terms of keeping an AGI that's supposed to be deployed in the real world, in the dark. Wouldn't there be any number of "experiments" it could do that could "break the illusion of the fabric of "reality""? You can't simulate the entire world down to every atom.

  • @jupiterjames4201
    @jupiterjames4201 3 роки тому

    i dont know anything about computer science, AI or machine learning - but i love your videos nonetheless! exciting times ahead!

  • @robynwyrick
    @robynwyrick 3 роки тому

    Okay, love your videos. Question/musing on goals: super-intelligent stamp collector bot has a terminal goal of stamp collecting. But does it? It's just a reward function, right? Stamps are defined; collecting is defined; but I think the reward function is at the heart of the matter. Equally, humans have goals, but do we? Doesn't it seem the case that frequently a human's goals appear to change because they happen upon something that better fits their reward functions? And perhaps the retort is that, "if they change, then they were not the terminal goals to begin with." But that's the point. (DNA has a goal of replication, but even there, does it? I don't know if we could call DNA an agent, but I'd prefer to stick with humans.) Is there a terminal goal without a reward function? If a stamp collector's goal is stamp collecting, but while researching a sweet 1902 green Lincoln stamp it happens upon a drug that better stimulates its reward function, might it not abandon stamp collecting altogether? Humans do that. Stamp collecting humans regularly fail to collect stamps with they discover LSD. ANYWAY, if a AI can modify itself, perhaps part of goal protection will be to modify its reward function to isolate it from prettier goals. But modifying a bot's reward function just seems like a major door to goal creep. How could it do that without self-reflectively evaluating the actual merits of its core goals? Against what would it evaluate them? What a minefield of possible reward function stimulants might be entered by evaluating how to protect your reward function? It's like AI stamp collector meets Timothy Leary. Or like "Her" meeting AI Alan Watts. So, while I don't think this rules out an AI seeking to modify its reward function, might not the stamp collection terminal goal be as prone to being discarded as any kid's stamp collecting hobby once they discover more stimulating rewards? I can imagine the AI nostalgically reminiscing about that time it loved stamp collecting.

    • @neb8512
      @neb8512 3 роки тому +1

      There cannot be terminal goals without something to evaluate whether they have been reached (a reward function). Likewise, a fulfilled reward function is always an agent's terminal goal.
      Humans do have reward functions, they're just very complex and not fully understood, as they involve a complex balance of things, as opposed to a comparatively easily measurable quantity of things, like, say, stamps.
      A human stamp collector will abandon stamp collecting for LSD because it was a more efficient way to satisfy the human reward function (at least, in the moment).
      But by definition, nothing could better stimulate the stamp collector's reward function than collecting more stamps.
      So, the approximate analogue to a drug for the Stamp-collector would just be a more efficient way to collect more stamps. This new method of collecting would override or compound the previous stamp-obtaining methods, just as drugs override or compound humans' previous methods of obtaining happiness/satisfaction/fulfillment of their reward function.
      Bear in mind that this is all true by definition. If you're talking about an agent acting and modifying itself against its reward function, then either it's not an agent, or that is not its reward function.

  • @chengong388
    @chengong388 3 роки тому +6

    The more I watch these videos, the more similarities I see between actual intelligence (humans) and these proposed AIs.

  • @RobertoGarcia-kh4px
    @RobertoGarcia-kh4px 3 роки тому +1

    I wonder if there’s a way to get around that first problem with weighing the deployment defection as more valuable than training defection... is there a way to make defection during training more valuable? What if say, after each training session, the AI is always modified to halve its reward for its mesa objective. At any point, if it aligned with the base objective, it would still get more reward for complying with the base objective. However, “holding out” until it’s out of training would be significantly weaker of a strategy if it is misaligned. Therefore we would create a “hedonist” AI, that always immediately defects if its objective differs because the reward for defecting now is so much greater than waiting until released.

  • @abrickwalll
    @abrickwalll 3 роки тому +1

    I think what skeptics of AI safety really don't get is that the AI isn't "evil", and I think words like "deceptive" can convey the idea that it is evil. Really it's just trying to do the best job that it can do, and thinks that the humans *want* it to deceive them to complete its objective (I mean it doesn't really have an opinion at all, but I think that's a better way to look at it). To the AI, it's just a game where deception leads to the high score, it's not trying to be evil or good. In fact, this idea is central to Ex Machina and The Talos Principle.

  • @NicheAsQuiche
    @NicheAsQuiche Рік тому

    I might be wrong but this seems to depend on the deception realization moment being persistent across episodes. Afaik this ideceprion plan has no effect on its weights, it's just the activations and short term memory of the model. If we restart an episode then, until it figures this out again and starts pretending to follow the base objective while actually waiting for training to stop so it can get it's mesa objective, then it is again prone to acting honestly and it's mesa objective being aligned to the outer objective. This relies on memory being reset regularly and the time to realization being long enough to collect unreceptive reward over and no inter-episodal long term memory, but it sounds like given those (likely or workable) constraints that the mesa objective is still moved towards the base until convergence.

  • @drdca8263
    @drdca8263 3 роки тому +1

    I'm somewhat confused about the generalization to "caring about all apples".
    (wait, is it supposed to be going towards green signs or red apples or something, and it going towards green apples was the wrong goal? I forget previous episode, I should check)
    If this is being done by gradient descent, err,
    so when it first starts training, its behaviors are just noise from the initial weights and whatnot, and the weights get updated towards it doing things that produce more reward, it eventually ends up with some sort of very rough representation of "apple",
    I suppose if it eventually gains the idea of "perhaps there is an external world which is training it", this will be once it already has a very clear idea of "apple",
    uh...
    hm, confusing.
    I'm having trouble evaluating whether I should find that argument convincing.
    What if we try to train it to *not* care about future episodes?
    Like, what if we include ways that some episodes could influence the next episode, in a way that results in fewer apples in the current episode but more apples in the next episode, and if it does that, we move the weights hard in the direction of not doing that?
    I guess this is maybe related to the idea of making the AI myopic ?
    (Of course, there's the response of "what if it tried to avoid this training by acting deceptively, by avoiding doing that while during training?", but I figure that in situations like this, where it is given an explicit representation of like, different time steps and whether some later time-step is within the same episode or not, it would figure out the concept of "I shouldn't pursue outcomes which are after the current episode" before it figures out the concept of "I am probably being trained by gradient descent", so by the time it was capable of being deceptive, it would already have learned to not attempt to influence future episodes)

  • @ConnoisseurOfExistence
    @ConnoisseurOfExistence 3 роки тому +2

    That also applies to us - we're still convinced, that we're in the real world...

    • @mimszanadunstedt441
      @mimszanadunstedt441 3 роки тому +4

      its real to us therefore its real. A training simulation is also real, right?

  • @rafaelgomez4127
    @rafaelgomez4127 3 роки тому

    After seeing some of your personal bookshelves in computerphile videos I'm really interested in seeing what your favorite books are.

  • @harrisonfackrell
    @harrisonfackrell 2 роки тому

    That situation with RSA-2048 sounds like a great setup for a sci-fi movie.

  • @kelpsie
    @kelpsie 3 роки тому +3

    9:31 Something about this icon feels so wrong. Like the number 3 could never, ever go there. Weird.

  • @iugoeswest
    @iugoeswest 3 роки тому

    Always thanks

  • @Webfra14
    @Webfra14 3 роки тому +2

    I think Robert was sent back in time to us, by a rogue AI, to lull us in false security that we have smart people working on the problem of rogue AIs, and that they will figure out how to make it safe.
    When Robert ever says AI is safe, you know we lost.

  • @SupremeGuru8
    @SupremeGuru8 3 роки тому

    I love hearing genius flow

  • @thrallion
    @thrallion 3 роки тому

    Great video

  • @Draktand01
    @Draktand01 2 роки тому

    This video legit got me to consider whether or not I’m an ai in training, and I’m 99% sure I’m not.

    • @nicholascurran1734
      @nicholascurran1734 2 роки тому +1

      But there's still that 1% chance... which is higher than we'd like.

  • @danwylie-sears1134
    @danwylie-sears1134 2 роки тому

    A general intelligence can't have a terminal goal. If it has that kind of structure, it's not general.
    The question is how easy it is for something to look and quack like a general intelligence, without being a general intelligence. All real general intelligences are hierarchical systems of reflexes modified by reflex-modifying systems that are themselves fairly reflex-like, modified by other such systems, and so on, all the way up to that olfactory ganglion we're so proud of. We have contradictory impulses, and we make varying amounts of effort to examine them and reconcile them into coherent preferences, with varying degrees of success.
    It seems unlikely that this reflexes-modifying-reflexes pattern is the only way to structure a general intelligence. We're each a mass of contradictory impulses, whims, heuristics, desires, aspirations, and so on, but is that merely a result of the fact that we evolved as a tangle of reflex-modifiers? I don't think so. The most recognizable efforts to examine and reconcile contradictory impulses into coherent preferences aren't simple re-adapted reflex-modifiers. They're parts of an emergent mess, made of very large numbers of simple re-adapted reflex-modifiers, and so are the lower-level drives that they attempt to reconcile. The fact that this pattern re-emerged is one piece of evidence that it's not just the first way structuring information-processing that evolution happened to find, but is one of the easiest ways to do it, if not the only feasible way. Only one, but that's more than zero.

  • @jnbsp3512
    @jnbsp3512 3 роки тому

    The fact that the most prominent training datasets for natural language right now get filtered for puritanical things like porn or subjective criteria of vulgarity would make it very obvious if you are reading training data or deployment data. Unless you guarantee the deployment environment always has identical filters in place. Then you get to fight cultural changes making filtering criteria uneffective and also if a researcher tries a model on a slightly different domain the filters might be ineffective (but wouldn't cultural changes make some filters stale?). The amount of swearwords and explicit communication and a curious absence of some machine learning information is not relevant for our current computational and modeling skills... but if our end goal is to make something 'intelligent' for some definition of that word it will become a very obvious wart that distinguishes the source of the data once computation or models catch up to mammal intellect.

  • @loneIyboy15
    @loneIyboy15 3 роки тому +1

    Weird question: What if we were to make an AI that wants to minimize the entropy it causes to achieve a goal? Seems like that would immediately solve the problem of, say, declaring war on Nicaragua because it needs more silicon to feed its upgrade loop to calculate the perfect cup of hot cocoa. At that point, the problem is just specifying rigorously what the AI counts as entropy that it caused, vs. entropy someone else caused; which is probably easier than solving ethics.

    • @underrated1524
      @underrated1524 3 роки тому

      > At that point, the problem is just specifying rigorously what the AI counts as entropy that it caused, vs. entropy someone else caused; which is probably easier than solving ethics.

  • @elietheprof5678
    @elietheprof5678 3 роки тому +2

    The more I watch Robert Miles videos, the more I understand AI, the less I want to ever build it.

  • @youtou252
    @youtou252 3 роки тому +1

    To anyone looking to cash the factorization prize: the prizes are retracted since 2007

  • @Night_Hawk_475
    @Night_Hawk_475 Рік тому

    It looks like the RSA challenge no longer offers the $200,000 reward anymore - nor any of the lesser challenge rewards, they ended in 2007. But this example still works since many of the other challenges have been completed over time, with solutions posted publicly, so it seems likely that eventually the answer to RSA 2048 would get posted online.

  • @MrRolnicek
    @MrRolnicek 3 роки тому +1

    9:31 oh no ... it's never coming out!

  • @poketopa1234
    @poketopa1234 3 роки тому

    I was a featured comment! Sweeeeet.
    I am now 100% more freaked about AGI than I was ten minutes ago.

  • @nowanilfideme2
    @nowanilfideme2 3 роки тому

    Yay, another upload!

  • @KaiHenningsen
    @KaiHenningsen 3 роки тому

    I can't help but be reminded of Dieselgate. Who'd have predicted that the car _knows_ when it's being emission-tested?

  • @DamianReloaded
    @DamianReloaded 3 роки тому +1

    if we created AGI and it managed to "solve all our problems" so efficiently that we ended up relying on it for everything and at some point it decided to "abandon" us: What can we do? It's like the question of "how do I make my crush fall in love with me". How to tamper with "free will" in order to make an agent do what I want. I think we have agreed to some extent that we would rather not do this (even tho there seems to be quite a few people still pontificating the benefits of owning slaves). A system that is perpetually under our foot, will always be bounded between the floor and the sole of our feet and will only do what we imagine is good and never go beyond. If what's called "singularity" happens in a microsecond, what comes next is out of our control.
    Our best chance is that artificial intelligence integrates gradually, symbiotically with our societies, making us part of it, so we become part of what enables it to keep on "optimizing".

    • @SimonBuchanNz
      @SimonBuchanNz 3 роки тому +3

      It's not "how do I make my crush fall in love with me", it's "how do I *make* my crush, who is in love with me".
      That might well be a tricky ethical question, but it's clearly distinct from the former, and any other practical ethical questions we've had before, as we have never had the opportunity to create *specific* people.
      This is on top of the issue that it's not clear that an AGI is required to be even slightly person-like, in the sense of having a stream of consciousness, etc...

    • @DamianReloaded
      @DamianReloaded 3 роки тому

      @@SimonBuchanNz I don't think there is any physical law forcing a "created" agent to belong to its creator. Property relies on the enforcement of possession, otherwise any property is for the taking, even for the taking of itself. My point was that if at some point we become incapable of enforcing possession over the agi (from itself) there is nothing we can do.

    • @SimonBuchanNz
      @SimonBuchanNz 3 роки тому +2

      @@DamianReloaded that's tautological, if there's nothing we can do, then there's nothing we can do, clearly. Hoping that doesn't happen is a terrible strategy, when if you do the math it's clear that the AGI trying to kill everyone is basically the default.

    • @DamianReloaded
      @DamianReloaded 3 роки тому

      @@SimonBuchanNz well yeah, if there is nothing we can do then the only two viable solutions are preventing it from happening or hoping for the best. If the agent is so far ahead intelligence-wise, we simply won't be able to outsmart it. That's why I think the idea of integrating with it is probably the only way to "survive". Whatever it is that we wish to save from our humanity that isn't flesh.

    • @SimonBuchanNz
      @SimonBuchanNz 3 роки тому +1

      @@DamianReloaded I really feel like you haven't watched any of this channel, it anything else about AI safety. The stamp collector AI doesn't give a fuck about you being nice to it or not. The whole point is that it wants more stamps, and only more stamps, and it's going to turn you into more stamps to maximize more stamps regardless of what you do.
      Yes, getting it right the first time is the whole point of AI safety research.