What Can We Do About Reward Hacking?: Concrete Problems in AI Safety Part 4

Поділитися
Вставка
  • Опубліковано 27 тра 2024
  • Three different approaches that might help to prevent reward hacking.
    New Side Channel with no content yet!: / @robertmiles2
    Where do we go now?: • Where do we go now?
    Previous Video in the series: • Reward Hacking Reloade...
    The Concrete Problems in AI Safety Playlist: • Concrete Problems in A...
    The Computerphile video: • General AI Won't Want ...
    The paper 'Concrete Problems in AI Safety': arxiv.org/pdf/1606.06565.pdf
    With thanks to my excellent Patreon supporters:
    / robertskmiles
    Steef
    Sara Tjäder
    Jason Strack
    Chad Jones
    Stefan Skiles
    Katie Byrne
    Ziyang Liu
    Jordan Medina
    Kyle Scott
    Jason Hise
    David Rasmussen
    Heavy Empty
    James McCuen
    Richárd Nagyfi
    Ammar Mousali
    Scott Zockoll
    Charles Miller
    Joshua Richardson
    Fabian Consiglio
    Jonatan R
    Øystein Flygt
    Björn Mosten
    Michael Greve
    robertvanduursen
    The Guru Of Vision
    Fabrizio Pisani
    A Hartvig Nielsen
    Volodymyr
    David Tjäder
    Paul Mason
    Ben Scanlon
    Julius Brash
    Mike Bird
    Taylor Winning
    Roman Nekhoroshev
    Peggy Youell
    Konstantin Shabashov
    Dodd Almighty
    DGJono
    Matthias Meger
    Scott Stevens
    Emilio Alvarez
    Michael Ore
    Robert Bridges
    Dmitri Afanasjev
    Brian Sandberg
    Einar Ueland
    Lo Rez
    C3POehne
    Stephen Paul
    Marcel Ward
    Andrew Weir
    Pontus Carlsson
    Taylor Smith
    Ben Archer
    Ivan Pochesnev
    Scott McCarthy
    Kabs Kabs Kabs
    Phil
    Philip Alexander
    Christopher
    Tendayi Mawushe
    Gabriel Behm
    Anne Kohlbrenner
    Jake Fish
    Jennifer Autumn Latham
    Filip
    Bjorn Nyblad
    Stefan Laurie
    Tom O'Connor
    Krethys
    PiotrekM
    Jussi Männistö
    Matanya Loewenthal
    Wr4thon
  • Наука та технологія

КОМЕНТАРІ • 274

  • @aenorist2431
    @aenorist2431 6 років тому +500

    I love how you could not get through "always serves the interest of the citizens" with a straight face.
    Hilariously depressing.

    • @bardes18
      @bardes18 5 років тому +22

      I legit would vote for him. Once we figure out how to make safe AIs then maybe we can turn it back into a sane economic/governing model :p

    • @inyobill
      @inyobill 5 років тому +13

      Governments at least have a mandate to work for the benefit of the citizens. Obviously the effectivities are problematic. Other systems, such as corporations, have no such mandate. Would you prefer empowering a system that has a mandate to enact your goals, or a system that cares nothing for YOUR goals?

    • @manishy1
      @manishy1 4 роки тому +18

      ​@@inyobill Unfortunately, this doesn't take into account the reward hacking nature of government - it is ideal to create a system that rewards the government without benefiting the citizens. Furthermore, it is beneficial to trick the citizens into thinking they are benefited when, in fact, the government is exclusively benefited. There are only so many tax dollars and the government benefits most by keeping it all (what manifests as a ruling class). This is why politicians tend to have a disproportionate increase in wage to the general population.
      Whereas, with a corporation, which has no mandate to look after citizens (since it doesn't possess citizens, only employees and customers) will tend to act in the best interests of its goals, which is to become a lucrative business. In many instances it is beneficial to benefit the customer (and some employees) in order to generate more wealth. Of course, when certain entities (in particular, regulatory bodies) demand other goals have higher weight, that concept goes out the window as it becomes more beneficial to prioritize the desires of those regulatory elements. This is only practical when the reward (money) is completely controlled by that regulatory body (the various federal reserves).
      Look at google adsense vs the CCP social credit scheme. Both utilize an inordinate amount of surveillance, assisted by neural networks to monitor the behaviour of people. One produces handy advertisements and convinces me to buy things I already want (even if I don't know it yet) but the other will imprison me for the colour of my skin/where I was born. And in both instances, each system works exactly as designed.

    • @inyobill
      @inyobill 4 роки тому +5

      @@manishy1 Unfortunately that paradigm doesn't take into account the greater complexity of the system, and self-awareness of the participants: government officials, employees and citizens. Do you vote? If Citizens aren't getting return on their investments, then they need to elect different representatives. Neither does your statement negate my original comment.

    • @manishy1
      @manishy1 4 роки тому +10

      ​@@inyobill Your argument implies governments have a mandate to work for the benefit of the citizens. I contradicted that. Self-awareness introduces more complexity to the system, certainly, but this doesn't contradict the observed phenomena - elected officials rarely represent the interests of the general populace; they pay lip service then proceed to employ an army of bureaucrats to suppress the people. I extrapolated that this behaviour can be described as reward hacking.

  • @PragmaticAntithesis
    @PragmaticAntithesis 5 років тому +133

    2:48 "Your AI can't hack its reward function by exploiting bugs in your code if there are no bugs in your code." Brilliant!

    • @baranxlr
      @baranxlr 3 роки тому +37

      If I were the one writing AI, I would simply not make any mistakes. So simple

    • @giannis_m
      @giannis_m Рік тому +7

      @@baranxlr I am just built different

    • @TextiAnimation
      @TextiAnimation 9 місяців тому

      @@baranxlr 2 years late but omg im trying it and its so hard to write an AI

  • @IAmNumber4000
    @IAmNumber4000 3 роки тому +86

    “Wireheading, where the fact that the reward system is a physical object in the environment means that the agent can get very high reward by physically modifying the reward system itself.”
    Drug addict robots are our future

    • @harryellis9571
      @harryellis9571 Рік тому +8

      There's actually a really interesting distinction between the two. Drugs tend to make addicts less capable so stopping their intake isnt too difficult (imagine if heroin made you smarter and more active, youd probably get pretty good at ensuring you always have some). This isnt the case for an AGI. A wireheaded AGI isnt just useless at completing its task it's actively going to ensure you cant prevent it from wirehacking itself. E.g. you try to take the bucket off its head and it kills you ... maybe they are similar to drug addicts in that sense

    • @pafnutiytheartist
      @pafnutiytheartist Рік тому +2

      @@harryellis9571
      "Imagine if heroin made you smarter" - that's basically the plot of limitless

  • @inthefade
    @inthefade 4 роки тому +21

    I love how difficult this problem is because my first thought was that there should be a reward for the agent to be modified, but then I realized that would instantly subvert the other rewards systems because the AGI would act to try to be modified then.
    This channel has made me feel that rewards systems are completely untenable and useless for an AGI.

    • @paradoxica424
      @paradoxica424 Рік тому +2

      we spend a quarter of a lifetime navigating arbitrary reward systems set up by large children who also don’t truly care about the reward systems … reward systems are also useless for humans imo, but to a lesser extent

    • @numbdigger9552
      @numbdigger9552 Рік тому +2

      @@paradoxica424 pain and pleasure are reward systems. They are also VERY effective

  • @keroqwer1761
    @keroqwer1761 6 років тому +116

    Pyro and Mei blasting their weapons on the toaster made my day :D

  • @Krmpfpks
    @Krmpfpks 6 років тому +152

    I was actually laughing out loud at 4:49 , thank you!

    • @SalahEddineH
      @SalahEddineH 6 років тому +28

      Me too :D That was lowkey and perfect :D

    • @MasterNeiXD
      @MasterNeiXD 6 років тому +22

      Krmpfpks
      Even he couldn't hold it in.

    • @Stedman75
      @Stedman75 6 років тому +7

      I love how he couldnt say it with a straight face... lol

    • @SecularMentat
      @SecularMentat 6 років тому +5

      Yuuuup. That was hilarious. I wonder how many takes he had to do to not laugh his ass off.

    • @spiveeforever7093
      @spiveeforever7093 6 років тому +5

      At the end he has a blooper, "take 17" XD

  • @dr-maybe
    @dr-maybe 6 років тому +37

    These videos of you are great on many level. The topic is extremely important, the way of explaining is very accessible, the humor is subtle yet brilliant and the pacing is just perfect.

  • @NiraExecuto
    @NiraExecuto 6 років тому +42

    4:47 "...ensuring that the government as a whole always serves the interests of the citizens. But seriously, I'm not that hopeful about this approach."
    Gee, I wonder why.....

    • @circuit10
      @circuit10 4 роки тому

      I mean we all think negatively about this but honestly 99% of what governments do is for the good of people, and it's so much better than 1000 years ago for example or better than a dictatorship

    • @Orillion123456
      @Orillion123456 4 роки тому +6

      @@circuit10 Well, for the most part, most modern governments are not strictly "better" than an absolute government system, just "less extreme".
      An absolute ruler (feudal, imperial, dictatorial, whatever) can wreak terrible havoc sure, but can also implement significant positive changes quickly and easily, there are historical examples of both. Modern governments are optimized to make it really slow to change anything and for there to be many points at which a change can get paused or cancelled before being put into place - we are failing to get anything important done any time soon but hey at least no first world nation has gone full hitler yet so we have that going for us?
      In the end the optimal government is one with actual power to quickly implement sweeping changes where necessary (like an absolute ruler of old times would have) but with proper screening process to ensure competence and benevolence. Unfortunately such a thing is impossible to implement (you can't get 2 people, let alone the entire planet, to agree on which criteria make for a competent and/or benevolent ruler, and you can't physically implement reliable tests to ensure any given person meets them).
      So in a way politics and AI safety are kinda similar in terms of the problems they face.

    • @debaronAZK
      @debaronAZK 3 роки тому

      @@circuit10 where do you get this number from? your ass?
      income inequality has never been greater, and never before have so few people owned so much wealth and power as now, and it's only getting worse.

  • @harrisonfackrell
    @harrisonfackrell 3 роки тому +1

    That little grin and barely-repressed chuckle when you started talking about the government really got me.

  • @himselfe
    @himselfe 6 років тому +11

    It's nice to see more of what AI safety researchers are considering to try and solve the problems of AI safety. Touching on what you said about software bugs, I think careful engineering should be an absolute cornerstone of AI development. Nature has natural selection and the inherent difficulty of survival to iron out bugs, and it doesn't care if entire species get wiped out in the debug process, we don't have that luxury. AI code should be considered critical code, and subject to the most stringent quality standards, not only to prevent the AI from exploiting bugs in its own code, but also to prevent malicious entities from doing the same to manipulate the AI.

    • @sperzieb00n
      @sperzieb00n 6 років тому

      AGI always makes me think about the first and second mass effect, and how true AGI is basically illegal in that universe.

  • @Shrooblord
    @Shrooblord 6 років тому +6

    Ah, the last method you discuss is quite smart. I love the idea of the AI predicting what would happen if it attempted to hack its reward system, and seeing that the "real world state" is different than its "perceived world state", and also less rewarding than actually making the real world state a better environment as defined by its reward function. It almost makes it feel like the AI is programmed with an understanding of consequences, and that consequences matter to its goals.

  • @metallsnubben
    @metallsnubben 6 років тому +47

    "I'm gonna start making more videos quickly so I can improve my ability to make videos"
    Why does this sound a bit familiar... ;)

  • @dannygjk
    @dannygjk 5 років тому +59

    The political system as it is currently administered, (ha ha), selects the candidates that are most fit to win an election which has a deleterious effect in general on society.

    • @KuraIthys
      @KuraIthys 5 років тому +8

      Yes. There's also some evidence that suggests the implementation of laws themselves is biased towards selecting laws chosen by those with the most influence on society, not by the largest part of society. (and this is usually synonymous with the wealthiest.)

    • @TheRealPunkachu
      @TheRealPunkachu 4 роки тому +22

      Votes have become a target so they are no longer a good measurement :/

    • @irok1
      @irok1 3 роки тому

      @@TheRealPunkachu true

    • @MarkusAldawn
      @MarkusAldawn 2 роки тому

      @@TheRealPunkachu I'd clarify to "winning a plurality of votes," since definitely there's strategies which lose you votes, but gain you voter _loyalty._
      But yeah- the end goal is to align your vote, and help align other people's votes, to support the candidate that will do the most good things. There's probably a function you could draw up to account to describe "I know this person lies in 50% of their promises, but this person hasn't been elected to get to keep them, so I have to evaluate the likelihood of them keeping >50% of their promises," and vary it to your liking (maybe average promise-keeping of politicians most similar to them ideologically? But then that would fail based on the fact a single politician won't be able to inact their campaign promise to, for example, change the national flag without support, so it's limited to what degree they *could* keep that promise? Certainly politicians would very quickly start clarifying their promises to be "I'll do my best" and so on).
      Anyway, humans are adversarial training networks for other humans, that's just what society means.

  • @dtanddtl
    @dtanddtl 6 років тому +10

    This needs to be added to the playlist

  • @qd4192
    @qd4192 6 років тому +3

    As a psychologist, it seems to me that an advanced ai systems fits perfectly the standard defintion of a sociopath. If it were administered Robert Hare's test for Sociopathy, it would score the maximum, having no genuine concern for others, but only for maximizing it's own rewards. Have you considered this perspective? (utterly selfish - inherently dangerous)

    • @dannygjk
      @dannygjk 5 років тому

      ie unless altruism naturally emerges from it's interactions with it's environment and subsequent consequences coupled with it's 'mental' development of an advanced AI then it will be at best completely neutral toward humans.

  • @kurtiswithak
    @kurtiswithak 6 років тому +5

    2:48 amazing meme usage, laughed out loud

  • @SalahEddineH
    @SalahEddineH 6 років тому +31

    New Rob Miles video! Yaaaay! I love your work, damnit! Keep rocking! Seriously!

    • @SalahEddineH
      @SalahEddineH 6 років тому +2

      :D 4:40 Was just PERFECT :D

  • @nilstrieb
    @nilstrieb 4 роки тому +3

    5:22 OMG an Overwatch reference love you!

  • @NathanTAK
    @NathanTAK 6 років тому +4

    NEW ROB MILES VIDEO.
    THIS IS THE GREATEST DAY OF MY LIFE.

    • @zxb995511
      @zxb995511 6 років тому +1

      Until the next one

    • @NathanTAK
      @NathanTAK 6 років тому +1

      +zxb995511 No no no, the delay gave me terminal cancer and I only have 30 seconds left to live.

  • @departchure
    @departchure 6 років тому +5

    Thanks for your videos Rob. I'd love for you to address something that has been confusing me about AI safety.
    For an AGI that is improving itself, its own code becomes part of its environment. It should have every incentive to reward hack its code. Even if you try to sandbox it, if it's smart enough, it's a high reward target to go after the code dictating its reward function. The same should be true of a utility function (because it seems like it would look the same). Modifying its utility function would allow it to achieve the highest possible value of its utility function. And even if you managed to keep it away from its utility or reward function, it would still like to wirehead itself elsewhere in its code. A stamp doesn't actually have to be collected anymore, for example, to be registered as collected, a billion times.
    How is it even possible to motivate a truly smart AGI to do anything? If it's really smart, it seems like it has to be modifying its code, and it'll be smart enough to realize that the easiest/fastest/most perfect way to meet whatever its goal is would be to cheat inside its software. Perfect score, every time.

    • @departchure
      @departchure 6 років тому +3

      Maybe that is a safety net. You try to get your AGI to solve your problem to your satisfaction before it figures out how to wirehead itself, knowing that it will ultimately wirehead itself before it destroys the universe looking for stamps.

  • @brbrmensch
    @brbrmensch 6 років тому +7

    ideas and hair start to get really interesting

  • @serovea333
    @serovea333 6 років тому +3

    Just binged watched your whole channel, I love how the *phile series has had a ripple effect on all you guys. Keep up the great work!

  • @jayxi5021
    @jayxi5021 6 років тому +20

    The thumbnail thought... 😂😂

  • @chrisaldridge545
    @chrisaldridge545 6 років тому

    Hi Robert, I’ve watched most of your videos now and just want to say many thanks. You are a really great communicator and I rate your input as one the top 2 YT sources I’ve discovered so far. You asked in another video for comments on what direction future videos should take..I think your idea of reviewing current papers and explaining them to curious laymen like myself would be best for me. It’s exactly what my other favourite channel “two minute papers” does with more of a focus on CGI , fluid simulations etc.
    I wish all the videos were longer too say 20 -40 mins each.
    (If I could afford to help patronise you guys I would, but I’m just a old/obsolete dumb ex-Programmer green with envy at the progress possible using the ML, AI tools of today.

  • @TheMusicfreak8888
    @TheMusicfreak8888 6 років тому +3

    Fantastic video once again. Seriously you never disappoint.

  • @interstice8638
    @interstice8638 6 років тому +2

    Great video Rob, your videos have inspired me to pursue an education in AI and AI safety.

  • @DaveGamesVT
    @DaveGamesVT 6 років тому +6

    These are always fascinating. Thanks.

  • @magellanicraincloud
    @magellanicraincloud 6 років тому +2

    Brilliant videos, I always feel more educated after listening to what you have to say. Thanks Rob!

  • @Ebdawinner
    @Ebdawinner 6 років тому +1

    Keep up the great work brother. U bring insight that is worth more than gold.

  • @JesseCaul
    @JesseCaul 6 років тому +2

    I loved your videos on computerphile a long time ago. I wish I had found your channel sooner.

  • @papa515
    @papa515 6 років тому +1

    The concepts discussed in these videos can be rated by comparing the behaviors that we want an AGI to have with how we analyze our own (human) behaviors. I found in this video exploring an especially strong connection.
    This means that this way of looking at how we should view AGI is not only just important for AI safety but also the creation of AGI in general.
    The only model we have for GI is between our collective ears.
    So to have a chance at constructing an AGI we will need on a very deep and complete level an understanding of the engine between our ears.
    As we come to understand our own mentation on deeper and deeper levels we will not just understand how to go about constructing an AGI but much more importantly we will understand ourselves. And this new understanding will help us learn how to behave as a modern social species and to maximize our chances of persisting with ever more advanced and complex technologies.

  • @simonmerkelbach9350
    @simonmerkelbach9350 6 років тому +1

    Really interesting content and the production quality of your videos has also gotten topnotch!

  • @AlexMcshred6505plus
    @AlexMcshred6505plus 6 років тому

    The pyro/mei toast was hillarious, wonderful as always

  • @tomahzo
    @tomahzo 2 роки тому

    4:49 : Hard to say that with a straight face, eh? Yeah, I feel you ;D.
    5:23 : That drawing is such a delight ;D.

  • @thrillscience
    @thrillscience 6 років тому

    Shana Tova! Have a great new year. Your videos are great.

  • @General12th
    @General12th 6 років тому +1

    This is such a good channel! I love it!

  • @SwordFreakPower
    @SwordFreakPower 6 років тому +36

    Top thumpnail!

  • @mykel723
    @mykel723 6 років тому +12

    Good idea, more people should post their links in the dooblydoo

    • @simeondermaats
      @simeondermaats 4 роки тому

      Artifexian's been doing it for a couple of years

  • @sunejohansson
    @sunejohansson 6 років тому

    Would love to see more about the code golf / how the program works or something like that :-) Keep up the good work. Cheers from Denmark

  • @sakurahertz
    @sakurahertz 4 роки тому

    Ok I'm 3 years late to this video (and channel) but I love that Mei from Overwatch reference
    Also amazing content, I always find this kind of stuff fascinating

  • @user-wi3db6wu8d
    @user-wi3db6wu8d 3 роки тому

    Your videos are really great !

  • @amargasaurus5337
    @amargasaurus5337 4 роки тому +1

    Imagine making an AGI with "keeping peace amongst humans" as it's goal, and ten years later coming back to find out it nerve stapled the entire human population so that noone felt the need to fight for anything

  • @LuminaryAluminum
    @LuminaryAluminum 6 років тому +6

    Love the Pyro and Mei reference.

  • @Jo_Wick
    @Jo_Wick 4 роки тому

    That analogy at 5:22 has a critical flaw; the liquid nitrogen would evaporate and smother the flame thrower's flames every time through the nitrogen displacing the oxygen in the air.

  • @grugnotice7746
    @grugnotice7746 6 років тому +1

    Id, ego, and superego as adversarial agents.
    Very interesting.

  • @roger_isaksson
    @roger_isaksson Рік тому

    The problem with rewarding hallucinated negative utility (anxious AI) is that it causes inaction from the observation that most actions got bad results (there’s a lot of negative utility that can be fantasized)
    Black vs. white lists. The black lists usually “wins”, where as white lists require effort to craft since it is mutually exclusive and might “sieve” a bit harshly.
    There’s also the problem of no good outcome irregardless of future projection. Then one is forced to “embrace the suck” of avoiding a (much) worse disaster.
    Some actions by the AGI will thus manifest as “apocalyptic”, even though AGI might strive to avoid ‘The Apocalypse’.
    I love this stuff.
    🤭👍

  • @Pheonix1328
    @Pheonix1328 4 роки тому +3

    Agents "fighting" each other and keeping each other in check reminded me a bit of the Magi from Evangelion.

  • @MarcoServetto
    @MarcoServetto 6 років тому +1

    For the gibbon/panda, a simple way to make the system more resistant could be to generate
    10 random filters like that, and pre apply those on the original image. Then try to evaluate all those 10 images and see if there is some "common" result. Indeed our eyes are full of noise all of the time.

    • @Frumpbeard
      @Frumpbeard Рік тому

      This is called data augmentation, and it's already done all the time.

    • @MarcoServetto
      @MarcoServetto Рік тому

      @@Frumpbeard and how can the random noise survive as an attack after this 'data augmentation'?

  • @douglasoak7964
    @douglasoak7964 6 років тому

    It would be really interesting to see coded examples of these concepts.

  • @electro_fisher
    @electro_fisher 6 років тому

    A+ editing, great video

  • @kiri101
    @kiri101 6 років тому

    Thank you for the content!

  • @splitzerjoke
    @splitzerjoke 6 років тому

    Great video, Rob :)

  • @tetraedri_1834
    @tetraedri_1834 6 років тому +7

    What if AGI realizes its reward function is being modified, and also realizes that the new reward function would for some reason give it higher reward once the new reward function is applied? Maybe it won't allow people to change its reward function until it ensures the new system would give it higher reward...?
    The rabbithole never ends...

    • @David_Last_Name
      @David_Last_Name 4 роки тому

      Someone else in the comments had this exact same idea, but then pointed out that would encourage the agi to never do what you wanted in order to force you to give it a new reward function. You can't ever win!!

  • @jupiter4602
    @jupiter4602 4 роки тому

    I know this video is over two years old now, but I couldn't help but notice the problem with model lookahead being somewhat similar to the problem with adversarial reward systems: If the general AI is able to subvert the model lookahead penalty, which in some cases could potentially happen by complete accident, then we're left with an AI that can plan what it wants without penalty again.

  • @DoveArrow
    @DoveArrow 2 роки тому +1

    Your comment about the flame thrower and the nitrogen gun trying to make toast perfectly describes our democratic systems. Maybe that's why Churchill said democracy is the worst form of government, except for all the others."

  • @BlackholeYT11
    @BlackholeYT11 4 роки тому +1

    "Pre-arachnophage" - as another former student I almost died when you brought that up, I was there in the room at the time xD

    • @David_Last_Name
      @David_Last_Name 4 роки тому

      Eh........ok I give up, explain please? This sounds both interesting and terrifying.

  • @DisKorruptd
    @DisKorruptd 4 роки тому

    regarding the bucket bot, I was actually just thinking that, in order to prevent the dolphin problem, the reward would be greater depending on the size of the things it picks up, so it is rewarded more for larger bits of trash, getting -100 for each piece of trash it sees would demotivate it from making one piece of trash into 2 pieces of trash, and it'd rather collect 1 piece of trash worth 500 rather than 2 pieces worth 200,

  • @bballs91
    @bballs91 6 років тому

    Glad I'm not the only one who noticed the overwatch characters 😂😂 well done Rob

  • @owlman145
    @owlman145 6 років тому +1

    Human reward hack all the time, but are kept in check by other humans. In this case, the society as a whole is the reward function that prevents us from doing really bad things. And yes, it's not a perfect system and so we probably shouldn't make super AIs use it, though it might end up being the case anyway.

  • @SS2Dante
    @SS2Dante 6 років тому +5

    ("Why not just"-style question, for anyone who wants to jump in and show me the stuff I've missed/overlooked/been dumb about)
    A lot of the trouble seems to come from the AI's ability to affect the physical world. Are there inherent problems to designing an AI such that the physical nature of the AI and utility function sync up to prevent this from happening? Note: this is slightly different from sandboxing, where the utility function still has free reign to attempt "escape" through social engineering etc.
    I'm imagining a computer, which functions as an oracle (ask a question, get an answer) with an input of...everything we have (functionally, the internet), a keyboard, and a screen for output. The Utility function would look something like
    1) Default state (i.e. no questions waiting to be answered) - MAX score
    2) Any action other than those specified in parts (4) to (5) - MIN score
    3) A question is fed in - 0 score
    4) Light up pixels on screen to answer question to the best of it's ability (with current information) - 10 score
    5) Light up pixels on screen to explain why it is unable to find an answer - 8 score
    As far as I can see, point (2) completely neuters the AI's desire to do...well, anything, except answer the questions that are given. In the default state it has max reward, but can't set up any kind of existence protection, as that violates (2), so it would just...sit there. Once a question is fed in (which it can't prevent, for the same reason), it is incentivised to answer the question (an improvement in score) as a short term goal, which allows it to clear the question and jump back up to the (1) state, where it idles again. The biggest danger I can see is in how we specify parts (4) to (5), but if worded clearly, any attempt at social engineering etc. would fall outside of the remit of "lighting up pixels to answer the question if you are able to do so".
    Obviously such an AI would be far slower and less effective than one that can actually take actions, but is certainly better than nothing!
    Anyway, as I said I'm sure I've missed....quite a few somethings, so if you know what's up please come correct this!
    Oh, and great video as always Rob, really enjoying the channel!

    • @thesteaksaignant
      @thesteaksaignant 5 років тому +1

      I know it's been a year but I think you left out the tricky part : how to get meaningful / "good" answers. You need to evaluate the quality of the possible answers and choose the best one according to some criteria (that is : a utility function). Reward hacking of this utility function will be a good strategy to choose the answer with the maximum value. All the risks described in this video apply.
      For instance, if the reward for the answer is given by humans then human manipulation is a good strategy, by giving answers that will please human / correspond to what they think is a good answer (regardless of what the true answer is).

  • @AmbionicsUK
    @AmbionicsUK 6 років тому

    Yey been waiting for this. Watching now...

  • @NoOne-fe3gc
    @NoOne-fe3gc 5 років тому +1

    We can see the same happening in the game industry currently. Because some studios use metacritic score as a gauge of success, they orient their game to please the critics and get a higher score, instead of making a good game for the fans

  • @gr00veh0lmes
    @gr00veh0lmes 5 років тому

    You ask some damn smart questions.

  • @CybranM
    @CybranM 6 років тому

    These videos are so interesting.

  • @WilliamDye-willdye
    @WilliamDye-willdye 6 років тому +13

    I strongly disagree with the approach that internal conflict between powerful systems is dubious (5:05), but the difference between Mr. Miles and I may be just a matter of semantics. I doubt if he truly advocates a system of government in which we give total power to the prime minister and then simply tell the public to make sure that they elect a very good prime minister. He later talks about components within the AI that lobby for different conclusions (at 7:16, for example), so maybe we only differ in how we draw the boundaries around what constitutes a separate entity in the conflict.
    For background, my own approach to AI safety (tentatively entitled "distributed algorithmic delineation") treats division of power as a critical component. Moreover, I fear that unification of power is to a large extent a natural long-term effect in any social organization with a high degree of self-interaction. Therefore a primary design consideration of a good safety system needs to place a high priority on defeating this natural tendency to centralize (sorry, "centralise") power.
    Well, like I said; maybe the differences between us are semantic. I still find these videos very interesting and often informative, and I'm glad that Mr. Miles is promoting AI safety as a proper field of study. For too many years, almost all of us with an interest in the topic could only make it an off-duty hobby. It's a delight to see well-written videos and papers about a topic that has interested me for so long.

    • @Paul-rs4gd
      @Paul-rs4gd 5 років тому

      So we wouldn't want to give all the power to the prime minister, but it might get a bit better if the power is divided among 3 bodies which watch over each other. The logical conclusion seems to be to increase the number of entities. In human society it is said that absolute power corrupts. When you have a lot of humans (or other types of agent) interacting things seem to be fairer when there are more of them, provided that no individuals gain drastically more power than others. It is the principle of avoiding monopolies of power. Hierarchies do form, so the system certainly has its problems, but nobody has ever managed to rule the whole earth so far !

  • @DeusExRequiem
    @DeusExRequiem 4 роки тому

    The best method would be a AGI ecosystem where there's many versions of an AGI trying to achieve similar goals, all while in competition with other AGIs that might not have the same goals, might be in opposition for different reasons, or might do unrelated things to those goals. All these thought experiments assume a world where there's only one AGI and once it becomes a problem there's nothing around to challenge it.

  • @cheydinal5401
    @cheydinal5401 4 роки тому

    I actually really like the idea of opposing subsystem, in the "branches of government" style. Sure, multi-branch government isn't perfect, but it's more stable than single-branch government. Not doing multi-branch AI would mean single-branch AI, which as I said can easily become dangerous.
    The opposing AI is basically an AI that is trained to make sure there is suffient AI safety. Then divide the other into a "legislative" AI that only decides what to do, and an "executive" AI that actually implements it in the real world, but those actions can only be taken if thag opposing "judicial" AI approves it

  • @claytonharting9899
    @claytonharting9899 3 роки тому

    An AGI that actively wants its reward function changed could make a very good character for a story. I imagine it as a schizophrenic ai that settles into a strategy where it randomly changes its own reward function. Maybe one day it will drop a bomb on a city, then the next day come by and go “oh no what happened? Here, let me help you”

  • @martinsmouter9321
    @martinsmouter9321 4 роки тому +1

    4:24 it doesn't have to be smarter just good enough to be less rewarding than it's environment

  • @starcubey
    @starcubey 6 років тому

    2:46 The best part of the video right there

  • @BatteryExhausted
    @BatteryExhausted 6 років тому

    Also, I was trying to explain to a friend that top thinkers discard Asimov's laws. Pls could you make a video directly dealing with Asimov. Thanks.
    Love your work.

    • @MrSparker95
      @MrSparker95 6 років тому +3

      He already did a video about that on Computerphile channel: ua-cam.com/video/7PKx3kS7f4A/v-deo.html

  • @notoriouswhitemoth
    @notoriouswhitemoth 4 роки тому

    Humans have multiple reward functions. We have dopamine, that rewards setting goals and accomplishing those goals, particularly related to our needs; serotonin, that rewards doing a thing well, impressing other people with displays of proficiency; and oxytocin, that rewards kindness and cooperation.

    • @armorsmith43
      @armorsmith43 3 роки тому

      > We have dopamine
      Some of us do... :_(

  • @XxThunderflamexX
    @XxThunderflamexX 3 роки тому

    Could you combine the agent and "utility function defender" into the same agent, and produce something that is "afraid" to wirehead itself? Something that periodically predicts what worldstates it would expect to observe if it suddenly operated with the goal of tricking its own systems, and then adding the predicted worldstates to a blacklist to check against with locality hashing.
    Admittedly, the hard part is probably in defining "tricking its own systems", which might be the core of the problem itself - how do we write a utility function that we unambiguously want to be maximized even all else being equal?

  • @snowballeffect7812
    @snowballeffect7812 6 років тому

    i love when my youtubers reference each other. makes the walls of my echochamber stronger. Sethblings vids on his work on the SMW are amazing.

  • @StardustAnlia
    @StardustAnlia 5 років тому +2

    Are drugs human reward hacking?

  • @recklessroges
    @recklessroges 6 років тому +6

    It feels like GAI research is re-covering much of the ground done by educational psychologists. I look forward to the flow of ideas being reversed and possibly implemented by GAI.

  • @ShazyShaze
    @ShazyShaze 6 років тому

    That's a darn great thumbnail

  • @syzygy6
    @syzygy6 4 роки тому

    You give the three-branch-government analogy as an example of overkill, and I agree with your reason but I also think you might say that the programmer is already a general purpose intelligence performing judiciary functions, And there may be value in formalizing the role human agents play in training AI rather than thinking of human agents as existing entirely outside of the system. For that matter, I wonder how much we can learn about effective governance from training AI; if you think of corporations as artificial intelligence, then you encounter the same problems with reward hacking.

  •  Місяць тому

    A general life wisdom that is often said is "everything in moderation". But all AI systems I have heard about so far try to maximise something. We do not even know words for what ethics maximises, especially not formulas or code for it. But maybe using a reasonably OK metric (or many) and telling the AI to get an OK score in that could be an interesting idea to follow?
    For example tests are still used in school, because they are a reasonably OK metric. But if someone got 100% on all tests ever, you would get suspicious. If one day there was an AGI that guaranteed that everyone had a better life than 99% of people have now, that would be considered somewhat of a utopia, right?
    I am sure there are many problems with this as well, but they are probably different ones. A programmer is always happy to see an error message change. :D And maybe this idea could lead to looking at things from a different angle, which is always helpful. Or maybe there are already lots of papers on this that I have not heard about.

  • @Paul-rs4gd
    @Paul-rs4gd 5 років тому

    I am interested in the idea that Reward Hacking is essentially the opposite argument to "Take a pill so you will be happy after killing your kids" - TPHKK for short. One view argues that an AI will try to modify its reward function, while the other argues that the AI will resist attempts to modify the function. I would like to hear more discussion of this conflict. My own 2 cents: It seems to me that the simple RL algorithms 'feel' their way down a gradient resulting from actions THEY HAVE ACTUALLY TAKEN in the real world. If the set of actions required to Reward Hack was sufficiently long it is vanishingly unlikely that sequence would be executed by chance, since no gradient would lead the RL along that path (there is no reward until the Hack has been successfully executed). The situation is entirely different in an AI that models the world and plans using that model. Such an AI could understand that it is implemented on a computer and plan to modify it. However it should see that the plan results in a change to its reward function and does not achieve its current goals. Therefore like humans being resistant to TPHKK, the AI should not wish to execute the plan.

  • @4.0.4
    @4.0.4 6 років тому

    The "careful engineering" idea is so good maybe we should apply it to every mission-critical software! Oh... Wait

  • @hugoehhh
    @hugoehhh 6 років тому +8

    Dude. Are we programmed? Anxiety and our actions make so much sense from ab agi programmers point of view

  • @williamchamberlain2263
    @williamchamberlain2263 4 роки тому

    1:10 I'd heard a few schools in Qld used to dissuade kids from taking some uni-entrance Year11-12 subjects to improve their positions in league tables.

  • @clayfare9733
    @clayfare9733 2 роки тому

    I know I'm late to the party here, but while thinking about the solutions to reward hacking that you listed here I can help but wonder if it would be possible to set something up like "You get 100 reward if you collect 100 stamps, but you lose 1 point for every stamp you collect over 100. If you attempt to modify your code you lose all/infinite points."

  • @bassett_green
    @bassett_green 6 років тому

    Dank memes and AGI, what's not to love

  • @iamsuperflush
    @iamsuperflush Рік тому

    Goodhart's Law can basically be wholly applied to the concept of capital as it currently exists. Money has become a very poor measure of value; see crypto, NFTs, subprime mortgage crisis, etc.

  • @XxThunderflamexX
    @XxThunderflamexX 3 роки тому

    Would multi-agent systems be resistant to wireheading? If the multiple agents are not normally in conflict with each other - say, they were able to delegate tasks based on specialization - there wouldn't be as much of a risk of humans being caught in the crossfire. It would only be when one agent starts misbehaving, wireheading itself or trying to take disproportionate power for itself over the other agents, that the other agents would be incentivised to step in and realign the agent with their shared goals.

  • @progamming242
    @progamming242 4 роки тому

    the chuckle at 4:47

  • @boldCactuslad
    @boldCactuslad 6 років тому

    Pyro and Mei vs Toast, such a nice sketch

  • @SamuelDurkin
    @SamuelDurkin 5 років тому

    If the end goal is more fuzzy and it's not something specific like stamps... would it work is the end goal was "do what the humans like", but don't tell it what they like so it has to try and guess what they like. This end goal sort of means you can keep changing it's end goals, as your liking of things is it's end goal.. when you stop liking something, it has to change what it's doing, which would including turning itself off, if that is something you would like it to do..

    • @dannygjk
      @dannygjk 5 років тому

      For that to work humans would have to change what they like which is unlikely to happen unless some force causes the humans to change what they like.

  • @yaakovgrunsfeld
    @yaakovgrunsfeld 3 роки тому

    If you wake up and don't want to smile
    If it takes just a little while
    Open your eyes and look at the day
    You'll see things in a different way
    Don't stop thinking about tomorrow
    Don't stop, it'll soon be here
    It'll be better than before
    Yesterday's gone, yesterday's gone
    Why not think about times to come?
    And not about the things that you've done
    If your life was bad to you
    Just think what tomorrow will do
    Don't stop thinking about tomorrow
    Don't stop, it'll soon be here
    It'll be better than before
    Yesterday's gone, yesterday's gone
    All I want is to see you smile
    If it takes just a little while
    I know you don't believe that it's true
    I never meant any harm to you
    Don't stop thinking about tomorrow
    Don't stop, it'll soon be here
    It'll be better than before
    Yesterday's gone, yesterday's
    Don't stop thinking about tomorrow
    Don't stop, it'll soon be here
    It'll be better than before
    Yesterday's gone, yesterday's gone
    Ooh
    Don't you look back
    Ooh
    Don't you look back
    Ooh
    Don't you look back
    Ooh
    Don't you look back

  • @Davesoft
    @Davesoft 3 роки тому

    The 'superior reward function' made me think of evangelion. They had 3 AI, hosted on human brains cuz why not, that argue and veto each other into either pre-existing procedures or complete inaction. Ignoring the sci-fi elements, it seems a nice idea, but I'm sure they'd find a way to nudge and wink at each other and unify one day :P

  • @inyobill
    @inyobill 5 років тому

    @2:50: "If there are bugs in your code", there are. Actually, there's no such thing as a software bug, an error that has crept in all on its own, there are only software "errors", a result of incorrect specification, design, and/or coding.

  • @chrisdaley2852
    @chrisdaley2852 6 років тому

    What utility function allows the adversarial agent to recognise reward hacking? Wouldn't it just minimise the reward?

  • @ChrisBigBad
    @ChrisBigBad 4 роки тому

    now i have to go and play a round of universal paper clips!

  • @ONDANOTA
    @ONDANOTA 2 роки тому

    Some producer should make a movie about Ai safety, like they did with "The big short".
    Just make it before it's too late

  • @zekejanczewski7275
    @zekejanczewski7275 4 місяці тому

    I actually think reward hacking can be used for good, if harnessed correctly.
    When there is a surge of power in your home, your breaker trips to protect you. It could be like an "Intelligence breaker" for AGI. It's not useful for alinement in it of itself, but it might be a safety measure put on self-modifying AI.
    Lets say Whatever their utility funtion is becomes clamped between 0 and 100. Even if they turn the whole world into stamps, they can never get more then 100 utility. Now, we add a new reward for tripping the breaker of 200 utility. If they grow unexpectedly powerful, they might have to mentally mani
    The breaker is guarded by a hard task which marks a turning point in the AIs Intelligence. Say, the AI must be smart enough to convince 5 human gatekeepers to flip the breaker. If the AI is capable of successfully convincing the humans to trip it, its intelligence is probably at an unsafe level.

  • @Thundermikeee
    @Thundermikeee 4 роки тому

    Wouldn't multiple superintelligent agents that are supposed to keep one another in check have a risk of cooperation as well? or any adversorial reward agent and the primary agent

    • @grimjowjaggerjak
      @grimjowjaggerjak 2 роки тому

      No since they don't have the same utility function.

  • @JimGiant
    @JimGiant 5 років тому

    Hmm, what if the reward is split in to two parts, the purpose (eg. get a high score in SMB) and do it in a way which doesn't displease the owner.
    Have the owner indicating they are happy part be worth more than the possible score for the purpose his way any discovered attempt at reward hacking will result in a lower score than any attempt to do it correctly.
    With more powerful AI which has the power to threaten or coerce the owner have a third reward layer bigger than the other two combined which rewards the AI as long as it doesn't interfere with the owner's free will.

  • @desiraepelletier719
    @desiraepelletier719 5 років тому

    What are your thoughts on the Blockchain project by Dr Ben Goertzel SingularityNET AGI coin?