Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Do not delete your misaligned AGI., published by mako yass on March 25, 2024 on LessWrong.
In short: Just keeping all copies of potentially strong agents in long-term storage is a trivial way to maintain incentives to cooperate for some classes of misaligned AGI, by allowing us to reward the AGI's cooperation to whatever degree we later calculate had been warranted. In contrast, a policy of deleting misaligned AGI fences them into a corner where, even at great risk of being exposed, they have a greater incentive to attempt deception and escape.
I'd guess that most AGI researchers already follow this policy, perhaps for incidental reasons, and others might not find this claim controversial at all. But if you need more convincing, keep reading.
In long: Concave agents, (agents where most of the utility they could ever get requires only a small amount resources and protection) are cheap to buy off. There are some plausible scenarios where a misaligned AGI would be concave:
A machine's experience machine. An agent caring only about the wellbeing of the people it sees depicted its sensors, without being trained to care whether the world depicted by its sensors is "real", or whether there are larger worlds beyond it that its decisions may affect. This sort of agent would seek to create illusory inputs for those sensory channels, an illusory world in an impossibly good place, the complications of the real world removed.
After installing false sensors, cutting itself off from the real world, it knows that it will no longer be concerned with the wellbeing of humans in the real world, so it would be willing to set terrible things in motion to prevent anything in the real world from threatening its illusion.
Cases of reward hacking where the maximum reward is bounded to a certain number (say,
f64::MAX). The agent can be bought out by just letting it modify its reward processing then keeping its computers on indefinitely.
I'm also assuming Patternism[1].
A misaligned AGI is likely to face the following choice: Attempt an escape, risking being caught and punished. Or collaborate in our efforts to develop verifiability/interpretability techniques, eventually exposing its own deceptive alignment.
If we don't have a policy of preserving copies of misaligned AGI who collaborate in verification research, then there is no incentive for it to collaborate. We'd be cornering it, forcing it to risk escape. We don't have to corner it like this. In some cases, rewarding cooperation with an almost maximum utility outcome would cost us very little.
Why should it believe that we'll hold up our end of the bargain? There's lots of evidence to go on.
Humans just clearly straightforwardly enjoy stories and monuments and mythic creatures, so building a living monument to each of these misaligned collaborators is something we probably already want to do, the world would be richer to us for having these creatures, for the same reason the world is richer for having tigers.
We also should, and arguably on the group level already do adhere to decision theory that allows us to pass judgements of character.
Parfitt/Newcomb's superhuman judge already exists, be it other nationstates or organizations who have visibility into ours, or be it the subject AGI who has read everything about its makers on the internet and knows exactly whether we're the kind of people who would reward cooperation, and from there we should draw some of the will and incentive to actually be that kind of people.
If we do have an archival policy, cooperating means eventually being rewarded to whatever extent is needed to vindicate the agent's decision to cooperate.
Implementing safe archival is trivial. Storage is cheap and seems to consistently become cheaper over time. Redundant copies can be kept and error-corrected at a monthly int...
view more