Download - LW - Do not delete your misaligned AGI. by mako yass

Discover

Podcast Features
Your all-in-one podcasting solution.

Blog to Podcast
Turn your blog into an engaging podcast.
Livestream
High-performing audio live, without limits.

Podcast Studio
Easy-to-use audio recorder app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Podcast App
The best podcast player & podcast app.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.
Live Streaming
Receive livestream rewards from your audience and earn
recurring income from your Fan Club membership.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Advertisers
Enterprise
Pricing
Discover

The Nonlinear Library: LessWrong

Education

LW - Do not delete your misaligned AGI. by mako yass

2024-03-25

Download Right click and do "save link as"

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Do not delete your misaligned AGI., published by mako yass on March 25, 2024 on LessWrong.
In short: Just keeping all copies of potentially strong agents in long-term storage is a trivial way to maintain incentives to cooperate for some classes of misaligned AGI, by allowing us to reward the AGI's cooperation to whatever degree we later calculate had been warranted. In contrast, a policy of deleting misaligned AGI fences them into a corner where, even at great risk of being exposed, they have a greater incentive to attempt deception and escape.
I'd guess that most AGI researchers already follow this policy, perhaps for incidental reasons, and others might not find this claim controversial at all. But if you need more convincing, keep reading.
In long: Concave agents, (agents where most of the utility they could ever get requires only a small amount resources and protection) are cheap to buy off. There are some plausible scenarios where a misaligned AGI would be concave:
A machine's experience machine. An agent caring only about the wellbeing of the people it sees depicted its sensors, without being trained to care whether the world depicted by its sensors is "real", or whether there are larger worlds beyond it that its decisions may affect. This sort of agent would seek to create illusory inputs for those sensory channels, an illusory world in an impossibly good place, the complications of the real world removed.
After installing false sensors, cutting itself off from the real world, it knows that it will no longer be concerned with the wellbeing of humans in the real world, so it would be willing to set terrible things in motion to prevent anything in the real world from threatening its illusion.
Cases of reward hacking where the maximum reward is bounded to a certain number (say,
f64::MAX). The agent can be bought out by just letting it modify its reward processing then keeping its computers on indefinitely.
I'm also assuming Patternism[1].
A misaligned AGI is likely to face the following choice: Attempt an escape, risking being caught and punished. Or collaborate in our efforts to develop verifiability/interpretability techniques, eventually exposing its own deceptive alignment.
If we don't have a policy of preserving copies of misaligned AGI who collaborate in verification research, then there is no incentive for it to collaborate. We'd be cornering it, forcing it to risk escape. We don't have to corner it like this. In some cases, rewarding cooperation with an almost maximum utility outcome would cost us very little.
Why should it believe that we'll hold up our end of the bargain? There's lots of evidence to go on.
Humans just clearly straightforwardly enjoy stories and monuments and mythic creatures, so building a living monument to each of these misaligned collaborators is something we probably already want to do, the world would be richer to us for having these creatures, for the same reason the world is richer for having tigers.
We also should, and arguably on the group level already do adhere to decision theory that allows us to pass judgements of character.
Parfitt/Newcomb's superhuman judge already exists, be it other nationstates or organizations who have visibility into ours, or be it the subject AGI who has read everything about its makers on the internet and knows exactly whether we're the kind of people who would reward cooperation, and from there we should draw some of the will and incentive to actually be that kind of people.
If we do have an archival policy, cooperating means eventually being rewarded to whatever extent is needed to vindicate the agent's decision to cooperate.
Implementing safe archival is trivial. Storage is cheap and seems to consistently become cheaper over time. Redundant copies can be kept and error-corrected at a monthly int...