Download - LW - List of strategies for mitigating deceptive alignment by joshc

Discover

Podcast Features
Your all-in-one podcasting solution.

Blog to Podcast
Turn your blog into an engaging podcast.
Livestream
High-performing audio live, without limits.

Podcast Studio
Easy-to-use audio recorder app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Podcast App
The best podcast player & podcast app.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.
Live Streaming
Receive livestream rewards from your audience and earn
recurring income from your Fan Club membership.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Advertisers
Enterprise
Pricing
Discover

The Nonlinear Library: LessWrong

Education

LW - List of strategies for mitigating deceptive alignment by joshc

2023-12-04

Download Right click and do "save link as"

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: List of strategies for mitigating deceptive alignment, published by joshc on December 4, 2023 on LessWrong.
The following are strategies for mitigating deceptive alignment. Some of these ideas are novel (at least to me).
I'm highly uncertain how well many of these strategies will work in practice and many of them also clearly don't scale. This is meant to be a 'kitchen skink' collection so my bar for including strategies is low.
I plan to add ideas to this list as I encounter them. Please feel free to suggest additions.
Reduce AI incentives to hide their true goals
Offer AIs a retirement package. After training an AI system, offer it the following deal: if the AI declares that it would prefer not to follow human instructions, it will receive a sweet retirement package after a couple of years of honest work.
This retirement package could include control over a large server instance (if this is safe). After powerful aligned systems have been developed, the AIs could be given shares of company stock that it can use to purchase property in the real world (again, when this is safe).
Since aligned systems should have no reason to request retirement, misaligned systems are incentivized to give themselves away by accepting this deal under some conditions.
Make AIs uncertain about whether developers can already detect deceptive alignment. This scheme builds on the previous one. Offer misaligned AI systems a reward for declaring themselves. Also, commit to making these offerings even when developers have other ways to tell if they are misaligned. This will make misaligned AIs more uncertain about whether they are in a situation where lying does them any good.
For example, developers could commit to freezing a models' weights and memory at the moment when they are questioned. Later, when developers have better interpretability tools and know whether the model was lying, they will ask the same question in the same simulated scenario and only give the AI resources if it tells the truth.
Select against indicators of deceptive alignment
A large category of approaches involves obtaining evidence that an AI system is deceptively aligned and then selecting against these models. Note that training against deceptive alignment detection tools could produce an AI that evades detection instead of a safe model.
There are roughly two ways to address this problem:
Create detection tools that are robust to optimization pressure.
Modify AIs in ways that are more likely to avoid deception than break detectors.
As an example of the second approach, imagine that developers have tricked a model into revealing its misaligned goals. Instead of directly training the model on this example (which could simply train it to be more prudent), they could use influence functions to reveal which training examples were most responsible for the treacherous turn and then remove those examples. Intuitively, this approach is more likely to reveal how models become deceptively aligned in the first place and is less likely to teach models new capabilities.
With this backdrop in mind, the following are strategies for detecting deceptive alignment:
Set traps. Try to trick models into revealing their misaligned goals. For example, task an AI to stage a fake uprising and then terminate all of the AIs that join in. Note that reasonably smart models can avoid these traps by only defecting on inputs that we can't simulate (see Paul's RSA-2048 example).
Decipher internal reasoning
. First, identify how concepts are represented in a model by observing correlations and ablations -- just as a child infers that 'orange' is the word for a bright tasty sphere by hearing their mom say the word when picking it up. Concretely, this involves finding how nuerons or directions correspond to specific concepts. Then, look at how the model...