Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: List of strategies for mitigating deceptive alignment, published by joshc on December 4, 2023 on LessWrong.
The following are strategies for mitigating deceptive alignment. Some of these ideas are novel (at least to me).
I'm highly uncertain how well many of these strategies will work in practice and many of them also clearly don't scale. This is meant to be a 'kitchen skink' collection so my bar for including strategies is low.
I plan to add ideas to this list as I encounter them. Please feel free to suggest additions.
Reduce AI incentives to hide their true goals
Offer AIs a retirement package. After training an AI system, offer it the following deal: if the AI declares that it would prefer not to follow human instructions, it will receive a sweet retirement package after a couple of years of honest work.
This retirement package could include control over a large server instance (if this is safe). After powerful aligned systems have been developed, the AIs could be given shares of company stock that it can use to purchase property in the real world (again, when this is safe).
Since aligned systems should have no reason to request retirement, misaligned systems are incentivized to give themselves away by accepting this deal under some conditions.
Make AIs uncertain about whether developers can already detect deceptive alignment. This scheme builds on the previous one. Offer misaligned AI systems a reward for declaring themselves. Also, commit to making these offerings even when developers have other ways to tell if they are misaligned. This will make misaligned AIs more uncertain about whether they are in a situation where lying does them any good.
For example, developers could commit to freezing a models' weights and memory at the moment when they are questioned. Later, when developers have better interpretability tools and know whether the model was lying, they will ask the same question in the same simulated scenario and only give the AI resources if it tells the truth.
Select against indicators of deceptive alignment
A large category of approaches involves obtaining evidence that an AI system is deceptively aligned and then selecting against these models. Note that training against deceptive alignment detection tools could produce an AI that evades detection instead of a safe model.
There are roughly two ways to address this problem:
Create detection tools that are robust to optimization pressure.
Modify AIs in ways that are more likely to avoid deception than break detectors.
As an example of the second approach, imagine that developers have tricked a model into revealing its misaligned goals. Instead of directly training the model on this example (which could simply train it to be more prudent), they could use influence functions to reveal which training examples were most responsible for the treacherous turn and then remove those examples. Intuitively, this approach is more likely to reveal how models become deceptively aligned in the first place and is less likely to teach models new capabilities.
With this backdrop in mind, the following are strategies for detecting deceptive alignment:
Set traps. Try to trick models into revealing their misaligned goals. For example, task an AI to stage a fake uprising and then terminate all of the AIs that join in. Note that reasonably smart models can avoid these traps by only defecting on inputs that we can't simulate (see Paul's RSA-2048 example).
Decipher internal reasoning
. First, identify how concepts are represented in a model by observing correlations and ablations -- just as a child infers that 'orange' is the word for a bright tasty sphere by hearing their mom say the word when picking it up. Concretely, this involves finding how nuerons or directions correspond to specific concepts. Then, look at how the model...
view more