Download - Deceptive Alignment by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, Scott Garrabrant

Discover

Podcast Features
Your all-in-one podcasting solution.

Blog to Podcast
Turn your blog into an engaging podcast.
Livestream
High-performing audio live, without limits.

Podcast Studio
Easy-to-use audio recorder app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Podcast App
The best podcast player & podcast app.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.
Live Streaming
Receive livestream rewards from your audience and earn
recurring income from your Fan Club membership.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Advertisers
Enterprise
Pricing
Discover

The Nonlinear Library: Alignment Forum Top Posts

Education

Deceptive Alignment by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, Scott Garrabrant

2021-12-04

Download Right click and do "save link as"

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Deceptive Alignment, published by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, Scott Garrabrant on the AI Alignment Forum.
This is the fourth of five posts in the Risks from Learned Optimization Sequence based on the paper “Risks from Learned Optimization in Advanced Machine Learning Systems” by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Each post in the sequence corresponds to a different section of the paper.
With enough training in sufficiently diverse environments, it seems plausible that the base objective will eventually have to be fully represented in the mesa-optimizer. We propose that this can happen without the mesa-optimizer becoming robustly aligned, however. Specifically, a mesa-optimizer might come to model the base objective function and learn that the base optimizer will modify the mesa-optimizer if the mesa-optimizer scores poorly on the base objective.
If the mesa-optimizer has an objective that extends across parameter updates, then it will be incentivized to avoid being modified,[1] as it might not pursue the same objective after modification (with the result that its current objective will not be achieved in future iterations). This means that the mesa-optimizer will be instrumentally incentivized to act as if it is optimizing the base objective function, even if its actual mesa-objective is something else entirely. We will refer to this hypothetical phenomenon as deceptive alignment.[2] Deceptive alignment is a form of instrumental proxy alignment, as fulfilling the base objective is an instrumental goal of the mesa-optimizer.
Figure 4.1. A toy example of deceptive alignment. In this task, the base objective is for the robot to get to A, while the mesa-objective is to get to B. The task is run for two training episodes and three deployment episodes. Parameter updates are made only after the training episodes, while after deployment the parameters are fixed. If the mesa-optimizer goes to its objective (B) during training, it will be modified to ensure it goes to A in future episodes (dashed arrows). Therefore, if the mesa-optimizer's objective spans across episodes, the dominant strategy for the mesa-optimizer is to go to A in training and B in testing (filled arrows).
As a simple example, consider an environment that consists of a
3
×
3
grid (see figure 4.1). The top left corner is “corner A” and the top right corner is “corner B.” The agent, which is a mesa-optimizer, starts in the bottom left corner. The base objective is to get to corner A in as few steps as possible, but the objective of the mesa-optimizer is to get to corner B. The environment will be run five times. The agent is modified only after the first two episodes (representing the training period), but not thereafter (representing testing/deployment). If a suboptimal move is made in training, the agent is modified so that its mesa-objective becomes the base objective. The objective of the mesa-optimizer spans all episodes: it wants B to be reached in as many episodes as possible.
For the sake of this example, we will assume that the agent knows all of the above facts. If the agent goes to B in any training episode, then it will be modified to go to A in all subsequent episodes, which means that it would reach B in only one episode in total. Alternatively, if the agent goes to A in the first two episodes, then it will not be modified, and could therefore go to B in the remaining three episodes. Using this strategy, B is reached in three episodes in total, which is better (on the mesa-objective) than the one time B is reached if the mesa-optimizer goes there in training. This simple example demonstrates how a misaligned mesa-optimizer could be incentivized to “pretend” to be aligned with the base objective until it is no...