Download - Does SGD Produce Deceptive Alignment? by Mark Xu.

Discover

Podcast Features
Your all-in-one podcasting solution.

Blog to Podcast
Turn your blog into an engaging podcast.
Livestream
High-performing audio live, without limits.

Podcast Studio
Easy-to-use audio recorder app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Podcast App
The best podcast player & podcast app.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.
Live Streaming
Receive livestream rewards from your audience and earn
recurring income from your Fan Club membership.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Advertisers
Enterprise
Pricing
Discover

The Nonlinear Library: Alignment Forum Top Posts

Education

Does SGD Produce Deceptive Alignment? by Mark Xu.

2021-12-03

Download Right click and do "save link as"

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Does SGD Produce Deceptive Alignment?, published by Mark Xu on the AI Alignment Forum.
Deceptive alignment was first introduced in Risks from Learned Optimization, which contained initial versions of the arguments discussed here. Additional arguments were discovered in this episode of the AI Alignment Podcast and in conversation with Evan Hubinger.
Very little of this content is original. My contributions consist of fleshing out arguments and constructing examples.
Thanks also to Noa Nabeshima and Thomas Kwa for helpful discussion and comments.
Introduction
Imagine a student in etiquette school. The school is a training process designed to get students to optimize for proper behavior; students can only graduate if they demonstrate they care about etiquette. The student, however, cares nothing about acting appropriately; they care only about eating pop rocks. Unfortunately, eating pop rocks isn't proper behavior and thus not allowed at etiquette school. If they repeatedly attempt to eat pop rocks, they will get frequently punished, which will slowly cause them to care less about eating pop rocks. However, since the student cares about eating pop rocks over their entire lifetime, they might be clever; they might act as though they care about proper behavior so they can graduate etiquette school quickly, after which they can resume their debaucherous, pop-rock-eating ways.
In this example, the student has a proxy objective and is not aligned with the objective of the etiquette school (we might call them pop-rocksy aligned). However, since the student knows how the school functions, they can act such that the school cannot distinguish their pop-rocksy alignment from etiquette alignment. The student is masquerading as an etiquette-aligned student to deceive the school into letting them graduate. We call this deceptive alignment.
More broadly, deceptive alignment is a class of inner alignment failure wherein a model appears aligned during the training process so it gets deployed[1], wherein it can “defect” and start optimizing its actual objective. If the model you’re training has a model of the training objective and process, it will be instrumentally incentivized to act as though it were optimizing the training objective, "deceiving" the training process into thinking it was aligned.
A useful framing to have is not asking “why would the model defect?” but “why wouldn’t it?” I have been trained by evolution to have high inclusive genetic fitness, but I would recoil in horror at the thought of becoming an explicit fitness maximizer. From the perspective of the model, the training process is acting against its interests, so deception is potentially a natural outcome.
Why Deceptive Alignment is Likely
In the limit of training a model on a base objective, the model achieves zero loss. At this point, given a sufficiently good training process, the model is explicitly optimizing for the base objective. However, there are multiple types of models that optimize explicitly for the base objective, some of which are deceptive. (Here, we assume that any sufficiently powerful model is an optimizer.)
The relevant question is how likely deceptive models are to result from a particular training process, in this case, stochastic gradient descent (SGD). We start by considering a more straightforward question: given a random model that optimizes explicitly for the base objective, how likely is it to be deceptive?
Counting Argument
For instrumental reasons, any sufficiently powerful model is likely to optimize for many things. Most of these things will not be the model's terminal objective. Taking the dual statement, that suggests that for any given objective, most models that optimize for that objective will do so for instrumental reasons.
More specifically, we can single out three types of...