Download - Gradations of Inner Alignment Obstacles by Abram Demski

Discover

Podcast Features
Your all-in-one podcasting solution.

Blog to Podcast
Turn your blog into an engaging podcast.
Livestream
High-performing audio live, without limits.

Podcast Studio
Easy-to-use audio recorder app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Podcast App
The best podcast player & podcast app.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.
Live Streaming
Receive livestream rewards from your audience and earn
recurring income from your Fan Club membership.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Advertisers
Enterprise
Pricing
Discover

The Nonlinear Library: Alignment Forum Top Posts

Education

Gradations of Inner Alignment Obstacles by Abram Demski

2021-12-04

Download Right click and do "save link as"

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Gradations of Inner Alignment Obstacles, published by Abram Demski on the AI Alignment Forum.
The existing definitions of deception, inner optimizer, and some other terms tend to strike me as "stronger than necessary" depending on the context. If weaker definitions are similarly problematic, this means we need stronger methods to prevent them! I illustrate this and make some related (probably contentious) claims.
Summary of contentious claims to follow:
The most useful definition of "mesa-optimizer" doesn't require them to perform explicit search, contrary to the current standard.
Success at aligning narrowly superhuman models might be bad news.
Some versions of the lottery ticket hypothesis seem to imply that randomly initialized networks already contain deceptive agents.
It's possible I've shoved too many things into one post. Sorry.
Inner Optimization
The standard definition of "inner optimizer" refers to something which carries out explicit search, in service of some objective. It's not clear to me whether/when we should focus that narrowly. Here are some other definitions of "inner optimizer" which I sometimes think about.
Mesa-Control
I've previously written about the idea of distinguishing mesa-search vs mesa-control:
Mesa-searchers implement an internal optimization algorithm, such as a planning algorithm, to help them achieve an objective -- this is the definition of "mesa-optimizer"/"inner optimizer" I think of as standard.
Mesa-controller refers to any effective strategies, including mesa-searchers but also "dumber" strategies which nonetheless effectively steer toward an objective. For example, thermostat-like strategies, or strategies which have simply memorized a number of effective interventions.
Richard Ngo points out that this definition is rather all-encompassing, since it includes any highly competent policy. Adam Shimi suggests that we think of inner optimizers as goal-directed.
Considering these comments, I think I want to revise my definition of mesa-controller to include that it is not totally myopic in some sense. A highly competent Q&A policy, if totally myopic, is not systematically "steering the world" in a particular direction, even if misaligned.
However, I am not sure how I want to define "totally myopic" there. There may be several reasonable definitions.
I think mesa-control is thought of as a less concerning problem than mesa-search, primarily because: how would you even get severely misaligned mesa-controllers? For example, why would a neural network memorize highly effective strategies for pursuing an objective which it hasn't been trained on?
However, I would make the following points:
If a mesa-searcher and a mesa-controller are equally effective, they're equally concerning. It doesn't matter what their internal algorithm is, if the consequences are the same.
The point of inner alignment is to protect against those bad consequences. If mesa-controllers which don't search are truly less concerning, this just means it's an easier case to guard against. That's not an argument against including them in the definition of the inner alignment problem.
Some of the reasons we expect mesa-search also apply to mesa-control more broadly.
"Search" is an incredibly ambiguous concept.
There's a continuum between searchers and pure memorized strategies:
Explicit brute-force search over a large space of possible strategies.
Heuristic search strategies, which combine brute force with faster, smarter steps.
Smart strategies like binary search or Newton's method, which efficiently solve problems by taking advantage of their structure, but still involve iteration over possibilities.
Highly knowledge-based strategies, such as calculus, which find solutions "directly" with no iteration -- but which still involve meaningful computation.
Mildly-computatio...