Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Gradations of Inner Alignment Obstacles, published by Abram Demski on the AI Alignment Forum.
The existing definitions of deception, inner optimizer, and some other terms tend to strike me as "stronger than necessary" depending on the context. If weaker definitions are similarly problematic, this means we need stronger methods to prevent them! I illustrate this and make some related (probably contentious) claims.
Summary of contentious claims to follow:
The most useful definition of "mesa-optimizer" doesn't require them to perform explicit search, contrary to the current standard.
Success at aligning narrowly superhuman models might be bad news.
Some versions of the lottery ticket hypothesis seem to imply that randomly initialized networks already contain deceptive agents.
It's possible I've shoved too many things into one post. Sorry.
Inner Optimization
The standard definition of "inner optimizer" refers to something which carries out explicit search, in service of some objective. It's not clear to me whether/when we should focus that narrowly. Here are some other definitions of "inner optimizer" which I sometimes think about.
Mesa-Control
I've previously written about the idea of distinguishing mesa-search vs mesa-control:
Mesa-searchers implement an internal optimization algorithm, such as a planning algorithm, to help them achieve an objective -- this is the definition of "mesa-optimizer"/"inner optimizer" I think of as standard.
Mesa-controller refers to any effective strategies, including mesa-searchers but also "dumber" strategies which nonetheless effectively steer toward an objective. For example, thermostat-like strategies, or strategies which have simply memorized a number of effective interventions.
Richard Ngo points out that this definition is rather all-encompassing, since it includes any highly competent policy. Adam Shimi suggests that we think of inner optimizers as goal-directed.
Considering these comments, I think I want to revise my definition of mesa-controller to include that it is not totally myopic in some sense. A highly competent Q&A policy, if totally myopic, is not systematically "steering the world" in a particular direction, even if misaligned.
However, I am not sure how I want to define "totally myopic" there. There may be several reasonable definitions.
I think mesa-control is thought of as a less concerning problem than mesa-search, primarily because: how would you even get severely misaligned mesa-controllers? For example, why would a neural network memorize highly effective strategies for pursuing an objective which it hasn't been trained on?
However, I would make the following points:
If a mesa-searcher and a mesa-controller are equally effective, they're equally concerning. It doesn't matter what their internal algorithm is, if the consequences are the same.
The point of inner alignment is to protect against those bad consequences. If mesa-controllers which don't search are truly less concerning, this just means it's an easier case to guard against. That's not an argument against including them in the definition of the inner alignment problem.
Some of the reasons we expect mesa-search also apply to mesa-control more broadly.
"Search" is an incredibly ambiguous concept.
There's a continuum between searchers and pure memorized strategies:
Explicit brute-force search over a large space of possible strategies.
Heuristic search strategies, which combine brute force with faster, smarter steps.
Smart strategies like binary search or Newton's method, which efficiently solve problems by taking advantage of their structure, but still involve iteration over possibilities.
Highly knowledge-based strategies, such as calculus, which find solutions "directly" with no iteration -- but which still involve meaningful computation.
Mildly-computatio...
view more