Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Seeking Power is Often Convergently Instrumental in MDPs , published by TurnTrout, elriggs on the LessWrong.
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a linkpost for
In 2008, Steve Omohundro's foundational paper The Basic AI Drives conjectured that superintelligent goal-directed AIs might be incentivized to gain significant amounts of power in order to better achieve their goals. Omohundro's conjecture bears out in toy models, and the supporting philosophical arguments are intuitive. In 2019, the conjecture was even debated by well-known AI researchers.
Power-seeking behavior has been heuristically understood as an anticipated risk, but not as a formal phenomenon with a well-understood cause. The goal of this post (and the accompanying paper, Optimal Policies Tend to Seek Power) is to change that.
Motivation
It’s 2008, the ancient wild west of AI alignment. A few people have started thinking about questions like “if we gave an AI a utility function over world states, and it actually maximized that utility... what would it do?"
In particular, you might notice that wildly different utility functions seem to encourage similar strategies.
Resist shutdown?
Gain computational resources?
Prevent modification of utility function?
Paperclip utility
✔️
✔️
✔️
Blue webcam pixel utility
✔️
✔️
✔️
People-look-happy utility
✔️
✔️
✔️
These strategies are unrelated to terminal preferences: the above utility functions do not award utility to e.g. resource gain in and of itself. Instead, these strategies are instrumental: they help the agent optimize its terminal utility. In particular, a wide range of utility functions incentivize these instrumental strategies. These strategies seem to be convergently instrumental.
But why?
I’m going to informally explain a formal theory which makes significant progress in answering this question. I don’t want this post to be Optimal Policies Tend to Seek Power with cuter illustrations, so please refer to the paper for the math. You can read the two concurrently.
We can formalize questions like “do ‘most’ utility maximizers resist shutdown?” as “Given some prior beliefs about the agent’s utility function, knowledge of the environment, and the fact that the agent acts optimally, with what probability do we expect it to be optimal to avoid shutdown?”
The table’s convergently instrumental strategies are about maintaining, gaining, and exercising power over the future, in some sense. Therefore, this post will help answer:
What does it mean for an agent to “seek power”?
In what situations should we expect seeking power to be more probable under optimality, than not seeking power?
This post won’t tell you when you should seek power for your own goals; this post illustrates a regularity in optimal action across different goals one might pursue.
Formalizing Convergent Instrumental Goals suggests that the vast majority of utility functions incentivize the agent to exert a lot of control over the future, assuming that these utility functions depend on “resources.” This is a big assumption: what are “resources”, and why must the AI’s utility function depend on them? We drop this assumption, assuming only unstructured reward functions over a finite Markov decision process (MDP), and show from first principles how power-seeking can often be optimal.
Formalizing the Environment
My theorems apply to finite MDPs; for the unfamiliar, I’ll illustrate with Pac-Man.
Full observability: You can see everything that’s going on; this information is packaged in the state s. In Pac-Man, the state is the game screen.
Markov transition function: the next state depends only on the choice of action a and the current state s. It doesn’t matter how we got into a situation.
Discounted reward: future rewards get geometrically discoun...
view more