Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Environmental Structure Can Cause Instrumental Convergence , published by Alex Turner on the AI Alignment Forum.
This is a linkpost for
Previously: Seeking Power Is Often Robustly Instrumental In MDPs
Key takeaways.
The structure of the agent's environment often causes instrumental convergence. In many situations, there are (potentially combinatorially) many ways for power-seeking to be optimal, and relatively few ways for it not to be optimal.
My previous results said something like: in a range of situations, when you're maximally uncertain about the agent's objective, this uncertainty assigns high probability to objectives for which power-seeking is optimal.
My new results prove that in a range of situations, seeking power is optimal for most agent objectives (for a particularly strong formalization of 'most').
More generally, the new results say something like: in a range of situations, for most beliefs you could have about the agent's objective, these beliefs assign high probability to reward functions for which power-seeking is optimal.
This is the first formal theory of the statistical tendencies of optimal policies in reinforcement learning.
One result says: whenever the agent maximizes average reward, then for any reward function, most permutations of it incentivize shutdown avoidance.
The formal theory is now beginning to explain why alignment is so hard by default, and why failure might be catastrophic.
Before, I thought of environmental symmetries as convenient sufficient conditions for instrumental convergence. But I increasingly suspect that symmetries are the main part of the story.
I think these results may be important for understanding the AI alignment problem and formally motivating its difficulty.
For example, my results imply that simplicity priors over reward functions assign non-negligible probability to reward functions for which power-seeking is optimal.
I expect my symmetry arguments to help explain other "convergent" phenomena, including:
convergent evolution
the prevalence of deceptive alignment
feature universality in deep learning
One of my hopes for this research agenda: if we can understand exactly why superintelligent goal-directed objective maximization seems to fail horribly, we might understand how to do better.
Thanks to TheMajor, Rafe Kennedy, and John Wentworth for feedback on this post. Thanks for Rohin Shah and Adam Shimi for feedback on the simplicity prior result.
Orbits Contain All Permutations of an Objective Function
The Minesweeper analogy for power-seeking risks
One view on AGI risk is that we're charging ahead into the unknown, into a particularly unfair game of Minesweeper in which the first click is allowed to blow us up. Following the analogy, we want to understand enough about the mine placement so that we don't get exploded on the first click. And once we get a foothold, we start gaining information about other mines, and the situation is a bit less dangerous.
My previous theorems on power-seeking said something like: "at least half of the tiles conceal mines."
I think that's important to know. But there are many tiles you might click on first. Maybe all of the mines are on the right, and we understand the obvious pitfalls, and so we'll just click on the left.
That is: we might not uniformly randomly select tiles:
We might click a tile on the left half of the grid.
Maybe we sample from a truncated discretized Gaussian.
Maybe we sample the next coordinate by using the universal prior (rejecting invalid coordinate suggestions).
Maybe we uniformly randomly load LessWrong posts and interpret the first text bits as encoding a coordinate.
There are lots of ways to sample coordinates, besides uniformly randomly. So why should our sampling procedure tend to activate mines?
My new results say something analogous to: for eve...
view more