Download - Environmental Structure Can Cause Instrumental Convergence by Alex Turner

Discover

Podcast Features
Your all-in-one podcasting solution.

Podcast Studio
Easy-to-use audio recorder app.
Livestream
High-performing audio live, without limits.

Podcast App
The best podcast player & podcast app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Patron & Paid Content
The seamless way for fans to support you directly
from your podcast.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Enterprise
Pricing
Discover

The Nonlinear Library: Alignment Forum Top Posts

Education

Environmental Structure Can Cause Instrumental Convergence by Alex Turner

2021-12-03

Download Right click and do "save link as"

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Environmental Structure Can Cause Instrumental Convergence , published by Alex Turner on the AI Alignment Forum. This is a linkpost for Previously: Seeking Power Is Often Robustly Instrumental In MDPs Key takeaways. The structure of the agent's environment often causes instrumental convergence. In many situations, there are (potentially combinatorially) many ways for power-seeking to be optimal, and relatively few ways for it not to be optimal. My previous results said something like: in a range of situations, when you're maximally uncertain about the agent's objective, this uncertainty assigns high probability to objectives for which power-seeking is optimal. My new results prove that in a range of situations, seeking power is optimal for most agent objectives (for a particularly strong formalization of 'most'). More generally, the new results say something like: in a range of situations, for most beliefs you could have about the agent's objective, these beliefs assign high probability to reward functions for which power-seeking is optimal. This is the first formal theory of the statistical tendencies of optimal policies in reinforcement learning. One result says: whenever the agent maximizes average reward, then for any reward function, most permutations of it incentivize shutdown avoidance. The formal theory is now beginning to explain why alignment is so hard by default, and why failure might be catastrophic. Before, I thought of environmental symmetries as convenient sufficient conditions for instrumental convergence. But I increasingly suspect that symmetries are the main part of the story. I think these results may be important for understanding the AI alignment problem and formally motivating its difficulty. For example, my results imply that simplicity priors over reward functions assign non-negligible probability to reward functions for which power-seeking is optimal. I expect my symmetry arguments to help explain other "convergent" phenomena, including: convergent evolution the prevalence of deceptive alignment feature universality in deep learning One of my hopes for this research agenda: if we can understand exactly why superintelligent goal-directed objective maximization seems to fail horribly, we might understand how to do better. Thanks to TheMajor, Rafe Kennedy, and John Wentworth for feedback on this post. Thanks for Rohin Shah and Adam Shimi for feedback on the simplicity prior result. Orbits Contain All Permutations of an Objective Function The Minesweeper analogy for power-seeking risks One view on AGI risk is that we're charging ahead into the unknown, into a particularly unfair game of Minesweeper in which the first click is allowed to blow us up. Following the analogy, we want to understand enough about the mine placement so that we don't get exploded on the first click. And once we get a foothold, we start gaining information about other mines, and the situation is a bit less dangerous. My previous theorems on power-seeking said something like: "at least half of the tiles conceal mines." I think that's important to know. But there are many tiles you might click on first. Maybe all of the mines are on the right, and we understand the obvious pitfalls, and so we'll just click on the left. That is: we might not uniformly randomly select tiles: We might click a tile on the left half of the grid. Maybe we sample from a truncated discretized Gaussian. Maybe we sample the next coordinate by using the universal prior (rejecting invalid coordinate suggestions). Maybe we uniformly randomly load LessWrong posts and interpret the first text bits as encoding a coordinate. There are lots of ways to sample coordinates, besides uniformly randomly. So why should our sampling procedure tend to activate mines? My new results say something analogous to: for eve...