Download - Selection Theorems: A Program For Understanding Agents by johnswentworth

Discover

Podcast Features
Your all-in-one podcasting solution.

Blog to Podcast
Turn your blog into an engaging podcast.
Livestream
High-performing audio live, without limits.

Podcast Studio
Easy-to-use audio recorder app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Podcast App
The best podcast player & podcast app.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.
Live Streaming
Receive livestream rewards from your audience and earn
recurring income from your Fan Club membership.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Advertisers
Enterprise
Pricing
Discover

The Nonlinear Library: Alignment Forum Top Posts

Education

Selection Theorems: A Program For Understanding Agents by johnswentworth

2021-12-04

Download Right click and do "save link as"

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Selection Theorems: A Program For Understanding Agents, published by johnswentworth on the AI Alignment Forum.
What’s the type signature of an agent?
For instance, what kind-of-thing is a “goal”? What data structures can represent “goals”? Utility functions are a common choice among theorists, but they don’t seem quite right. And what are the inputs to “goals”? Even when using utility functions, different models use different inputs - Coherence Theorems imply that utilities take in predefined “bet outcomes”, whereas AI researchers often define utilities over “world states” or “world state trajectories”, and human goals seem to be over latent variables in humans’ world models.
And that’s just goals. What about “world models”? Or “agents” in general? What data structures can represent these things, how do they interface with each other and the world, and how do they embed in their low-level world? These are all questions about the type signatures of agents.
One general strategy for answering these sorts of questions is to look for what I’ll call Selection Theorems. Roughly speaking, a Selection Theorem tells us something about what agent type signatures will be selected for (by e.g. natural selection or ML training or economic profitability) in some broad class of environments. In inner/outer agency terms, it tells us what kind of inner agents will be selected by outer optimization processes.
We already have many Selection Theorems: Coherence and Dutch Book theorems, Good Regulator and Gooder Regulator, the Kelly Criterion, etc. These theorems generally seem to point in a similar direction - suggesting deep unifying principles exist - but they have various holes and don’t answer all the questions we want. We need better Selection Theorems if they are to be a foundation for understanding human values, inner agents, value drift, and other core issues of AI alignment.
The quest for better Selection Theorems has a lot of “surface area” - lots of different angles for different researchers to make progress, within a unified framework, but without redundancy. It also requires relatively little ramp-up; I don’t think someone needs to read the entire giant corpus of work on alignment to contribute useful new Selection Theorems. At the same time, better Selection Theorems directly tackle the core conceptual problems of alignment and agency; I expect sufficiently-good Selection Theorems would get us most of the way to solving the hardest parts of alignment. Overall, I think they’re a good angle for people who want to make useful progress on the theory of alignment and agency, and have strong theoretical/conceptual skills.
Outline of this post:
More detail on what “type signatures” and “Selection Theorems” are
Examples of existing Selection Theorems and what they prove (or assume) about agent type signatures
Aspects which I expect/want from future Selection Theorems
How to work on Selection Theorems
What’s A Type Signature Of An Agent?
We’ll view the “type signature of an agent” as an answer to three main questions:
Representation: What “data structure” represents the agent - i.e. what are its high-level components, and how can they be represented?
Interfaces: What are the “inputs” and “outputs” between the components - i.e. how do they interface with each other and with the environment?
Embedding: How does the abstract “data structure” representation relate to the low-level system in which the agent is implemented?
A selection theorem typically assumes some parts of the type signature (often implicitly), and derives others.
For example, coherence theorems show that any non-dominated strategy is equivalent to maximization of Bayesian expected utility.
Representation: utility function and probability distribution.
Interfaces: both the utility function and distribution take in “bet ...