Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Selection Theorems: A Program For Understanding Agents, published by johnswentworth on the AI Alignment Forum.
What’s the type signature of an agent?
For instance, what kind-of-thing is a “goal”? What data structures can represent “goals”? Utility functions are a common choice among theorists, but they don’t seem quite right. And what are the inputs to “goals”? Even when using utility functions, different models use different inputs - Coherence Theorems imply that utilities take in predefined “bet outcomes”, whereas AI researchers often define utilities over “world states” or “world state trajectories”, and human goals seem to be over latent variables in humans’ world models.
And that’s just goals. What about “world models”? Or “agents” in general? What data structures can represent these things, how do they interface with each other and the world, and how do they embed in their low-level world? These are all questions about the type signatures of agents.
One general strategy for answering these sorts of questions is to look for what I’ll call Selection Theorems. Roughly speaking, a Selection Theorem tells us something about what agent type signatures will be selected for (by e.g. natural selection or ML training or economic profitability) in some broad class of environments. In inner/outer agency terms, it tells us what kind of inner agents will be selected by outer optimization processes.
We already have many Selection Theorems: Coherence and Dutch Book theorems, Good Regulator and Gooder Regulator, the Kelly Criterion, etc. These theorems generally seem to point in a similar direction - suggesting deep unifying principles exist - but they have various holes and don’t answer all the questions we want. We need better Selection Theorems if they are to be a foundation for understanding human values, inner agents, value drift, and other core issues of AI alignment.
The quest for better Selection Theorems has a lot of “surface area” - lots of different angles for different researchers to make progress, within a unified framework, but without redundancy. It also requires relatively little ramp-up; I don’t think someone needs to read the entire giant corpus of work on alignment to contribute useful new Selection Theorems. At the same time, better Selection Theorems directly tackle the core conceptual problems of alignment and agency; I expect sufficiently-good Selection Theorems would get us most of the way to solving the hardest parts of alignment. Overall, I think they’re a good angle for people who want to make useful progress on the theory of alignment and agency, and have strong theoretical/conceptual skills.
Outline of this post:
More detail on what “type signatures” and “Selection Theorems” are
Examples of existing Selection Theorems and what they prove (or assume) about agent type signatures
Aspects which I expect/want from future Selection Theorems
How to work on Selection Theorems
What’s A Type Signature Of An Agent?
We’ll view the “type signature of an agent” as an answer to three main questions:
Representation: What “data structure” represents the agent - i.e. what are its high-level components, and how can they be represented?
Interfaces: What are the “inputs” and “outputs” between the components - i.e. how do they interface with each other and with the environment?
Embedding: How does the abstract “data structure” representation relate to the low-level system in which the agent is implemented?
A selection theorem typically assumes some parts of the type signature (often implicitly), and derives others.
For example, coherence theorems show that any non-dominated strategy is equivalent to maximization of Bayesian expected utility.
Representation: utility function and probability distribution.
Interfaces: both the utility function and distribution take in “bet ...
view more