Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: 2. Corrigibility Intuition, published by Max Harms on June 8, 2024 on The AI Alignment Forum.
(Part 2 of the CAST sequence)
As a reminder, here's how I've been defining "corrigible" when introducing the concept: an agent is corrigible when it robustly acts opposite of the trope of "be careful what you wish for" by cautiously reflecting on itself as a flawed tool and focusing on empowering the principal to fix its flaws and mistakes.
This definition is vague, imprecise, and hides a lot of nuance. What do we mean by "flaws," for example? Even the parts that may seem most solid, such as the notion of there being a principal and an agent, may seem philosophically confused to a sufficiently advanced mind. We'll get into trying to precisely formalize corrigibility later on, but part of the point of corrigibility is to work even when it's only loosely understood. I'm more interested in looking for something robust (i.e.
simple and gravitational) that can be easily gestured at, rather than trying to find something that has a precise, unimpeachable construction.[1]
Towards this end, I think it's valuable to try and get a rich, intuitive feeling for what I'm trying to talk about, and only attempt technical details once there's a shared sense of the outline. So in this document I'll attempt to build up details around what I mean by "corrigibility" through small stories about a purely corrigible agent whom I'll call Cora, and her principal, who I'll name Prince.
These stories will attempt to demonstrate how some desiderata (such as obedience) emerge naturally from corrigibility, while others (like kindness) do not, as well as provide some texture on the ways in which the plain-English definition above is incomplete.
Please keep in mind that these stories are meant to illustrate what we want, rather than how to get what we want; actually producing an agent that actually has all the corrigibility desiderata will take a deeper, better training set than just feeding these stories to a language model or whatever.
In the end, corrigibility is not the definition given above, nor is it the collection of these desiderata, but rather corrigibility is the simple concept which generates the desiderata and which might be loosely described by my attempt at a definition.
I'm going to be vague about the nature of Cora in these stories, with an implication that she's a somewhat humanoid entity with some powers, a bit like a genie.
It probably works best if you imagine that Cora is actually an egoless, tool-like AGI, to dodge questions of personhood and slavery.[2] The relationship between a purely corrigible agent and a principal is not a healthy way for humans to relate to each other, and if you imagine Cora is a human some of these examples may come across as psychopathic or abusive.
While corrigibility is a property we look for in employees, I think the best employees bring human values to their work, and the best employers treat their employees as more than purely corrigible servants. On the same theme, while I describe Prince as a single person, I expect it's useful to sometimes think of him more like a group of operators who Cora doesn't distinguish.
To engage our intuitions, the setting resembles something like Cora being a day-to-day household servant doing mundane tasks, despite that being an extremely reckless use for a general intelligence capable of unconstrained self-improvement and problem-solving.
The point of these stories is not to describe an ideal setup for a real-world AGI. In fact, I spent no effort on describing the sort of world that we might see in the future, and many of these scenarios depict a wildly irresponsible and unwise use of Cora. The point of these stories is to get a better handle on what it means for an agent to be corrigible, not to serve as a role-model for...
view more