Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: 3a. Towards Formal Corrigibility, published by Max Harms on June 9, 2024 on The AI Alignment Forum.
(Part 3a of the CAST sequence)
As mentioned in Corrigibility Intuition, I believe that it's more important to find a simple, coherent, natural/universal concept that can be gestured at, rather than coming up with a precisely formal measure of corrigibility and using that to train an AGI. This isn't because formal measures are bad; in principle (insofar as corrigibility is a real concept) there will be some kind of function which measures corrigibility.
But it's hard to capture the exact right thing with formal math, and explicit metrics have the tendency to blind people to the presence of better concepts that are nearby.
Nevertheless, there are advantages in attempting to tighten up and formalize our notion of corrigibility. When using a fuzzy, intuitive approach, it's easy to gloss-over issues by imagining that a corrigible AGI will behave like a helpful, human servant. By using a sharper, more mathematical frame, we can more precisely investigate where corrigibility may have problems, such as by testing whether a purely corrigible agent behaves nicely in toy-settings.
Sharp, English Definition
The loose English definition I've used prior to this point has been: an agent is corrigible when it robustly acts opposite of the trope of "be careful what you wish for" by cautiously reflecting on itself as a flawed tool and focusing on empowering the principal to fix its flaws and mistakes.
Before diving into mathematical structures, I'd like to spend a moment attempting to sharpen this definition into something more explicit. In reaching for a crisp definition of corrigibility, we run the risk of losing touch with the deep intuition, so I encourage you to repeatedly check in with yourself about whether what's being built matches precisely with your gut-sense of the corrigible.
In particular, we must be wary of both piling too much in, such that it ceases to be a single coherent target, becoming a grab-bag, and of stripping too much out, such that it loses necessary qualities.
My best guess of where to start is in leaning deeper into the final bit of my early definition - the part about empowering the principal. Indeed, one of the only pre-existing attempts I've seen to formalize corrigibility also conceives of it primarily as about the principal having power (albeit general power over the agent's policy, as opposed to what I'm reaching for).
Many of the emergent desiderata in the intuition doc also work as stories for why empowering the principal to fix mistakes is a good frame.
New definition: an agent is corrigible when it robustly acts to empower the principal to freely fix flaws in the agent's structure, thoughts, and actions (including their consequences), particularly in ways that avoid creating problems for the principal that they didn't foresee.
This new definition puts more emphasis on empowering the principal, unpacks the meaning of "opposite the trope of-" and drops the bit about "reflecting on itself as a flawed tool." While the framing of corrigibility as about reflectively-seeing-oneself-as-a-flawed-part-of-a-whole is a standard MIRI-ish framing of corrigibility, I believe that it leans too heavily into the epistemic/architectural direction and not enough on the corrigibility-from-terminal-values direction I discuss in The CAST
Strategy.
Furthermore, I suspect that the right sub-definition of "robust" will recover much of what I think is good about the flawed-tool frame.
For the agent to "robustly act to empower the principal" I claim it naturally needs to continue to behave well even when significantly damaged or flawed. As an example, a robust process for creating spacecraft parts needs to, when subject to disruption and malfeasance, continue to either contin...
view more