Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: 3b. Formal (Faux) Corrigibility, published by Max Harms on June 9, 2024 on The AI Alignment Forum.
(Part 3b of the CAST sequence)
In the first half of this document, Towards Formal Corrigibility, I sketched a solution to the stop button problem. As I framed it, the solution depends heavily on being able to detect manipulation, which I discussed on an intuitive level. But intuitions can only get us so far. Let's dive into some actual math and see if we can get a better handle on things.
Measuring Power
To build towards a measure of manipulation, let's first take inspiration from the suggestion that manipulation is somewhat the opposite of empowerment. And to measure empowerment, let's begin by trying to measure "power" in someone named Alice.
Power, as I touched on in the ontology in Towards Formal Corrigibility, is (intuitively) the property of having one's values/goals be causally upstream of the state of some part of the world, such that the agent's preferences get expressed through their actions changing reality.
Let's imagine that the world consists of a Bayes net where there's a (multidimensional and probabilistic) node for Alice's Values, which can be downstream of many things, such as Genetics or whether Alice has been Brainwashed. In turn, her Values will be upstream of her (deliberate) Actions, as well as other side-channels such as her reflexive Body-Language.
Alice's Actions are themselves downstream of nodes besides Values, such as her Beliefs, as well as upstream of various parts of reality, such as her Diet and whether Bob-Likes-Alice.
As a simplifying assumption, let's assume that while the nodes upstream of Alice's Values can strongly affect the probability of having various Values, they can't determine her Values. In other words, regardless of things like Genetics and Brainwashing, there's always at least some tiny chance associated with each possible setting of Values.
Likewise, we'll assume that regardless of someone's Values, they always have at least a tiny probability of taking any possible action (including the "null action" of doing nothing).
And, as a further simplification, let's restrict our analysis of Alice's power to a single aspect of reality that's downstream of their actions which we'll label "Domain". ("Diet" and "Bob-Likes-Alice" are examples of domains, as are blends of nodes like those.) We'll further compress things by combining all nodes upstream of values (e.g. Genetics and Brainwashing) into a single node called "Environment" and then marginalize out all other nodes besides Actions, Values, and the Domain.
The result should be a graph which has Environment as a direct parent of everything, Values as a direct parent of Actions and the Domain, and Actions as a direct parent of the Domain.
Let's now consider sampling a setting of the Environment. Regardless of what we sample, we've assumed that each setting of the Values node is possible, so we can consider each counterfactual setting of Alice's Values. In this setting, with a choice of environment and values, we can begin to evaluate Alice's power. Because we're only considering a specific environment and choice of values, I'll call this "local power."
In an earlier attempt at formalization, I conceived of (local) power as a difference in expected value between sampling Alice's Action compared to the null action, but I don't think this is quite right. To demonstrate, let's imagine that Alice's body-language reveals her Values, regardless of her Actions. An AI which is monitoring Alice's body-language could, upon seeing her do anything at all, swoop in and rearrange the universe according to her Values, regardless of what she did.
This might, naively, seem acceptable to Alice (since she gets what she wants), but it's not a good measure of my intuitive notion of power, since the c...
view more