Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: 5. Open Corrigibility Questions, published by Max Harms on June 10, 2024 on The AI Alignment Forum.
(Part 5 of the CAST sequence)
Much work remains on the topic of corrigibility and the CAST strategy in particular. There's theoretical work in both nailing down an even more complete picture of corrigibility and in developing better formal measures. But there's also a great deal of empirical work that seems possible to do at this point. In this document I'll attempt to give a summary of where I, personally, want to invest more energy.
Remaining Confusion
Does "empowerment" really capture the gist of corrigibility?
Does it actually matter whether we restrict the empowerment goal to the domains of the agent's structure, thoughts, actions, and the consequences of their actions? Or do we still get good outcomes if we ask for more general empowerment?
It seems compelling to model nearly everything in the AI's lightcone as a consequence of its actions, given that there's a counterfactual way the AI could have behaved such that those facts would change. If we ask to be able to correct the AI's actions, are we not, in practice, then asking to be generally empowered?
Corrigible agents should, I think, still (ultimately) obey commands that predictably disempower the principal or change the agent to be less corrigible. Does my attempted formalism actually capture this?
Can we prove that, in my formalism, any pressure on the principal's actions that stems from outside their values is disempowering?
How should we think about agent-actions which scramble the connection between values and principal-actions, but in a way that preserves the way in which actions encode information about what generated them? Is this still kosher? What if the scrambling takes place by manipulating the principal's beliefs?
What's going on with the relationship between time, policies, and decisions? Am I implicitly picking a decision theory for the agent in my formalism?
Are my attempts to rescue corrigibility in the presence of multiple timesteps philosophically coherent? Should we inject entropy into the AI's distribution over what time it is when measuring its expected corrigibility? If so, how much? Are the other suggestions about managing time good? What other tricks are there to getting things to work that I haven't thought of?
Sometimes it's good to change values, such as if one has a meta-value (i.e. "I want to want to stop gambling"). How can we formally reflect the desiderata of having a corrigible agent support this kind of growth, or at least not try to block the principal from growing.
If the agent allows the principal to change values, how can we clearly distinguish the positive and natural kind of growth from unwanted value drift or manipulation?
Is there actually a clean line between learning facts and changing values? If not, does "corrigibility" risk having an agent who wants to prevent the principal from learning things?
Does the principal want to protect the principal in general, or simply to protect the principal from the actions of the agent?
Corrigibility clearly involves respecting commands given by the principal yesterday, or more generally, some arbitrary time in the past. But when the principal of today gives a contradictory command, we want the agent to respect the updated instruction. What gives priority of the present over the past?
If the agent strongly expects the principal to give a command in the future, does that expected-command carry any weight? If so, can it take priority over the principal of the past/present?
Can a multiple-human team actually be a principal?
What's the right way to ground that out, ontologically?
How should a corrigible agent behave when its principal seems self-contradictory? (Either because the principal is a team, or simply because the single-huma...
view more