Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: The strategy-stealing assumption, published by Paul Christiano on the AI Alignment Forum.
Suppose that 1% of the world’s resources are controlled by unaligned AI, and 99% of the world’s resources are controlled by humans. We might hope that at least 99% of the universe’s resources end up being used for stuff-humans-like (in expectation).
Jessica Taylor argued for this conclusion in Strategies for Coalitions in Unit-Sum Games: if the humans divide into 99 groups each of which acquires influence as effectively as the unaligned AI, then by symmetry each group should end, up with as much influence as the AI, i.e. they should end up with 99% of the influence.
This argument rests on what I’ll call the strategy-stealing assumption: for any strategy an unaligned AI could use to influence the long-run future, there is an analogous strategy that a similarly-sized group of humans can use in order to capture a similar amount of flexible influence over the future. By “flexible” I mean that humans can decide later what to do with that influence — which is important since humans don’t yet know what we want in the long run.
Why might the strategy-stealing assumption be true?
Today there are a bunch of humans, with different preferences and different kinds of influence. Crudely speaking, the long-term outcome seems to be determined by some combination of {which preferences have how much influence?} and {what is the space of realizable outcomes?}.
I expect this to become more true over time — I expect groups of agents with diverse preferences to eventually approach efficient outcomes, since otherwise there are changes that every agent would prefer (though this is not obvious, especially in light of bargaining failures). Then the question is just about which of these efficient outcomes we pick.
I think that our actions don’t effect the space of realizable outcomes, because long-term realizability is mostly determined by facts about distant stars that we can’t yet influence. The obvious exception is that if we colonize space faster, we will have access more resources. But quantitatively this doesn’t seem like a big consideration, because astronomical events occur over millions of millennia while our decisions only change colonization timelines by decades.
So I think our decisions mostly affect long-term outcomes by changing the relative weights of different possible preferences (or by causing extinction).
Today, one of the main ways that preferences have weight is because agents with those preferences control resources and other forms of influence. Strategy-stealing seems most possible for this kind of plan — an aligned AI can exactly copy the strategy of an unaligned AI, except the money goes into the aligned AI’s bank account instead. The same seems true for most kinds of resource gathering.
There are lots of strategies that give influence to other people instead of helping me. For example, I might preferentially collaborate with people who share my values. But I can still steal these strategies, as long as my values are just as common as the values of the person I’m trying to steal from. So a majority can steal strategies from a minority, but not the other way around.
There can be plenty of strategies that don’t involve acquiring resources or flexible influence. For example, we could have a parliament with obscure rules in which I can make maneuvers that advantage one set of values or another in a way that can’t be stolen. Strategy-stealing may only be possible at the level of groups — you need to retain the option of setting up a different parliamentary system that doesn’t favor particular values. Even then, it’s unclear whether strategy-stealing is possible.
There isn’t a clean argument for strategy-stealing, but I think it seems plausible enough that it’s meaningful and productive...
view more