Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Inaccessible information, published by Paul Christiano on the AI Alignment Forum.
Suppose that I have a great model for predicting “what will Alice say next?”
I can evaluate and train this model by checking its predictions against reality, but there may be many facts this model “knows” that I can’t easily access.
For example, the model might have a detailed representation of Alice’s thoughts which it uses to predict what Alice will say, without being able to directly answer “What is Alice thinking?” In this case, I can only access that knowledge indirectly, e.g. by asking about what Alice would say in under different conditions.
I’ll call information like “What is Alice thinking?” inaccessible. I think it’s very plausible that AI systems will build up important inaccessible knowledge, and that this may be a central feature of the AI alignment problem.
In this post I’m going to try to clarify what I mean by “inaccessible information” and the conditions under which it could be a problem. This is intended as clarification and framing rather than a presentation of new ideas, though sections IV, V, and VI do try to make some small steps forward.
I. Defining inaccessible information
I’ll start by informally defining what it means for information to be accessible, based on two mechanisms:
Mechanism 1: checking directly
If I can check X myself, given other accessible information, then I’ll define X to be accessible.
For example, I can check a claim about what Alice will do, but I can’t check a claim about what Alice is thinking.
If I can run randomized experiments, I can probabilistically check a claim about what Alice would do. But I can’t check a counterfactual claim for conditions that I can’t create in an experiment.
In reality this is a graded notion — some things are easier or harder to check. For the purpose of this post, we can just talk about whether something can be tested even a single time over the course of my training process.
Mechanism 2: transfer
The simplest model that provides some accessible information X may also provide some other information Y. After all, it’s unlikely that the simplest model that outputs X doesn’t output anything else. In this case, we’ll define Y to be accessible.
For example, if I train a model to predict what happens over the next minute, hour, or day, it may generalize to predicting what will happen in a month or year. For example, if the simplest model to predict the next day was a fully-accurate physical simulation, then the same physics simulation might work when run for longer periods of time.
I think this kind of transfer is kind of dicey, so I genuinely don’t know if long-term predictions are accessible or not (we certainly can’t directly check them, so transfer is the only way they could be accessible).
Regardless of whether long-term predictions are accessible by transfer, there are other cases where I think transfer is pretty unlikely. For example, the simplest way to predict Alice’s behavior might be to have a good working model for her thoughts. But it seems unlikely that this model would spontaneously describe what Alice is thinking in an understandable way — you’d need to specify some additional machinery, for turning the latent model into useful descriptions.
I think this is going to be a fairly common situation: predicting accessible information may involve almost all the same work as predicting inaccessible information, but you need to combine that work with some “last mile” in order to actually output inaccessible facts.
Definition
I’ll say that information is accessible if it’s in the smallest set of information that is closed under those two mechanisms, and inaccessible otherwise.
There are a lot of nuances in that definition, which I’ll ignore for now.
Examples
Here are some candidates for accessible vs. inacces...
view more