Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Reply to Paul Christiano on Inaccessible Information, published by Alex Flint on the AI Alignment Forum.
In Inaccessible Information, Paul Christiano lays out a fundamental challenge in training machine learning systems to give us insight into parts of the world that we cannot directly verify. The core problem he lays out is as follows.
Suppose we lived in a world that had invented machine learning but not Newtonian mechanics. And suppose we trained some machine learning model to predict the motion of the planets across the sky -- we could do this by observing the position of the planets over, say, a few hundred days, and using this as training data for, say, a recurrent neural network. And suppose further that this worked and our training process yielded a model that output highly accurate predictions many days into the future. If all we wanted was to know the position of the planets in the sky then -- good news -- we’re done. But we might hope to use our model to gain some direct insight into the nature of the motion of the planets (i.e. the laws of gravity, although we wouldn’t know that this is what we were looking for).
Presumably the machine learning model has in some sense discovered Newtonian mechanics using the training data we fed it, since this is surely the most compact way to predict the position of the planets far into the future. But we certainly can’t just read off the laws of Newtonian mechanics by looking at the millions or billions or trillions of weights in the trained model. How might we extract insight into the nature of the motion of the planets from this model?
Well we might train a model to output both predictions about the position of the planets in the sky and a natural language description of what’s really going on behind the scenes (i.e. the laws of gravity). We’re assuming that we have enough training data that the training process was already able to derive these laws, so it’s not unreasonable to train a model that also outputs such legible descriptions. But in order to train a model that outputs such legible descriptions we need to generate a reward signal that incentivizes the right kind of legible descriptions. And herein lies the core of the problem: in this hypothesized world we do not know the true laws of Newtonian mechanics, so we cannot generate a reward signal by comparing the output of our model to ground truth during training. We might instead generate a reward signal that (1) measures how accurate the predictions of the position of the planets are, and (2) measures how succinct and plausible the legible descriptions are. But then what we are really training is a model that is good at producing succinct descriptions that seem plausible to humans. This may be a very very different (and dangerous) thing to do since there are lots of ways that a description can seem plausible to a human while being quite divorced from the truth.
Christiano calls this the instrumental policy: the policy that produces succinct descriptions that merely seem plausible to humans:
The real problem comes from what I’ll call the instrumental policy. Let’s say we’ve tried to dream up a loss function L(x, y) to incentivize the model to correctly answer information we can check, and give at least plausible and consistent answers on things we can’t check. By definition, the values L(x, y) are themselves accessible. Then it’s natural to learn a policy like: “on input x, produce the output y for which the loss L(x, y) will be minimal.” Let’s write BAD for this policy.
Christiano uses the term “inaccessible information” for information like the laws of gravity in this example: information about the underlying nature of things that a machine learning model might learn quite accurately as latent info in service of making predictions, but that is difficult to ex...
view more