Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Better priors as a safety problem , published by Paul Christiano on the AI Alignment Forum.
(Related: Inaccessible Information, What does the universal prior actually look like?, Learning the prior)
Fitting a neural net implicitly uses a “wrong” prior. This makes neural nets more data hungry and makes them generalize in ways we don’t endorse, but it’s not clear whether it’s an alignment problem.
After all, if neural nets are what works, then both the aligned and unaligned AIs will be using them. It’s not clear if that systematically disadvantages aligned AI.
Unfortunately I think it’s an alignment problem:
I think the neural net prior may work better for agents with certain kinds of simple goals, as described in Inaccessible Information. The problem is that the prior mismatch may bite harder for some kinds of questions, and some agents simply never need to answer those hard questions.
I think that Solomonoff induction generalizes catastrophically because it becomes dominated by consequentialists who use better priors.
In this post I want to try to build some intuition for this problem, and then explain why I’m currently feeling excited about learning the right prior.
Indirect specifications in universal priors
We usually work with very broad “universal” priors, both in theory (e.g. Solomonoff induction) and in practice (deep neural nets are a very broad hypothesis class). For simplicity I’ll talk about the theoretical setting in this section, but I think the points apply equally well in practice.
The classic universal prior is a random output from a random stochastic program. We often think of the question “which universal prior should we use?” as equivalent to the question “which programming language should we use?” but I think that’s a loaded way of thinking about it — not all universal priors are defined by picking a random program.
A universal prior can never be too wrong — a prior P is universal if, for any other computable prior Q, there is some constant c such that, for all x, we have P(x) > c Q(x). That means that given enough data, any two universal priors will always converge to the same conclusions, and no computable prior will do much better than them.
Unfortunately, universality is much less helpful in the finite data regime. The first warning sign is that our “real” beliefs about the situation can appear in the prior in two different ways:
Directly: if our beliefs about the world are described by a simple computable predictor, they are guaranteed to appear in a universal prior with significant weight.
Indirectly: the universal prior also “contains” other programs that are themselves acting as priors. For example, suppose I use a universal prior with a terribly inefficient programming language, in which each character needed to be repeated 10 times in order for the program to do anything non-trivial. This prior is still universal, but it’s reasonably likely that the “best” explanation for some data will be to first sample a really simple interpret for a better programming language, and then draw a uniformly randomly program in that better programming language.
(There isn’t a bright line between these two kinds of posterior, but I think it’s extremely helpful for thinking intuitively about what’s going on.)
Our “real” belief is more like the direct model — we believe that the universe is a lawful and simple place, not that the universe is a hypothesis of some agent trying to solve a prediction problem.
Unfortunately, for realistic sequences and conventional universal priors, I think that indirect models are going to dominate. The problem is that “draw a random program” isn’t actually a very good prior, even if the programming language is OK— if I were an intelligent agent, even if I knew nothing about the particular world I lived in, I could do a lot of a...
view more