Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Inner Alignment: Explain like I'm 12 Edition , published by Rafael Harth on the AI Alignment Forum.
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
(This is an unofficial explanation of Inner Alignment based on the Miri paper Risks from Learned Optimization in Advanced Machine Learning Systems (which is almost identical to the LW sequence) and the Future of Life podcast with Evan Hubinger (Miri/LW). It's meant for anyone who found the sequence too long/challenging/technical to read.)
Note that bold and italics means "this is a new term I'm introducing," whereas underline and italics is used for emphasis.
What is Inner Alignment?
Let's start with an abridged guide to how Machine Learning works:
Choose a problem
Decide on a space of possible solutions
Find a good solution from that space
If the problem is "find a tool that can look at any image and decide whether or not it contains a cat," then each conceivable set of rules for answering this question (formally, each function from the set of all pixels to the set
yes, no
) defines one solution. We call each such solution a model. The space of possible models is depicted below.
Since that's all possible models, most of them are utter nonsense.
Pick a random one, and you're as likely to end up with a car-recognizer than a cat-recognizer – but far more likely with an algorithm that does nothing we can interpret. Note that even the examples I annotated aren't typical – most models would be more complex while still doing nothing related to cats. Nonetheless, somewhere in there is a model that would do a decent job on our problem. In the above, that's the one that says, "I look for cats."
How does ML find such a model? One way that does not work is trying out all of them. That's because the space is too large: it might contain over
10
1000000
candidates. Instead, there's this thing called Stochastic Gradient Descent (SGD). Here's how it works:
SGD begins with some (probably terrible) model and then proceeds in steps. In each step, it switches to another model that is "close" and hopefully a little better. Eventually, it stops and outputs the most recent model.
1
Note that, in the example above, we don't end up with the perfect cat-recognizer (the red box) but with something close to it – perhaps a model that looks for cats but has some unintended quirks. SGD does generally not guarantee optimality.
The speech bubbles where the models explain what they're doing are annotations for the reader. From the perspective of the programmer, it looks like this:
The programmer has no idea what the models are doing. Each model is just a black box.
2
A necessary component for SGD is the ability to measure a model's performance, but this happens while treating them as black boxes. In the cat example, assume the programmer has a bunch of images that are accurately labeled as "contains cat" and "doesn't contain cat." (These images are called the training data and the setting is called supervised learning.) SGD tests how well each model does on these images and, in each step, chooses one that does better. In other settings, performance might be measured in different ways, but the principle remains the same.
Now, suppose that the images we have happen to include only white cats. In this case, SGD might choose a model implementing the rule "output yes if there is something white and with four legs." The programmer would not notice anything strange – all she sees is that the model output by SGD does well on the training data.
In this setting, there is thus only a problem if our way of obtaining feedback is flawed. If it is perfect – if the pictures with cats are perfectly representative of what images-with-cats are like, and the pictures without cats are perfectly representative of what images-without-cats ...
view more