Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: The Alignment Problem: Machine Learning and Human Values, published by Rohin Shah on the AI Alignment Forum.
This is a linkpost for
The Alignment Problem: Machine Learning and Human Values, by Brian Christian, was just released. This is an extended summary + opinion, a version without the quotes from the book will go out in the next Alignment Newsletter.
Summary:
This book starts off with an explanation of machine learning and problems that we can currently see with it, including detailed stories and analysis of:
- The gorilla misclassification incident
- The faulty reward in CoastRunners
- The gender bias in language models
- The failure of facial recognition models on minorities
- The COMPAS controversy (leading up to impossibility results in fairness)
- The neural net that thought asthma reduced the risk of pneumonia
It then moves on to agency and reinforcement learning, covering from a more historical and academic perspective how we have arrived at such ideas as temporal difference learning, reward shaping, curriculum design, and curiosity, across the fields of machine learning, behavioral psychology, and neuroscience. While the connections aren't always explicit, a knowledgeable reader can connect the academic examples given in these chapters to the ideas of specification gaming and mesa optimization that we talk about frequently in this newsletter. Chapter 5 especially highlights that agent design is not just a matter of specifying a reward: often, rewards will do ~nothing, and the main requirement to get a competent agent is to provide good shaping rewards or a good curriculum. Just as in the previous part, Brian traces the intellectual history of these ideas, providing detailed stories of (for example):
- BF Skinner's experiments in training pigeons
- The invention of the perceptron
- The success of TD-Gammon, and later AlphaGo Zero
The final part, titled "Normativity", delves much more deeply into the alignment problem. While the previous two parts are partially organized around AI capabilities -- how to get AI systems that optimize for their objectives -- this last one tackles head on the problem that we want AI systems that optimize for our (often-unknown) objectives, covering such topics as imitation learning, inverse reinforcement learning, learning from preferences, iterated amplification, impact regularization, calibrated uncertainty estimates, and moral uncertainty.
Opinion:
I really enjoyed this book, primarily because of the tracing of the intellectual history of various ideas. While I knew of most of these ideas, and often also who initially came up with the ideas, it's much more engaging to read the detailed stories of _how_ that person came to develop the idea; Brian's book delivers this again and again, functioning like a well-organized literature survey that is also fun to read because of its great storytelling. I struggled a fair amount in writing this summary, because I kept wanting to somehow communicate the writing style; in the end I decided not to do it and to instead give a few examples of passages from the book in this post.
Passages:
Note: It is generally not allowed to have quotations this long from this book; I have specifically gotten permission to do so.
Here’s an example of agents with evolved inner reward functions, which lead to the inner alignment problems we’ve previously worried about:
They created a two-dimensional virtual world in which simulated organisms (or “agents”) could move around a landscape, eat, be preyed upon, and reproduce. Each organism’s “genetic code” contained the agent’s reward function: how much it liked food, how much it disliked being near predators, and so forth. During its lifetime, it would use reinforcement learning to learn how to take actions to maximize these rewards. When an organism reproduced, ...
view more