Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: High-level interpretability: detecting an AI's objectives, published by Paul Colognese on September 29, 2023 on LessWrong.
Thanks to Monte MacDiarmid (for discussions, feedback, and experiment infrastructure) and to the Shard Theory team for their prior work and exploratory infrastructure.
Thanks to Joseph Bloom, John Wentworth, Alexander Gietelink Oldenziel, Johannes Treuitlein, Marius Hobbhahn, Jeremy Gillen, Bilal Chughtai, Evan Hubinger, Rocket Drew, Tassilo Neubauer, Jan Betley, and Juliette Culver for discussions/feedback.
Summary
This is a brief overview of our research agenda, recent progress, and future objectives.
Having the ability to robustly detect, interpret, and modify an AI's objectives could allow us to directly solve the inner alignment problem. Our work focuses on a top-down approach, where we focus on clarifying our understanding of how objectives might exist in an AI's internals and developing methods to detect and understand them.
This post is meant to do quite a few things:
We'll start by outlining the problem and potential solution.
We then present our initial theory on objectives.
Next, we look at some initial empirical work that shows how we hope to test theory-based predictions.
We then illustrate how we intend to go from theory to objective detection methods by producing an initial (but crude) objective detection method.
Finally, we conclude by discussing related work and future directions.
Introduction to objective detection
In this section, we outline how objective detection could be used to tackle the inner alignment problem, clarify what we mean when we refer to an internal objective, and present our initial theory on objectives.
Background
A major concern is that we may accidentally train AIs that pursue misaligned objectives. It is insufficient to rely on behavioral observations to confidently deduce the true objectives of an AI system. This is in part due to the problem of deceptive alignment. Therefore, we may need to rely on advanced interpretability tools to confidently deduce the true objectives of AI systems.
Prior work has discussed how agentic AIs are likely to have internal objectives used to select actions by predicting whether they will lead to target outcomes. If an overseer had an objective detection method that could robustly detect and interpret all of the internal objectives of an AI (in training and deployment), it could confidently know whether or not the system is misaligned and intervene or use this observation as part of a training signal.
We currently believe that this approach is one of our best hopes at tackling some of the hardest problems in alignment, such as the sharp left turn and (deep) deception.
Our current research agenda primarily aims to develop an appropriate notion of an internal objective that is probable and predictive, to use that notion to develop a theory around internal objectives and what form they take in future agentic systems, and then to leverage this theory to build detection methods that can identify and interpret internal objectives in such systems.
What is an objective?
In this section, we outline starting intuitions on what we think objectives are and begin to develop a notion of objectives that will form the basis of our initial theory of objectives.
We start with the observation that an agent has to select actions that lead to its target outcome by some kind of internal action-selection mechanism. This action-selection mechanism could take the form of explicit optimization (i.e., explicitly via the selection of an action by evaluating a set of possible actions), some heuristics-based approach, or a combination of both.
This internal action-selection mechanism needs to use some criterion to decide which actions lead to the target outcome. For example, in a chess engine, Monte Carl...
view more