Download - LW - High-level interpretability: detecting an AI's objectives by Paul Colognese

Discover

Podcast Features
Your all-in-one podcasting solution.

Podcast Studio
Easy-to-use audio recorder app.
Livestream
High-performing audio live, without limits.

Podcast App
The best podcast player & podcast app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Patron & Paid Content
The seamless way for fans to support you directly
from your podcast.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Enterprise
Pricing
Discover

The Nonlinear Library: LessWrong

Education

LW - High-level interpretability: detecting an AI's objectives by Paul Colognese

2023-09-29

Download Right click and do "save link as"

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: High-level interpretability: detecting an AI's objectives, published by Paul Colognese on September 29, 2023 on LessWrong.
Thanks to Monte MacDiarmid (for discussions, feedback, and experiment infrastructure) and to the Shard Theory team for their prior work and exploratory infrastructure.
Thanks to Joseph Bloom, John Wentworth, Alexander Gietelink Oldenziel, Johannes Treuitlein, Marius Hobbhahn, Jeremy Gillen, Bilal Chughtai, Evan Hubinger, Rocket Drew, Tassilo Neubauer, Jan Betley, and Juliette Culver for discussions/feedback.
Summary
This is a brief overview of our research agenda, recent progress, and future objectives.
Having the ability to robustly detect, interpret, and modify an AI's objectives could allow us to directly solve the inner alignment problem. Our work focuses on a top-down approach, where we focus on clarifying our understanding of how objectives might exist in an AI's internals and developing methods to detect and understand them.
This post is meant to do quite a few things:
We'll start by outlining the problem and potential solution.
We then present our initial theory on objectives.
Next, we look at some initial empirical work that shows how we hope to test theory-based predictions.
We then illustrate how we intend to go from theory to objective detection methods by producing an initial (but crude) objective detection method.
Finally, we conclude by discussing related work and future directions.
Introduction to objective detection
In this section, we outline how objective detection could be used to tackle the inner alignment problem, clarify what we mean when we refer to an internal objective, and present our initial theory on objectives.
Background
A major concern is that we may accidentally train AIs that pursue misaligned objectives. It is insufficient to rely on behavioral observations to confidently deduce the true objectives of an AI system. This is in part due to the problem of deceptive alignment. Therefore, we may need to rely on advanced interpretability tools to confidently deduce the true objectives of AI systems.
Prior work has discussed how agentic AIs are likely to have internal objectives used to select actions by predicting whether they will lead to target outcomes. If an overseer had an objective detection method that could robustly detect and interpret all of the internal objectives of an AI (in training and deployment), it could confidently know whether or not the system is misaligned and intervene or use this observation as part of a training signal.
We currently believe that this approach is one of our best hopes at tackling some of the hardest problems in alignment, such as the sharp left turn and (deep) deception.
Our current research agenda primarily aims to develop an appropriate notion of an internal objective that is probable and predictive, to use that notion to develop a theory around internal objectives and what form they take in future agentic systems, and then to leverage this theory to build detection methods that can identify and interpret internal objectives in such systems.
What is an objective?
In this section, we outline starting intuitions on what we think objectives are and begin to develop a notion of objectives that will form the basis of our initial theory of objectives.
We start with the observation that an agent has to select actions that lead to its target outcome by some kind of internal action-selection mechanism. This action-selection mechanism could take the form of explicit optimization (i.e., explicitly via the selection of an action by evaluating a set of possible actions), some heuristics-based approach, or a combination of both.
This internal action-selection mechanism needs to use some criterion to decide which actions lead to the target outcome. For example, in a chess engine, Monte Carl...