Download - LW - A gentle introduction to mechanistic anomaly detection by Erik Jenner

Discover

Podcast Features
Your all-in-one podcasting solution.

Blog to Podcast
Turn your blog into an engaging podcast.
Livestream
High-performing audio live, without limits.

Podcast Studio
Easy-to-use audio recorder app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Podcast App
The best podcast player & podcast app.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.
Live Streaming
Receive livestream rewards from your audience and earn
recurring income from your Fan Club membership.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Advertisers
Enterprise
Pricing
Discover

The Nonlinear Library: LessWrong

Education

LW - A gentle introduction to mechanistic anomaly detection by Erik Jenner

2024-04-04

Download Right click and do "save link as"

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A gentle introduction to mechanistic anomaly detection, published by Erik Jenner on April 4, 2024 on LessWrong.
TL;DR: Mechanistic anomaly detection aims to flag when an AI produces outputs for "unusual reasons." It is similar to mechanistic interpretability but doesn't demand human understanding. I give a self-contained introduction to mechanistic anomaly detection from a slightly different angle than the
existing one by Paul Christiano (focused less on heuristic arguments and drawing a more explicit parallel to interpretability).
Mechanistic anomaly detection was first introduced by the Alignment Research Center (ARC), and a lot of this post is based on their ideas. However, I am not affiliated with ARC; this post represents my perspective.
Introduction
We want to create useful AI systems that never do anything too bad. Mechanistic anomaly detection relaxes this goal in two big ways:
Instead of eliminating all bad behavior from the start, we're just aiming to flag AI outputs online.
Instead of specifically flagging bad outputs, we flag any outputs that the AI produced for "unusual reasons."
These are serious simplifications. But strong methods for mechanistic anomaly detection (or MAD for short) might still be important progress toward the full goal or even achieve it entirely:
Reliably flagging bad behavior would certainly be a meaningful step (and perhaps sufficient if we can use the detector as a training signal or are just fine with discarding some outputs).
Not all the cases flagged as unusual by MAD will be bad, but the hope is that the converse holds: with the right notion of "unusual reasons," all bad cases might involve unusual reasons. Often we may be fine with flagging more cases than just the bad ones, as long as it's not excessive.
I intentionally say "unusual reasons for an output" rather than "unusual inputs" or "unusual outputs." Good and bad outputs could look indistinguishable to us if they are sufficiently complex, and inputs might have similar problems. The focus on mechanistic anomalies (or "unusual reasons") distinguishes MAD from other out-of-distribution or anomaly detection problems.
Because of this, I read the name as "[mechanistic anomaly] detection" - it's about detecting mechanistic anomalies rather than detecting any anomalies with mechanistic means.
One intuition pump for mechanistic anomaly detection comes from mechanistic interpretability. If we understand an AI system sufficiently well, we should be able to detect, for example, when it thinks it's been deployed and executes a treacherous turn. The hope behind MAD is that human understanding isn't required and that we can detect cases like this as "mechanistically anomalous" without any reference to humans. This might make the problem much easier than if we demand human understanding.
The
Alignment Research Center (ARC) is trying to formalize "reasons" for an AI's output using
heuristic arguments. If successful, this theoretical approach might provide an indefinitely scalable solution to MAD. Collaborators and I are working on a more empirical approach to MAD that is not centered on heuristic arguments, and this post gives a self-contained introduction that might be more suitable to that perspective (and perhaps helpful for readers with an interpretability background).
Thanks to Viktor Rehnberg, Oliver Daniels-Koch, Jordan Taylor, Mark Xu, Alex Mallen, and Lawrence Chan for feedback on a draft!
Mechanistic anomaly detection as an alternative to interpretability: a toy example
As a toy example, let's start with the SmartVault setting from the
ELK report. SmartVault is a vault housing a diamond that we want to protect from robbers. We would like an AI to use various actuators to keep the diamond safe by stopping any robbers. There is a camera pointed at the diamond, which we want to u...