Download - AF - SAEs Discover Meaningful Features in the IOI Task by Alex Makelov

Discover

Podcast Features
Your all-in-one podcasting solution.

Podcast Studio
Easy-to-use audio recorder app.
Livestream
High-performing audio live, without limits.

Podcast App
The best podcast player & podcast app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Patron & Paid Content
The seamless way for fans to support you directly
from your podcast.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Enterprise
Pricing
Discover

The Nonlinear Library

Education

AF - SAEs Discover Meaningful Features in the IOI Task by Alex Makelov

2024-06-05

Download Right click and do "save link as"

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: SAEs Discover Meaningful Features in the IOI Task, published by Alex Makelov on June 5, 2024 on The AI Alignment Forum.
TLDR: recently, we wrote a paper proposing several evaluations of SAEs against "ground-truth" features computed w/ supervision for a given task (in our case, IOI [1]). However, we didn't optimize the SAEs much for performance in our tests. After putting the paper on arxiv, Alex carried out a more exhaustive search for SAEs that do well on our test for controlling (a.k.a. steering) model output with SAE features. The results show that:
SAEs trained on IOI data find interpretable features that come close to matching supervised features (computed with knowledge of the IOI circuit) for the task of editing representations to steer the model.
Gated SAEs outperform vanilla SAEs across the board for steering
SAE training metrics like sparsity and loss recovered significantly correlate with how good representation edits are. In particular, sparsity is more strongly correlated than loss recovered.
Partial Paper Recap: Towards More Objective SAE Evals
Motivation: SAE Evals Are Too Indirect
We train SAEs with the goal of finding the true features in LLM representations - but currently, "true features" is more of a vague direction than a well-defined concept in mech interp research. SAE evaluations mostly use indirect measures of performance - ones we hope correlate with the features being the "true" ones, such as the ℓ0 (sparsity) loss, the LLM loss recovered when using SAE reconstructions, and how interpretable the features are.
This leaves a big gap in our understanding of the usefulness of SAEs and similar unsupervised methods; it also makes it hard to objectively compare different SAE architectures and/or training algorithms.
So, we wanted to develop more objective SAE evaluations, by benchmarking SAEs against features that we know to be meaningful through other means, even if in a narrow context. We chose the IOI task, as it's perhaps the most well-studied example of a non-trivial narrow capability in a real-world LLM (GPT2-Small).
We set out to compute a "skyline" for SAE performance: an object of the same "type" as an SAE - a "sparse feature dictionary" - which is constructed and validated "by hand" using our very precise knowledge about IOI. Such an object would allow us to evaluate how close a given SAE is to the limit of what's afforded by its representational power.
The IOI circuit (copy of Figure 2 from the IOI paper [1]).
Creating Our Own Feature Dictionaries for the IOI Task With Supervision
Following the prior work by Wang et al [1] that discovered the IOI circuit, we conjectured that internal LLM activations for an IOI prompt p (e.g., "When Mary and John went to the store, John gave a book to") can be described using the following three attributes:
IO(p): the indirect object token (" Mary" in our example)
S(p): the subject token (" John" in our example)
Pos(p): whether the IO token comes first or second in the sentence (1st in our example; the alternative would be "When John and Mary went...")
And indeed, we found that intermediate activations of the model at a given site (e.g., the output of some attention head) for a prompt p can be approximated as[1]
activation(p)Ep'IOI(activation(p'))+vIO=IO(p)+vS=S(p)+vPos=Pos(p)
where the vectors vIO=...,… form a "supervised sparse feature dictionary" that we construct using our prior knowledge about the IOI circuit[2]. In fact, these vectors can be chosen in a very simple way as the (centered) conditional mean, e.g.
vIO=Mary=EpIOI(activation(p)|IO(p)=" Mary")EpIOI(activation(p))
Not just that, but we can use these vectors for editing individual attributes' values in internal model states in a natural way via feature arithmetic, e.g. to change the IO from " Mary" to " Mike", we can use the activation
aedit...