Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: I am the Golden Gate Bridge, published by Zvi on May 27, 2024 on LessWrong.
Easily Interpretable Summary of New Interpretability Paper
Anthropic has identified (full paper here) how millions of concepts are represented inside Claude Sonnet, their current middleweight model. The features activate across modalities and languages as tokens approach the associated context. This scales up previous findings from smaller models.
By looking at neuron clusters, they defined a distance measure between clusters. So the Golden Gate Bridge is close to various San Francisco and California things, and inner conflict relates to various related conceptual things, and so on.
Then it gets more interesting.
Importantly, we can also manipulate these features, artificially amplifying or suppressing them to see how Claude's responses change.
If you sufficiently amplify the feature for the Golden Gate Bridge, Claude starts to think it is the Golden Gate Bridge. As in, it thinks it is the physical bridge, and also it gets obsessed, bringing it up in almost every query.
If you amplify a feature that fires when reading a scam email, you can get Claude to write scam emails.
Turn up sycophancy, and it will go well over the top talking how great you are.
They note they have discovered features corresponding to various potential misuses, forms of bias and things like power-seeking, manipulation and secrecy.
That means that, if you had the necessary access and knowledge, you could amplify such features.
Like most powers, one could potentially use this for good or evil. They speculate you could watch the impact on features during fine tuning, or turn down or even entirely remove undesired features. Or amplify desired ones. Checking for certain patterns is proposed as a 'test for safety,' which seems useful but also is playing with fire.
They have a short part at the end comparing their work to other methods. They note that dictionary learning need happen only once per model, and the additional work after that is typically inexpensive and fast, and that it allows looking for anything at all and finding the unexpected. It is a big deal that this allows you to be surprised.
They think this has big advantages over old strategies such as linear probes, even if those strategies still have their uses.
One Weird Trick
You know what AI labs are really good at?
Scaling. It is their one weird trick.
So guess what Anthropic did here? They scaled the autoencoders to Claude Sonnet.
Our general approach to understanding Claude 3 Sonnet is based on the linear representation hypothesis (see e.g.) and the superposition hypothesis. For an introduction to these ideas, we refer readers tothe Background and Motivation section ofToy Models
. At a high level, the linear representation hypothesis suggests that neural networks represent meaningful concepts - referred to as features - as directions in their activation spaces. The superposition hypothesis accepts the idea of linear representations and further hypothesizes that neural networks use the existence of almost-orthogonal directions in high-dimensional spaces to represent more features than there are dimensions.
If one believes these hypotheses, the natural approach is to use a standard method called dictionary learning.
…
Our SAE consists of two layers. The first layer ("encoder") maps the activity to a higher-dimensional layer via a learned linear transformation followed by a ReLU nonlinearity. We refer to the units of this high-dimensional layer as "features." The second layer ("decoder") attempts to reconstruct the model activations via a linear transformation of the feature activations.
The model is trained to minimize a combination of (1) reconstruction error and (2) an L1 regularization penalty on the feature activations, which incentivizes sparsity.
Once the S...
view more