Download - LW - I am the Golden Gate Bridge by Zvi

Discover

Podcast Features
Your all-in-one podcasting solution.

Blog to Podcast
Turn your blog into an engaging podcast.
Livestream
High-performing audio live, without limits.

Podcast Studio
Easy-to-use audio recorder app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Podcast App
The best podcast player & podcast app.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.
Live Streaming
Receive livestream rewards from your audience and earn
recurring income from your Fan Club membership.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Advertisers
Enterprise
Pricing
Discover

The Nonlinear Library: LessWrong

Education

LW - I am the Golden Gate Bridge by Zvi

2024-05-27

Download Right click and do "save link as"

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: I am the Golden Gate Bridge, published by Zvi on May 27, 2024 on LessWrong.
Easily Interpretable Summary of New Interpretability Paper
Anthropic has identified (full paper here) how millions of concepts are represented inside Claude Sonnet, their current middleweight model. The features activate across modalities and languages as tokens approach the associated context. This scales up previous findings from smaller models.
By looking at neuron clusters, they defined a distance measure between clusters. So the Golden Gate Bridge is close to various San Francisco and California things, and inner conflict relates to various related conceptual things, and so on.
Then it gets more interesting.
Importantly, we can also manipulate these features, artificially amplifying or suppressing them to see how Claude's responses change.
If you sufficiently amplify the feature for the Golden Gate Bridge, Claude starts to think it is the Golden Gate Bridge. As in, it thinks it is the physical bridge, and also it gets obsessed, bringing it up in almost every query.
If you amplify a feature that fires when reading a scam email, you can get Claude to write scam emails.
Turn up sycophancy, and it will go well over the top talking how great you are.
They note they have discovered features corresponding to various potential misuses, forms of bias and things like power-seeking, manipulation and secrecy.
That means that, if you had the necessary access and knowledge, you could amplify such features.
Like most powers, one could potentially use this for good or evil. They speculate you could watch the impact on features during fine tuning, or turn down or even entirely remove undesired features. Or amplify desired ones. Checking for certain patterns is proposed as a 'test for safety,' which seems useful but also is playing with fire.
They have a short part at the end comparing their work to other methods. They note that dictionary learning need happen only once per model, and the additional work after that is typically inexpensive and fast, and that it allows looking for anything at all and finding the unexpected. It is a big deal that this allows you to be surprised.
They think this has big advantages over old strategies such as linear probes, even if those strategies still have their uses.
One Weird Trick
You know what AI labs are really good at?
Scaling. It is their one weird trick.
So guess what Anthropic did here? They scaled the autoencoders to Claude Sonnet.
Our general approach to understanding Claude 3 Sonnet is based on the linear representation hypothesis (see e.g.) and the superposition hypothesis. For an introduction to these ideas, we refer readers tothe Background and Motivation section ofToy Models
. At a high level, the linear representation hypothesis suggests that neural networks represent meaningful concepts - referred to as features - as directions in their activation spaces. The superposition hypothesis accepts the idea of linear representations and further hypothesizes that neural networks use the existence of almost-orthogonal directions in high-dimensional spaces to represent more features than there are dimensions.
If one believes these hypotheses, the natural approach is to use a standard method called dictionary learning.
…
Our SAE consists of two layers. The first layer ("encoder") maps the activity to a higher-dimensional layer via a learned linear transformation followed by a ReLU nonlinearity. We refer to the units of this high-dimensional layer as "features." The second layer ("decoder") attempts to reconstruct the model activations via a linear transformation of the feature activations.
The model is trained to minimize a combination of (1) reconstruction error and (2) an L1 regularization penalty on the feature activations, which incentivizes sparsity.
Once the S...