Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Superposition is not "just" neuron polysemanticity, published by Lawrence Chan on April 26, 2024 on The AI Alignment Forum.
TL;DR: In this post, I distinguish between two related concepts in neural network interpretability: polysemanticity and superposition. Neuron polysemanticity is the observed phenomena that many neurons seem to fire (have large, positive activations) on multiple unrelated concepts.
Superposition is a specific explanation for neuron (or attention head) polysemanticity, where a neural network represents more sparse features than there are neurons (or number of/dimension of attention heads) in near-orthogonal directions. I provide three ways neurons/attention heads can be polysemantic without superposition: non--neuron aligned orthogonal features, non-linear feature representations, and compositional representation without features.
I conclude by listing a few reasons why it might be important to distinguish the two concepts.
Epistemic status: I wrote this "quickly" in about 12 hours, as otherwise it wouldn't have come out at all. Think of it as a (failed) experiment in writing brief and unpolished research notes, along the lines of GDM or Anthropic
Interp Updates.
Introduction
Meaningfully interpreting neural networks involves decomposing them into smaller interpretable components. For example, we might hope to look at each neuron or attention head, explain what that component is doing, and then compose our understanding of individual components into a mechanistic understanding of the model's behavior as a whole.
It would be very convenient if the natural subunits of neural networks - neurons and attention heads - are monosemantic - that is, each component corresponds to "a single concept". Unfortunately, by default, both neurons and attention heads seem to be polysemantic: many of them seemingly correspond to multiple unrelated concepts. For example, out of 307k neurons in GPT-2, GPT-4 was able to generate short explanations that captured over >50% variance for
only 5203 neurons, and a quick glance at
OpenAI microscope reveals many examples of neurons in vision models that fire on unrelated clusters
such as "poetry" and "dice".
One explanation for polysemanticity is the superposition hypothesis: polysemanticity occurs because models are (approximately) linearly representing more features[1] than their activation space has dimensions (i.e. place features in superposition). Since there are more features than neurons, it immediately follows that some neurons must correspond to more than one feature.[2]
It's worth noting that most written resources on superposition clearly distinguish between the two terms. For example, in the seminal
Toy Model of Superposition,[3] Elhage et al write:
Why are we interested in toy models? We believe they are useful proxies for studying the superposition we suspect might exist in real neural networks. But how can we know if they're actually a useful toy model? Our best validation is whether their predictions are consistent with empirical observations regarding polysemanticity.
(
Source)
Similarly, Neel Nanda's
mech interp glossary explicitly notes that the two concepts are distinct:
Subtlety: Neuron superposition implies polysemanticity (since there are more features than neurons), but not the other way round. There could be an interpretable basis of features, just not the standard basis - this creates polysemanticity but not superposition.
(
Source)
However, I've noticed empirically that many researchers and grantmakers identify the two concepts, which often causes communication issues or even confused research proposals.
Consequently, this post tries to more clearly point at the distinction and explain why it might matter. I start by discussing the two terms in more detail, give a few examples of why you might have po...
view more