Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Do sparse autoencoders find "true features"?, published by Demian Till on February 22, 2024 on LessWrong.
In this post I'll discuss an apparent limitation of sparse autoencoders (SAEs) in their current formulation as they are applied to discovering the latent features within AI models such as transformer-based LLMs. In brief, I'll cover the following:
I'll argue that the L1 regularisation used to promote sparsity when training SAEs may cause neurons in the sparse layer to learn to represent common combinations of features rather than the individual features that we want them to discover
As well as making it more difficult to understand what the actual latent features are, I'll also argue that this limitation may result in some less common latent features not being discovered at all, not even within combinations
I'll then explain why I think that the phenomenon of feature splitting observed in Anthropic's SAE paper appears to demonstrate that this limitation does indeed have a large impact on the features discovered by SAEs
Finally I'll propose an approach for overcoming this limitation and discuss how we can test whether it really brings us closer to finding the real latent features
Rough definition of "true features"
We intend for SAEs to discover the "true features" (a term I'm borrowing from Anthropic's SAE paper) used by the target model e.g. a transformer-based LLM. There isn't a universally accepted definition of what "true features" are, but for now I'll use the term somewhat loosely to refer to something like:
linear directions in an activation space at a hidden layer within a target model which encode some reasonably monosemantic quantity such as the model's "confidence" in some concept being in play
they should play a causal role in the functioning of the target model. So for example if we were to activate or deactivate the feature while the target model is processing a given input sequence then we should expect the outputs to change accordingly in some reasonably understandable way
they should be in their most atomic form, so that e.g an arbitrary linear combination of two "true feature" directions is not necessarily itself a "true feature" direction even though it may satisfy the previous criteria
There may be other ways of thinking about features but this should give us enough to work with for our current purposes.
Why SAEs are incentivised to discover combinations of features rather than individual features
Consider a toy setup where one of the hidden layers in the target model has 3 "true features" represented by the following directions in its activation space:
Additionally, suppose that feature 1 and feature 2 occur far more frequently than feature 3, and that all features can potentially co-occur in a given activation vector. For the sake of simplicity let's also suppose for now that when features 1 & 2 occur together they tend to both activate with some roughly fixed proportions. For example, an activation vector in which both features 1 and 2 are present (but not feature 3) might look like the following:
Now suppose we train an SAE with 3 neurons in the sparse layer on activation vectors from this hidden layer such as the one above. The desirable outcome is that each of the 3 neurons in the sparse layer learns one of the 3 "true features". If this happens then the directions learnt by SAE would mirror the directions of the "true features" in the target model, looking something like:
However depending on the respective frequencies of feature 3 vs features 1 & 2, as well as the value of the L1 regularisation weight, I will argue shortly that what may happen is that two of the neurons learn to detect when each of features 1 & 2 respectively occur by themselves, while the third neuron learns to detect when they both occur together. In this case the di...
view more