Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Anthropic announces interpretability advances. How much does this advance alignment?, published by Seth Herd on May 22, 2024 on LessWrong.
Anthropic just published a pretty impressive set of results in interpretability. This raises for me, some questions and a concern: Interpretability helps, but it isn't alignment, right? It seems to me as though the vast bulk of alignment funding is now going to interpretability. Who is thinking about how to leverage interpretability into alignment?
It intuitively seems as though we are better off the more we understand the cognition of foundation models. I think this is true, but there are sharp limits: it will be impossible to track the full cognition of an AGI, and simply knowing what it's thinking about will be inadequate to know whether it's making plans you like. One can think about bioweapons, for instance, to either produce them or prevent producing them. More on these at the end; first a brief summary of their results.
In this work, they located interpretable features in Claude 3 Sonnet using sparse autoencoders, and manipulating model behavior using those features as steering vectors. They find features for subtle concepts; they highlight features for:
The Golden Gate Bridge 34M/31164353: Descriptions of or references to the Golden Gate Bridge.
Brain sciences 34M/9493533: discussions of neuroscience and related academic research on brains or minds.
Monuments and popular tourist attractions 1M/887839.
Transit infrastructure 1M/3. [links to examples]
...
We also find more abstract features - responding to things like bugs in computer code, discussions of gender bias in professions, and conversations about keeping secrets.
...we found features corresponding to:
Capabilities with misuse potential (code backdoors, developing biological weapons)
Different forms of bias (gender discrimination, racist claims about crime)
Potentially problematic AI behaviors (power-seeking, manipulation, secrecy)
Presumably, the existence of such features will surprise nobody who's used and thought about large language models. It is difficult to imagine how they would do what they do without using representations of subtle and abstract concepts.
They used the dictionary learning approach, and found distributed representations of features:
Our general approach to understanding Claude 3 Sonnet is based on the linear representation hypothesis and the superposition hypothesis
from the publication, Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Or to put it more plainly:
It turns out that each concept is represented across many neurons, and each neuron is involved in representing many concepts.
Representations in the brain definitely follow that description, and the structure of representations seems pretty similar as far as we can guess from animal studies and limited data on human language use.
They also include a fascinating image of near neighbors to the feature for internal conflict (see header image).
So, back to the broader question: it is clear how this type of interpretability helps with AI safety: being able to monitor when it's activating features for things like bioweapons, and use those features as steering vectors, can help control the model's behavior.
It is not clear to me how this generalizes to AGI. And I am concerned that too few of us are thinking about this. It seems pretty apparent how detecting lying will dramatically help in pretty much any conceivable plan for technical alignment of AGI. But it seems like being able to monitor an entire thought process of a being smarter than us is impossible on the face of it.
I think the hope is that we can detect and monitor cognition that is about dangerous topics, so we don't need to follow its full train of thought.
If we can tell what an AGI is thinking ...
view more