Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Implementing activation steering, published by Annah on February 5, 2024 on LessWrong.
Produced as part of the SERI ML Alignment Theory Scholars Program - Autumn 2023 Cohort and while being an affiliate at PIBBSS in 2024. A thank you to @Jayjay and @fela for helpful comments on this draft.
This blog post is an overview of different ways to implement activation steering with some of my takes on their pros and cons. See also this GitHub repository for my minimal implementations of the different approaches.
The blog post is aimed at people who are new to activation/representation steering/engineering/editing.
General approach
The idea is simple: we just add some vector to the internal model activations and thus influence the model output in a similar (but sometimes more effective way) to prompting.
Example[1]: Imagine that some vector in the internal representations in some transformer layer encodes a direction associated with "Love". When you add this vector to the activations of some encoded sentence "I hate the world", you change the internal representation (and thus the meaning) to something more like "I love the world". This graphic might help with an intuition:
In general there are a few steps involved which I simplify in the following:
Decide on a layer l and transformer module ϕ to apply the activation steering to.
Define a steering vector. In the simplest case we just take the difference of the activations of two encoded strings like v=ϕl(Love)ϕl(Hate).
Add the vector to the activation during the forward pass. In the simplest case it's something like ~ϕl=ϕl+v.
Each of the three points mentioned above includes complexities you might encounter as a beginner. Feel free to move on to the next section if you prefer.
You can do activation steering at pretty much any layer/module in the transformer. It's often done at the residual stream of one of the hidden layers. However, if you want to do activation steering by modifying the bias parameter, you need to do it in a transformer module that has a specific structure. This is usually not the residual stream but one can do it in the attention or the mlp module.
When defining a direction there are several things that might complicate it:
Tokens: When encoding a word or a short sentence it is often encoded into several tokens, so when you get to the internal activation ϕl(Love) you don't just have one vector but one vector per token. Even worse, 'Love' and 'Hate' might not be encoded with the same number of tokens, so then ϕl(Love) and ϕl(Hate) are two matrices with different dimensions.
You can come up with different ways of how to deal with this, but one simple solution is to just use the representation of the last token since it should have all information encoded. Careful; if you use batches, you'll likely want to use left padding when choosing the last token to ensure your last token isn't a padding token.
Data: You can potentially create a more meaningful steering vector, for instance, by averaging several vectors from contrastive pairs (for example "I love the world" - "I hate the world" or "I dislike cats" - "I adore cats"), applying PCA on a relevant dataset, or training a linear classifier and using its weights as the steering direction.
Here are additional factors that may add complexity to the process of activation steering:
Tokens: The question arises to which activations you actually want to add your steering vector. When you encode some text you could for example add it at the first token or the last or even at every token. I chose to do the latter, adding at every token of the new text. Careful; if you use batches you might not want to add to padding tokens.
Scaling: In some cases, for example when v=ϕl(Love)ϕl(Hate), the length of the steering vector already contains some meaningful information. However yo...
view more