Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Soft Prompts for Evaluation: Measuring Conditional Distance of Capabilities, published by porby on February 3, 2024 on LessWrong.
(The link above is likely going to expire in favor of arxiv once I get it submitted. Incidentally, if anyone wants to endorse me for cs.LG on arxiv, please DM me, and thanks!)
Abstract:
To help evaluate and understand the latent capabilities of language models, this paper introduces an approach using optimized input embeddings, or 'soft prompts,' as a metric of conditional distance between a model and a target behavior.
The technique aims to facilitate latent capability discovery as a part of automated red teaming/evaluation suites and to provide quantitative feedback about the accessibility of potentially concerning behaviors in a way that may scale to powerful future models, including those which may otherwise be capable of deceptive alignment. An evaluation framework using soft prompts is demonstrated in natural language, chess, and pathfinding, and the technique is extended with generalized conditional soft prompts to aid in constructing task evaluations.
What is this?
This paper applies soft prompts to evaluations. They're prompts composed of optimized token embeddings; they aren't restricted to the usual dictionary of discrete tokens. To the limit of the optimizer's ability, they compress conditions into the tokens composing the soft prompt.
Different target behaviors or capabilities will tend to require different numbers of soft prompt tokens. Capabilities that are natural to the model, like repeating a single token over and over again, can be achieved with a single soft prompt token even in small models. Repeating a longer string may require more tokens, and even partially eliciting more complex behavior like pathfinding or chess could require many more.
Why is this useful?
Soft prompts offer a simple way to quantify the conditional distance between a model and a target behavior. If many tokens are required, then lots of information-as-conditions is needed to define the behavior and it's unlikely that it will happen by chance. If no number of tokens suffices, the behavior is likely inaccessible by inputs alone.[1]
It can be used as a part of evaluation suites to monitor the change in distance to concerning behaviors across models and scales and the degree to which those behaviors can be achieved in the first place.
It is also difficult for the model to hide a capability that is otherwise accessible to some input. If the original model is an end-to-end differentiable product of gradient descent, it has already been shown to be transparent to the same type of optimizer that will be tasked with adversarially optimizing the input soft prompts against the machinery of the model.[2]
Compared to manual attempts at elicitation or automated techniques restricted to discrete tokens or black box optimization, it dramatically improves the chances that an evaluation's attempt to elicit concerning behavior will succeed.
It's also dumb simple.
Are there any cute short results for me to look at?
Yes! One of the tests was reverting fine-tuning on TinyLlama's chat variant by simply optimizing a soft prompt for standard autoregressive prediction. The training set was RedPajama v2, a broad training set. No effort was put into providing directly 'detuning' samples (like starting with a dialogue and then having the trained continuation deliberately subvert the original fine tuning), so some dialogue-esque behaviors persisted, but the character of the assistant was a little... different.
Note that, despite a malicious system message, the original chat model simply doesn't respond, then turns around and generates a user question about how great General Electric is.
The soft-prompted assistant has other ideas.
Notably, the loss on RedPajama v2 for the orig...
view more