Download - LW - Soft Prompts for Evaluation: Measuring Conditional Distance of Capabilities by porby

Discover

Podcast Features
Your all-in-one podcasting solution.

Podcast Studio
Easy-to-use audio recorder app.
Livestream
High-performing audio live, without limits.

Podcast App
The best podcast player & podcast app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Patron & Paid Content
The seamless way for fans to support you directly
from your podcast.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Enterprise
Pricing
Discover

The Nonlinear Library: LessWrong

Education

LW - Soft Prompts for Evaluation: Measuring Conditional Distance of Capabilities by porby

2024-02-03

Download Right click and do "save link as"

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Soft Prompts for Evaluation: Measuring Conditional Distance of Capabilities, published by porby on February 3, 2024 on LessWrong. (The link above is likely going to expire in favor of arxiv once I get it submitted. Incidentally, if anyone wants to endorse me for cs.LG on arxiv, please DM me, and thanks!) Abstract: To help evaluate and understand the latent capabilities of language models, this paper introduces an approach using optimized input embeddings, or 'soft prompts,' as a metric of conditional distance between a model and a target behavior. The technique aims to facilitate latent capability discovery as a part of automated red teaming/evaluation suites and to provide quantitative feedback about the accessibility of potentially concerning behaviors in a way that may scale to powerful future models, including those which may otherwise be capable of deceptive alignment. An evaluation framework using soft prompts is demonstrated in natural language, chess, and pathfinding, and the technique is extended with generalized conditional soft prompts to aid in constructing task evaluations. What is this? This paper applies soft prompts to evaluations. They're prompts composed of optimized token embeddings; they aren't restricted to the usual dictionary of discrete tokens. To the limit of the optimizer's ability, they compress conditions into the tokens composing the soft prompt. Different target behaviors or capabilities will tend to require different numbers of soft prompt tokens. Capabilities that are natural to the model, like repeating a single token over and over again, can be achieved with a single soft prompt token even in small models. Repeating a longer string may require more tokens, and even partially eliciting more complex behavior like pathfinding or chess could require many more. Why is this useful? Soft prompts offer a simple way to quantify the conditional distance between a model and a target behavior. If many tokens are required, then lots of information-as-conditions is needed to define the behavior and it's unlikely that it will happen by chance. If no number of tokens suffices, the behavior is likely inaccessible by inputs alone.[1] It can be used as a part of evaluation suites to monitor the change in distance to concerning behaviors across models and scales and the degree to which those behaviors can be achieved in the first place. It is also difficult for the model to hide a capability that is otherwise accessible to some input. If the original model is an end-to-end differentiable product of gradient descent, it has already been shown to be transparent to the same type of optimizer that will be tasked with adversarially optimizing the input soft prompts against the machinery of the model.[2] Compared to manual attempts at elicitation or automated techniques restricted to discrete tokens or black box optimization, it dramatically improves the chances that an evaluation's attempt to elicit concerning behavior will succeed. It's also dumb simple. Are there any cute short results for me to look at? Yes! One of the tests was reverting fine-tuning on TinyLlama's chat variant by simply optimizing a soft prompt for standard autoregressive prediction. The training set was RedPajama v2, a broad training set. No effort was put into providing directly 'detuning' samples (like starting with a dialogue and then having the trained continuation deliberately subvert the original fine tuning), so some dialogue-esque behaviors persisted, but the character of the assistant was a little... different. Note that, despite a malicious system message, the original chat model simply doesn't respond, then turns around and generates a user question about how great General Electric is. The soft-prompted assistant has other ideas. Notably, the loss on RedPajama v2 for the orig...