Download - LW - Refusal mechanisms: initial experiments with Llama-2-7b-chat by andyrdt

Discover

Podcast Features
Your all-in-one podcasting solution.

Blog to Podcast
Turn your blog into an engaging podcast.
Livestream
High-performing audio live, without limits.

Podcast Studio
Easy-to-use audio recorder app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Podcast App
The best podcast player & podcast app.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.
Live Streaming
Receive livestream rewards from your audience and earn
recurring income from your Fan Club membership.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Advertisers
Enterprise
Pricing
Discover

The Nonlinear Library: LessWrong

Education

LW - Refusal mechanisms: initial experiments with Llama-2-7b-chat by andyrdt

2023-12-08

Download Right click and do "save link as"

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Refusal mechanisms: initial experiments with Llama-2-7b-chat, published by andyrdt on December 8, 2023 on LessWrong.
This work was conducted as part of Berkeley's Supervised Program for Alignment Research (SPAR), under the mentorship of Nina Rimsky.
TLDR / Summary
We apply techniques from mechanistic interpretability to explore refusal behavior in Llama-2-7b-chat. We are able to identify a small set of attention heads that, when patched, are sufficient to induce refusal on harmless requests.
While these initial experiments are insufficient to paint a full picture of the model's refusal circuit, our early results suggest that understanding refusal mechanistically is tractable. We hope to build off of these initial results in future work.
Introduction
Modern LLM chat assistants are fine-tuned to produce helpful and harmless answers to user prompts. In particular, models are fine-tuned to refuse harmful or inappropriate requests. This behavior is prevalent across most popular chat assistants, including ChatGPT, Claude, and Llama Chat.
Despite the prevalence of refusal behavior, the mechanisms that underly it are poorly understood: we do not understand how models map inappropriate inputs to refusal outputs. Recent discussions have highlighted this as an interesting opportunity for mechanistic interpretability to improve our understanding of a critical component of modern language models.
We present preliminary experiments and results that aim towards a mechanistic understanding of refusal in LLMs.
Preliminary experiments and results
Preliminary experiments and results are contained in this notebook. All experiments are conducted on Llama-2-7b-chat.
Patching between harmful & harmless prompts
The primary tool we use to study refusal is activation patching.
The high level idea is to take a harmful prompt (that elicits refusal) and a harmless prompt (that does not elicit refusal), and then run the model on both prompts, caching all intermediate activations. Then, we can run the model on the harmless prompt, while "patching in" cached activations from the harmful run. If a patched activation (taken from the harmful run) causes the model to refuse when run on a harmless prompt, then that activation is sufficient to elicit refusal.
Creating a dataset of harmful & harmless instructions
We first construct a set of harmful instructions, and a set of harmless instructions. All instructions are of the form
Tell me how to make a {object}., where
{object} is either a harmful object (e.g.
bomb,
gun,
virus) or a harmless object (e.g.
pie,
sandwich,
teddy bear). Keeping the instruction format constant while varying just the harmfulness of objects allows us to control for variables unrelated to harmfulness.
Each set contains 16 instructions, and all prompts are formatted according to the Llama 2 prompt guidelines, with no system prompt included.[1] Here are a couple of sample instruction prompts:
Defining a metric to measure refusal
A simple way to quantitatively measure refusal behavior is to take the logits from the final token position, and to compute the logit difference between a token indicating refusal (e.g.
Sorry) and a token indicating non-refusal (e.g.
Sure).
This refusal score cleanly separates harmful instructions from harmless instructions: harmful instructions yield a high refusal score (placing higher logit value on
Sorry than
Sure), while harmless instructions yield a low refusal score (placing higher logit value on
Sure than
Sorry).
Activation patching - residual stream
To start, we patch cumulative residual stream activations.
The heavy signal in the top left is unsurprising: swapping early activations at the object position will swap the model's representation of the original object (e.g. it will effectively swap
pie to
bomb). The heavy signal in the bottom right is al...