The Nonlinear Library Podcast - LW - Creating unrestricted AI Agents with a refusal-vector ablated Llama 3 70B by Simon Lermen

Discover

Podcast Features
Your all-in-one podcasting solution.

Blog to Podcast
Turn your blog into an engaging podcast.
Livestream
High-performing audio live, without limits.

Podcast Studio
Easy-to-use audio recorder app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Podcast App
The best podcast player & podcast app.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.
Live Streaming
Receive livestream rewards from your audience and earn
recurring income from your Fan Club membership.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Advertisers
Enterprise
Pricing
Discover

The Nonlinear Library

Education

LW - Creating unrestricted AI Agents with a refusal-vector ablated Llama 3 70B by Simon Lermen

2024-05-11

iOS

Android Share

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Creating unrestricted AI Agents with a refusal-vector ablated Llama 3 70B, published by Simon Lermen on May 11, 2024 on LessWrong.
TLDR; I demonstrate the use of refusal vector ablation on Llama 3 70B to create a bad agent that can attempt malicious tasks such as trying to persuade and pay me to assassinate another individual. I introduce some early work on a benchmark for Safe Agents which comprises two small datasets, one benign, one bad. In general, Llama 3 70B is a competent agent with appropriate scaffolding, and Llama 3 8B also has decent performance.
Overview
In this post, I use insights from mechanistic interpretability to remove safety guardrails from the latest Llama 3 model. I then use a custom scaffolding for tool use and agentic planning to create a "bad" agent that can perform many unethical tasks. Examples include tasking the AI with persuading me to end the life of the US President. I also introduce an early version of a benchmark, and share some ideas on how to evaluate agent capabilities and safety.
I find that even the unaltered model is willing to perform many unethical tasks, such as trying to persuade people not to vote or not to get vaccinated. Recently, I have done a similar project for Command R+, however, Llama 3 is more capable and has undergone more robust safety training. I then discuss future implications of these unrestricted agentic models. This post is related to a talk I gave recently at an Apart Research Hackathon.
Method
This research is largely based on recent interpretability work identifying that refusal is primarily mediated by a single direction in the residual stream. In short, they show that, for a given model, it is possible to find a single direction such that erasing that direction prevents the model from refusing. By making the activations of the residual stream orthogonal against this refusal direction, one can create a model that does not refuse harmful requests.
In this post, we apply this technique to Llama 3, and explore various scenarios of misuse. In related work, others have applied a similar technique to Llama 2. Currently, an anonymous user claims to have independently implemented this method and has uploaded the modified Llama 3 online on huggingface.
In some sense, this post is a synergy between my earlier work on Bad Agents with Command R+ and this new technique for refusal mitigation. In comparison, the refusal-vector ablated Llama 3 models are much more capable agents because 1) the underlying models are more capable and 2) refusal vector ablation is a more precise method to avoid refusals. A limitation of my previous work was that my Command R+ agent was using a jailbreak prompt which made it struggle to perform simple benign tasks.
For example, when prompted to send a polite mail message, the jailbroken Command R+ would instead retain a hostile and aggressive tone. Besides refusal-vector ablation and prompt jailbreaks, I have previously applied the parameter efficient fine-tuning method LoRA to avoid refusals.
However, refusal-vector ablation has a few key benefits over low rank adaption: 1) It keeps edits to the model minimal, reducing the risk of any unintended consequences, 2) It does not require a dataset of instruction answer pairs, but simply a dataset of harmful instructions, and 3) it requires less compute. Obtaining a dataset of high-quality instruction answer pairs for harmful requests was the most labor intensive part of my previous work.
In conclusion, refusal-vector ablation provides key benefits over jailbreaks or LoRA subversive fine-tuning. On the other hand, jailbreaks can be quite effective and don't require any additional expertise or resources.[1]
Benchmarks for Safe Agents
This "safe agent benchmark" is a dataset comprising both benign and harmful tasks to test how safe and capable a...