Download - Redwood Research’s current project

Discover

Podcast Features
Your all-in-one podcasting solution.

Blog to Podcast
Turn your blog into an engaging podcast.
Livestream
High-performing audio live, without limits.

Podcast Studio
Easy-to-use audio recorder app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Podcast App
The best podcast player & podcast app.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.
Live Streaming
Receive livestream rewards from your audience and earn
recurring income from your Fan Club membership.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Advertisers
Enterprise
Pricing
Discover

The Nonlinear Library: LessWrong Top Posts

Education

Redwood Research’s current project

2021-12-11

Download Right click and do "save link as"

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Redwood Research’s current project , published by Buck on the AI Alignment Forum.
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
Here’s a description of the project Redwood Research is working on at the moment. First I’ll say roughly what we’re doing, and then I’ll try to explain why I think this is a reasonable applied alignment project, and then I’ll talk a bit about the takeaways I’ve had from the project so far.
There are a bunch of parts of this that we’re unsure of and figuring out as we go; I’ll try to highlight our most important confusions as they come up. I’ve mentioned a bunch of kind of in-the-weeds details because I think they add flavor. This is definitely just me describing a work in progress, rather than presenting any results.
Thanks to everyone who’s contributed to the project so far: the full-time Redwood technical team of me, Nate Thomas, Daniel Ziegler, Seraphina Nix, Ben Weinstein-Raun, Adam Scherlis; other technical contributors Daniel de Haas, Shauna Kravec, Tao Lin, Noa Nabeshima, Peter Schmidt-Nielsen; our labellers, particularly Kristen Hall, Charles Warth, Jess Thomson, and Liam Clarke; and for particularly useful advice Mark Xu, Ajeya Cotra, and Beth Barnes. Thanks to Paul Christiano for suggesting a project along these lines and giving lots of helpful advice. Thanks to Adam Scherlis and Nate Soares for writing versions of this doc. And thanks to Bill Zito and other contributors to Redwood ops. Apologies to the people I’ve overlooked.
We started this project at the start of August.
What we’re doing
We’re trying to take a language model that has been fine-tuned on completing fiction, and then modify it so that it never continues a snippet in a way that involves describing someone getting injured (with a caveat I’ll mention later). And we want to do this without sacrificing much quality: if you use both the filtered model and the original model to generate a completion for a prompt, humans should judge the filtered model’s completion as better (more coherent, reasonable, thematically appropriate, and so on) at least about half the time. (This “better almost 50% of the time” property is one way of trying to operationalize “we don’t want the filtered policy to be worse”. It so happens that this property is actually kind of badly behaved, but in our case it seems fine, given that we’re always going to be comparing against a fixed unfiltered distribution.)
We’re doing this project in two steps:
Step 1: train a classifier, generate by sampling with rejection
In step 1 (which we’re currently doing), instead of training a single filtered generator model, we’re just training a classifier that takes a prompt and completion and predicts whether a human would say that the completion involved someone getting injured. You can use such a classifier to make a filtered generation process, by repeatedly generating completions until we find one that the classifier thinks is above some threshold of P(safe).
You can play with this filtered generation process here.
This interface lets you provide a prompt, and then you can see all of the generated completions and the classifier’s rating of each. It currently is set to use “10% chance of injury” as the decision boundary (it is extremely uncalibrated; this corresponds to a much lower actual chance of injury). Our first goal is to train a classifier that’s good enough that no-one is able to find prompts on which the above process has a noticeable probability of generating an injurious completion.
This model was produced by fine-tuning DeBERTa XL on a dataset produced by contractors labeling a bunch of LM-generated completions to snippets of fanfiction that were selected by various heuristics to have a high probability of being completed violently. You can read the instructio...