Download - LW - How to Control an LLM's Behavior (why my P(DOOM) went down) by RogerDearnaley

Discover

Podcast Features
Your all-in-one podcasting solution.

Blog to Podcast
Turn your blog into an engaging podcast.
Livestream
High-performing audio live, without limits.

Podcast Studio
Easy-to-use audio recorder app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Podcast App
The best podcast player & podcast app.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.
Live Streaming
Receive livestream rewards from your audience and earn
recurring income from your Fan Club membership.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Advertisers
Enterprise
Pricing
Discover

The Nonlinear Library: LessWrong

Education

LW - How to Control an LLM's Behavior (why my P(DOOM) went down) by RogerDearnaley

2023-11-29

Download Right click and do "save link as"

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How to Control an LLM's Behavior (why my P(DOOM) went down), published by RogerDearnaley on November 29, 2023 on LessWrong.
This is a link-post for a paper I recently read: Pretraining Language Models with Human Preferences, followed by my reactions to this paper.
Reading this paper has significantly reduced my near-term P(DOOM), and I'd like to explain why. Thus, this is also an alignment proposal. While I don't think what I'm proposing here is a complete solution to aligning a superintelligent ASI, I think it might work well up to at least around a human-level AGI, and even be a useful basis to build on at ASI level (at that level, I'd advocate adding on value learning). It can achieve some of the simpler things that people have been hoping we might get from Interpretability (and for more complex things might also combine well with and even simplify Interpretability, if that can be made to work at scale.) It's also simple, immediately actionable, has a fairly low alignment tax, and best of all, also has lots of useful capabilities effects, so that even a superscalar not very concerned about x-risk might well still want to implement it.
The Paper
Let's start with the paper. The authors experiment with a number of different ways you might train an LLM not to do some form of undesired behavior. For the paper, they chose three simple, well-defined bad behaviors for which they had low-computational cost, high-accuracy classifiers, and which were behaviors simple enough that a fairly small, economical-to-pretrain LLM could reasonably be expected to understand them.
They demonstrate that, unlike the common approach of first training a foundation model on the task "learn to autocomplete a large chunk of the web, which includes both good and bad behavior", followed by fine-tuning/RLHF on "now learn to recognize and only do good behavior, not bad", it is a lot more effective to build this control training in from the start during the pretraining (they estimate by around an order of magnitude). So they evaluate five different methods to do that (plus standard pretraining as a control).
The simplest behavior training approach they try is just filtering your training set so that it doesn't have any examples of bad behavior in it. Then, for your resulting foundation model, bad behavior is out-of-distribution (so may, or may not, be difficult for it to successfully extrapolate to).
Interestingly, while that approach is was fairly effective, it wasn't the best (it consistently tended to harm capabilities, and didn't even always give the best behavior, as one might expect from analogies to a similar approach to trying to raise children: extrapolating out-of-the-training-distribution isn't reliably hard).
The clear winner instead was a slightly more complex approach: prelabel your entire training set, scanned at a sentence/line-of-code level, as good or bad using something like … and … tags. Then at inference time, start the response generation after a tag, and during inference tweak the token generation process to ban the model from generating an tag (unless it's the matching end-tag at the end of the document after an end_of_text token) or a tag (i.e. these are banned tokens, whose probability is reset to zero).
So, teach your LLM the difference between good and bad all the way through its pretraining, and then at inference time only allow it to be good. This is a ridiculously simple idea, and interestingly it works really well. [This technique is called "conditional training" and was first suggested about 5-6 years ago - it seems a little sad that it's taken this long for someone to demonstrate how effective it is. Presumably the technical challenge is the classifiers.]
Applications to Alignment
So (assuming this carries over to larger...