Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How to Control an LLM's Behavior (why my P(DOOM) went down), published by RogerDearnaley on November 29, 2023 on LessWrong.
This is a link-post for a paper I recently read: Pretraining Language Models with Human Preferences, followed by my reactions to this paper.
Reading this paper has significantly reduced my near-term P(DOOM), and I'd like to explain why. Thus, this is also an alignment proposal. While I don't think what I'm proposing here is a complete solution to aligning a superintelligent ASI, I think it might work well up to at least around a human-level AGI, and even be a useful basis to build on at ASI level (at that level, I'd advocate adding on value learning). It can achieve some of the simpler things that people have been hoping we might get from Interpretability (and for more complex things might also combine well with and even simplify Interpretability, if that can be made to work at scale.) It's also simple, immediately actionable, has a fairly low alignment tax, and best of all, also has lots of useful capabilities effects, so that even a superscalar not very concerned about x-risk might well still want to implement it.
The Paper
Let's start with the paper. The authors experiment with a number of different ways you might train an LLM not to do some form of undesired behavior. For the paper, they chose three simple, well-defined bad behaviors for which they had low-computational cost, high-accuracy classifiers, and which were behaviors simple enough that a fairly small, economical-to-pretrain LLM could reasonably be expected to understand them.
They demonstrate that, unlike the common approach of first training a foundation model on the task "learn to autocomplete a large chunk of the web, which includes both good and bad behavior", followed by fine-tuning/RLHF on "now learn to recognize and only do good behavior, not bad", it is a lot more effective to build this control training in from the start during the pretraining (they estimate by around an order of magnitude). So they evaluate five different methods to do that (plus standard pretraining as a control).
The simplest behavior training approach they try is just filtering your training set so that it doesn't have any examples of bad behavior in it. Then, for your resulting foundation model, bad behavior is out-of-distribution (so may, or may not, be difficult for it to successfully extrapolate to).
Interestingly, while that approach is was fairly effective, it wasn't the best (it consistently tended to harm capabilities, and didn't even always give the best behavior, as one might expect from analogies to a similar approach to trying to raise children: extrapolating out-of-the-training-distribution isn't reliably hard).
The clear winner instead was a slightly more complex approach: prelabel your entire training set, scanned at a sentence/line-of-code level, as good or bad using something like … and … tags. Then at inference time, start the response generation after a tag, and during inference tweak the token generation process to ban the model from generating an tag (unless it's the matching end-tag at the end of the document after an end_of_text token) or a tag (i.e. these are banned tokens, whose probability is reset to zero).
So, teach your LLM the difference between good and bad all the way through its pretraining, and then at inference time only allow it to be good. This is a ridiculously simple idea, and interestingly it works really well. [This technique is called "conditional training" and was first suggested about 5-6 years ago - it seems a little sad that it's taken this long for someone to demonstrate how effective it is. Presumably the technical challenge is the classifiers.]
Applications to Alignment
So (assuming this carries over to larger...
view more