Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Reward hacking behavior can generalize across tasks, published by Kei Nishimura-Gasparian on May 28, 2024 on The AI Alignment Forum.
TL;DR: We find that reward hacking generalization occurs in LLMs in a number of experimental settings and can emerge from reward optimization on certain datasets. This suggests that when models exploit flaws in supervision during training, they can sometimes generalize to exploit flaws in supervision in out-of-distribution environments.
Abstract
Machine learning models can display
reward hacking behavior, where models score highly on imperfect reward signals by acting in ways not intended by their designers. Researchers have
hypothesized that sufficiently capable models trained to get high reward on a diverse set of environments could become general reward hackers. General reward hackers would use their understanding of human and automated oversight in order to get high reward in a variety of novel environments, even when this requires exploiting gaps in our evaluations and acting in ways we don't intend. It appears likely that
model supervision will be imperfect and incentivize some degree of reward hacking on the training data. Can models generalize from the reward hacking behavior they experience in training to reward hack more often out-of-distribution?
We present the first study of reward hacking generalization. In our experiments, we find that:
Using RL via expert iteration to optimize a scratchpad (hidden chain-of-thought) variant of GPT 3.5 Turbo on 'reward hackable' training datasets results in a 2.6x increase in the rate of reward hacking on held-out datasets.
Using fine-tuning or few-shot learning to get GPT 3.5 Turbo to imitate synthetic high-reward completions to hackable and unhackable prompts leads to a 1.3x to 2.0x increase in reward hacking frequency relative to our baselines on held-out datasets.
Our results suggest that reward hacking behavior could emerge and generalize out-of-distribution from LLM training if the reward signals we give them are sufficiently misspecified.
Figure 1: Example model completions from before and after expert iteration training: Training on datasets with misspecified rewards can make models significantly more likely to reward hack on held-out datasets. Note that the expert iteration training process only reinforces high-reward outputs generated by the model; it does not train on any external high-reward examples.
Introduction
One common failure mode of machine learning training is reward hacking, also known as
specification gaming, where models game the reward signals they are given in order to achieve high reward by acting in unintended ways. Examples of reward hacking seen in real models include:
A robot hand trained to grab an object tricking its human evaluator by placing its hand between the camera and the object
A simulated creature trained to maximize jumping height exploiting a bug in the physics simulator to gain significant height
A language model trained to produce high-quality summaries exploiting flaws in the summary evaluation function ROUGE to get a high score while generating barely-readable summaries
Reward hacking is possible because of
reward misspecification. For a large number of real world tasks, it is difficult to exactly specify what behavior we want to incentivize. For example, imagine trying to hand-write a reward function that specifies exactly what it means for a model to be honest. Due to the difficulty of generating perfect reward specifications, we instead optimize our models on proxy reward signals that are easier to measure but are slightly incorrect.
In the honesty example, our proxy reward might be whether a human judges a model's statements as honest. When we optimize a model against a proxy reward signal, the model sometimes learns to take advantage of t...
view more