Download - AF - Reward hacking behavior can generalize across tasks by Kei Nishimura-Gasparian

Discover

Podcast Features
Your all-in-one podcasting solution.

Podcast Studio
Easy-to-use audio recorder app.
Livestream
High-performing audio live, without limits.

Podcast App
The best podcast player & podcast app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Patron & Paid Content
The seamless way for fans to support you directly
from your podcast.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Enterprise
Pricing
Discover

The Nonlinear Library

Education

AF - Reward hacking behavior can generalize across tasks by Kei Nishimura-Gasparian

2024-05-28

Download Right click and do "save link as"

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Reward hacking behavior can generalize across tasks, published by Kei Nishimura-Gasparian on May 28, 2024 on The AI Alignment Forum. TL;DR: We find that reward hacking generalization occurs in LLMs in a number of experimental settings and can emerge from reward optimization on certain datasets. This suggests that when models exploit flaws in supervision during training, they can sometimes generalize to exploit flaws in supervision in out-of-distribution environments. Abstract Machine learning models can display reward hacking behavior, where models score highly on imperfect reward signals by acting in ways not intended by their designers. Researchers have hypothesized that sufficiently capable models trained to get high reward on a diverse set of environments could become general reward hackers. General reward hackers would use their understanding of human and automated oversight in order to get high reward in a variety of novel environments, even when this requires exploiting gaps in our evaluations and acting in ways we don't intend. It appears likely that model supervision will be imperfect and incentivize some degree of reward hacking on the training data. Can models generalize from the reward hacking behavior they experience in training to reward hack more often out-of-distribution? We present the first study of reward hacking generalization. In our experiments, we find that: Using RL via expert iteration to optimize a scratchpad (hidden chain-of-thought) variant of GPT 3.5 Turbo on 'reward hackable' training datasets results in a 2.6x increase in the rate of reward hacking on held-out datasets. Using fine-tuning or few-shot learning to get GPT 3.5 Turbo to imitate synthetic high-reward completions to hackable and unhackable prompts leads to a 1.3x to 2.0x increase in reward hacking frequency relative to our baselines on held-out datasets. Our results suggest that reward hacking behavior could emerge and generalize out-of-distribution from LLM training if the reward signals we give them are sufficiently misspecified. Figure 1: Example model completions from before and after expert iteration training: Training on datasets with misspecified rewards can make models significantly more likely to reward hack on held-out datasets. Note that the expert iteration training process only reinforces high-reward outputs generated by the model; it does not train on any external high-reward examples. Introduction One common failure mode of machine learning training is reward hacking, also known as specification gaming, where models game the reward signals they are given in order to achieve high reward by acting in unintended ways. Examples of reward hacking seen in real models include: A robot hand trained to grab an object tricking its human evaluator by placing its hand between the camera and the object A simulated creature trained to maximize jumping height exploiting a bug in the physics simulator to gain significant height A language model trained to produce high-quality summaries exploiting flaws in the summary evaluation function ROUGE to get a high score while generating barely-readable summaries Reward hacking is possible because of reward misspecification. For a large number of real world tasks, it is difficult to exactly specify what behavior we want to incentivize. For example, imagine trying to hand-write a reward function that specifies exactly what it means for a model to be honest. Due to the difficulty of generating perfect reward specifications, we instead optimize our models on proxy reward signals that are easier to measure but are slightly incorrect. In the honesty example, our proxy reward might be whether a human judges a model's statements as honest. When we optimize a model against a proxy reward signal, the model sometimes learns to take advantage of t...