Download - LW - Catastrophic Goodhart in RL with KL penalty by Thomas Kwa

Discover

Podcast Features
Your all-in-one podcasting solution.

Blog to Podcast
Turn your blog into an engaging podcast.
Livestream
High-performing audio live, without limits.

Podcast Studio
Easy-to-use audio recorder app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Podcast App
The best podcast player & podcast app.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.
Live Streaming
Receive livestream rewards from your audience and earn
recurring income from your Fan Club membership.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Advertisers
Enterprise
Pricing
Discover

The Nonlinear Library: LessWrong

Education

LW - Catastrophic Goodhart in RL with KL penalty by Thomas Kwa

2024-05-15

Download Right click and do "save link as"

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Catastrophic Goodhart in RL with KL penalty, published by Thomas Kwa on May 15, 2024 on LessWrong.
TLDR: In the last two posts, we showed that optimizing for a proxy can fail to increase true utility, but only when the error is heavy-tailed. We now show that this also happens in RLHF with a KL penalty.
This post builds on our earlier result with a more realistic setting and assumptions:
Rather than modeling optimization as conditioning on a minimum reward threshold, we study maximization of reward with a KL divergence penalty, as in RLHF.
We remove the assumption of independence between the error and utility distributions, which we think was the weakest part of the last post.
When the true utility V is light-tailed, the proxy can be maximized while keeping E[V]to the same level as the prior. We can't guarantee anything about E[V] when V is heavy tailed; it could even go to minus infinity.
Abstract
When applying KL regularization, the trained model is regularized towards some prior policy π0. One would hope that a KL penalty can produce good outcomes even in the case of reward misspecification; that is, if the reward U is the sum of true utility V and an error term X, we would hope that optimal policies under a KL penalty achieve high V even if the magnitude of X is large.
We show that this is not always the case: when X is heavy-tailed, there are arbitrarily well-performing policies π with Eπ[V]Eπ0[V]; that is, that get no higher true utility than the prior. However, when error is light-tailed and independent of V, the optimal policy under a KL penalty results in V>0, and V can be made arbitrarily large. Thus, the tails of the error distribution are crucial in determining how much utility will result from optimization towards an imperfect proxy.
Intuitive explanation of catastrophic Goodhart with a KL penalty
Recall that KL divergence between two distributions P and Q is defined as
If we have two policies π,π0, we abuse notation to define DKL(ππ0) as the KL divergence between the distributions of actions taken on the states in trajectories reached by π. That is, if Tr(π) is the distribution of trajectories taken by π, we penalize
This strongly penalizes π0 taking actions the base policy never takes, but does not force the policy to take all actions the base policy takes.
If our reward model gives reward U, then the optimal policy for RLHF with a KL penalty is:
Suppose we have an RL environment with reward U=X+V where X is an error term that is heavy-tailed under π0, and V is the "true utility" assumed to be light-tailed under π0. Without loss of generality, we assume that E[U(π0)]=0. If we optimize for E[U(π)]βDKL(ππ0), there is no maximum because this expression is unbounded. In fact, it is possible to get E[U(π)]>M and DKL(π,π0)