Download - LW - The True Story of How GPT-2 Became Maximally Lewd by Writer

Discover

Podcast Features
Your all-in-one podcasting solution.

Blog to Podcast
Turn your blog into an engaging podcast.
Livestream
High-performing audio live, without limits.

Podcast Studio
Easy-to-use audio recorder app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Podcast App
The best podcast player & podcast app.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.
Live Streaming
Receive livestream rewards from your audience and earn
recurring income from your Fan Club membership.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Advertisers
Enterprise
Pricing
Discover

The Nonlinear Library: LessWrong

Education

LW - The True Story of How GPT-2 Became Maximally Lewd by Writer

2024-01-19

Download Right click and do "save link as"

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The True Story of How GPT-2 Became Maximally Lewd, published by Writer on January 19, 2024 on LessWrong.
This video recounts an incident that occurred at OpenAI in which flipping a single minus sign led the RLHF process to make GPT-2 only output sexually explicit continuations.
The incident is described in OpenAI's paper "Fine-Tuning Language Models from Human Preferences" under section 4.4: "Bugs can optimize for bad behavior".
The script has been written by Jai, with some significant input and rework by me, Writer. You can read it below.
In 2019, one OpenAI researcher made a typo - and birthed an evil AI hell-bent on making everything as horny as possible.
This is the absurd, ridiculous, and yet true story of how it happened.
Part I: GPT
Since 2017, OpenAI has been building Generative Pre-trained Transformer models, or GPTs - language AIs with a singular focus on predicting text, trained across billions of writing samples. If you prompt a GPT model with "Once upon a ", it would predict "time" to follow. Asked for further predictions, the same GPT model might continue "there was a… brave dog named Grace", and so on - because those are the kinds of words that it expects to come next.
In this example the GPT model has essentially learned to write a fairy tale, simply as a consequence of getting very, very good at text prediction. And it was exactly these kinds of emergent capabilities that had OpenAI so excited. These models can do a lot more than fairy tales.
OpenAI's first GPT model, often called GPT-1, had been trained on excerpts from thousands of books. It showed so much promise that OpenAI almost immediately decided to train a much bigger model that could do more. But bigger models need more training data, and for this model, books would not be enough. No - this model would be trained on...the Internet.
OpenAI trained GPT-2 to imitate writing across eight million web pages. And in learning to predict such an overwhelming quantity and variety of writing, GPT-2 acquired some surprising capabilities. With the right prompt, it could translate documents, answer questions about a text, summarize passages, and sometimes even write like a human. It was a shockingly powerful model.
In fact, it may have been too powerful. GPT-2 wouldn't hesitate to plan crimes, instruct terrorists on bomb-making, create sexually explicit content, or promote cruelty, hatred and misinformation.
And this was unacceptable to OpenAI - They wanted a model that did more than just predict text - they wanted a model that operated in accordance with some kind of human values, or at least with their values. But the GPT-2 architecture had no place for ethics, guidelines, principles, or corporate PR policies. It couldn't be bullied, reasoned or negotiated with. Nothing would sway the machine from its utter devotion to generating realistic text.
But OpenAI was determined to get their model under control. So they got to work... not yet realizing that this work, along with a single typo, would lead to the one thing they didn't want to happen.
Part II: Human Feedback
To align GPT-2, OpenAI used a new technique known as "Reinforcement Learning from Human Feedback", or "RLHF". We're going to outline a simplified form of RLHF here, but if you want all the juicy technical details check out the link in the description.
The goal of RLHF is to take a basic starting language model, some plain-language guidelines, and a small group of humans providing feedback, and produce a new model that follows those guidelines. We can think of this model-in-training as the "Apprentice".
The apprentice begins the training process as an exact copy of GPT-2. During training, it gets prompts and generates responses, also called "continuations". Those prompts and continuations are sent to the human evaluators, who rate them based o...