Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The True Story of How GPT-2 Became Maximally Lewd, published by Writer on January 19, 2024 on LessWrong.
This video recounts an incident that occurred at OpenAI in which flipping a single minus sign led the RLHF process to make GPT-2 only output sexually explicit continuations.
The incident is described in OpenAI's paper "Fine-Tuning Language Models from Human Preferences" under section 4.4: "Bugs can optimize for bad behavior".
The script has been written by Jai, with some significant input and rework by me, Writer. You can read it below.
In 2019, one OpenAI researcher made a typo - and birthed an evil AI hell-bent on making everything as horny as possible.
This is the absurd, ridiculous, and yet true story of how it happened.
Part I: GPT
Since 2017, OpenAI has been building Generative Pre-trained Transformer models, or GPTs - language AIs with a singular focus on predicting text, trained across billions of writing samples. If you prompt a GPT model with "Once upon a ", it would predict "time" to follow. Asked for further predictions, the same GPT model might continue "there was a… brave dog named Grace", and so on - because those are the kinds of words that it expects to come next.
In this example the GPT model has essentially learned to write a fairy tale, simply as a consequence of getting very, very good at text prediction. And it was exactly these kinds of emergent capabilities that had OpenAI so excited. These models can do a lot more than fairy tales.
OpenAI's first GPT model, often called GPT-1, had been trained on excerpts from thousands of books. It showed so much promise that OpenAI almost immediately decided to train a much bigger model that could do more. But bigger models need more training data, and for this model, books would not be enough. No - this model would be trained on...the Internet.
OpenAI trained GPT-2 to imitate writing across eight million web pages. And in learning to predict such an overwhelming quantity and variety of writing, GPT-2 acquired some surprising capabilities. With the right prompt, it could translate documents, answer questions about a text, summarize passages, and sometimes even write like a human. It was a shockingly powerful model.
In fact, it may have been too powerful. GPT-2 wouldn't hesitate to plan crimes, instruct terrorists on bomb-making, create sexually explicit content, or promote cruelty, hatred and misinformation.
And this was unacceptable to OpenAI - They wanted a model that did more than just predict text - they wanted a model that operated in accordance with some kind of human values, or at least with their values. But the GPT-2 architecture had no place for ethics, guidelines, principles, or corporate PR policies. It couldn't be bullied, reasoned or negotiated with. Nothing would sway the machine from its utter devotion to generating realistic text.
But OpenAI was determined to get their model under control. So they got to work... not yet realizing that this work, along with a single typo, would lead to the one thing they didn't want to happen.
Part II: Human Feedback
To align GPT-2, OpenAI used a new technique known as "Reinforcement Learning from Human Feedback", or "RLHF". We're going to outline a simplified form of RLHF here, but if you want all the juicy technical details check out the link in the description.
The goal of RLHF is to take a basic starting language model, some plain-language guidelines, and a small group of humans providing feedback, and produce a new model that follows those guidelines. We can think of this model-in-training as the "Apprentice".
The apprentice begins the training process as an exact copy of GPT-2. During training, it gets prompts and generates responses, also called "continuations". Those prompts and continuations are sent to the human evaluators, who rate them based o...
view more