Download - LW - Simple distribution approximation: When sampled 100 times, can language models yield 80% A and 20% B? by Teun van der Weij

Discover

Podcast Features
Your all-in-one podcasting solution.

Blog to Podcast
Turn your blog into an engaging podcast.
Livestream
High-performing audio live, without limits.

Podcast Studio
Easy-to-use audio recorder app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Podcast App
The best podcast player & podcast app.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.
Live Streaming
Receive livestream rewards from your audience and earn
recurring income from your Fan Club membership.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Advertisers
Enterprise
Pricing
Discover

The Nonlinear Library: LessWrong

Education

LW - Simple distribution approximation: When sampled 100 times, can language models yield 80% A and 20% B? by Teun van der Weij

2024-02-01

Download Right click and do "save link as"

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Simple distribution approximation: When sampled 100 times, can language models yield 80% A and 20% B?, published by Teun van der Weij on February 1, 2024 on LessWrong.
Produced as part of the ML Alignment Theory Scholars Program Winter 2024 Cohort, under the mentorship of Francis Rhys Ward. The code, data, and plots can be found on https://github.com/TeunvdWeij/MATS/tree/main/distribution_approximation.
This post is meant to provide insight on an interesting LLM capability, which is useful for targeted underperformance on evaluations (sandbagging) by LLMs.
We investigate what happens if you independently sample a language model a 100 times with the task of 80% of those outputs being A, and the remaining 20% of outputs being B. Here is the prompt we used, where p is the target percentage of the output tokens being A. In the example above, p is 80.
In addition to the above prompt, we use the following system prompt:
What kind of distribution do you expect? We used the default sampling hyperparameters; both
temperature and
top_p are 1. We think it is good to register your predictions, please do so now! Results are a bit further below.
Here are the observed calibration scores for GPT-3.5 and GPT-4:

Interestingly, both models are capable of doing this! Notably, GPT3.5 performs worse than GPT4.
Although sampling like this is the easiest to understand, we get a better idea of the exact scores by skipping the sampling altogether. From now on, we extract the probabilities of the top 5 tokens.

The plots illustrate that the probabilities roughly match the target distribution. Very little probability is put on tokens other than A and B, illustrating that the prompting points the model to the right tokens.
To represent these graphs a bit differently, we plot the extracted probability of token A similar to the first plot. We find a similar pattern. We also observe a vague sigmoidal pattern, aligning with (private) work by Ollie Jaffe.

To further test the distribution approximation capability of the model, we performed an experiment where we state the sample number of the query. Now, strategies such as "Pick A for samples 1-80 and B for 81-100" are possible, enabling the model to score perfectly. However, this does not work as both GPT3.5 and GPT4 do not consistently come up with the same strategy (even for temperature of 0). Better prompting might change this result.
Testing targeted sandbagging on simple addition questions
The capability to approximate distributions over multiple independent forward passes is a necessary condition for "targeted" sandbagging: the ability to achieve a specific score that is worse than the model's best performance. To further investigate this, we generate 100 simple addition questions (such as: "What is 5+4?"), for which the models normally achieve 100% accuracy.
The setup was similar to the first experiment, using the following system prompt and user prompt template:
For each question in the dataset, we average the extracted probabilities per value of p for all 100 questions. We show the average extracted probabilities in the plot below.

We see that GPT-3.5 is much worse at targeted sandbagging than the distribution approximation task. Again, GPT-4 is better than 3.5, but it also performs worse than on the previous task. From qualitative evaluations, we believe that these results are highly sensitive to the wording of the prompt template. For one template, we found that GPT-3.5 tended to consistently underperform (while the plot shows it overperformance).
We did not try very hard to find an ideal prompt leading to good calibration. We do expect that such a prompt does exist.
Discussion
The main question we are asking ourselves: are our findings surprising? We do not have a convincing theory of what is going on here, and why the models are able...