Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [Paper] Stress-testing capability elicitation by training password-locked models, published by Fabien Roger on June 4, 2024 on The AI Alignment Forum.
The paper is by Ryan Greenblatt, Fabien Roger, Dmitrii Krasheninnikov and David Krueger. This post was written by Fabien and Ryan, and may not reflect the views of Dmitrii and David.
Scheming models might try to perform less capably than they are able to (
sandbag). They might do this to make evaluations underestimate danger, slow down promising safety work, or cause issues with debate or
recursive
oversight more generally. You might hope that you can avoid this issue by eliciting capabilities, for example by fine-tuning on demonstrations and then using RL. In principle, this elicitation might fail. It could fail due to having too few (or no) high-quality demonstrations (i.e. sample efficiency is too low) or because RL fails to explore good actions.
We're uncertain how this goes in practice. It could be the case that simple countermeasures prevent sandbagging in most of the important cases such that it isn't a big problem for safety. But it could also turn out that even high-effort approaches can't rule out these issues.
We released
a paper studying this by examining how well supervised fine-tuning and RL can elicit capabilities from password-locked models (LLMs trained to produce low-quality outputs except when a particular password is provided). We find that:
If you have some high-quality demonstrations, supervised fine-tuning usually recovers the hidden capabilities with very few samples.
Elicitation on password-locked models generalizes well across domains (even when we made password-locked models using techniques that we thought might prevent generalization).
Using RL on password-locked models recovers hidden capabilities, except when the weak model RL starts from is extremely weak.
While we target cases where AIs intentionally produce low-quality outputs and they could easily produce higher quality outputs, we also think our results have some relevance to other cases where AIs produce low-quality outputs despite being able to produce better ones.
Here are the main takeaways:
We think our results are a moderate update against scheming models being able to sandbag effectively (given countermeasures) for the kind of tasks we studied here, though there are differences between password-locked models and actual scheming models. We think this should be a bigger update for people who are relatively more pessimistic than us about sandbagging (at least insofar as their views make pessimistic predictions about current architectures).
(We think this is also a small update against scheming, since scheming models might still try to sandbag, and if elicitation fine-tunes sandbagging out, it may partially erode scheming.)
People interested in studying sandbagging or similar elicitation failures should consider using a methodology similar to the password-locked models methodology we use here. More generally, they should consider using some sort of adversarial methodology in which a red team constructs a model organism.
In this post:
We explain why good capability elicitation matters for AI safety
We describe the sandbagging concern we are trying to assess;
We summarize the results of our paper, and discuss their implications;
We discuss how our results relate to relevant AGI safety work;
We describe future work we are excited by;
We discuss concerns other than intentional sandbagging for which our results might be relevant.
The big picture: why good capability elicitation matters for AI safety
Currently, AI labs try reasonably hard to get as much performance as possible out of their expensive pretrained model. The process of making an AI perform a task competently (without a huge amount of additional training) is often called eli...
view more