Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: An Introduction to AI Sandbagging, published by Teun van der Weij on April 26, 2024 on The AI Alignment Forum.
Summary: Evaluations provide crucial information to determine the safety of AI systems which might be deployed or (further) developed. These development and deployment decisions have important safety consequences, and therefore they require trustworthy information. One reason why evaluation results might be untrustworthy is sandbagging, which we define as strategic underperformance on an evaluation.
The strategic nature can originate from the developer (developer sandbagging) and the AI system itself (AI system sandbagging). This post is an introduction to the problem of sandbagging.
The Volkswagen emissions scandal
There are environmental regulations which require the reduction of harmful emissions from diesel vehicles, with the goal of protecting public health and the environment. Volkswagen struggled to meet these emissions standards while maintaining the desired performance and fuel efficiency of their diesel engines (Wikipedia).
Consequently, Volkswagen intentionally installed defeat devices in millions of vehicles to lower the amount of emissions only during tests, allowing the cars to emit up to 40 times the permitted levels of nitrogen oxides during normal driving conditions. The scandal showed corporate misconduct and a lack of effective oversight, and is a form of sandbagging. But clearly, cars and their manufacturers are different to AI systems and their developers. So, how does sandbagging relate to AI systems?
Sandbagging in the context of AI
Evaluations provide crucial information to determine the safety of AI systems which might be deployed or (further) developed. Frontier labs rely on evaluations to reduce catastrophic risks, as stated in e.g. OpenAI's preparedness framework and Anthropic's responsible scaling policy. Governmental institutions like the UK AI Safety Institute and the EU AI Office also utilize evaluations to manage risks. However, sandbagging can reduce the reliability of these evaluations.
Both the developers of an AI system and the AI system itself have incentives to sandbag. For example, the developer might want to appease stakeholders, and the AI system might want to acquire more resources after an evaluation to better achieve its goals. Importantly, sandbagging is technically feasible[1] and neglected. We hope this post leads to a better understanding of the problem of sandbagging.
Definition
We define sandbagging as strategic underperformance on an evaluation.
This has three main concepts: strategic, underperformance, and evaluation (or: eval), which we discuss below in turn. Notably, we use the term 'AI system' to describe various types of AIs, such as standalone models and models augmented with tools/scaffolding (Shavit et al., 2023).
Strategic
Strategic means that the developer of the AI system and/or the AI system itself underperforms on an evaluation because it believes it will benefit.[2] If the strategy comes from the developer, we call it developer sandbagging (relating to misuse). If the strategy comes from the AI system, we call it AI system sandbagging (relating to misalignment). We use the term developer to mean any entity responsible for the AI system being evaluated.
These entities have been called developers, deployers, or providers elsewhere (1, 2, 3).
By including the term 'strategic' in the definition of sandbagging, we exclude accidental underperformance. In other words, this condition captures the fact that underperformance was intentional, which is a central condition for legal responsibility, including in the context of AI (Ashton, 2022; Halpern, 2018; Ward, 2024). Non-strategic, or accidental, sandbagging brings about other less important safety problems.
Underperformance
Underperformance occurs whe...
view more