Download - AF - An Introduction to AI Sandbagging by Teun van der Weij

Discover

Podcast Features
Your all-in-one podcasting solution.

Blog to Podcast
Turn your blog into an engaging podcast.
Livestream
High-performing audio live, without limits.

Podcast Studio
Easy-to-use audio recorder app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Podcast App
The best podcast player & podcast app.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.
Live Streaming
Receive livestream rewards from your audience and earn
recurring income from your Fan Club membership.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Advertisers
Enterprise
Pricing
Discover

The Nonlinear Library

Education

AF - An Introduction to AI Sandbagging by Teun van der Weij

2024-04-26

Download Right click and do "save link as"

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: An Introduction to AI Sandbagging, published by Teun van der Weij on April 26, 2024 on The AI Alignment Forum.
Summary: Evaluations provide crucial information to determine the safety of AI systems which might be deployed or (further) developed. These development and deployment decisions have important safety consequences, and therefore they require trustworthy information. One reason why evaluation results might be untrustworthy is sandbagging, which we define as strategic underperformance on an evaluation.
The strategic nature can originate from the developer (developer sandbagging) and the AI system itself (AI system sandbagging). This post is an introduction to the problem of sandbagging.
The Volkswagen emissions scandal
There are environmental regulations which require the reduction of harmful emissions from diesel vehicles, with the goal of protecting public health and the environment. Volkswagen struggled to meet these emissions standards while maintaining the desired performance and fuel efficiency of their diesel engines (Wikipedia).
Consequently, Volkswagen intentionally installed defeat devices in millions of vehicles to lower the amount of emissions only during tests, allowing the cars to emit up to 40 times the permitted levels of nitrogen oxides during normal driving conditions. The scandal showed corporate misconduct and a lack of effective oversight, and is a form of sandbagging. But clearly, cars and their manufacturers are different to AI systems and their developers. So, how does sandbagging relate to AI systems?
Sandbagging in the context of AI
Evaluations provide crucial information to determine the safety of AI systems which might be deployed or (further) developed. Frontier labs rely on evaluations to reduce catastrophic risks, as stated in e.g. OpenAI's preparedness framework and Anthropic's responsible scaling policy. Governmental institutions like the UK AI Safety Institute and the EU AI Office also utilize evaluations to manage risks. However, sandbagging can reduce the reliability of these evaluations.
Both the developers of an AI system and the AI system itself have incentives to sandbag. For example, the developer might want to appease stakeholders, and the AI system might want to acquire more resources after an evaluation to better achieve its goals. Importantly, sandbagging is technically feasible[1] and neglected. We hope this post leads to a better understanding of the problem of sandbagging.
Definition
We define sandbagging as strategic underperformance on an evaluation.
This has three main concepts: strategic, underperformance, and evaluation (or: eval), which we discuss below in turn. Notably, we use the term 'AI system' to describe various types of AIs, such as standalone models and models augmented with tools/scaffolding (Shavit et al., 2023).
Strategic
Strategic means that the developer of the AI system and/or the AI system itself underperforms on an evaluation because it believes it will benefit.[2] If the strategy comes from the developer, we call it developer sandbagging (relating to misuse). If the strategy comes from the AI system, we call it AI system sandbagging (relating to misalignment). We use the term developer to mean any entity responsible for the AI system being evaluated.
These entities have been called developers, deployers, or providers elsewhere (1, 2, 3).
By including the term 'strategic' in the definition of sandbagging, we exclude accidental underperformance. In other words, this condition captures the fact that underperformance was intentional, which is a central condition for legal responsibility, including in the context of AI (Ashton, 2022; Halpern, 2018; Ward, 2024). Non-strategic, or accidental, sandbagging brings about other less important safety problems.
Underperformance
Underperformance occurs whe...