Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI companies aren't really using external evaluators, published by Zach Stein-Perlman on May 24, 2024 on LessWrong.
Crossposted from my new blog: AI Lab Watch. Subscribe on Substack.
Many AI safety folks think that METR is close to the labs, with ongoing relationships that grant it access to models before they are deployed. This is incorrect. METR (then called ARC Evals) did pre-deployment evaluation for GPT-4 and Claude 2 in the first half of 2023, but it seems to have had no special access since then.[1] Other model evaluators also seem to have little access before deployment.
Frontier AI labs' pre-deployment risk assessment should involve external model evals for dangerous capabilities.[2] External evals can improve a lab's risk assessment and - if the evaluator can publish its results - provide public accountability.
The evaluator should get deeper access than users will get.
To evaluate threats from a particular deployment protocol, the evaluator should get somewhat deeper access than users will - then the evaluator's failure to elicit dangerous capabilities is stronger evidence that users won't be able to either.[3] For example, the lab could share a version of the model without safety filters or harmlessness training, and ideally allow evaluators to fine-tune the model.
To evaluate threats from model weights being stolen or released, the evaluator needs deep access, since someone with the weights has full access.
The costs of using external evaluators are unclear.
Anthropic said that collaborating with METR "requir[ed] significant science and engineering support on our end"; it has not clarified why. And even if providing deep model access or high-touch support is a hard engineering problem, I don't understand how sharing API access - including what users will receive and a no-harmlessness no-filters version - could be.
Sharing model access pre-deployment increases the risk of leaks, including of information about products (modalities, release dates), information about capabilities, and demonstrations of models misbehaving.
Independent organizations that do model evals for dangerous capabilities include METR, the UK AI Safety Institute (UK AISI), and Apollo. Only Google DeepMind says it has recently shared pre-deployment access with such an evaluator - UK AISI - and that sharing was minimal (see below).
What the labs say they're doing on external evals before deployment:
DeepMind
DeepMind
shared Gemini 1.0 Ultra with unspecified external groups
apparently including UK AISI to test for dangerous capabilities before deployment. But DeepMind didn't share deep access: it only shared a system with safety fine-tuning and safety filters and it didn't allow evaluators to fine-tune the model. DeepMind has not shared any results of this testing.
Its Frontier Safety Framework says "We will . . . explore how to appropriately involve independent third parties in our risk assessment and mitigation processes."
Anthropic
Currently nothing
Its Responsible Scaling Policy mentions "external audits" as part of "Early Thoughts on ASL-4"
It shared Claude 2 with METR in the first half of 2023
OpenAI
Currently nothing
Its Preparedness Framework does not mention external evals before deployment. The closest thing it says is "Scorecard evaluations (and corresponding mitigations) will be audited by qualified, independent third-parties."
It shared GPT-4 with METR in the first half of 2023
It
said "We think it's important that efforts like ours submit to independent audits before releasing new systems; we will talk about this in more detail later this year." That was in February 2023; I do not believe it elaborated (except to
mention that it shared GPT-4 with METR).
All notable American labs joined the
White House voluntary commitments
, which include "external red-teaming . . . in areas ...
view more