Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.Following on from our recent paper, “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training”, I’m very excited to announce that I have started leading a new team at Anthropic, the Alignment Stress-Testing team, with Carson Denison and Monte MacDiarmid as current team members. Our mission—and our mandate from the organization—is to red-team Anthropic's alignment techniques and evaluations, empirically demonstrating ways in which Anthropic's alignment strategies could fail.
The easiest way to get a sense of what we’ll be working on is probably just to check out our “Sleeper Agents” paper, which was our first big research project. I’d also recommend Buck and Ryan's post on meta-level adversarial evaluation as a good general description of our team's scope. Very simply, our job is to try to prove to Anthropic—and the world more broadly—(if it is [...]
---
First published:
January 12th, 2024
Source:
https://www.lesswrong.com/posts/EPDSdXr8YbsDkgsDG/introducing-alignment-stress-testing-at-anthropic
---
Narrated by TYPE III AUDIO.
view more