Download - LW - How useful is "AI Control" as a framing on AI X-Risk? by habryka

Discover

Podcast Features
Your all-in-one podcasting solution.

Blog to Podcast
Turn your blog into an engaging podcast.
Livestream
High-performing audio live, without limits.

Podcast Studio
Easy-to-use audio recorder app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Podcast App
The best podcast player & podcast app.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.
Live Streaming
Receive livestream rewards from your audience and earn
recurring income from your Fan Club membership.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Advertisers
Enterprise
Pricing
Discover

The Nonlinear Library: LessWrong

Education

LW - How useful is "AI Control" as a framing on AI X-Risk? by habryka

2024-03-14

Download Right click and do "save link as"

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How useful is "AI Control" as a framing on AI X-Risk?, published by habryka on March 14, 2024 on LessWrong.
Redwood (where Ryan works) recently released a series of blogposts proposing a framing and research agenda for reducing AI-risk that focuses on ensuring safety (and secondarily usefulness) under the conservative assumption that AIs are misaligned and actively scheming against human interests, under the name "AI Control".
This is in contrast to other work on AI risk which focuses on reducing the probability that AI systems pursue goals that are in conflict with human values in the first place (which might include having it not pursue goals in the relevant sense at all), usually called "AI Alignment". In other words, control aims to ensure that even if your models are actively misaligned, you'll be safe, because they are not capable of subverting your safety measures.
In this dialogue we dig into our disagreements on the degree to which this kind of work seems promising, and whether/how this reframing opens up new avenues for valuable research and engineering projects.
In the context of this dialogue, we'll use the word "scheming" in the same way as used in Joe Carlsmith's recent report: scheming is when AIs perform well (and look aligned) in training and evaluations in order to gain power later. This is also called deceptive alignment.
The Case for Control Work
Let's start by me restating the very basic case for control work in my own words, and then give some high-level takes on where I expect things to break by my own models.
The aim of control work is to:
Get useful work out of AIs even if they are situationally aware and would like to pursue unbounded objectives in a relatively consequentialist way
Set things up so that we get to learn from any failures in the above process, both societally and technically. I.e. a large fraction of important work in the context of control is to catch treacherous turns and leverage each case for both a better technical and societal response
In some sense this doesn't really capture an existential success story yet. Like, cool, we can now use the AIs to make more stuff and be richer. That's nice, but seems like by-default at some point our control schemes will fail as the AIs get more competent, unless we do one of the following:
Use the controlled AIs to help us be less confused about AI Alignment and come up with a plan that solves the full AI X-risk problem (in some sense punting the problem to future human + controlled AI systems)
Have a specific story about a task that we want to use the controlled AIs for that we think will then allow us to control much smarter systems (maybe not literal galaxy-brains, but like smart enough that just punting the problem to future humans + controlled AI systems seems like a pretty sure bet)
Have a specific story about a task that we want to use the AIs for that make coordinating on how careful to be with building AI much easier. This could be leveraging the AI systems itself to make very scary demos, or some better way of eliciting preferences from the world's population in a way that allows for better coordinated action. Then humans and AIs can have much more time to figure out how to solve this problem.
So another major part of working on victory via control is to study and figure out how to use controlled AIs to do one of the three above.
Does this seem like a non-crazy summary?
Does this seem like a non-crazy summary?
Yes.
It's worth noting that parts of your summary are applicable to various strategies which aren't just control. E.g., sometimes people talk about avoiding misalignment in human-ish level systems and then using these systems to do various useful things. (See e.g. the OpenAI superalignment plan.)
So there are kinda two components:
Control to ensure safety and usefulness ...