Download - The case for aligning narrowly superhuman models by Ajeya Cotra | Podbean

Discover

Podcast Features
Your all-in-one podcasting solution.

Blog to Podcast
Turn your blog into an engaging podcast.
Livestream
High-performing audio live, without limits.

Podcast Studio
Easy-to-use audio recorder app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Podcast App
The best podcast player & podcast app.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.
Live Streaming
Receive livestream rewards from your audience and earn
recurring income from your Fan Club membership.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Advertisers
Enterprise
Pricing
Discover

The Nonlinear Library: LessWrong Top Posts

Education

The case for aligning narrowly superhuman models by Ajeya Cotra

2021-12-12

Download Right click and do "save link as"

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: The case for aligning narrowly superhuman models, published by Ajeya Cotra on the LessWrong.
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
I wrote this post to get people’s takes on a type of work that seems exciting to me personally; I’m not speaking for Open Phil as a whole. Institutionally, we are very uncertain whether to prioritize this (and if we do where it should be housed and how our giving should be structured). We are not seeking grant applications on this topic right now.
Thanks to Daniel Dewey, Eliezer Yudkowsky, Evan Hubinger, Holden Karnofsky, Jared Kaplan, Mike Levine, Nick Beckstead, Owen Cotton-Barratt, Paul Christiano, Rob Bensinger, and Rohin Shah for comments on earlier drafts.
A genre of technical AI risk reduction work that seems exciting to me is trying to align existing models that already are, or have the potential to be, “superhuman”[1] at some particular task (which I’ll call narrowly superhuman models).[2] I don’t just mean “train these models to be more robust, reliable, interpretable, etc” (though that seems good too); I mean “figure out how to harness their full abilities so they can be as useful as possible to humans” (focusing on “fuzzy” domains where it’s intuitively non-obvious how to make that happen).
Here’s an example of what I’m thinking of: intuitively speaking, it feels like GPT-3 is “smart enough to” (say) give advice about what to do if I’m sick that’s better than advice I’d get from asking humans on Reddit or Facebook, because it’s digested a vast store of knowledge about illness symptoms and remedies. Moreover, certain ways of prompting it provide suggestive evidence that it could use this knowledge to give helpful advice. With respect to the Reddit or Facebook users I might otherwise ask, it seems like GPT-3 has the potential to be narrowly superhuman in the domain of health advice.
But GPT-3 doesn’t seem to “want” to give me the best possible health advice -- instead it “wants” to play a strange improv game riffing off the prompt I give it, pretending it’s a random internet user. So if I want to use GPT-3 to get advice about my health, there is a gap between what it’s capable of (which could even exceed humans) and what I can get it to actually provide me. I’m interested in the challenge of:
How can we get GPT-3 to give “the best health advice it can give” when humans[3] in some sense “understand less” about what to do when you’re sick than GPT-3 does? And in that regime, how can we even tell whether it’s actually “doing the best it can”?
I think there are other similar challenges we could define for existing models, especially large language models.
I’m excited about tackling this particular type of near-term challenge because it feels like a microcosm of the long-term AI alignment problem in a real, non-superficial sense. In the end, we probably want to find ways to meaningfully supervise (or justifiably trust) models that are more capable than ~all humans in ~all domains.[4] So it seems like a promising form of practice to figure out how to get particular humans to oversee models that are more capable than them in specific ways, if this is done with an eye to developing scalable and domain-general techniques.
I’ll call this type of project aligning narrowly superhuman models. In the rest of this post, I:
Give a more detailed description of what aligning narrowly superhuman models could look like, what does and doesn’t “count”, and what future projects I think could be done in this space (more).
Explain why I think aligning narrowly superhuman models could meaningfully reduce long-term existential risk from misaligned AI (more).
Lay out the potential advantages that I think this work has over other types of AI alignment research: (a) conceptual thinking, (b) demos in small-scal...

More Episodes

Lifestyle interventions to increase longevity by RomeoStevens

2021-12-12

Is Rationalist Self-Improvement Real? by Jacob Falkovich

2021-12-12

A whirlwind tour of Ethereum financeby cata

2021-12-12

The Apologist and the Revolutionary by Scott Alexander

2021-12-12

2021-12-12

Where do your eyes go? by alkjash

2021-12-12

Confidence levels inside and outside an argument by Scott Alexander

2021-12-12

Fact Posts: How and Why by sarahconstantin

2021-12-12

Alignment Research Field Guide by abramdemski

2021-12-12

The Pavlov Strategy by sarahconstantin

2021-12-12

Tell Culture by LoganStrohl

2021-12-12

Other people are wrong vs I am right by Buck

2021-12-12

Your Cheerful Price by Eliezer Yudkowsky

2021-12-12

The Importance of Sidekicks by Swimmer963

2021-12-12

Extreme Rationality: It's Not That Great

2021-12-12

Learned Blankness by AnnaSalamon

2021-12-12

What cognitive biases feel like from the insideby chaosmage

2021-12-12

Hiring engineers and researchers to help align GPT-3 by paulfchristiano

2021-12-12

Arbital postmortem by alexei

2021-12-12

Personalized Medicine For Real by sarahconstantin

2021-12-12

←
1
2
3
4
5
6
7
8
9
10
→

012345678910111213141516171819

Get this podcast on your
phone, FREE

Download Podbean app on App Store

Download Podbean app on Google Play

Create your
podcast in
minutes

Full-featured podcast site
Unlimited storage and bandwidth
Comprehensive podcast stats
Distribute to Apple Podcasts, Spotify, and more
Make money with your podcast

It is Free

Podcast Services
MONETIZATION & MORE
KNOWLEDGE BASE
Support
Podbean

Privacy Policy
Cookie Policy
Terms of Use
Consent Preferences
Copyright © 2015-2024 Podbean.com