Download - MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models" by Rob Bensinger

Discover

Podcast Features
Your all-in-one podcasting solution.

Blog to Podcast
Turn your blog into an engaging podcast.
Livestream
High-performing audio live, without limits.

Podcast Studio
Easy-to-use audio recorder app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Podcast App
The best podcast player & podcast app.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.
Live Streaming
Receive livestream rewards from your audience and earn
recurring income from your Fan Club membership.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Advertisers
Enterprise
Pricing
Discover

The Nonlinear Library: LessWrong Top Posts

Education

MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models" by Rob Bensinger

2021-12-11

Download Right click and do "save link as"

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models", published by Rob Bensinger on the AI Alignment Forum.
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
Below, I’ve copied comments left by MIRI researchers Eliezer Yudkowsky and Evan Hubinger on March 1–3 on a draft of Ajeya Cotra’s "Case for Aligning Narrowly Superhuman Models." I've included back-and-forths with Cotra, and interjections by me and Rohin Shah.
The section divisions below correspond to the sections in Cotra's post.
0. Introduction
How can we train GPT-3 to give “the best health advice it can give” using demonstrations and/or feedback from humans who may in some sense “understand less” about what to do when you’re sick than GPT-3 does?
Eliezer Yudkowsky: I've had some related conversations with Nick Beckstead. I'd be hopeful about this line of work primarily because I think it points to a bigger problem with the inscrutable matrices of floating-point numbers, namely, we have no idea what the hell GPT-3 is thinking and cannot tell it to think anything else. GPT-3 has a great store of medical knowledge, but we do not know where that medical knowledge is; we do not know how to tell it to internally apply its medical knowledge rather than applying other cognitive patterns it has stored. If this is still the state of opacity of AGI come superhuman capabilities, we are all immediately dead. So I would be relatively more hopeful about any avenue of attack for this problem that used anything other than an end-to-end black box - anything that started to address, "Well, this system clearly has a bunch of medical knowledge internally, can we find that knowledge and cause it to actually be applied" rather than "What external forces can we apply to this solid black box to make it think more about healthcare?"
Evan Hubinger: +1 I continue to think that language model transparency research is the single most valuable current research direction within the class of standard ML research, for similar reasons to what Eliezer said above.
Ajeya Cotra: Thanks! I'm also excited about language model transparency, and would love to find ways to make it more tractable as a research statement / organizing question for a field. I'm not personally excited about the connotations of transparency because it evokes the neuroscience-y interpretability tools, which don't feel scalable to situations when we don't get the concepts the model is using, and I'm very interested in finding slogans to keep researchers focused on the superhuman stuff.
Ajeya Cotra: I've edited the description of the challenge to emphasize human feedback less. It now reads "How can we get GPT-3 to give “the best health advice it can give” when humans in some sense “understand less” about what to do when you’re sick than GPT-3 does? And in that regime, how can we even tell/verify that it’s “doing the best it can”?"
Rob Bensinger: Nate and I tend to talk about "understandability" instead of "transparency" exactly because we don't want to sound like we're talking about normal ML transparency work.
Eliezer Yudkowsky: Other possible synonyms: Clarity, legibility, cognitive readability.
Ajeya Cotra: Thanks all -- I like the project of trying to come up with a good handle for the kind of language model transparency we're excited about (and have talked to Nick, Evan, etc about it too) but I think I don't want to push it in this blog post right now because I haven't hit on something I believe in and I want to ship this.
In the end, we probably want to find ways to meaningfully supervise (or justifiably trust) models that are more capable than ~all humans in ~all domains.
Eliezer Yudkowsky: (I think you want an AGI that is superhuman in engineering domains and infrahuman in human-modeling-and-manipulation if such a ...