Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models", published by Rob Bensinger on the AI Alignment Forum.
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
Below, I’ve copied comments left by MIRI researchers Eliezer Yudkowsky and Evan Hubinger on March 1–3 on a draft of Ajeya Cotra’s "Case for Aligning Narrowly Superhuman Models." I've included back-and-forths with Cotra, and interjections by me and Rohin Shah.
The section divisions below correspond to the sections in Cotra's post.
0. Introduction
How can we train GPT-3 to give “the best health advice it can give” using demonstrations and/or feedback from humans who may in some sense “understand less” about what to do when you’re sick than GPT-3 does?
Eliezer Yudkowsky: I've had some related conversations with Nick Beckstead. I'd be hopeful about this line of work primarily because I think it points to a bigger problem with the inscrutable matrices of floating-point numbers, namely, we have no idea what the hell GPT-3 is thinking and cannot tell it to think anything else. GPT-3 has a great store of medical knowledge, but we do not know where that medical knowledge is; we do not know how to tell it to internally apply its medical knowledge rather than applying other cognitive patterns it has stored. If this is still the state of opacity of AGI come superhuman capabilities, we are all immediately dead. So I would be relatively more hopeful about any avenue of attack for this problem that used anything other than an end-to-end black box - anything that started to address, "Well, this system clearly has a bunch of medical knowledge internally, can we find that knowledge and cause it to actually be applied" rather than "What external forces can we apply to this solid black box to make it think more about healthcare?"
Evan Hubinger: +1 I continue to think that language model transparency research is the single most valuable current research direction within the class of standard ML research, for similar reasons to what Eliezer said above.
Ajeya Cotra: Thanks! I'm also excited about language model transparency, and would love to find ways to make it more tractable as a research statement / organizing question for a field. I'm not personally excited about the connotations of transparency because it evokes the neuroscience-y interpretability tools, which don't feel scalable to situations when we don't get the concepts the model is using, and I'm very interested in finding slogans to keep researchers focused on the superhuman stuff.
Ajeya Cotra: I've edited the description of the challenge to emphasize human feedback less. It now reads "How can we get GPT-3 to give “the best health advice it can give” when humans in some sense “understand less” about what to do when you’re sick than GPT-3 does? And in that regime, how can we even tell/verify that it’s “doing the best it can”?"
Rob Bensinger: Nate and I tend to talk about "understandability" instead of "transparency" exactly because we don't want to sound like we're talking about normal ML transparency work.
Eliezer Yudkowsky: Other possible synonyms: Clarity, legibility, cognitive readability.
Ajeya Cotra: Thanks all -- I like the project of trying to come up with a good handle for the kind of language model transparency we're excited about (and have talked to Nick, Evan, etc about it too) but I think I don't want to push it in this blog post right now because I haven't hit on something I believe in and I want to ship this.
In the end, we probably want to find ways to meaningfully supervise (or justifiably trust) models that are more capable than ~all humans in ~all domains.
Eliezer Yudkowsky: (I think you want an AGI that is superhuman in engineering domains and infrahuman in human-modeling-and-manipulation if such a ...
view more