Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Research Agenda v0.9: Synthesising a human's preferences into a utility function, published by Stuart Armstrong on the AI Alignment Forum.
I'm now in a position where I can see a possible route to a safe/survivable/friendly Artificial Intelligence being developed. I'd give a 10+% chance of it being possible this way, and a 95% chance that some of these ideas will be very useful for other methods of alignment. So I thought I'd encode the route I'm seeing as research agenda; this is the first public draft of it.
Clarity, rigour, and practicality: that's what this agenda needs. Writing this agenda has clarified a lot of points for me, to the extent that some of it now seems, in retrospect, just obvious and somewhat trivial - "of course that's the way you have to do X". But more clarification is needed in the areas that remain vague. And, once these are clarified enough for humans to understand, they need to be made mathematically and logically rigorous - and ultimately, cashed out into code, and tested and experimented with.
So I'd appreciate any comments that could help with these three goals, and welcome anyone interested in pursuing research along these lines over the long-term.
Note: I periodically edit this document, to link it to more recent research ideas/discoveries.
0 The fundamental idea
This agenda fits itself into the broad family of Inverse Reinforcement Learning: delegating most of the task of inferring human preferences to the AI itself. Most of the task, since it's been shown that humans need to build the right assumptions into the AI, or else the preference learning will fail.
To get these "right assumptions", this agenda will look into what preferences actually are, and how they may be combined together. There are hence four parts to the research agenda:
A way of identifying the (partial[1]) preferences of a given human
H
A way for ultimately synthesising a utility function
U
H
that is an adequate encoding of the partial preferences of a human
H
Practical methods for estimating this
U
H
, and how one could use the definition of
U
H
to improve other suggested methods for value-alignment.
Limitations and lacunas of the agenda: what is not covered. These may be avenues of future research, or issues that cannot fit into the
U
H
paradigm.
There has been a myriad of small posts on this topic, and most will be referenced here. Most of these posts are stubs that hint to a solution, rather than spelling it out fully and rigorously.
The reason for that is to check for impossibility results ahead of time. The construction of
U
H
is deliberately designed to be adequate, rather than elegant (indeed, the search for an elegant
U
H
might be counterproductive and even dangerous, if genuine human preferences get sacrificed for elegance). If this approach is to work, then the safety of
U
H
has to be robust to different decisions in the synthesis process (see Section 2.8, on avoiding disasters). Thus, initially, it seems more important to find approximate ideas that cover all possibilities, rather than having a few fully detailed sub-possibilities and several gaps.
Finally, it seems that if a sub-problem is not formally solved, we stand a much better chance of getting a good result from "hit it with lots of machine learning and hope for the best", than we would if there were huge conceptual holes in the method - a conceptual hole meaning that the relevant solution is broken in an unfixable way. Thus, I'm publishing this agenda now, where I see many implementation holes, but no large conceptual holes.
A word of warning here, though: with some justification, the original Dartmouth AI conference could also have claimed to be confident that there were no large conceptual holes in their plan of developing AI over a summer - and we know how wrong they turned out to be. With tha...
view more