Download - Research Agenda v0.9: Synthesising a human's preferences into a utility function by Stuart Armstrong

Discover

Podcast Features
Your all-in-one podcasting solution.

Blog to Podcast
Turn your blog into an engaging podcast.
Livestream
High-performing audio live, without limits.

Podcast Studio
Easy-to-use audio recorder app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Podcast App
The best podcast player & podcast app.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.
Live Streaming
Receive livestream rewards from your audience and earn
recurring income from your Fan Club membership.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Advertisers
Enterprise
Pricing
Discover

The Nonlinear Library: Alignment Forum Top Posts

Education

Research Agenda v0.9: Synthesising a human's preferences into a utility function by Stuart Armstrong

2021-12-03

Download Right click and do "save link as"

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Research Agenda v0.9: Synthesising a human's preferences into a utility function, published by Stuart Armstrong on the AI Alignment Forum.
I'm now in a position where I can see a possible route to a safe/survivable/friendly Artificial Intelligence being developed. I'd give a 10+% chance of it being possible this way, and a 95% chance that some of these ideas will be very useful for other methods of alignment. So I thought I'd encode the route I'm seeing as research agenda; this is the first public draft of it.
Clarity, rigour, and practicality: that's what this agenda needs. Writing this agenda has clarified a lot of points for me, to the extent that some of it now seems, in retrospect, just obvious and somewhat trivial - "of course that's the way you have to do X". But more clarification is needed in the areas that remain vague. And, once these are clarified enough for humans to understand, they need to be made mathematically and logically rigorous - and ultimately, cashed out into code, and tested and experimented with.
So I'd appreciate any comments that could help with these three goals, and welcome anyone interested in pursuing research along these lines over the long-term.
Note: I periodically edit this document, to link it to more recent research ideas/discoveries.
0 The fundamental idea
This agenda fits itself into the broad family of Inverse Reinforcement Learning: delegating most of the task of inferring human preferences to the AI itself. Most of the task, since it's been shown that humans need to build the right assumptions into the AI, or else the preference learning will fail.
To get these "right assumptions", this agenda will look into what preferences actually are, and how they may be combined together. There are hence four parts to the research agenda:
A way of identifying the (partial[1]) preferences of a given human
H
A way for ultimately synthesising a utility function
U
H
that is an adequate encoding of the partial preferences of a human
H
Practical methods for estimating this
U
H
, and how one could use the definition of
U
H
to improve other suggested methods for value-alignment.
Limitations and lacunas of the agenda: what is not covered. These may be avenues of future research, or issues that cannot fit into the
U
H
paradigm.
There has been a myriad of small posts on this topic, and most will be referenced here. Most of these posts are stubs that hint to a solution, rather than spelling it out fully and rigorously.
The reason for that is to check for impossibility results ahead of time. The construction of
U
H
is deliberately designed to be adequate, rather than elegant (indeed, the search for an elegant
U
H
might be counterproductive and even dangerous, if genuine human preferences get sacrificed for elegance). If this approach is to work, then the safety of
U
H
has to be robust to different decisions in the synthesis process (see Section 2.8, on avoiding disasters). Thus, initially, it seems more important to find approximate ideas that cover all possibilities, rather than having a few fully detailed sub-possibilities and several gaps.
Finally, it seems that if a sub-problem is not formally solved, we stand a much better chance of getting a good result from "hit it with lots of machine learning and hope for the best", than we would if there were huge conceptual holes in the method - a conceptual hole meaning that the relevant solution is broken in an unfixable way. Thus, I'm publishing this agenda now, where I see many implementation holes, but no large conceptual holes.
A word of warning here, though: with some justification, the original Dartmouth AI conference could also have claimed to be confident that there were no large conceptual holes in their plan of developing AI over a summer - and we know how wrong they turned out to be. With tha...