Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Thoughts on Human Models, published by Ramana Kumar, Scott Garrabrant on the AI Alignment Forum.
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
Human values and preferences are hard to specify, especially in complex domains. Accordingly, much AGI safety research has focused on approaches to AGI design that refer to human values and preferences indirectly, by learning a model that is grounded in expressions of human values (via stated preferences, observed behaviour, approval, etc.) and/or real-world processes that generate expressions of those values. There are additionally approaches aimed at modelling or imitating other aspects of human cognition or behaviour without an explicit aim of capturing human preferences (but usually in service of ultimately satisfying them). Let us refer to all these models as human models.
In this post, we discuss several reasons to be cautious about AGI designs that use human models. We suggest that the AGI safety research community put more effort into developing approaches that work well in the absence of human models, alongside the approaches that rely on human models. This would be a significant addition to the current safety research landscape, especially if we focus on working out and trying concrete approaches as opposed to developing theory. We also acknowledge various reasons why avoiding human models seems difficult.
Problems with Human Models
To be clear about human models, we draw a rough distinction between our actual preferences (which may not be fully accessible to us) and procedures for evaluating our preferences. The first thing, actual preferences, is what humans actually want upon reflection. Satisfying our actual preferences is a win. The second thing, procedures for evaluating preferences, refers to various proxies for our actual preferences such as our approval, or what looks good to us (with necessarily limited information or time for thinking). Human models are in the second category; consider, as an example, a highly accurate ML model of human yes/no approval on the set of descriptions of outcomes. Our first concern, described below, is about overfitting to human approval and thereby breaking its connection to our actual preferences. (This is a case of Goodhart’s law.)
Less Independent Audits
Imagine we have built an AGI system and we want to use it to design the mass transit system for a new city. The safety problems associated with such a project are well recognised; suppose we are not completely sure we have solved them, but are confident enough to try anyway. We run the system in a sandbox on some fake city input data and examine its outputs. Then we run it on some more outlandish fake city data to assess robustness to distributional shift. The AGI’s outputs look like reasonable transit system designs and considerations, and include arguments, metrics, and other supporting evidence that they are good. Should we be satisfied and ready to run the system on the real city’s data, and to implement the resulting proposed design?
We suggest that an important factor in the answer to this question is whether the AGI system was built using human modelling or not. If it produced a solution to the transit design problem (that humans approve of) without human modelling, then we would more readily trust its outputs. If it produced a solution we approve of with human modelling, then although we expect the outputs to be in many ways about good transit system design (our actual preferences) and in many ways suited to being approved by humans, to the extent that these two targets come apart we must worry about having overfit to the human model at the expense of the good design. (Why not the other way around? Because our assessment of the sandboxed results uses human judgement, not an in...
view more