Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Learning Normativity: A Research Agenda, published by Abram Demski on the AI Alignment Forum.
(Related to Inaccessible Information, Learning the Prior, and Better Priors as a Safety Problem. Builds on several of my alternate alignment ideas.)
I want to talk about something which I'll call learning normativity. What is normativity? Normativity is correct behavior. I mean something related to the fuzzy concept humans convey with the word "should". I think it has several interesting features:
Norms are the result of a complex negotiation between humans, so they shouldn't necessarily be thought of as the result of maximizing some set of values. This distinguishes learning normativity from value learning.
A lot of information about norms is present in the empirical distribution of what people actually do, but you can't learn norms just by learning human behavior. This distinguishes it from imitation learning.
It's often possible to provide a lot of information in the form of "good/bad" feedback. This feedback should be interpreted more like approval-directed learning rather than RL. However, approval should not be treated as a gold standard.
Similarly, it's often possible to provide a lot of information in the form of rules, but rules are not necessarily 100% true; they are just very likely to apply in typical cases.
In general, it's possible to get very rich types of feedback, but very sparse: humans get all sorts of feedback, including not only instruction on how to act, but also how to think.
Any one piece of feedback is suspect. Teachers can make mistakes, instructions can be wrong, demonstrations can be imperfect, dictionaries can contain spelling errors, reward signals can be corrupt, and so on.
Example: Language Learning
A major motivating example for me is how language learning works in humans. There is clearly, to some degree, a "right way" and a "wrong way" to use a language. I'll call this correct usage.
One notable feature of language learning is that we don't always speak, or write, in correct usage. This means that a child learning language has to distinguish between mistakes (such as typos) and correct usage. (Humans do sometimes learn to imitate mistakes, but we have a notion of not doing so. This is unlike GPT systems learning to imitate the empirical distribution of human text.)
This means we're largely doing something like unsupervised learning, but with a notion of "correct"/"incorrect" data. We're doing something like throwing data out when it's likely to be incorrect.
A related point is that we are better at recognizing correct usage than we are at generating it. If we say something wrong, we're likely able to correct it. In some sense, this means there's a foothold for intelligence amplification: we know how to generate our own training gradent.
Another fascinating feature of language is that although native speakers are pretty good at both recognizing and generating correct usage, we don't know the rules explicitly. The whole field of linguistics is largely about trying to uncover the rules of grammar.
So it's impossible for us to teach proper English by teaching the rules. Yet, we do know some of the rules. Or, more accurately, we know a set of rules that usually apply. And those rules are somewhat useful for teaching English. (Although children have usually reached fluency before the point where they're taught explicit English grammar.)
All of these things point toward what I mean by learning normativity:
We can tell a lot about what's normative by simply observing what's common, but the two are not exactly the same thing.
A (qualified) human can usually label an example as correct or incorrect, but this is not perfect either.
We can articulate a lot about correct vs incorrect in the form of rules; but the rules which we can articulate never seem...
view more