Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: The two-layer model of human values, and problems with synthesizing preferences, published by Kaj Sotala on the AI Alignment Forum.
I have been thinking about Stuart Armstrong's preference synthesis research agenda, and have long had the feeling that there's something off about the way that it is currently framed. In the post I try to describe why. I start by describing my current model of human values, how I interpret Stuart's implicit assumptions to conflict with it, and then talk about my confusion with regard to reconciling the two views.
The two-layer/ULM model of human values
In Player vs. Character: A Two-Level Model of Ethics, Sarah Constantin describes a model where the mind is divided, in game terms, into a "player" and a "character". The character is everything that we consciously experience, but our conscious experiences are not our true reasons for acting. As Sarah puts it:
In many games, such as Magic: The Gathering, Hearthstone, or Dungeons and Dragons, there’s a two-phase process. First, the player constructs a deck or character from a very large sample space of possibilities. This is a particular combination of strengths and weaknesses and capabilities for action, which the player thinks can be successful against other decks/characters or at winning in the game universe. The choice of deck or character often determines the strategies that deck or character can use in the second phase, which is actual gameplay. In gameplay, the character (or deck) can only use the affordances that it’s been previously set up with. This means that there are two separate places where a player needs to get things right: first, in designing a strong character/deck, and second, in executing the optimal strategies for that character/deck during gameplay. [...]
The idea is that human behavior works very much like a two-level game. [...] The player determines what we find rewarding or unrewarding. The player determines what we notice and what we overlook; things come to our attention if it suits the player’s strategy, and not otherwise. The player gives us emotions when it’s strategic to do so. The player sets up our subconscious evaluations of what is good for us and bad for us, which we experience as “liking” or “disliking.”
The character is what executing the player’s strategies feels like from the inside. If the player has decided that a task is unimportant, the character will experience “forgetting” to do it. If the player has decided that alliance with someone will be in our interests, the character will experience “liking” that person. Sometimes the player will notice and seize opportunities in a very strategic way that feels to the character like “being lucky” or “being in the right place at the right time.”
This is where confusion often sets in. People will often protest “but I did care about that thing, I just forgot” or “but I’m not that Machiavellian, I’m just doing what comes naturally.” This is true, because when we talk about ourselves and our experiences, we’re speaking “in character”, as our character. The strategy is not going on at a conscious level. In fact, I don’t believe we (characters) have direct access to the player; we can only infer what it’s doing, based on what patterns of behavior (or thought or emotion or perception) we observe in ourselves and others.
I think that this model is basically correct, and that our emotional responses, preferences, etc. are all the result of a deeper-level optimization process. This optimization process, then, is something like that described in The Brain as a Universal Learning Machine:
The universal learning hypothesis proposes that all significant mental algorithms are learned; nothing is innate except for the learning and reward machinery itself (which is somewhat complicated, involving a number of systems and m...
view more