Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Contrite Strategies and The Need For Standards, published by sarahconstantin on the AI Alignment Forum.
Epistemic Status: Confident
There’s a really interesting paper from 1996 called The Logic of Contrition, which I’ll summarize here. In it, the authors identify a strategy called “Contrite Tit For Tat”, which does better than either Pavlov or Generous Tit For Tat in Iterated Prisoner’s Dilemma.
In Contrite Tit For Tat, the player doesn’t only look at what he and the other player played on the last term, but also another variable, the standing of the players, which can be good or bad.
If Bob defected on Alice last round but Alice was in good standing, then Bob’s standing switches to bad, and Alice defects against Bob.
If Bob defected on Alice last round but Alice was in bad standing, then Bob’s standing stays good, and Alice cooperates with Bob.
If Bob cooperated with Alice last round, Bob keeps his good standing, and Alice cooperates.
This allows two Contrite Tit For Tat players to recover quickly from accidental defections without defecting against each other forever;
D/C -> C/D -> C/C
But, unlike Pavlov, it consistently resists the “always defect” strategy
D/C -> D/D -> D/D -> D/D .
Like TFT (Tit For Tat) and unlike Pavlov and gTFT (Generous Tit For Tat), cTFT (Contrite Tit For Tat) can invade a population of all Defectors.
A related contrite strategy is Remorse. Remorse cooperates only if it is in bad standing, or if both players cooperated in the previous round. In other words, Remorse is more aggressive; unlike cTFT, it can attack cooperators.
Against the strategy “always cooperate”, cTFT always cooperates but Remorse alternates cooperating and defecting:
C/C -> C/D -> C/C -> C/D .
And Remorse defends effectively against defectors:
D/C -> D/D -> D/D -> D/D.
But if one Remorse accidentally defects against another, recovery is more difficult:
C/D -> D/C -> D/D -> C/D -> .
If the Prisoner’s Dilemma is repeated a large but finite number of times, cTFT is an evolutionarily stable state in the sense that you can’t do better for yourself when playing against a cTFT player through doing anything that deviates from what cTFT would recommend. This implies that no other strategy can successfully invade a population of all cTFT’s.
REMORSE can sometimes be invaded by strategies better at cooperating with themselves, while Pavlov can sometimes be invaded by Defectors, depending on the payoff matrix; but for all Prisoner’s Dilemma payoff matrices, cTFT resists invasion.
Defector and a similar strategy called Grim Trigger (if a player ever defects on you, keep defecting forever) are evolutionarily stable, but not good outcomes — they result in much lower scores for everyone in the population than TFT or its variants. By contrast, a whole population that adopts cTFT, gTFT, Pavlov, or Remorse on average gets the payoff from cooperating each round.
The bottom line is, adding “contrition” to TFT makes it quite a bit better, and allows it to keep pace with Pavlov in exploiting TFT’s, while doing better than Pavlov at exploiting Defectors.
This is no longer true if we add noise in the perception of good or bad standing; contrite strategies, like TFT, can get stuck defecting against each other if they erroneously perceive bad standing.
The moral of the story is that there’s a game-theoretic advantage to not only having reciprocity (TFT) but standards (cTFT), and in fact reciprocity alone is not enough to outperform strategies like Pavlov which don’t map well to human moral maxims.
What do I mean by standards?
There’s a difference between saying “Behavior X is better than behavior Y” and saying “Behavior Y is unacceptable.”
The concept of “unacceptable” behavior functions like the concept of “standing” in the game theory paper. If I do something “unacceptable” and you respond in some ...
view more