Pages

Jul 4, 2012

Prisoner's Dilemma

I stumbled across this concept called Prisoner's Dilemma which I am finding to be extremely interesting and wanted to share.


The prisoner's dilemma is a canonical example of a game analyzed in game theory that shows why two individuals might not cooperate, even if it appears that it is in their best interest to do so. It was originally framed by Merrill Flood and Melvin Dresher working at RAND in 1950. Albert W. Tucker formalized the game with prison sentence payoffs and gave it the "prisoner's dilemma" name (Poundstone, 1992). A classic example of the prisoner's dilemma (PD) is presented as follows:

Two men are arrested, but the police do not possess enough information for a conviction. Following the separation of the two men, the police offer both a similar deal—if one testifies against his partner (defects/betrays), and the other remains silent (uncooperative), the betrayer goes free and the one that remains silent receives the full one-year sentence. If both remain silent, both are sentenced to only one month in jail for a minor charge. If each 'rats out' the other, each receives a three-month sentence. Each prisoner must choose either to betray or remain silent; the decision of each is kept quiet. What should they do?

If it is supposed here that each player is only concerned with lessening his time in jail, the game becomes a non-zero sum game where the two players may either assist or betray the other. In the game, the sole worry of the prisoners seems to be increasing his own reward. The interesting symmetry of this problem is that the logical decision leads each to betray the other, even though their individual ‘prize’ would be greater if they cooperated.

In the regular version of this game, collaboration is dominated by betrayal, and as a result, the only possible outcome of the game is for both prisoners to betray the other. Regardless of what the other prisoner chooses, one will always gain a greater payoff by betraying the other. Because betrayal is always more beneficial than cooperation, all objective prisoners would seemingly betray the other.

In the extended form game, the game is played over and over, and consequently, both prisoners continuously have an opportunity to penalize the other for the previous decision. If the number of times the game will be played is known, the finite aspect of the game means that by backward induction, the two prisoners will betray each other repeatedly.

In casual usage, the label "prisoner's dilemma" may be applied to situations not strictly matching the formal criteria of the classic or iterative games, for instance, those in which two entities could gain important benefits from cooperating or suffer from the failure to do so, but find it merely difficult or expensive, not necessarily impossible, to coordinate their activities to achieve cooperation.

Strategy for the classic prisoners' dilemma

The normal game is shown below:

Prisoner B stays silent (cooperates) Prisoner B betrays (defects)
Prisoner A stays silent (cooperates) Each serves 1 month Prisoner A: 1 year
Prisoner B: goes free
Prisoner A betrays (defects) Prisoner A: goes free
Prisoner B: 1 year
Each serves 3 months

Here, regardless of what the other decides, each prisoner gets a higher pay-off by betraying the other. For example, Prisoner A can (according to the payoffs above) state that no matter what prisoner B chooses, prisoner A is better off 'ratting him out' (defecting) than staying silent (cooperating). As a result, based on the payoffs above, prisoner A should logically betray him. The game is symmetric, so Prisoner B should act the same way, Since both rationally decide to defect, each receives a lower reward than if both were to stay quiet. Traditional game theory results in both players being worse off than if each chose to lessen the sentence of his accomplice at the cost of spending more time in jail himself.

The iterated prisoners' dilemma

If two players play prisoners' dilemma more than once in succession and they remember previous actions of their opponent and change their strategy accordingly, the game is called iterated prisoners' dilemma.

The iterated prisoners' dilemma game is fundamental to certain theories of human cooperation and trust. On the assumption that the game can model transactions between two people requiring trust, cooperative behaviour in populations may be modeled by a multi-player, iterated, version of the game. It has, consequently, fascinated many scholars over the years. In 1975, Grofman and Pool estimated the count of scholarly articles devoted to it at over 2,000. The iterated prisoners' dilemma has also been referred to as the "Peace-War game".

If the game is played exactly N times and both players know this, then it is always game theoretically optimal to defect in all rounds. The only possible Nash equilibrium is to always defect. The proof is inductive: one might as well defect on the last turn, since the opponent will not have a chance to punish the player. Therefore, both will defect on the last turn. Thus, the player might as well defect on the second-to-last turn, since the opponent will defect on the last no matter what is done, and so on. The same applies if the game length is unknown but has a known upper limit.

Unlike the standard prisoners' dilemma, in the iterated prisoners' dilemma the defection strategy is counter-intuitive and fails badly to predict the behavior of human players. Within standard economic theory, though, this is the only correct answer. The superrational strategy in the iterated prisoners' dilemma with fixed N is to cooperate against a superrational opponent, and in the limit of large N, experimental results on strategies agree with the superrational version, not the game-theoretic rational one.

For cooperation to emerge between game theoretic rational players, the total number of rounds N must be random, or at least unknown to the players. In this case always defect may no longer be a strictly dominant strategy, only a Nash equilibrium. Amongst results shown by Robert Aumann in a 1959 paper, rational players repeatedly interacting for indefinitely long games can sustain the cooperative outcome.

Strategy for the iterated prisoners' dilemma

Interest in the iterated prisoners' dilemma (IPD) was kindled by Robert Axelrod in his book The Evolution of Cooperation (1984). In it he reports on a tournament he organized of the N step prisoners' dilemma (with N fixed) in which participants have to choose their mutual strategy again and again, and have memory of their previous encounters. Axelrod invited academic colleagues all over the world to devise computer strategies to compete in an IPD tournament. The programs that were entered varied widely in algorithmic complexity, initial hostility, capacity for forgiveness, and so forth.

Axelrod discovered that when these encounters were repeated over a long period of time with many players, each with different strategies, greedy strategies tended to do very poorly in the long run while more altruistic strategies did better, as judged purely by self-interest. He used this to show a possible mechanism for the evolution of altruistic behaviour from mechanisms that are initially purely selfish, by natural selection.

The best deterministic strategy was found to be tit for tat, which Anatol Rapoport developed and entered into the tournament. It was the simplest of any program entered, containing only four lines of BASIC, and won the contest. The strategy is simply to cooperate on the first iteration of the game; after that, the player does what his or her opponent did on the previous move. Depending on the situation, a slightly better strategy can be "tit for tat with forgiveness." When the opponent defects, on the next move, the player sometimes cooperates anyway, with a small probability (around 1–5%). This allows for occasional recovery from getting trapped in a cycle of defections. The exact probability depends on the line-up of opponents.

By analysing the top-scoring strategies, Axelrod stated several conditions necessary for a strategy to be successful.

Nice
The most important condition is that the strategy must be "nice", that is, it will not defect before its opponent does (this is sometimes referred to as an "optimistic" algorithm). Almost all of the top-scoring strategies were nice; therefore a purely selfish strategy will not "cheat" on its opponent, for purely self-interested reasons first.
Retaliating
However, Axelrod contended, the successful strategy must not be a blind optimist. It must sometimes retaliate. An example of a non-retaliating strategy is Always Cooperate. This is a very bad choice, as "nasty" strategies will ruthlessly exploit such players.
Forgiving
Successful strategies must also be forgiving. Though players will retaliate, they will once again fall back to cooperating if the opponent does not continue to defect. This stops long runs of revenge and counter-revenge, maximizing points.
Non-envious
The last quality is being non-envious, that is not striving to score more than the opponent (note that a "nice" strategy can never score more than the opponent).

The optimal (points-maximizing) strategy for the one-time PD game is simply defection; as explained above, this is true whatever the composition of opponents may be. However, in the iterated-PD game the optimal strategy depends upon the strategies of likely opponents, and how they will react to defections and cooperations. For example, consider a population where everyone defects every time, except for a single individual following the tit for tat strategy. That individual is at a slight disadvantage because of the loss on the first turn. In such a population, the optimal strategy for that individual is to defect every time. In a population with a certain percentage of always-defectors and the rest being tit for tat players, the optimal strategy for an individual depends on the percentage, and on the length of the game.

A strategy called Pavlov (an example of Win-Stay, Lose-Switch) cooperates at the first iteration and whenever the player and co-player did the same thing at the previous iteration; Pavlov defects when the player and co-player did different things at the previous iteration. For a certain range of parameters, Pavlov beats all other strategies by giving preferential treatment to co-players which resemble Pavlov.

Deriving the optimal strategy is generally done in two ways:

Bayesian Nash Equilibrium: If the statistical distribution of opposing strategies can be determined (e.g. 50% tit for tat, 50% always cooperate) an optimal counter-strategy can be derived analytically.

Monte Carlo simulations of populations have been made, where individuals with low scores die off, and those with high scores reproduce (a genetic algorithm for finding an optimal strategy). The mix of algorithms in the final population generally depends on the mix in the initial population. The introduction of mutation (random variation during reproduction) lessens the dependency on the initial population; empirical experiments with such systems tend to produce tit for tat players (see for instance Chess 1988), but there is no analytic proof that this will always occur.

Although tit for tat is considered to be the most robust basic strategy, a team from Southampton University in England (led by Professor Nicholas Jennings  and consisting of Rajdeep Dash, Sarvapali Ramchurn, Alex Rogers, Perukrishnen Vytelingum) introduced a new strategy at the 20th-anniversary iterated prisoners' dilemma competition, which proved to be more successful than tit for tat. This strategy relied on cooperation between programs to achieve the highest number of points for a single program. The University submitted 60 programs to the competition, which were designed to recognize each other through a series of five to ten moves at the start. Once this recognition was made, one program would always cooperate and the other would always defect, assuring the maximum number of points for the defector. If the program realized that it was playing a non-Southampton player, it would continuously defect in an attempt to minimize the score of the competing program. As a result, this strategy ended up taking the top three positions in the competition, as well as a number of positions towards the bottom.

This strategy takes advantage of the fact that multiple entries were allowed in this particular competition, and that the performance of a team was measured by that of the highest-scoring player (meaning that the use of self-sacrificing players was a form of minmaxing). In a competition where one has control of only a single player, tit for tat is certainly a better strategy. Because of this new rule, this competition also has little theoretical significance when analysing single agent strategies as compared to Axelrod's seminal tournament. However, it provided the framework for analysing how to achieve cooperative strategies in multi-agent frameworks, especially in the presence of noise. In fact, long before this new-rules tournament was played, Richard Dawkins in his book The Selfish Gene pointed out the possibility of such strategies winning if multiple entries were allowed, but remarked that most probably Axelrod would not have allowed them if they had been submitted. It also relies on circumventing rules about the prisoners' dilemma in that there is no communication allowed between the two players. When the Southampton programs engage in an opening "ten move dance" to recognize one another, this only reinforces just how valuable communication can be in shifting the balance of the game.

The Prisoner's Dilemma simplified

Summary


The “dilemma” faced is that, whatever the other does, each is better off confessing than remaining silent. But the outcome obtained when both confess is worse for each than the outcome they would have obtained had both remained silent. A common view is that the puzzle illustrates a conflict between individual and group rationality. A group whose members pursue rational self-interest may all end up worse off than a group whose members act contrary to rational self-interest. More generally, if the payoffs are not assumed to represent self-interest, a group whose members rationally pursue any goals may all meet less success than if they had not rationally pursued their goals individually. A closely related view is that the prisoner's dilemma game and its multi-player generalizations model familiar situations in which it is difficult to get rational, selfish agents to cooperate for their common good. Much of the contemporary literature has focused on identifying conditions under which players would or should make the “cooperative” move corresponding to remaining silent. A slightly different interpretation takes the game to represent a choice between selfish behavior and socially desirable altruism. The move corresponding to confession benefits the actor, no matter what the other does, while the move corresponding to silence benefits the other player no matter what that player does. Benefiting oneself is not always wrong, of course, and benefiting others at the expense of oneself is not always morally required, but in the prisoner's dilemma game both players prefer the outcome with the altruistic moves to that with the selfish moves. This observation has led David Gauthier and others to take the Prisoner's Dilemma to say something important about the nature of morality.