Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: OthelloGPT learned a bag of heuristics, published by jylin04 on July 2, 2024 on The AI Alignment Forum.
Work performed as a part of Neel Nanda's MATS 6.0 (Summer 2024) training program.
TLDR
This is an interim report on reverse-engineering Othello-GPT, an 8-layer transformer trained to take sequences of Othello moves and predict legal moves. We find evidence that Othello-GPT learns to compute the board state using many independent decision rules that are localized to small parts of the board.
Though we cannot rule out that it also learns a single succinct algorithm in addition to these rules, our best guess is that Othello-GPT's learned algorithm is just a bag of independent heuristics.
Board state reconstruction
1. Direct attribution to linear probes indicate that the internal board representation is frequently up- and down-weighted during a forward pass.
2. Case study of a decision rule:
1. MLP Neuron L1N421 represents the decision rule: If the move A4 was just played AND B4 is occupied AND C4 is occupied update B4+C4+D4 to "theirs". This rule does not generalize to translations across the board.
2. Another neuron L0377 participates in the implementation of this rule by checking if B4 is occupied, and inhibiting the activation of L1N421 if no.
Legal move prediction
1. A subset of neurons in mid to late MLP layers classify board configurations that are sufficient to make a certain move legal with an F1-score above 0.99. These neurons have high direct attribution to the logit for that move, and are causally relevant for legal move prediction.
2. Logit lens suggests that legal move predictions gradually solidify during a forward pass.
3. Some MLP neurons systematically activate at certain times in the game, regardless of the moves played so far. We hypothesize that these neurons encode heuristics about moves that are more probable in specific phases (early/mid/late) of the game.
Review of Othello-GPT
Othello-GPT is a transformer with 25M parameters trained on sequences of random legal moves in the board game Othello as inputs[1] to predict legal moves[2].
How it does this is a black box that we don't understand. Its claim to fame is that it supposedly
1. Learns an internal representation of the board state;
2. Uses it to predict legal moves
which if true, resolves the black box in two[3].
The evidence for the first claim is that linear probes work. Namely, for each square of the ground-truth game board, if we train a linear classifier to take the model's activations at layer 6 as input and predict logits for whether that square is blank, "mine" (i.e. belonging to the player whose move it currently is) or "yours", the probes work with high accuracy on games not seen in training.
The evidence for the second claim is that if we edit the residual stream until the probe's outputs change, the model's own output at the end of layer 7 becomes consistent with legal moves that are accessible from the new board state.
However, we don't yet understand what's going on in the remaining black boxes. In particular, although it would be interesting if Othello-GPT emergently learned to implement them via algorithms with relatively short description lengths, the evidence so far doesn't rule out the possibility that they could be implemented via a bag of heuristics instead.
Project goal
Our goal in this project was simply to figure out what's going on in the remaining black boxes.
1. What's going on in box #1 - how does the model compute the board representation?
1. How does the model decide if a cell is blank or not blank?
2. How does the model decide if a cell is "mine" or "yours"?
2. What's going on in box #2 - how does the model use the board representation to pick legal moves?
Results on box #1: Board reconstruction
A circuit for how the model computes if a cell is blank or not blan...
view more