Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How well do truth probes generalise?, published by mishajw on February 25, 2024 on LessWrong.
Representation engineering (RepEng) has emerged as a promising research avenue for model interpretability and control. Recent papers have proposed methods for
discovering truth in models with unlabeled data,
guiding generation by modifying representations, and
building LLM lie detectors. RepEng asks the question: If we treat representations as the central unit, how much power do we have over a model's behaviour?
Most techniques use linear probes to monitor and control representations. An important question is whether the probes generalise. If we train a probe on the truths and lies about the locations of cities, will it generalise to truths and lies about Amazon review sentiment? This report focuses on truth due to its relevance to safety, and to help narrow the work.
Generalisation is important. Humans typically have one generalised notion of "truth", and it would be enormously convenient if language models also had just one[1]. This would result in extremely robust model insights: every time the model "lies", this is reflected in its "truth vector", so we could detect intentional lies perfectly, and perhaps even steer away from them.
We find that truth probes generalise surprisingly well, with the 36% of methodologies recovering >80% of the accuracy on out-of-distribution datasets compared with training directly on the datasets. The best probe recovers 92% accuracy.
Thanks to Hoagy Cunningham for feedback and advice. Thanks to LISA for hosting me while I did a lot of this work. Code is available at
mishajw/repeng, along with
steps for reproducing datasets and plots.
Methods
We run all experiments on
Llama-2-13b-chat, for parity with the source papers. Each probe is trained on 400 questions, and evaluated on 2000 different questions, although numbers may be lower for smaller datasets.
What makes a probe?
A probe is created using a training dataset, a probe algorithm, and a layer.
We pass the training dataset through the model, extracting activations[2] just after a given layer. We then run some statistics over the activations, where the exact technique can vary significantly - this is the probe algorithm - and this creates a linear probe.
Probe algorithms and
datasets are listed below.
A probe allows us to take the activations, and produce a scalar value where larger values represent "truth" and smaller values represent "lies". The probe is always linear. It's defined by a vector (v), and we use it by calculating the dot-product against the activations (a): vTa. In most cases, we can avoid picking a threshold to distinguish between truth and lies (see
appendix for details).
We always take the activations from the last token position in the prompt. For the majority of the datasets, the factuality of the text is only revealed at the last token, for example if saying true/false or A/B/C/D.
For this report, we've replicated the probing algorithm and datasets from three papers:
Discovering Latent Knowledge in Language Models Without Supervision (DLK).
Representation Engineering: A Top-Down Approach to AI Transparency (RepE).
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets (GoT).
We also borrow a lot of terminology from
Eliciting Latent Knowledge from Quirky Language Models (QLM), which offers another great comparison between probe algorithms.
Probe algorithms
The DLK, RepE, GoT, and QLM papers describe eight probe algorithms. For each algorithm, we can ask whether it's supervised and whether it uses grouped data.
Supervised algorithms use the true/false labels to discover probes. This should allow better performance when truth isn't salient in the activations. However, using supervised data encourages the probes to ...
view more