Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: What's So Bad About Ad-Hoc Mathematical Definitions?, published by johnswentworth on the LessWrong.
Suppose it’s the early twentieth century, and we’re trying to quantify the concept of “information”. Specifically, we want to measure “how much information” one variable contains about another - for instance, how much information a noisy measurement of the temperature of an engine contains about the actual engine temperature.
Along comes Karl Pearson, and suggests using his “correlation coefficient” (specifically the square of the correlation coefficient,
ρ
X
Y
2
. As a measure of information, this has some sensible properties:
If there’s no information, then
ρ
X
Y
2
is zero.
If
ρ
X
Y
2
is one, then there’s perfect information - one variable tells us everything there is to know about the other.
It’s symmetric: the amount of information which X tells us about Y equals the amount of information which Y tells us about X.
As an added bonus, it’s mathematically simple to calculate, estimate, and manipulate. Sure, it’s not very “principled”, but it seems like a good-enough measure to work with.
Karl Pearson. He'd make a solid movie villain; I get sort of a Tywin Lannister vibe.
Now an engineer from Bell Telephone shows up with a real-world problem: they’ve been contracted to create secure communications for the military. They want to ensure that externally-visible data Y contains no information about secret message X, so they need a way to measure “how much information” one variable contains about another. What a perfect use-case! We advise them to design their system so that X and Y have zero correlation.
A few years later, Bell Telephone gets a visit from a very unhappy colonel. Apparently the enemy has been reading their messages. Zero correlation was not enough to keep the secret messages secret.
Now, Bell could patch over this problem. For instance, they could pick a bunch of functions like
X
2
sin
Y
e
X
2
X
−
1
, etc, and require that those also be uncorrelated. With enough functions, and a wide enough variety, that might be enough. but it’s going to get very complicated very quickly, with all these new design constraints piling up.
Fortunately, off in a corner of Bell Labs, one of their researchers already has an alternative solution. Claude Shannon suggests quantifying “how much information” X contains about Y using his “mutual information” metric
I
X
Y
. This has a bunch of sensible properties, but the main argument is that
I
X
Y
is exactly the difference between the average number of bits one needs to send in a message in order to communicate the value of X, and the average number of bits one needs to send to communicate X if the receiving party already knows Y. It’s the number of bits “savable” by knowing Y. By imagining different things as the “message” and thinking about how hard it is to guess X after knowing Y, we can intuitively predict that this metric will apply to lots of different situations, including Bell’s secret message problem.
Claude Shannon. Note the electronics in the background; this guy is my kind of theorist. No ivory tower for him.
Shannon advises the engineers to design their system so that X and Y have zero mutual information. And now, the enemy can’t read their messages quite so easily.
Proxies vs Definitions
In this story, what does the correlation coefficient do “wrong” which mutual information does “right”? What’s the generalizable lesson here?
The immediate difference is that correlation is a proxy for amount of information, while mutual information is a true definition/metric. When we apply optimization pressure to a proxy, it breaks down - that’s Goodheart’s Law. In this case, the optimization pressure is a literal adversary trying to read our secret messages. The optimizer finds the corner cases where our proxy no longer perfectly ca...
view more