Download - What's So Bad About Ad-Hoc Mathematical Definitions? by johnswentworth

Discover

Podcast Features
Your all-in-one podcasting solution.

Blog to Podcast
Turn your blog into an engaging podcast.
Livestream
High-performing audio live, without limits.

Podcast Studio
Easy-to-use audio recorder app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Podcast App
The best podcast player & podcast app.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.
Live Streaming
Receive livestream rewards from your audience and earn
recurring income from your Fan Club membership.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Advertisers
Enterprise
Pricing
Discover

The Nonlinear Library: LessWrong Top Posts

Education

What's So Bad About Ad-Hoc Mathematical Definitions? by johnswentworth

2021-12-11

Download Right click and do "save link as"

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: What's So Bad About Ad-Hoc Mathematical Definitions?, published by johnswentworth on the LessWrong.
Suppose it’s the early twentieth century, and we’re trying to quantify the concept of “information”. Specifically, we want to measure “how much information” one variable contains about another - for instance, how much information a noisy measurement of the temperature of an engine contains about the actual engine temperature.
Along comes Karl Pearson, and suggests using his “correlation coefficient” (specifically the square of the correlation coefficient,
ρ
X
Y
2
. As a measure of information, this has some sensible properties:
If there’s no information, then
ρ
X
Y
2
is zero.
If
ρ
X
Y
2
is one, then there’s perfect information - one variable tells us everything there is to know about the other.
It’s symmetric: the amount of information which X tells us about Y equals the amount of information which Y tells us about X.
As an added bonus, it’s mathematically simple to calculate, estimate, and manipulate. Sure, it’s not very “principled”, but it seems like a good-enough measure to work with.
Karl Pearson. He'd make a solid movie villain; I get sort of a Tywin Lannister vibe.
Now an engineer from Bell Telephone shows up with a real-world problem: they’ve been contracted to create secure communications for the military. They want to ensure that externally-visible data Y contains no information about secret message X, so they need a way to measure “how much information” one variable contains about another. What a perfect use-case! We advise them to design their system so that X and Y have zero correlation.
A few years later, Bell Telephone gets a visit from a very unhappy colonel. Apparently the enemy has been reading their messages. Zero correlation was not enough to keep the secret messages secret.
Now, Bell could patch over this problem. For instance, they could pick a bunch of functions like
X
2
sin
Y
e
X
2
X
−
1
, etc, and require that those also be uncorrelated. With enough functions, and a wide enough variety, that might be enough. but it’s going to get very complicated very quickly, with all these new design constraints piling up.
Fortunately, off in a corner of Bell Labs, one of their researchers already has an alternative solution. Claude Shannon suggests quantifying “how much information” X contains about Y using his “mutual information” metric
I
X
Y
. This has a bunch of sensible properties, but the main argument is that
I
X
Y
is exactly the difference between the average number of bits one needs to send in a message in order to communicate the value of X, and the average number of bits one needs to send to communicate X if the receiving party already knows Y. It’s the number of bits “savable” by knowing Y. By imagining different things as the “message” and thinking about how hard it is to guess X after knowing Y, we can intuitively predict that this metric will apply to lots of different situations, including Bell’s secret message problem.
Claude Shannon. Note the electronics in the background; this guy is my kind of theorist. No ivory tower for him.
Shannon advises the engineers to design their system so that X and Y have zero mutual information. And now, the enemy can’t read their messages quite so easily.
Proxies vs Definitions
In this story, what does the correlation coefficient do “wrong” which mutual information does “right”? What’s the generalizable lesson here?
The immediate difference is that correlation is a proxy for amount of information, while mutual information is a true definition/metric. When we apply optimization pressure to a proxy, it breaks down - that’s Goodheart’s Law. In this case, the optimization pressure is a literal adversary trying to read our secret messages. The optimizer finds the corner cases where our proxy no longer perfectly ca...