Download - Updating the Lottery Ticket Hypothesis by johnswentworth

Discover

Podcast Features
Your all-in-one podcasting solution.

Blog to Podcast
Turn your blog into an engaging podcast.
Livestream
High-performing audio live, without limits.

Podcast Studio
Easy-to-use audio recorder app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Podcast App
The best podcast player & podcast app.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.
Live Streaming
Receive livestream rewards from your audience and earn
recurring income from your Fan Club membership.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Advertisers
Enterprise
Pricing
Discover

The Nonlinear Library: Alignment Forum Top Posts

Education

Updating the Lottery Ticket Hypothesis by johnswentworth

2021-12-03

Download Right click and do "save link as"

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Updating the Lottery Ticket Hypothesis, published by johnswentworth on the AI Alignment Forum.
Epistemic status: not confident enough to bet against someone who’s likely to understand this stuff.
The lottery ticket hypothesis of neural network learning (as aptly described by Daniel Kokotajlo) roughly says:
When the network is randomly initialized, there is a sub-network that is already decent at the task. Then, when training happens, that sub-network is reinforced and all other sub-networks are dampened so as to not interfere.
This is a very simple, intuitive, and useful picture to have in mind, and the original paper presents interesting evidence for at least some form of the hypothesis. Unfortunately, the strongest forms of the hypothesis do not seem plausible - e.g. I doubt that today’s neural networks already contain dog-recognizing subcircuits at initialization. Modern neural networks are big, but not that big. (See this comment for some clarification of this claim.)
Meanwhile, a cluster of research has shown that large neural networks approximate certain Bayesian models, involving phrases like “neural tangent kernel (NTK)” or “Gaussian process (GP)”. Mingard et al. show that these models explain the large majority of the good performance we see from large neural networks in practice. This view also implies a version of the lottery ticket hypothesis, but it has different implications for what the “lottery tickets” are. They’re not subcircuits of the initial net, but rather subcircuits of the parameter tangent space of the initial net.
This post will sketch out what that means.
Let’s start with the jargon: what’s the “parameter tangent space” of a neural net? Think of the network as a function
f
with two kinds of inputs: parameters
θ
, and data inputs
x
. During training, we try to adjust the parameters so that the function sends each data input
x
n
to the corresponding data output
y
n
- i.e. find
θ
for which
y
n
f
x
n
θ
, for all
n
. Each data point gives an equation which
θ
must satisfy, in order for that data input to be exactly mapped to its target output. If our initial parameters
θ
0
happen to be close enough to a solution to those equations, then we can (approximately) solve this using a linear approximation: we look for
Δ
θ
such that
y
n
f
x
n
θ
0
Δ
θ
⋅
d
f
d
θ
x
n
θ
0
The right-hand-side of that equation is essentially the parameter tangent space. More precisely, (what I'm calling) the parameter tangent space at
θ
0
is the the set of functions
F
x
of the form
F
x
f
x
θ
0
Δ
θ
⋅
d
f
d
θ
x
θ
0
. for some
Δ
θ
In other words: the parameter tangent space is the set of functions which can be written as linear approximations (with respect to the parameters) of the network.
The main empirical finding which led to the NTK/GP/Mingard et al picture of neural nets is that, in practice, that linear approximation works quite well. As neural networks get large, their parameters change by only a very small amount during training, so the overall
Δ
θ
found during training is actually nearly a solution to the linearly-approximated equations.
Major upshot of all this: the space-of-models “searched over” during training is approximately just the parameter tangent space.
At initialization, we randomly choose
θ
0
, and that determines the parameter tangent space - that’s our set of “lottery tickets”. The SGD training process then solves the equations - it picks out the lottery tickets which perfectly match the data. In practice, there will be many such lottery tickets - many solutions to the equations - because modern nets are extremely overparameterized. SGD effectively picks one of them at random (that’s one of the main results of the Mingard et al work).
Summary:
The “parameter tangent space” of a network is the set of functions which can be written as linear approximations ...