Download - LW - "Deep Learning" Is Function Approximation by Zack M Davis

Discover

Podcast Features
Your all-in-one podcasting solution.

Blog to Podcast
Turn your blog into an engaging podcast.
Livestream
High-performing audio live, without limits.

Podcast Studio
Easy-to-use audio recorder app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Podcast App
The best podcast player & podcast app.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.
Live Streaming
Receive livestream rewards from your audience and earn
recurring income from your Fan Club membership.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Advertisers
Enterprise
Pricing
Discover

The Nonlinear Library: LessWrong

Education

LW - "Deep Learning" Is Function Approximation by Zack M Davis

2024-03-21

Download Right click and do "save link as"

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: "Deep Learning" Is Function Approximation, published by Zack M Davis on March 21, 2024 on LessWrong.
A Surprising Development in the Study of Multi-layer Parameterized Graphical Function Approximators
As a programmer and epistemology enthusiast, I've been studying some statistical modeling techniques lately! It's been boodles of fun, and might even prove useful in a future dayjob if I decide to pivot my career away from the backend web development roles I've taken in the past.
More specifically, I've mostly been focused on multi-layer parameterized graphical function approximators, which map inputs to outputs via a sequence of affine transformations composed with nonlinear "activation" functions.
(Some authors call these "deep neural networks" for some reason, but I like my name better.)
It's a curve-fitting technique: by setting the multiplicative factors and additive terms appropriately, multi-layer parameterized graphical function approximators can approximate any function. For a popular choice of "activation" rule which takes the maximum of the input and zero, the curve is specifically a piecewise-linear function.
We iteratively improve the approximation f(x,θ) by adjusting the parameters θ in the direction of the derivative of some error metric on the current approximation's fit to some example input-output pairs (x,y), which some authors call "gradient descent" for some reason. (The mean squared error (f(x,θ)y)2 is a popular choice for the error metric, as is the negative log likelihood logP(y|f(x,θ)). Some authors call these "loss functions" for some reason.)
Basically, the big empirical surprise of the previous decade is that given a lot of desired input-output pairs (x,y) and the proper engineering know-how, you can use large amounts of computing power to find parameters θ to fit a function approximator that "generalizes" well - meaning that if you compute ^y=f(x,θ) for some x that wasn't in any of your original example input-output pairs (which some authors call "training" data for some reason), it turns out that ^y is usually pretty similar to
the y you would have used in an example (x,y) pair.
It wasn't obvious beforehand that this would work! You'd expect that if your function approximator has more parameters than you have example input-output pairs, it would overfit, implementing a complicated function that reproduced the example input-output pairs but outputted crazy nonsense for other choices of x - the more expressive function approximator proving useless for the lack of evidence to pin down the correct approximation.
And that is what we see for function approximators with only slightly more parameters than example input-output pairs, but for sufficiently large function approximators, the trend reverses and "generalization" improves - the more expressive function approximator proving useful after all, as it admits algorithmically simpler functions that fit the example pairs.
The other week I was talking about this to an acquaintance who seemed puzzled by my explanation. "What are the preconditions for this intuition about neural networks as function approximators?" they asked. (I paraphrase only slightly.) "I would assume this is true under specific conditions," they continued, "but I don't think we should expect such niceness to hold under capability increases. Why should we expect this to carry forward?"
I don't know where this person was getting their information, but this made zero sense to me. I mean, okay, when you increase the number of parameters in your function approximator, it gets better at representing more complicated functions, which I guess you could describe as "capability increases"?
But multi-layer parameterized graphical function approximators created by iteratively using the derivative of some error metric to improve the quality ...