Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: "Deep Learning" Is Function Approximation, published by Zack M Davis on March 21, 2024 on LessWrong.
A Surprising Development in the Study of Multi-layer Parameterized Graphical Function Approximators
As a programmer and epistemology enthusiast, I've been studying some statistical modeling techniques lately! It's been boodles of fun, and might even prove useful in a future dayjob if I decide to pivot my career away from the backend web development roles I've taken in the past.
More specifically, I've mostly been focused on multi-layer parameterized graphical function approximators, which map inputs to outputs via a sequence of affine transformations composed with nonlinear "activation" functions.
(Some authors call these "deep neural networks" for some reason, but I like my name better.)
It's a curve-fitting technique: by setting the multiplicative factors and additive terms appropriately, multi-layer parameterized graphical function approximators can approximate any function. For a popular choice of "activation" rule which takes the maximum of the input and zero, the curve is specifically a piecewise-linear function.
We iteratively improve the approximation f(x,θ) by adjusting the parameters θ in the direction of the derivative of some error metric on the current approximation's fit to some example input-output pairs (x,y), which some authors call "gradient descent" for some reason. (The mean squared error (f(x,θ)y)2 is a popular choice for the error metric, as is the negative log likelihood logP(y|f(x,θ)). Some authors call these "loss functions" for some reason.)
Basically, the big empirical surprise of the previous decade is that given a lot of desired input-output pairs (x,y) and the proper engineering know-how, you can use large amounts of computing power to find parameters θ to fit a function approximator that "generalizes" well - meaning that if you compute ^y=f(x,θ) for some x that wasn't in any of your original example input-output pairs (which some authors call "training" data for some reason), it turns out that ^y is usually pretty similar to
the y you would have used in an example (x,y) pair.
It wasn't obvious beforehand that this would work! You'd expect that if your function approximator has more parameters than you have example input-output pairs, it would overfit, implementing a complicated function that reproduced the example input-output pairs but outputted crazy nonsense for other choices of x - the more expressive function approximator proving useless for the lack of evidence to pin down the correct approximation.
And that is what we see for function approximators with only slightly more parameters than example input-output pairs, but for sufficiently large function approximators, the trend reverses and "generalization" improves - the more expressive function approximator proving useful after all, as it admits algorithmically simpler functions that fit the example pairs.
The other week I was talking about this to an acquaintance who seemed puzzled by my explanation. "What are the preconditions for this intuition about neural networks as function approximators?" they asked. (I paraphrase only slightly.) "I would assume this is true under specific conditions," they continued, "but I don't think we should expect such niceness to hold under capability increases. Why should we expect this to carry forward?"
I don't know where this person was getting their information, but this made zero sense to me. I mean, okay, when you increase the number of parameters in your function approximator, it gets better at representing more complicated functions, which I guess you could describe as "capability increases"?
But multi-layer parameterized graphical function approximators created by iteratively using the derivative of some error metric to improve the quality ...
view more