Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Simple versus Short: Higher-order degeneracy and error-correction, published by Daniel Murfet on March 11, 2024 on LessWrong.
TLDR: The simplicity bias in Bayesian statistics is not just a bias towards short description length.
The folklore relating the simplicity bias in Bayesian statistics to description length is incomplete: while it is true that the fewer parameters you use the better, the true complexity measure which appears in the mathematical theory of Bayesian statistics (that is, singular learning theory) is more exotic. The content of this complexity measure remains quite mysterious, but in this note we point out that in a particular setting it includes a bias towards runtime error-correction.
This suggests caution when reasoning about the role of inductive biases in neural network training.
Acknowledgements. Thanks to Jesse Hoogland, Liam Carroll, Rumi Salazar and Simon Pepin Lehalleur for comments.
1. Background
1.1 Relevance to Deep Learning
Consider the problem of solving an ordinary differential equation. A constructive proof involves actually writing down a solution, or an algorithm that in finite time will produce a solution. The Picard-Lindelöf theorem proves that a solution to a broad class of initial value problems exists, but the proof is not constructive: it sets up a contraction mapping on a complete metric space and appeals to the Banach fixed point theorem.
While the Picard-Lindelöf theorem uniquely characterises the solution as the fixed point of a contraction mapping, and gives an iterative process for approximating the solution, it does not construct the solution. However a construction is not necessary for many of the applications of Picard-Lindelöf (in differential geometry, topology and many parts of analysis).
This mode of reasoning about mathematical objects, where it suffices to have characterised[1] them by (universal) properties, is pervasive in modern mathematics (in the above example, the characterising property is the differential equation, or its associated contraction mapping).
However this may seem quite alien to a computer scientist or programmer, who for historical reasons tend to think that there is only one mode of reasoning about mathematical objects, and that is centred on the study of a construction.
In an era where programs are increasingly the product of gradient descent rather than human construction, this attitude is untenable. We may have to accept a mode of reasoning about learned programs, based on understanding the nature of the problems to which they are a solution and the iterative processes that produce them. To understand the implicit algorithms learned by neural networks, it may be necessary from this perspective to understand
the computational structures latent in the data distribution, and
the inductive biases of neural network training.
We do not currently have a good understanding of these matters. If we understood these inductive biases better, it could conceivably help us in the context of AI alignment to answer questions like "how likely is deceptive alignment", "how likely is consequentialism", and "what goals are instrumentally convergent"?
This note is about the inductive biases of the Bayesian learning process (conditioned on more samples, the posterior increasingly localises around true parameters). Since Bayesian statistics is both fundamental and theoretically tractable, this seems potentially useful for understanding the inductive biases of neural network training. However it is worth noting that the relation between these is not understood at present.
1.2 Singular Learning Theory
The asymptotic expansion of the Bayesian free energy, or "free energy formula'', proven by Watanabe in Singular Learning Theory (SLT) introduces the learning coefficient λ as a measure of complexity that balances ...
view more