Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: ProLU: A Pareto Improvement for Sparse Autoencoders, published by Glen M. Taggart on April 23, 2024 on The AI Alignment Forum.
Abstract
This paper presents ProLU, an alternative to ReLU for the activation function in sparse autoencoders that produces a pareto improvement over the standard sparse autoencoder architectures and sparse autoencoders trained with Sqrt(L1) penalty.
Introduction
SAE Context and Terminology
Learnable parameters of a sparse autoencoder:
Wenc : encoder weights
Wdec : decoder weights
benc : encoder bias
bdec : decoder bias
Training
Notation: Encoder/Decoder
Let encode(x)=ReLU((xbdec)Wenc+benc)decode(a)=aWdec+bdec
so that the full computation done by an SAE can be expressed as SAE(x)=decode(encode(x))
An SAE is trained with gradient descent on
where λ is the sparsity penalty coefficient (often "L1 coefficient") and P is the sparsity penalty function, used to encourage sparsity.
P is commonly the L1 norm ||a||1 but recently l12 has been shown to produce a Pareto improvement on the L0 and CE metrics.
Sqrt(L1) SAEs
There has been other work producing pareto improvements to SAEs by taking P(a)=||a||1/21/2 as the penalty function. We will use this as a further baseline to compare against when assessing our models.
Motivation: Inconsistent Scaling in Sparse Autoencoders
Due to the affine translation, sparse autoencoder features with nonzero encoder biases only perfectly reconstruct feature magnitudes at a single point.
This poses difficulties if activation magnitudes for a fixed feature tend to vary over a wide range. This potential problem motivates the concept of scale consistency:
A scale consistent response curve
The bias maintains its role in noise suppression, but no longer translates activation magnitudes when the feature is active.
The lack of gradients for the encoder bias term poses a challenge for learning with gradient descent. This paper will formalize an activation function which gives SAEs this scale-consistent response curve, and motivate and propose two plausible synthetic gradients, and compare scale-consistent models trained with the two synthetic gradients to standard SAEs and SAEs trained with Sqrt(L1) penalty.
Scale Consistency Desiderata
Notation: Centered Submodule
The use of the decoder bias can be viewed as performing centering on the inputs to a centered SAE then reversing the centering on the outputs: SAE(x)=SAEcent(xbdec)+bdec
SAEcent(x)=ReLU(xWenc+benc)Wdec
Notation: Specified Feature
Let Wi denote the weights and bienc the encoder bias for the i-th feature. Then, let SAEi(x)=SAEicent(xbdec)+bdec
where SAEicent(x)=ReLU(xWienc+bienc)Widec
Conditional Linearity
Noise Suppresion Threshold
Methods
Proportional ReLU (ProLU)
We define the Proportional ReLU (ProLU) as:
Backprop with ProLU:
To use ProLU in SGD-optimized models, we first address the lack of gradients wrt. the b term.
ReLU gradients:
For comparison and later use, we will first consider ReLU: partial derivatives are well defined for ReLU at all points other than xi=0:
Gradients of ProLU:
Partials of ProLU wrt. m are similarly well defined:
However, they are not well defined wrt. b, so we must synthesize these.
Notation: Synthetic Gradients
Let fx denote the synthetic partial derivative of f wrt. x, and f the synthetic gradient of f, used for backpropagation as a stand-in for the gradient.
Different synthetic gradient types
We train two classes of ProLU with different synthetic gradients. These are distinguished by their subscript:
ProLUReLU
ProLUSTE
They are identical in output, but have different synthetic gradients. I.e.
ReLU-Like Gradients: ProLUReLU
The first synthetic gradient is very similar to the gradient for ReLU. We retain the gradient wrt. m, and define the synthetic gradient wrt. b as follows:
Thresh STE Derived Gradients: ProLUSTE
The second class of Pro...
view more