Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: the scaling "inconsistency": openAI’s new insight, published by nostalgebraist on the AI Alignment Forum.
I’ve now read the new OpenAI scaling laws paper. Also, yesterday I attended a fun and informative lecture/discussion with one of the authors.
While the topic is on my mind, I should probably jot down some of my thoughts.
This post is mostly about what the new paper says about the “inconsistency” brought up in their previous paper.
The new paper has a new argument on this topic, which is intuitive and appealing, and suggests that the current scaling trend will indeed “switch over” soon to a new one where dataset size, not model size, is the active constraint on performance. Most of this post is an attempt to explain and better understand this argument.
The new paper is mainly about extending the scaling laws from their earlier paper to new modalities.
In that paper, they found scaling laws for transformers trained autoregressively on text data. The new paper finds the same patterns in the scaling behavior of transformers trained autoregressively on images, math problems, etc.
So the laws aren’t telling us something about the distribution of text data, but about something more fundamental. That’s cool.
They also have a new, very intuitive hypothesis for what’s going on with the “scaling inconsistency” they described in the previous paper – the one I made a big deal about at the time. So that’s the part I’m most excited to discuss.
I’m going to give a long explanation of it, way longer than the relevant part of their paper. Some of this is original to me, all errors are mine, all the usual caveats.
1. L(C) and L(D)
To recap: the “inconsistency” is between two scaling laws:
The law for the best you can do, given a fixed compute budget.
This is L(C), sometimes called L(C_min). L is the loss (lower = better), C is your compute budget.
The law for the best you can do, given a fixed dataset size.
This is L(D), where D is the number of examples (say, tokens) in the dataset.
Once you reach a certain level of compute, these two laws contradict each other.
I’ll take some time to unpack that here, as it’s not immediately obvious the two can even be compared to one another – one is a function of compute, the other of data.
2. C sets E, and E bounds D
Budget tradeoffs
Given a compute budget C, you can derive the optimal way to spend it on different things. Roughly, you are trading off between two ways to spend compute:
Use C to buy “N”: Training a bigger model – “N” here is model size
Use C to buy “S”: Training for more steps “S” (gradient updates)
The relationship between S (steps) and D (dataset size) is a little subtle, for several reasons.
From step count to update count
For one thing, each single “step” is an update on the information from more than one data point. Specifically, a step updates on “B” different points – B is the batch size.
So the total number of data points processed during training is B times S. The papers sometimes call this quantity “E” (number of examples), so I’ll call it that too.
From update count to data count
Now, when you train an ML model, you usually update on each data point more than once. Typically, you’ll do one pass over the full dataset (updating on each point as you go along), then you’ll go back and do a second full pass, and then a third, etc. These passes are called “epochs.”
If you’re doing things this way, then for every point in the data, you get (number of epochs) updates out of it. So
E = (number of epochs) D.
Some training routines don’t visit every point the exact same number of times – there’s nothing forcing you to do that. Still, for any training procedure, we can look at the quantity E / D.
This would be the number of epochs, if you’re doing epochs. For a generic training routine, you can can think of E / D as the “effecti...
view more