Download - the scaling "inconsistency": openAI’s new insight by nostalgebraist

Discover

Podcast Features
Your all-in-one podcasting solution.

Blog to Podcast
Turn your blog into an engaging podcast.
Livestream
High-performing audio live, without limits.

Podcast Studio
Easy-to-use audio recorder app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Podcast App
The best podcast player & podcast app.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.
Live Streaming
Receive livestream rewards from your audience and earn
recurring income from your Fan Club membership.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Advertisers
Enterprise
Pricing
Discover

The Nonlinear Library: Alignment Forum Top Posts

Education

the scaling "inconsistency": openAI’s new insight by nostalgebraist

2021-12-10

Download Right click and do "save link as"

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: the scaling "inconsistency": openAI’s new insight, published by nostalgebraist on the AI Alignment Forum.
I’ve now read the new OpenAI scaling laws paper. Also, yesterday I attended a fun and informative lecture/discussion with one of the authors.
While the topic is on my mind, I should probably jot down some of my thoughts.
This post is mostly about what the new paper says about the “inconsistency” brought up in their previous paper.
The new paper has a new argument on this topic, which is intuitive and appealing, and suggests that the current scaling trend will indeed “switch over” soon to a new one where dataset size, not model size, is the active constraint on performance. Most of this post is an attempt to explain and better understand this argument.
The new paper is mainly about extending the scaling laws from their earlier paper to new modalities.
In that paper, they found scaling laws for transformers trained autoregressively on text data. The new paper finds the same patterns in the scaling behavior of transformers trained autoregressively on images, math problems, etc.
So the laws aren’t telling us something about the distribution of text data, but about something more fundamental. That’s cool.
They also have a new, very intuitive hypothesis for what’s going on with the “scaling inconsistency” they described in the previous paper – the one I made a big deal about at the time. So that’s the part I’m most excited to discuss.
I’m going to give a long explanation of it, way longer than the relevant part of their paper. Some of this is original to me, all errors are mine, all the usual caveats.
1. L(C) and L(D)
To recap: the “inconsistency” is between two scaling laws:
The law for the best you can do, given a fixed compute budget.
This is L(C), sometimes called L(C_min). L is the loss (lower = better), C is your compute budget.
The law for the best you can do, given a fixed dataset size.
This is L(D), where D is the number of examples (say, tokens) in the dataset.
Once you reach a certain level of compute, these two laws contradict each other.
I’ll take some time to unpack that here, as it’s not immediately obvious the two can even be compared to one another – one is a function of compute, the other of data.
2. C sets E, and E bounds D
Budget tradeoffs
Given a compute budget C, you can derive the optimal way to spend it on different things. Roughly, you are trading off between two ways to spend compute:
Use C to buy “N”: Training a bigger model – “N” here is model size
Use C to buy “S”: Training for more steps “S” (gradient updates)
The relationship between S (steps) and D (dataset size) is a little subtle, for several reasons.
From step count to update count
For one thing, each single “step” is an update on the information from more than one data point. Specifically, a step updates on “B” different points – B is the batch size.
So the total number of data points processed during training is B times S. The papers sometimes call this quantity “E” (number of examples), so I’ll call it that too.
From update count to data count
Now, when you train an ML model, you usually update on each data point more than once. Typically, you’ll do one pass over the full dataset (updating on each point as you go along), then you’ll go back and do a second full pass, and then a third, etc. These passes are called “epochs.”
If you’re doing things this way, then for every point in the data, you get (number of epochs) updates out of it. So
E = (number of epochs) D.
Some training routines don’t visit every point the exact same number of times – there’s nothing forcing you to do that. Still, for any training procedure, we can look at the quantity E / D.
This would be the number of epochs, if you’re doing epochs. For a generic training routine, you can can think of E / D as the “effecti...