Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Data Wall is Important, published by JustisMills on June 10, 2024 on LessWrong.
Modern AI is trained on a
huge fraction of the internet, especially at the cutting edge, with the best models trained on close to all the high quality data we've got.[1] And data is
really important! You can scale up compute, you can make algorithms more efficient, or you can add infrastructure around a model to make it more useful, but on the margin, great datasets are king. And, naively, we're about to run out of fresh data to use.
It's rumored that the top firms are looking for ways to get around the data wall. One possible approach is having LLMs create their own data to train on, for which there is kinda-sorta a precedent from, e.g. modern chess AIs learning by playing games against themselves.[2] Or just finding ways to make AI dramatically more sample efficient with the data we've already got: the existence of human brains proves that this is, theoretically, possible.[3]
But all we have, right now, are rumors. I'm not even personally aware of rumors that any lab has cracked the problem: certainly, nobody has come out and say so in public! There's a lot of insinuation that the data wall is not so formidable, but no hard proof. And if the data wall is a hard blocker, it could be very hard to get AI systems much stronger than they are now.
If the data wall stands, what would we make of today's rumors? There's certainly an optimistic mood about progress coming from AI company CEOs, and a steady trickle of not-quite-leaks that exciting stuff is going on behind the scenes, and to stay tuned. But there are at least two competing explanations for all this:
Top companies are already using the world's smartest human minds to crack the data wall, and have all but succeeded.
Top companies need to keep releasing impressive stuff to keep the money flowing, so they declare, both internally and externally, that their current hurdles are surmountable.
There's lots of precedent for number two! You may have heard of startups hard coding a feature and then scrambling to actually implement it when there's interest.
And race dynamics make this even more likely: if OpenAI projects cool confidence that it's almost over the data wall, and Anthropic doesn't, then where will all the investors, customers, and high profile corporate deals go? There also could be an echo chamber effect, where one firm acting like the data wall's not a big deal makes other firms take their word for it.
I don't know what a world with a strong data wall looks like in five years. I bet it still looks pretty different than today! Just improving GPT-4 level models around the edges, giving them better tools and scaffolding, should be enough to spur massive economic activity and, in the absence of government intervention, job market changes. We can't unscramble the egg. But the "just trust the straight line on the graph" argument is ignoring that one of the determinants of that line is running out.
There's a world where the line is stronger than that particular constraint, and a new treasure trove of data appears in time. But there's also a world where it isn't, and we're near the inflection of an
S-curve.
Rumors and projected confidence can't tell us which world we're in.
1. ^
For good analysis of this, search for the heading "The data wall" here.
2. ^
But don't take this parallel too far! Chess AI (or AI playing any other game) has a signal of "victory" that it can seek out - it can preferentially choose moves that systematically lead to the "my side won the game" outcome. But the core of a LLM is a text predictor: "winning" for it is correctly guessing what comes next in human-created text.
What does self-play look like there? Merely making up fake human-created text has the obvious issue of amplifying any weaknesses the AI has ...
view more