Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Copyright Confrontation #1, published by Zvi on January 4, 2024 on LessWrong.
Lawsuits and legal issues over copyright continued to get a lot of attention this week, so I'm gathering those topics into their own post. The 'virtual #0' post is the relevant section from last week's roundup.
Four Core Claims
Who will win the case? Which of New York Times's complaints will be convincing?
Different people have different theories of the case.
Part of that is that there are four distinct allegations NYT is throwing at the wall.
Arvind Narayanan: A thread on some misconceptions about the NYT lawsuit against OpenAI. Morality aside, the legal issues are far from clear cut. Gen AI makes an end run around copyright and IMO this can't be fully resolved by the courts alone.
As I currently understand it, NYT alleges that OpenAI engaged in 4 types of unauthorized copying of its articles:
The training dataset
The LLMs themselves encode copies in their parameters
Output of memorized articles in response to queries
Output of articles using browsing plugin
Key Claim: The Training Dataset Contains Copyrighted Material
Which, of course, it does.
The training dataset is the straightforward baseline battle royale. The main event.
The real issue is the use of NYT data for training without compensation … Unfortunately, these stand on far murkier legal ground, and several lawsuits along these lines have already been dismissed.
It is unclear how well current copyright law can deal with the labor appropriation inherent to the way generative AI is being built today. Note that *people* could always do the things gen AI does, and it was never a problem.
We have a problem now because those things are being done (1) in an automated way (2) at a billionfold greater scale (3) by companies that have vastly more power in the market than artists, writers, publishers, etc.
Bingo. That's the real issue. Can you train an LLM or other AI on other people's copyrighted data without their permission? If you do, do you owe compensation?
A lot of people are confident in very different answers to this question, both in terms of the positive questions of what the law says and what society will do, and also the normative question what society should decide.
Daniel Jeffries, for example, is very confident that this is not how any of this works. We all learn, he points out, for free. Why should a computer system have to pay?
Do we all learn for free? We do still need access to the copyrighted works. In the case of The New York Times, they impose a paywall. If you want to learn from NYT, you have to pay. Of course you can get around this in practice in various ways, but any systematic use of them would obviously not be legal, even if much such use is effectively tolerated. The price is set on the assumption that the subscription is for one person or family unit.
Why does it seem so odd to think that if an AI also wanted access, it too would need a subscription? And that the cost might not want to be the same as for a person, although saying 'OpenAI must buy one (1) ongoing NYT subscription retroactive to their founding' would be a hilarious verdict?
Scale matters. Scale changes things. What is fine at small scale might not be fine at large scale. Both as a matter of practicality, and as a matter of law and its enforcement.
Many of us have, at some point, written public descriptions of a game of professional football without the express written consent of the National Football League. And yet, they tell us every game:
NFL: This telecast is copyrighted by the NFL for the private use of our audience. Any other use of this telecast or any pictures, descriptions, or accounts of the game without the NFL's consent is prohibited.
Why do they spend valuable air time on this, despite the disdain it creates? Because they do not wan...
view more