Is training AI on copyrighted data legal?

It depends on the specific use. Courts are applying a market-competition standard: when AI outputs compete with the market for the copyrighted training data (AI-generated news replacing journalism, AI images replacing stock photography), courts are finding infringement. When outputs serve different purposes, fair use arguments are viable. Training on pirated content is nearly indefensible. Training on properly licensed content creates strong legal protection. The definitive answers will come from the NYT v. OpenAI and Getty v. Stability AI trials in 2026-2027.

What is the Thomson Reuters v. Ross Intelligence ruling?

Thomson Reuters v. Ross Intelligence (D. Del., February 2025) is the first final judgment on AI training data copyright. The court found that Ross Intelligence's use of Westlaw headnotes to train its legal AI was not fair use, focusing on the market harm factor — Ross's AI competed directly with Westlaw's research services. The ruling established that market competition between an AI's outputs and its training data source is the key inquiry in AI copyright analysis.

Can AI companies be sued for using web content in training data?

Yes. Multiple lawsuits are proceeding against AI companies for training on web content without authorization. The legal question — whether scraping publicly available websites for AI training constitutes fair use — is unresolved. The answer likely depends on the content type and competitive relationship. Rights holders can strengthen their position by registering copyrights, implementing opt-out signals (robots.txt, TDM reservation headers), and documenting market harm from AI-generated competing content.

Do AI companies need to license training data?

Licensing isn't legally required in all cases, but it's the safest approach and the direction the market is moving. OpenAI, Google, and other major AI companies have signed licensing agreements with publishers, news organizations, and content platforms. The Thomson Reuters v. Ross ruling and the NYT v. OpenAI case create significant litigation risk for companies training on unlicensed copyrighted content, especially when AI outputs compete with the original works' markets.

How does AI copyright law differ between the US and EU?

The US applies a case-by-case fair use analysis focused on market harm, with no specific AI training exception. The EU permits text and data mining for research purposes but requires rights-holder opt-out mechanisms for commercial AI training. Japan is the most AI-friendly, permitting AI training on copyrighted works for information analysis without consent. The UK lacks a specific AI training exception and applies its narrower fair dealing doctrine. AI companies operating globally must build jurisdiction-specific compliance protocols.

Ai Copyright Training Data 2026 Landscape

The AI copyright question has moved from theoretical to existential — and the courts are finally giving answers. The NYT v. OpenAI case survived summary judgment. Getty v. Stability AI is heading to trial. Thomson Reuters v. Ross Intelligence established that copying legal headnotes for AI training isn't fair use. Concord Music v. Anthropic is testing whether AI-generated lyrics constitute copyright infringement. The "move fast and train on everything" era is over. The question now is where exactly the line falls between fair use and infringement.

Here's the emerging framework: training AI on copyrighted data isn't automatically illegal, but it isn't automatically fair use either. The courts are applying a fact-specific analysis that turns on whether the AI's output competes with the original work's market. When it does — an AI generating news articles that substitute for newspaper subscriptions, or AI-generated images that replace stock photography — courts are finding infringement. When it doesn't — an AI trained on medical literature producing diagnostic recommendations — fair use arguments have legs.

The Major Cases and Where They Stand

NYT v. OpenAI (S.D.N.Y.): the New York Times sued OpenAI and Microsoft for using millions of NYT articles to train GPT models. The court denied OpenAI's motion to dismiss, finding that the NYT plausibly alleged that ChatGPT's outputs compete with NYT content. Key ruling: the court rejected OpenAI's argument that AI training is inherently transformative. The case is in discovery as of April 2026, with trial expected in late 2026 or early 2027. This is the bellwether case — its outcome will shape the entire landscape.

Getty Images v. Stability AI (D. Del.): Getty sued Stability AI for training Stable Diffusion on 12 million Getty images without license. The case tests whether image generation AI that produces outputs competing with the plaintiff's stock photography business constitutes infringement. Getty's strongest evidence: Stable Diffusion outputs that include garbled Getty watermarks, demonstrating direct copying. Trial is expected in 2026.

Thomson Reuters v. Ross Intelligence (D. Del.): decided in February 2025. The court found that Ross Intelligence's use of Westlaw headnotes to train its legal AI was not fair use. The ruling focused on the market harm factor — Ross's AI competed directly with Westlaw's research services. This is the first final judgment on AI training data copyright and it favors rights holders.

Concord Music v. Anthropic (M.D. Tenn.): music publishers allege that Claude generates copyrighted song lyrics when prompted, proving that copyrighted lyrics were in training data. The case tests a narrower question: even if training is arguably fair use, does the AI's ability to reproduce copyrighted content on demand constitute infringement? Settlement discussions reported but no resolution as of April 2026.

Bartz v. Meta (N.D. Cal.): a class action by book authors alleging that Meta trained Llama models on pirated book datasets (Books3). The case targets the use of clearly pirated content in training data — the weakest fair use position for AI companies. The court certified the class in 2025.

The Fair Use Framework Applied to AI

Courts are applying the traditional four-factor fair use test (17 U.S.C. Section 107), but the AI context is reshaping how each factor operates:

Factor 1 — Purpose and character of use: is AI training "transformative"? The Supreme Court's Andy Warhol Foundation v. Goldsmith (2023) decision narrowed the transformative use doctrine, and courts are applying that narrowing to AI. The emerging rule: training an AI model is not inherently transformative. The analysis turns on whether the AI's output serves a different purpose than the original work. An AI trained on news articles that generates news summaries serves the same purpose — not transformative. An AI trained on medical papers that provides diagnostic assistance may serve a different purpose — potentially transformative.

Factor 2 — Nature of the copyrighted work: creative works (novels, photographs, music) receive stronger protection than factual works (news reporting, academic papers, legal headnotes). This factor doesn't change significantly in the AI context — it's applied conventionally.

Factor 3 — Amount used: AI training typically uses entire works, which weighs against fair use. AI companies argue that no single work is "substantially" reproduced in the model's outputs, but courts have noted that the quantity of training data (billions of works) amplifies the copying.

Factor 4 — Market effect: this is the decisive factor. When AI outputs compete with the market for the original works — AI-generated images replacing stock photography, AI-generated articles replacing journalism — courts are finding substantial market harm. When AI outputs don't compete — AI trained on medical literature providing clinical decision support — the market harm is less clear. Thomson Reuters v. Ross established that market competition between the AI's outputs and the training data source is the key inquiry.

The Piracy-Licensing Bright Line

A clear legal line is emerging that practitioners and AI companies can rely on: using pirated or unlicensed content in AI training data creates nearly indefensible copyright liability, while properly licensed training data creates strong fair use arguments.

The piracy side: the Bartz v. Meta case involves the Books3 dataset — a collection of pirated books scraped from Library Genesis. Meta hasn't seriously argued that using pirated books was fair use. The case is proceeding on damages, not liability. AI companies that trained on clearly pirated datasets (and most early models did) face significant retroactive liability.

The licensing side: major AI companies have pivoted to licensing agreements. OpenAI signed deals with the Associated Press, Axel Springer, and multiple publishers. Google has licensing agreements with Reddit, Stack Overflow, and news publishers. Anthropic has been most cautious, building training processes designed to reduce copyright exposure. These licensing agreements serve as both legal protection and practical acknowledgment that the "train on everything" approach was legally unsustainable.

The middle ground: content that's publicly available but not pirated — websites, public databases, government documents, Creative Commons content — remains the most legally contested training data category. Courts haven't resolved whether web scraping for AI training constitutes fair use, and the answer likely depends on the specific content type and competitive relationship.

Practical advice for AI companies: license what you can, document your data provenance, and prepare to defend the fair use argument for content that can't be practically licensed. For rights holders: register your copyrights (statutory damages require registration), document market harm from AI-generated competing content, and consider opt-out mechanisms through robots.txt and AI training exclusion headers.

International Landscape: EU, UK, and Japan Diverge

The international approach to AI training data copyright varies dramatically, creating compliance complexity for AI companies operating globally.

EU: the AI Act and the Digital Single Market Directive create a framework where text and data mining for AI training is permitted for research purposes but requires rights-holder opt-out mechanisms for commercial use. Rights holders who implement machine-readable opt-out signals (robots.txt, TDM reservation headers) can effectively prevent their content from being used in commercial AI training. The EU approach favors rights holders with clear opt-out infrastructure.

UK: proposed legislation would have created a broad text and data mining exception for AI training, but the proposal was withdrawn after creator backlash. The UK currently lacks a specific AI training exception, meaning standard copyright law applies. Training on copyrighted works without license likely requires a fair dealing analysis that's even narrower than U.S. fair use.

Japan: the most AI-friendly jurisdiction. Japan's 2018 copyright amendment permits the use of copyrighted works for "information analysis" including AI training, even for commercial purposes, without rights-holder consent. This makes Japan an attractive jurisdiction for AI training operations, though enforcement of Japanese-trained AI models in other jurisdictions still subjects the outputs to local copyright law.

For practitioners advising AI companies: the jurisdictional divergence means training data compliance must be analyzed by jurisdiction. A training approach that's legal in Japan may infringe in the EU. Content licensed under U.S. agreements may not satisfy EU opt-out requirements. Build jurisdiction-specific compliance protocols.

What This Means for Legal Practitioners

For IP attorneys: AI copyright is the fastest-growing area of intellectual property practice. Content creators, publishers, and media companies need counsel on rights enforcement, licensing strategy, and litigation. AI companies need counsel on training data compliance, fair use analysis, and licensing negotiations. Both sides are hiring aggressively.

For corporate counsel: if your company uses AI tools, ask your vendors about their training data provenance. Are the models trained on licensed data? Is there indemnification for copyright claims arising from AI outputs? The Thomson Reuters v. Ross ruling means using AI tools trained on infringing data creates potential downstream liability for the deployer, not just the developer.

For litigation attorneys: AI-generated content in litigation work product raises copyright questions. If Claude drafts a brief that includes language substantially similar to copyrighted material it was trained on, who's liable? Current thinking: the attorney bears responsibility for reviewing AI-generated work product, which includes checking for copyright issues in AI-drafted content. This is one more reason AI outputs require human review.

For transactional attorneys: AI licensing agreements are a new contract category with unique provisions — training data rights, output ownership, model weight licensing, fine-tuning rights, and indemnification structures that don't map to traditional software licensing. Firms building expertise in AI contract drafting are capturing significant deal flow.

The Bottom Line: The AI copyright landscape in 2026 is resolving around a market-competition standard: when AI outputs compete with the market for training data, courts are finding infringement; when they don't, fair use arguments survive. The piracy-licensing bright line is clear — license what you can, don't train on pirated content. The major cases (NYT v. OpenAI, Getty v. Stability AI) will deliver definitive rulings in late 2026 or 2027. Until then, the Thomson Reuters v. Ross precedent and the emerging market-harm framework are the best guides for practitioners.

AI-Assisted Research. This piece was researched and written with AI assistance, reviewed and edited by Manu Ayala. For deeper takes and the perspective behind the research, follow me on LinkedIn or email me directly.

Ai Copyright Training Data 2026 Landscape

The Major Cases and Where They Stand

The Fair Use Framework Applied to AI

The Piracy-Licensing Bright Line

International Landscape: EU, UK, and Japan Diverge

What This Means for Legal Practitioners

Frequently Asked Questions

Need help with AI infrastructure?

The Major Cases and Where They Stand

The Fair Use Framework Applied to AI

The Piracy-Licensing Bright Line

International Landscape: EU, UK, and Japan Diverge

What This Means for Legal Practitioners

Frequently Asked Questions

More from Guides

Need help with AI infrastructure?