Loading blog post, please wait Loading blog post...

Listen to the AI-generated audio version of this piece.

00:00
00:00

Fired Over Fair Use: The Bombshell AI-Training Report That Cost the Copyright Chief Her Job

Why the May 2025 report is a bombshell—and where it still misses the mark on Retrieval‑Augmented Generation (RAG)

On Saturday, May 10, 2025, President Trump abruptly fired Register of Copyrights Shira Perlmutter—an unprecedented move widely interpreted as retaliation for the Office’s newly released generative‑AI training report. The dismissal underscores just how explosive the findings are and sets the stage for a legal and political showdown over AI and copyright.

Introduction

In a 113‑page, jargon‑free tour‑de‑force, the U.S. Copyright Office finally weighs in on one of the thorniest questions in tech policy: Does training generative‑AI models on copyrighted works infringe copyright, and if so, can developers hide behind the doctrine of fair use?

Spoiler: the Office lands mostly on the side of copyright owners—an outcome that puts the Copyright Office at odds with the Trump‑era instinct to hold off on regulating AI. The analysis of the issue and ten thousand public comments submitted to the Office is meticulous and, on balance, persuasive. Yet its treatment of retrieval-augmented generation (RAG)—the notion that users should be able to ground a general-purpose model in their own documents—raises serious practical concerns.

The stakes are high, and the consequences are often described in existential terms. Some warn that requiring AI companies to license copyrighted works would throttle a transformative technology, because it is not practically possible to obtain licenses for the volume and diversity of content necessary to power cutting-edge systems. Others fear that unlicensed training will corrode the creative ecosystem, with artists’ entire bodies of works used against their will to produce content that competes with them in the marketplace. The public interest requires striking an effective balance, allowing technological innovation to flourish while maintaining a thriving creative community.

Below is a Q&A that distills the report and offers some commentary on what it gets right, what it means for the AI ecosystem, and where its RAG analysis may need a reality check.

Q&A

Q1. What did the Copyright Office study, and why should we care?

A. The Office examined every step of modern AI training—from mass web‑scraping and dataset curation to fine‑tuning and deployment. It also considered retrieval‑augmented generation (RAG) and whether model weights themselves can infringe copyright. The stakes could not be higher: billions have been invested in models built on unlicensed data, while creators worry about their life’s work being used to produce competing content.

Q2. Does training on copyrighted works create a prima‑facie infringement?

A. Yes. The Office finds that each stage of the pipeline—scraping, storing, processing, and even distributing weights that “memorize” protectable expression—implicates the reproduction right. In short, you are copying the works, no matter how fleeting the cache.

Q3. When is that copying likely to be fair use?

A. The Office refuses a bright‑line rule; instead, it walks through the classic four‑factor fair use test, which answers the question whether infringement is excusable.

  1. Purpose & character – Non‑commercial research that never reveals expressive text looks “strongly transformative.” Commercial systems that spit out competing content do not.

  2. Nature of the work – Most training data is highly creative; this factor usually hurts the developer.

  3. Amount & substantiality – Copying entire works cuts against fair use unless the system can reliably prevent leakage (à la Google Books snippets). The effectiveness of these guardrails, not just their existence, matters greatly.

  4. Market effect – Courts must weigh both verbatim leaks and economic substitution. Where licensing is “reasonably available,” unlicensed use “weighs heavily against fair use.”

Bottom line: context matters. A research lab that carefully filters outputs may skate by; a commercial content mill that ignores takedown requests probably won’t.

Q4. Which kinds of AI training are most at risk?

Higher risk: large‑scale pre‑training or fine‑tuning on expressive, pirated works aimed at generating similar outputs.

Lower risk: non‑expressive uses like search embeddings or tightly‑guarded academic models that block text regurgitation.

Factors that move you along this spectrum include the source of your data, the purpose of your model, and the guardrails you deploy.

Q5. What about Retrieval‑Augmented Generation (RAG)?

A. Here's where the report raises eyebrows. The Office treats RAG as requiring separate consideration from pre-training. It views RAG as "less likely to be transformative where the purpose is to generate outputs that summarize or provide abridged versions of retrieved copyrighted works, such as news articles, as opposed to hyperlinks."

Why that's a problem: RAG is increasingly how enterprises get GPT-like models to be accurate, up-to-date, and domain-specific. If every ingestion of a PDF manual or contract is presumptively infringing, the practical utility of AI in the workplace plummets. The Office's stance risks overemphasizing public-facing RAG services while failing to adequately distinguish private, internal uses.

Suggested middle ground: The report could have more clearly differentiated between private, intra-organizational RAG (with minimal market harm) and public-facing services that redistribute copyrighted text. By not fully acknowledging this distinction, the Office risks stifling a technique that, for many use-cases, is closer to search than to publication.

Q6. Does the developer’s intent or knowledge of pirated data matter?

A. Infringement is strict liability—intent does not matter. But for fair‑use factor 1, willful or reckless use of pirated data weighs against the developer. Good‑faith acquisition and robust opt‑out mechanisms help tilt the balance back.

Q7. Do outputs need to match the training data for liability?

A. No. The Office is clear: training can infringe even if no later output copies the original, and outputs can infringe even if the training was licensed. They are separate acts.

Q8. What does this mean for developers, creators, and policymakers?

  • Developers should (1) secure lawful access to data, (2) minimize memorization, (3) block verbatim output, and (4) license where feasible.

  • Creators gain leverage: the Office views large‑scale, unlicensed training as presumptively infringing, nudging the market toward voluntary deals.

  • Policymakers face a dilemma: let the market evolve or impose statutory licenses. The Office opts for wait‑and‑see, signaling that compulsory schemes are premature.

Closing Thoughts

The report is, in many ways, a bombshell: it validates many copyright owners' concerns and puts AI companies on notice that "training" is not a magic fair-use wand. Yet its approach to RAG could benefit from further refinement to accommodate how modern AI actually works in enterprise contexts.

Perhaps most surprisingly, the report's ultimate recommendation is decidedly market-friendly: "The Office recommends allowing the licensing market to continue to develop without government intervention." This hands-off approach—which would typically align with Trump-era deregulatory instincts—makes the President's unprecedented firing of Perlmutter all the more puzzling. Rather than embracing the report's free-market solution for now, Trump appears to be putting his thumb on the scale—raising questions about whose interests are really being served in this high-stakes copyright battle.

Still, the Office's high-resolution tour of AI training is invaluable reading for courts, companies, and creatives alike. Whether you cheer or jeer its conclusions, the report is an invaluable read.

Tags