TRAINING AI ON COPYRIGHTED WORK: WHAT DOES US COPYRIGHT OFFICE HAS TO SAY? (12.05.25)

Authored by Ms. Vanshika Jain

The increasing utilisation of AI systems which are capable of producing text, imagery, audio, and video has reignited longstanding debates about copyright law’s ability to keep pace with technological innovation. In response to the challenges posed by AI to the Intellectual property principles of Authorship and Ownership, the U.S. Copyright Office released a pre‑publication report.  Drawing on over 10,000 public comments and congressional inquiries, the report frames a complex debate over when AI developers must seek authorization, how fair use applies, and what licensing mechanisms might emerge.

 

THE TECHNICAL AND COPYRIGHT FOUNDATION

At its core, modern AI training entails the systematic ingestion and processing of vast “corpora” of data ranging from public‑domain texts and user‑contributed content to subscription‑only articles and proprietary visual or audio works. This multi‑stage pipeline not only underpins the generative capabilities of today’s large‑scale neural networks but also implicates multiple aspects of the reproduction right under U.S. copyright law.

1. Data Collection & Curation

  • Acquisition Methods: Developers often rely on automated web crawlers to harvest publicly accessible material, license bulk collections from publishers, or ingest millions of user‑generated posts via APIs. Each act of copying—even if for analysis—constitutes reproduction, since the work is fixed in a new medium or storage device.
  • Filtering and Deduplication: Proprietary filtering algorithms remove low‑quality, corrupted, or duplicate files to optimize training efficiency. However, even “temporary” duplicates stored in cache during preprocessing can trigger infringement concerns under the Transient Copying Doctrine .
  • Metadata Tagging and Weighting: Data is often annotated with contextual metadata—author, date, genre—and re‑weighted to prioritize under‑represented domains (e.g., non‑English texts), further embedding subjective editorial decisions into the curated dataset.


2. Model Training

  • Parameter Adjustment (“Weights”): Training involves iteratively adjusting millions or billions of internal parameters so that the model’s mathematical representation encodes statistical relationships rather than verbatim text. Yet each training cycle entails transient reproductions: as input tokens or image patches are loaded into memory, they are “fixed” momentarily in the model’s buffer, satisfying the “copy” element of infringement .
  • Checkpointing and Fine‑Tuning: Intermediate models (“checkpoints”) are often saved to disk for later fine‑tuning on specialized sub‑corpora. These snapshots—if retained indefinitely represent additional copies of the underlying training material and may require clearance.

3. Retrieval‑Augmented Generation (RAG)

  • Hybrid Architecture: RAG systems combine a generative backbone with an external indexed database that can be queried at inference time. When prompted, the model retrieves and incorporates verbatim passages from the index, blending generation with exact replication.
  • Indexing Rights: Creating and maintaining a searchable index of copyrighted texts effectively duplicates the works and may exceed the scope of a fair use defense if used for commercial outputs that mirror the retrieved content.
  • Attribution and Transparency: Some architectures log provenance metadata tracking which source documents contributed to each output to enable authorship attribution, yet the very act of logging implies storage of copies for future reference.

Each of these technical steps intersects with the exclusive right of reproduction codified at 17 U.S.C. § 106. Whether developers must clear rights depends on whether these acts qualify for an exception, most prominently fair use, or whether a licensing regime (voluntary or statutory) is necessary to legitimize large‑scale AI training. The U.S. Copyright Office report thus urges stakeholders to examine not only the end‑product (the model’s outputs) but also the intermediate copying that undergirds generative AI.

 

POINTS OF POTENTIAL INFRINGMENT

The pre‑publication draft is a trove of rigorous analysis. It translates machine‑learning jargon (neural‑net “weights,” token‑by‑token generation, retrieval‑augmented architectures) into legal concepts, mapping each phase of AI training onto the reproduction right.

The report quickly captures attention by spotlighting three “critical touchpoints” in AI development that may trigger infringement claims:

  • The mass copying of training data,
  • Transient reproductions during model optimization, and
  • Model outputs that echo specific copyrighted passages.

FAIR USE AS THE DEFAULT US FRAMEWORK

It weighs the four fair‑use factors with surgical precision, examining:

  1. Purpose and Transformativeness. Does ingesting copyrighted text as statistical data create something “new,” or merely replicate the original author’s labor?
  2. Nature of the Source Material. How should courts balance the higher creative value of novels and films against the empirical needs of AI research?
  3. Amount and Necessity. Is wholesale copying of entire works “reasonable” when partial or abstracted datasets might suffice?
  4. Market Harm. Will AI‑powered substitutes undermine licensing revenues or spawn new licensing markets that enrich creators?

By unpacking each factor with real‑world vignettes from a language model trained on bestselling novels to an image generator fed millions of professional photographs, the report anticipates how courts may navigate literary disputes.

 

LICENSING AS A WAY FORWARD

To mitigate fair use’s unpredictability, the report outlines three complementary licensing pathways:

  • Voluntary Collective Licensing: Rights holders form consortia that grant blanket permissions under uniform terms. This “one‑stop shop” reduces individual negotiations and pools revenues, but only covers participating creators and requires robust governance to allocate fees equitably.
  • Statutory Compulsory Licenses: Legislation mandates AI‑training rights in exchange for preset royalties, with an opt‑out registry for dissenting authors. This model delivers legal certainty and includes orphan works, but must align with international treaty obligations and entails significant administrative overhead.
  • Extended Collective Licensing (ECL): Building on voluntary agreements, ECL automatically extends deals to all works in defined categories unless specifically excluded. It combines wide coverage with negotiated terms but may face resistance from creators wary of reduced individual control and poses cross‑border enforcement challenges.

Each approach balances inclusivity, certainty, and administrative feasibility differently—suggesting that a hybrid mix of collective market mechanisms and targeted statutory measures may ultimately best serve both AI innovators and content creators.

 

COMPARATIVE PERSPECTIVE: USA V EU

In the United States, the Copyright Office relies chiefly on an expansive, case‑by‑case fair use framework emphasizing a nuanced balancing of the four statutory factors to accommodate both cutting‑edge AI innovation and the rights of authors. By contrast, the European Union generally treats the use of copyrighted works for AI training under its Text & Data Mining (TDM) exception in the 2019 DSM Directive (Articles 3–4), which allows computational analysis of lawfully accessed works for research and non‑commercial purposes, with an opt‑out for rights holders. Although this TDM exception broadly covers AI training, stakeholders have criticized its limits particularly its research‑oriented carve‑out and the EU AI Act’s failure to explicitly address generative AI’s commercial scale . The EU approach offers greater legal certainty through a statutory exception but may constrain commercial deployment without further legislative refinement, whereas the U.S. approach allows more flexibility (and uncertainty) via fair use determinations.

IMPLICATIONS FOR INDIA 

As India advances its digital infrastructure and AI capabilities, the lessons from both the U.S. and EU models provide valuable direction. The U.S. approach—anchored in flexible, case-by-case fair use assessments offers room for judicial discretion but lacks the legal certainty developers and rights holders may desire. On the other hand, the EU’s Text and Data Mining (TDM) exception provides a statutory foundation with clear opt-out provisions, though it is largely research-focused and may not fully address commercial AI use cases. For India, currently governed by the Copyright Act, 1957, the absence of specific provisions on data mining or AI training highlights a regulatory gap. Moving forward, India could explore adopting a hybrid framework, one that permits AI training under a statutory exception for lawful, non-commercial uses, while also enabling voluntary collective licensing models for commercial applications. Such a model would provide clarity, protect the rights of Indian creators, and encourage innovation by AI developers, especially within India’s emerging digital economy and startup ecosystem.