In a legal dispute highlighting the contentious relationship between AI companies and traditional media, OpenAI has been accused of inadvertently erasing key evidence in a copyright infringement lawsuit by The New York Times and Daily News. The lawsuit claims that OpenAI unlawfully used copyrighted articles from these publications to train its AI models, such as ChatGPT, without permission. The case, filed in late 2023, has taken a dramatic turn, underscoring the growing tension between artificial intelligence innovation and intellectual property rights.
The Alleged Data Erasure Incident
As part of the discovery phase of the lawsuit, OpenAI had agreed to grant The New York Times access to its AI training datasets. This process involved setting up virtual machines where the plaintiff’s legal team could search for evidence of their copyrighted content. Beginning November 1, lawyers and technical experts spent over 150 hours combing through the training data.
However, on November 14, The New York Times alleged that search data from one of the virtual machines had been deleted by OpenAI engineers. Despite efforts to recover the lost data, the restored files were missing critical information such as file names and folder structures. According to the plaintiffs’ legal team, this rendered the recovered data practically useless for determining how The New York Times articles might have been incorporated into OpenAI’s models.
In a letter filed with a U.S. district court on November 20, the plaintiff’s lawyers stated, “News plaintiffs have been forced to recreate their work from scratch using significant person-hours and computer processing time.” They added that while they believed the deletion was unintentional, OpenAI was in the best position to search its own datasets for evidence of copyright infringement.
OpenAI’s Response
OpenAI has pushed back against the allegations, emphasizing that the incident was a technical error rather than a deliberate act. In a statement, OpenAI spokesperson Jason Deutrom said, “We disagree with the characterizations made and will file our response soon.” Internal emails submitted to the court describe the issue as a “glitch” rather than misconduct.
The Stakes of the Lawsuit
This lawsuit is among several legal challenges OpenAI and other AI companies face over the use of copyrighted materials in training datasets. The New York Times alleges that OpenAI’s practice of training AI models on publicly available content—including its articles—constitutes copyright infringement. OpenAI, on the other hand, argues that such use falls under the legal principle of “fair use.”
The outcome of this case could set a significant legal precedent for the AI industry, determining how copyrighted material can be utilized in training generative AI models. With the proliferation of AI-powered tools, including ChatGPT, these cases carry profound implications for the boundaries of intellectual property law in the digital age.
A Growing Tension Between Media and AI
This is not the first instance of friction between publishers and AI companies. Lawsuits from other publishers, including the Associated Press and Daily News, echo similar concerns. These legal battles come at a time when AI companies like OpenAI and Google are facing mounting pressure to clarify how their models are trained.
To mitigate these challenges, OpenAI has begun striking content licensing deals with major publishers, including Reuters, Financial Times, and Axel Springer (the parent company of Business Insider and Politico). These agreements allow AI companies to legally access copyrighted content for training purposes, offering a potential blueprint for resolving disputes with other publishers.
Challenges in Discovery and Evidence Collection
The lawsuit’s discovery phase has been particularly contentious. OpenAI’s training data has never been fully disclosed to the public, making it a sensitive and critical element of the case. The company created a “sandbox” of virtual machines to provide limited access to its datasets for The New York Times. However, technical issues have plagued the process, with the plaintiffs alleging that “severe and repeated technical issues” have hindered their ability to search the data effectively.
In a recent filing, The New York Times called on OpenAI to take greater responsibility for examining its own datasets. The plaintiffs also requested additional evidence, including internal Slack messages, text messages, and emails from OpenAI executives and former employees like Ilya Sutskever and Brad Lightstone.
Microsoft’s Role in the Lawsuit
Microsoft, a major investor in OpenAI, has also been implicated in the case. The New York Times has asked the court to compel Microsoft and OpenAI to provide further documentation, including communications and materials related to their use of generative AI. In turn, Microsoft has demanded documents from The New York Times detailing its own use of AI technologies, potentially as a defense strategy.
What Lies Ahead
The case underscores the broader challenges faced by the AI industry as it navigates the legal and ethical complexities of using publicly available data. While OpenAI has taken steps to address these concerns through licensing deals, the lawsuit demonstrates the potential risks of operating in a legal gray area.
For publishers like The New York Times, the stakes are equally high. Beyond the immediate question of copyright infringement, the case raises broader questions about the value of journalism in an era where AI tools can generate text with unprecedented sophistication.
As the lawsuit moves forward, its outcome could have far-reaching implications for the future of AI development and intellectual property law. Whether OpenAI’s use of copyrighted material is ultimately deemed fair use or a violation of copyright, the case is likely to shape the legal landscape for years to come.
SOURCES-
https://www.wired.com/story/new-york-times-openai-erased-potential-lawsuit-evidence/
https://www.ccn.com/news/technology/openai-accused-accidentally-erasing-evidence/