OpenAI, currently facing lawsuits from The New York Times and Daily News for allegedly using their content without permission to train its AI models, recently encountered a significant data mishap. The plaintiffs' legal teams claim that OpenAI engineers accidentally deleted critical data that could have been relevant to the case.
Earlier this fall, OpenAI agreed to provide two virtual machines to allow The Times and Daily News legal teams to search for their copyrighted content in the AI training datasets. Virtual machines are software environments used for testing, backups, and running applications. Since November 1, the plaintiffs' attorneys and hired experts have spent over 150 hours analyzing the provided data.
However, on November 14, OpenAI engineers unintentionally erased all search data stored on one of the virtual machines. Although the company successfully recovered most of the data, the folder structure and file names were lost irretrievably. As a result, the recovered data cannot be used to trace where or how the plaintiffs' articles might have been used in building OpenAI’s models.
According to a letter filed in the U.S. District Court for the Southern District of New York, the plaintiffs’ counsel expressed frustration over the incident. They stated that the lost data forced them to redo an entire week’s worth of work, requiring significant time and computational resources.
The plaintiffs acknowledged that the deletion appeared accidental but argued that the situation demonstrates OpenAI’s unique ability to search its own datasets more effectively using proprietary tools.
OpenAI has declined to comment on the incident.
The broader legal case centers around whether OpenAI’s use of publicly available content, such as news articles, constitutes fair use. OpenAI asserts that training its models, like GPT-4, on publicly accessible data does not require permission or payment, even when the models are used commercially. Despite this stance, OpenAI has entered into licensing agreements with several publishers, including the Associated Press, Business Insider owner Axel Springer, and Financial Times. These deals remain confidential, though some reports suggest significant payments, such as $16 million annually to Dotdash Meredith.
OpenAI has neither confirmed nor denied using specific copyrighted works to train its models. The lawsuits highlight ongoing tensions between AI developers and content creators over the use of intellectual property in AI training.