Recently, the New York Times and the Daily News filed a copyright lawsuit against OpenAI, accusing it of using its works to train AI models without authorization. The case focused on the fact that OpenAI engineers accidentally deleted evidence that may be crucial to the case, which caused widespread concern. This move not only affected the trial process of the case, but also exposed the potential risks and ethical issues of data processing during the training of large language models. This article will analyze the ins and outs of this incident in detail and explore its impact on the development of the artificial intelligence industry.
Recently, The New York Times and the Daily News jointly sued OpenAI, accusing it of using their works to train artificial intelligence models without authorization.
The development of the case has attracted public attention because the plaintiff's legal team pointed out in the latest court documents that OpenAI engineers accidentally deleted evidence that may have an important impact on the case when processing relevant data.
It is reported that OpenAI agreed to provide two virtual machines this fall so that the plaintiff’s legal team could search its training data for copyrighted content. A virtual machine is a virtual computer that runs within a computer operating system and is typically used for testing, data backup, and running applications. Legal counsel from The New York Times and the Daily News and their hired experts have worked on OpenAI’s training data for more than 150 hours since November 1.
However, on November 14, OpenAI engineers accidentally cleared the search data stored on one of the virtual machines. According to the letter from the plaintiffs' attorneys, while OpenAI attempted to recover the lost data, and was successful in most cases, the recovered data could not be used to determine which news the plaintiffs' articles were because the folder structure and file names were "unrecoverable." How it is used to train OpenAI models.
Legal counsel for the plaintiffs noted that they do not believe the removal was intentional, but that the incident demonstrates that OpenAI is "in the best position to search its own data sets for potentially infringing content." This means that OpenAI should use its own tools to find relevant infringing content more efficiently.
OpenAI has maintained in this case and others like it that using publicly available data for model training is a fair use. This means that OpenAI believes it does not have to pay royalties for the use of these examples, even though it makes money from these models.
It is worth mentioning that OpenAI has signed licensing agreements with an increasing number of new media, including the Associated Press, Business Insider, Financial Times, etc., but OpenAI has not disclosed the specific terms of these agreements. It is reported that content partner Dotdash receives at least US$16 million in annual compensation.
Despite the legal dispute, OpenAI has not confirmed or denied using specific copyrighted works for AI training without permission.
Highlight:
OpenAI has been accused of mistakenly deleting potentially important evidence in a copyright lawsuit.
Lawyers for the plaintiffs said they spent a lot of time and manpower trying to recover the data.
OpenAI maintains that the use of publicly available data to train its models is fair use.
This incident highlights the complexity of the source and copyright issues of artificial intelligence model training data, and also raises concerns about data security and evidence management. Whether OpenAI’s behavior constitutes infringement and how to define the boundaries of “fair use” will be important issues that need further discussion in the future. The final outcome of this case will have a profound impact on the development of the artificial intelligence industry.