Meta has been caught in a highly anticipated copyright infringement lawsuit, with the plaintiff accusing its CEO Mark Zuckerberg of personally approving the use of pirated e-books and article data sets to train its Llama AI model. This case not only pushed Meta to the forefront, but also attracted widespread attention from the industry on the copyright use of technology giants in AI model training. It is worth noting that this is one of many copyright lawsuits against multiple tech companies that have been accused of training AI models using copyrighted works without authorization.
According to the latest filings submitted to the U.S. District Court for the Northern District of California, the plaintiff cited Meta's testimony late last year, which explicitly mentioned that Zuckerberg approved the use of a dataset called LibGen for training the Llama model. As a "link aggregator", LibGen provides a large number of copyrighted academic publications. Although the website has been sued and closed for copyright infringement, it continues to provide works from major publishers such as Cengage Learning and McGraw Hill, which facilitates Meta's infringement.
The document further disclosed that internal Meta employees acknowledged LibGen as a "known pirated data set" and realized that its use could adversely affect the company's negotiated position with regulators. What is even more shocking is that Meta engineer Nikolay Bashlykov was accused of writing scripts specifically to delete copyright information in LibGen e-books, including words such as "copyright" and "acknowledgement". In addition, Meta has been accused of removing copyright tags and source metadata from scientific journal articles, both intended to cover up its infringement.
The most controversial allegation in the case was that Meta downloaded LibGen content through torrenting and helped disseminate these pirated copyrighted documents. Torrenting is a network file distribution method, and downloaders are also sharing content while uploading files. Plaintiff’s lawyers pointed out that Meta actually carried out another form of copyright infringement through its involvement in torrenting. Although Meta engineers offered reservations about this, believing that this behavior was illegal, Meta continued to carry out the behavior with the support of Ahmad Al-Dahle, the head of Generator AI.
The allegations coincide with a report from the New York Times last April that suggested Meta had cut corners when collecting AI data. Meta reportedly hired African contractors to summarize book summary and considered acquiring publisher Simon Schuster. However, Meta executives believe that negotiation of copyright licensing takes too long and the principle of reasonable use has become their main defense, an attitude that has sparked doubts about the business ethics of technology companies.
At present, the trial of the case has not yet been concluded, and only Meta's early Llama model is involved. Although the court dismissed several copyright lawsuits related to AI in 2023, believing that the plaintiff failed to prove the infringement, the allegations in this case may still have a significant impact on Meta. In a Wednesday order, presiding judge Vince Chabria pointed out that he rejected Meta's request to delete most of the files, saying that the deletion of these files was clearly intended to avoid negative publicity rather than protect sensitive commercial information. The statement was undoubtedly a major blow to Meta.
The case not only poses a serious challenge to Meta, but also triggers extensive discussions on how technology companies can use copyrighted works to train AI models. Especially on the issue of the boundary between reasonable use and copyright protection, this case may become an important reference for similar cases in the future. With the rapid development of AI technology, how to find a balance between innovation and copyright protection will become an important issue facing technology companies and the legal community.