Zuckerberg knows that Meta uses pirated library data to train AI - AI article

Author：Eve Cole Update Time：2025-01-26 14:32:01

Meta Company has caused huge controversy over the source of training data for its latest AI chatbot Llama3. According to disclosed documents, Meta used the pirated e-book website Library Genesis (LibGen) to train Llama3, a move that raised widespread concerns about copyright and data ownership. Although employees within Meta had expressed concerns about the risks of using LibGen, including potential legal risks and negative publicity, CEO Zuckerberg still approved the decision, highlighting the aggressive tactics and copyright protection of big technology companies in the AI race. of indifference.

Recently, as documents disclosed by Meta in a copyright class action lawsuit surfaced, the company used a pirated e-book library called Library Genesis (LibGen) to train its latest AI chatbot Llama3. attracted widespread attention. The documents show that Meta engineers discussed the potential risks of leveraging LibGen, a "shadow library," especially amid growing concerns about copyright and data ownership. Despite the potential negative impact and risk of publicity, Meta CEO Mark Zuckerberg approved the decision.

图书馆书房阅读 (3)

At the request of the court, records of confidential conversations within Meta about the use of the LibGen data set were declassified. The documents show that Meta executives made it clear in discussions with the AI research team that LibGen's data was "we know to be pirated." Agree to use this data to improve the performance of Llama3. In an email, Meta's director of product management Sony Theakanath pointed out that although the decision to use LibGen triggered public opinion risks, other AI companies are also using similar data, which makes Meta's team feel that this path is not an isolated one.

More worryingly, Meta staff also discussed how to process and filter text in LibGen to remove copyright markings such as ISBNs and copyright notices. An internal memo said the materials provided by LibGen were "high quality and long-format, making them ideal for learning particularly specialized subjects." This suggests that Meta appears to be trying to conceal its use of unauthorized content.

In addition, Meta employees also mentioned in the email that it may be inappropriate to directly use the company's IP address for torrenting and expressed concerns about this behavior. However, with Zuckerberg "pushing from the top" to use the LibGen data set, Meta's winning mentality in the AI race is clearly revealed. This incident has once again aroused attention and doubts about the copyright issues of large technology companies.

The outcome of this copyright lawsuit may have important implications for other similar ongoing cases, particularly regarding the use of creative works such as images, music and literature. As technology companies’ demand for original content continues to increase, the rights of original content creators will become the focus of attention.

This incident not only exposed Meta’s irresponsible attitude on copyright issues, but also triggered people’s in-depth thinking on ethical and legal issues in the development of AI. In the future, how to balance technological development and intellectual property protection will become an important issue, requiring joint efforts within and outside the industry to find solutions.