The editor of Downcodes will help you understand the "alternative uses" of academic papers! In recent years, the source of training data for AI models has attracted widespread attention. Many academic publishers are "packaging and selling" research papers to technology companies to improve the capabilities of AI systems. This involved huge transactions and triggered heated discussions in the academic community about intellectual property rights, author rights, and the ethics of AI development. This article will delve into the mechanisms, impacts, and future trends behind this phenomenon.
Have you ever thought that your research paper may have been used to train AI. Yes, many academic publishers are “packaging and selling” their results to technology companies developing AI models. There is no doubt that this move has caused quite a stir in the scientific research community, especially when the authors know nothing about it. When known. Experts say that if yours isn't already being used by a large language model (LLM), there's a good chance it will be in the near future.
Recently, British academic publisher Taylor & Francis reached a $10 million deal with Microsoft, allowing the technology giant to use their research data to improve the capabilities of its AI systems. As early as June, the American publisher Wiley also reached a deal with a company and received US$23 million in revenue in return for their content being used to train generative AI models.
If a paper is available online, whether open access or behind a paywall, it's likely that it has been fed into some large language model. "Once a paper is used to train a model, it cannot be removed after the model is trained," said Lucy Lu Wang, an AI researcher at the University of Washington.
Large language models require large amounts of data to train, which is often scraped from the Internet. By analyzing billions of snippets of language, these models are able to learn and generate fluent text. Academic papers have become a very valuable "treasure" for LLM developers due to their high information density and long length. Such data helps AI make better inferences in science.
Recently, the trend of purchasing high-quality data sets is rising, and many well-known media and platforms have begun to cooperate with AI developers to sell their content. Considering that without an agreement, many works may be silently scraped, this kind of cooperation will only become more common in the future.
However, some AI developers, such as the Large-scale Artificial Intelligence Network, choose to keep their data sets open, but many companies developing generative AI keep their training data secretive. Nothing is known about the training data.” Experts believe that open source platforms like arXiv and databases such as PubMed are undoubtedly popular targets for AI companies to crawl.
It is not simple to prove whether a certain paper appears in the training set of a certain LLM. Researchers can use unusual sentences from the paper to test whether the model output matches the original text, but this does not completely prove that the paper was not used, because developers can adjust the model to avoid outputting training data directly.
Even if it is proven that an LLM has used a specific text, what happens next? Publishers claim that the unauthorized use of copyrighted text constitutes infringement, but there are also objections that the LLM is not copying the text, but rather It generates new text by analyzing the information content.
There is currently a copyright lawsuit underway in the United States that could become a landmark case. The New York Times is suing Microsoft and ChatGPT developer OpenAI, accusing them of using its news content to train models without permission.
Many scholars welcome the inclusion of their works in the training data of LLM, especially when these models can improve the accuracy of research. However, not every researcher in the profession is taking this in stride, and many feel their jobs are threatened.
In general, individual scientific authors currently have little say in publishers’ sales decisions, and there is no clear mechanism for how credit is allocated and whether it is used for published articles. Some researchers expressed frustration: "We hope to have the help of AI models, but we also hope to have a fair mechanism. We have not yet found such a solution."
References:
https://www.nature.com/articles/d41586-024-02599-9
https://arxiv.org/pdf/2112.03570
The future direction of AI and academic publishing is still unclear, and copyright issues, data privacy, and protection mechanisms for authors’ rights and interests all need to be further improved. This is not only a game between publishers and technology companies, but also a major issue related to the sustainable development of academic research and the ethics of AI technology, which requires the joint attention and efforts of the entire society.