Harvard University has spent huge sums of money to release nearly one million public domain book data sets, aiming to promote fair competition in the field of artificial intelligence and promote the development of AI technology. The project is led by the Harvard University Institutional Data Initiative and funded by Microsoft and OpenAI. The data set contains rich content from classic literature to professional academic literature, providing valuable resources for small AI companies and individual researchers, bridging the gap between The data gap at big tech companies. This move also provides new ideas for the source of training data in the field of artificial intelligence, and attempts to explore a sustainable development path in the context of increasingly complex copyright issues.
Harvard University recently announced plans to release a data set consisting of nearly 1 million public domain books that anyone can use to train large language models and other artificial intelligence tools.
This project is led by Harvard University's newly established Institutional Data Initiative (Institutional Data Initiative) and completed with funding from Microsoft and OpenAI. The data set includes scanned books from the Google Books project, covering classic works such as Shakespeare, Dickens, and Dante, as well as some obscure Czech mathematics textbooks and Welsh dictionaries.
Image source note: The image is generated by AI, and the image authorization service provider Midjourney
Dubbed the “Books3 Dataset,” the data set is five times larger and aims to level the playing field in the field of artificial intelligence, giving the public, especially small AI companies and individual researchers, access to what is usually only available to big tech. Only companies can collect high-quality data. Greg Leppert said the project was rigorously selected and the content carefully curated.
Microsoft Vice President Burton Davis emphasized that Microsoft's goal in supporting the project is to create an "accessible data pool" for startups and ensure that this data is managed in the "public interest." Tom Rubin, OpenAI’s director of intellectual property, also said the company was pleased to support the project.
As lawsuits over the use of copyrighted data in AI continue to mount, projects like Harvard's public domain dataset are becoming an important source of AI training data. Although it is unclear how the data set will be released specifically, it is expected to provide enterprises with a large amount of high-quality data while avoiding copyright issues.
Harvard's Institutional Data Initiative goes beyond books, working with the Boston Public Library to scan millions of public domain newspaper articles and planning similar collaborations with more partners in the future. In addition, Harvard is working with Google to discuss how to achieve public distribution of the data set.
This project will join several similar initiatives that also promise to provide high-quality AI training materials without copyright risks. In the future, as more public domain datasets become available, AI companies will have more options to train their models while reducing copyright-related legal risks.
This move by Harvard University not only provides high-quality data resources for artificial intelligence research, but also provides new ideas for solving the copyright issue of AI training data sources. It is expected to promote healthy development and fair competition in the field of artificial intelligence in the future. The successful implementation of this project will have a profound impact on the entire industry.