Physical Address

304 North Cardinal St.
Dorchester Center, MA 02124

Harvard Releases Large Free AI Database Powered by OpenAI and Microsoft


In addition to many books, the Institutional Data Initiative is also working with the Boston Public Library to search millions of articles from various newspapers in the public domain, and says it is open to similar collaborations. The exact way in which the literature data will be released has not been established. The Institutional Data Initiative has asked Google to work together on public distribution, but the details are still being discussed. In a statement, Kent Walker, Google’s global president, said the company was “proud to support” the project.

However, when the IDI data is released, it will be joining many similar projects, startups, and experiments that promise to give companies access to high-quality and advanced AI training tools without the risk of copyright infringement. Companies like Calliope Networks and ProRata it has been discovered licensing and regulation payment plans was created to reward creators and copyright holders for providing educational information on AI.

There are also some new public services. Last year, the French AI Pleias he rolled over its own public domain, the Common Corpus, which contains about 3 to 4 million books and periodical collections, according to project director Pierre-Carl Langlais. Supported by the French Ministry of Culture, the Common Corpus has been downloaded more than 60,000 times this month alone on the open AI Hugging Face platform. Last week, Pleias announced that it was releasing its first training languages ​​to the group, which Langlais told WIRED were the first models “trained exclusively on open data and compliant with the (EU) AI Act.”

Efforts are underway to reproduce similar image sets. Introduction of AI release its own this summer called Source.Plus, which contains public domain images from Wikimedia Commons as well as various archives and archives. Several requirements cultural institutions for a long time they have made their archives accessible to the public as independent works, such as the Metropolitan Museum of Art in New York.

Ed Newton-Rex, former CEO at Stability AI who now runs useless which validates well-trained AI tools, says that the rise of these datasets shows that there is no need to hack proprietary tools to create sophisticated and advanced AI models. OpenAI has already told lawmakers in the United Kingdom that it will be “impossible” to create things like ChatGPT without using a copycat. “Large public documents like this are also undermining the ‘essential security’ some AI companies use to justify copyrighted works to train their models,” says Newton-Rex.

But he still doubts whether IDI and projects like it will really revolutionize AI education. “These documents can have a positive effect if they are used, perhaps in conjunction with other licensing, instead of modified works. If they are added to the mix, a part of the dataset that also includes the unlicensed work of the creators of the world, will greatly benefit the companies of AI,” he says.

Updated 12/12/24 11:18am ET: This article has been updated with comments from Google.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *