Summary

  • Wikipedia has partnered with data science community platform Kaggle, which is owned by Google, to offer a dataset of English and French Wikipedia content, in a bid to deter AI developers from scraping information from the site.
  • The dataset is structured and machine-readable, including research summaries, image links and article sections, minus references and non-written elements like audio files.
  • Wikimedia hopes the alternative dataset will make it easier for AI developers to access content for modelling, benchmarking and analysis, and relieve pressure on its servers from AI bots scraping data.
  • The platform already has content-sharing agreements with Google and the Internet Archive, but the Kaggle partnership enhances its accessibility for smaller companies and independent data scientists.

By Jess Weatherbed

Original Article