AI Is Tearing Wikipedia Apart

Article by Claire Woodcock: “As generative artificial intelligence continues to permeate all aspects of culture, the people who steward Wikipedia are divided on how best to proceed. 

During a recent community call, it became apparent that there is a community split over whether or not to use large language models to generate content. While some people expressed that tools like Open AI’s ChatGPT could help with generating and summarizing articles, others remained wary. 

The concern is that machine-generated content has to be balanced with a lot of human review and would overwhelm lesser-known wikis with bad content. While AI generators are useful for writing believable, human-like text, they are also prone to including erroneous information, and even citing sources and academic papers which don’t exist. This often results in text summaries which seem accurate, but on closer inspection are revealed to be completely fabricated

“The risk for Wikipedia is people could be lowering the quality by throwing in stuff that they haven’t checked,” Bruckman added. “I don’t think there’s anything wrong with using it as a first draft, but every point has to be verified.” 

The Wikimedia Foundation, the nonprofit organization behind the website, is looking into building tools to make it easier for volunteers to identify bot-generated content. Meanwhile, Wikipedia is working to draft a policy that lays out the limits to how volunteers can use large language models to create content.

The current draft policy notes that anyone unfamiliar with the risks of large language models should avoid using them to create Wikipedia content, because it can open the Wikimedia Foundation up to libel suits and copyright violations—both of which the nonprofit gets protections from but the Wikipedia volunteers do not. These large language models also contain implicit biases, which often result in content skewed against marginalized and underrepresented groups of people

The community is also divided on whether large language models should be allowed to train on Wikipedia content. While open access is a cornerstone of Wikipedia’s design principles, some worry the unrestricted scraping of internet data allows AI companies like OpenAI to exploit the open web to create closed commercial datasets for their models. This is especially a problem if the Wikipedia content itself is AI-generated, creating a feedback loop of potentially biased information, if left unchecked…(More)”.