Data Commons: The Missing Infrastructure for Public Interest Artificial Intelligence


Article by Stefaan Verhulst, Burton Davis and Andrew Schroeder: “Artificial intelligence is celebrated as the defining technology of our time. From ChatGPT to Copilot and beyond, generative AI systems are reshaping how we work, learn, and govern. But behind the headline-grabbing breakthroughs lies a fundamental problem: The data these systems depend on to produce useful results that serve the public interest is increasingly out of reach.

Without access to diverse, high-quality datasets, AI models risk reinforcing bias, deepening inequality, and returning less accurate, more imprecise results. Yet, access to data remains fragmented, siloed, and increasingly enclosed. What was once open—government records, scientific research, public media—is now locked away by proprietary terms, outdated policies, or simple neglect. We are entering a data winter just as AI’s influence over public life is heating up.

This isn’t just a technical glitch. It’s a structural failure. What we urgently need is new infrastructure: data commons.

A data commons is a shared pool of data resources—responsibly governed, managed using participatory approaches, and made available for reuse in the public interest. Done correctly, commons can ensure that communities and other networks have a say in how their data is used, that public interest organizations can access the data they need, and that the benefits of AI can be applied to meet societal challenges.

Commons offer a practical response to the paradox of data scarcity amid abundance. By pooling datasets across organizations—governments, universities, libraries, and more—they match data supply with real-world demand, making it easier to build AI that responds to public needs.

We’re already seeing early signs of what this future might look like. Projects like Common Corpus, MLCommons, and Harvard’s Institutional Data Initiative show how diverse institutions can collaborate to make data both accessible and accountable. These initiatives emphasize open standards, participatory governance, and responsible reuse. They challenge the idea that data must be either locked up or left unprotected, offering a third way rooted in shared value and public purpose.

But the pace of progress isn’t matching the urgency of the moment. While policymakers debate AI regulation, they often ignore the infrastructure that makes public interest applications possible in the first place. Without better access to high-quality, responsibly governed data, AI for the common good will remain more aspiration than reality.

That’s why we’re launching The New Commons Challenge—a call to action for universities, libraries, civil society, and technologists to build data ecosystems that fuel public-interest AI…(More)”.

Entering the Vortex


Essay by Nils Gilman: “A strange and unsettling weather pattern is forming over the landscape of scholarly research. For decades, the climate of academic inquiry was shaped by a prevailing high-pressure system, a consensus grounded in the vision articulated by Vannevar Bush in “Science: The Endless Frontier” (1945). That era was characterized by robust federal investment, a faith in the university as the engine of basic research, and a compact that traded public funding for scientific autonomy and the promise of long-term societal benefit. It was a climate conducive to the slow, deliberate, and often unpredictable growth of knowledge, nurtured by a diverse ecosystem of human researchers — the vital “seed stock” of intellectual discovery.

But that high-pressure system is collapsing. A brutal, unyielding cold front of academic defunding has swept across the nation, a consequence of shifting political priorities, populist resentment, and a calculated assault on the university as an institution perceived as hostile to certain political agendas. This is not merely a belt-tightening exercise; it is, for all intents and purposes, the dismantling of Vannevar Bush’s Compact, the end of the era of “big government”-funded Wissenschaft. Funding streams for basic research are dwindling, grant applications face increasingly long odds, and the financial precarity of academic careers deters the brightest minds. The human capital necessary for sustained, fundamental inquiry is beginning to wither.

Simultaneously, a warm, moisture-laden airmass is rapidly advancing: the astonishing rise of AI-based research tools. Powered by vast datasets and sophisticated algorithms, these tools promise to revolutionize every stage of the research process – from literature review and data analysis to hypothesis generation and the drafting of scholarly texts. As a recent New Yorker piece on AI and the humanities suggests, these AI engines can already generate deep research and coherent texts on virtually any subject, seemingly within moments. They offer the prospect of unprecedented efficiency, speed, and scale in the production of scholarly output.

The collision of these two epochal weather systems — the brutal cold front of academic defunding and the warm, expansive airmass of AI-based research tools — is creating an atmospheric instability unlike anything the world of scholarship has ever witnessed. Along the front where these forces meet, a series of powerful and unpredictable tornados are beginning to touch down, reshaping the terrain of knowledge production in real-time…(More)”.

Our new AI strategy puts Wikipedia’s humans first


Blog by Chris Albon and Leila Zia: “Not too long ago, we were asked when we’re going to replace Wikipedia’s human-curated knowledge with AI. 

The answer? We’re not.

The community of volunteers behind Wikipedia is the most important and unique element of Wikipedia’s success. For nearly 25 years, Wikipedia editors have researched, deliberated, discussed, built consensus, and collaboratively written the largest encyclopedia humankind has ever seen. Their care and commitment to reliable encyclopedic knowledge is something AI cannot replace. 

That is why our new AI strategy doubles down on the volunteers behind Wikipedia.

We will use AI to build features that remove technical barriers to allow the humans at the core of Wikipedia to spend their valuable time on what they want to accomplish, and not on how to technically achieve it. Our investments will be focused on specific areas where generative AI excels, all in the service of creating unique opportunities that will boost Wikipedia’s volunteers: 

  • Supporting Wikipedia’s moderators and patrollers with AI-assisted workflows that automate tedious tasks in support of knowledge integrity; 
  • Giving Wikipedia’s editors time back by improving the discoverability of information on Wikipedia to leave more time for human deliberation, judgment, and consensus building; 
  • Helping editors share local perspectives or context by automating the translation and adaptation of common topics;
  • Scaling the onboarding of new Wikipedia volunteers with guided mentorship. 

You can read the Wikimedia Foundation’s new AI strategy over on Meta-Wiki…(More)”.

Real-time prices, real results: comparing crowdsourcing, AI, and traditional data collection


Article by Julius Adewopo, Bo Andree, Zacharey Carmichael, Steve Penson, Kamwoo Lee: “Timely, high-quality food price data is essential for shock responsive decision-making. However, in many low- and middle-income countries, such data is often delayed, limited in geographic coverage, or unavailable due to operational constraints. Traditional price monitoring, which relies on structured surveys conducted by trained enumerators, is often constrained by challenges related to cost, frequency, and reach.

To help overcome these limitations, the World Bank launched the Real-Time Prices (RTP) data platform. This effort provides monthly price data using a machine learning framework. The models combine survey results with predictions derived from observations in nearby markets and related commodities. This approach helps fill gaps in local price data across a basket of goods, enabling real-time monitoring of inflation dynamics even when survey data is incomplete or irregular.

In parallel, new approaches—such as citizen-submitted (crowdsourced) data—are being explored to complement conventional data collection methods. These crowdsourced data were recently published in a Nature Scientific Data paper. While the adoption of these innovations is accelerating, maintaining trust requires rigorous validation.

newly published study in PLOS compares the two emerging methods with the traditional, enumerator-led gold standard, providing  new evidence that both crowdsourced and AI-imputed prices can serve as credible, timely alternatives to traditional ground-truth data collection—especially in contexts where conventional methods face limitations…(More)”.

These Startups Are Building Advanced AI Models Without Data Centers


Article by Will Knight: “Researchers have trained a new kind of large language model (LLM) using GPUs dotted across the world and fed private as well as public data—a move that suggests that the dominant way of building artificial intelligence could be disrupted.

Article by Will Knight: “Flower AI and Vana, two startups pursuing unconventional approaches to building AI, worked together to create the new model, called Collective-1.

Flower created techniques that allow training to be spread across hundreds of computers connected over the internet. The company’s technology is already used by some firms to train AI models without needing to pool compute resources or data. Vana provided sources of data including private messages from X, Reddit, and Telegram.

Collective-1 is small by modern standards, with 7 billion parameters—values that combine to give the model its abilities—compared to hundreds of billions for today’s most advanced models, such as those that power programs like ChatGPTClaude, and Gemini.

Nic Lane, a computer scientist at the University of Cambridge and cofounder of Flower AI, says that the distributed approach promises to scale far beyond the size of Collective-1. Lane adds that Flower AI is partway through training a model with 30 billion parameters using conventional data, and plans to train another model with 100 billion parameters—close to the size offered by industry leaders—later this year. “It could really change the way everyone thinks about AI, so we’re chasing this pretty hard,” Lane says. He says the startup is also incorporating images and audio into training to create multimodal models.

Distributed model-building could also unsettle the power dynamics that have shaped the AI industry…(More)”

Digital Public Infrastructure Could Make a Better Internet


Essay by Akash Kapur: “…The advent of AI has intensified geopolitical rivalries, and with them the risks of fragmentation, exclusion, and hyper-concentration that are already so prevalent. The prospects of a “Splinternet” have never appeared more real. The old dream of a global digital commons seems increasingly quaint; we are living amid what Yanis Varoufakis, the former Greek finance minister, calls “technofeudalism.”

DPI suggests it doesn’t have to be this way. The approach’s emphasis on loosening chokeholds, fostering collaboration, and reclaiming space from monopolies represents an effort to recuperate some of the internet’s original promise. At its most aspirational, DPI offers the potential for a new digital social contract: a rebalancing of public and private interests, a reorientation of the network so that it advances broad social goals even while fostering entrepreneurship and innovation. How fitting it would be if this new model were to emerge not from the entrenched powers that have so long guided the network, but from a handful of nations long confined to the periphery—now determined to take their seats at the table of global technology…(More)”.

Understanding and Addressing Misinformation About Science


Report by National Academies of Sciences, Engineering, and Medicine: “Our current information ecosystem makes it easier for misinformation about science to spread and harder for people to figure out what is scientifically accurate. Proactive solutions are needed to address misinformation about science, an issue of public concern given its potential to cause harm at individual, community, and societal levels. Improving access to high-quality scientific information can fill information voids that exist for topics of interest to people, reducing the likelihood of exposure to and uptake of misinformation about science. Misinformation is commonly perceived as a matter of bad actors maliciously misleading the public, but misinformation about science arises both intentionally and inadvertently and from a wide range of sources…(More)”.

Bad Public Policy: Malignity, Volatility and the Inherent Vices of Policymaking


Book by Policy studies assume the existence of baseline parameters – such as honest governments doing their best to create public value, publics responding in good faith, and both parties relying on a policy-making process which aligns with the public interest. In such circumstances, policy goals are expected to be produced through mechanisms in which the public can articulate its preferences and policy-makers are expected to listen to what has been said in determining their governments’ courses of action. While these conditions are found in some governments, there is evidence from around the world that much policy-making occurs without these pre-conditions and processes. Unlike situations which produce what can be thought of as ‘good’ public policy, ‘bad’ public policy is a more common outcome. How this happens and what makes for bad public policy are the subjects of this Element…(More)”.

AI action plan database


A project by the Institute for Progress: “In January 2025, President Trump tasked the Office of Science and Technology Policy with creating an AI Action Plan to promote American AI Leadership. The government requested input from the public, and received 10,068 submissions. The database below summarizes specific recommendations from these submissions. … We used AI to extract recommendations from each submission, and to tag them with relevant information. Click on a recommendation to learn more about it. See our analysis of common themes and ideas across these recommendations…(More)”.

Updating purpose limitation for AI: a normative approach from law and philosophy 


Paper by Rainer Mühlhoff and Hannah Ruschemeier: “The purpose limitation principle goes beyond the protection of the individual data subjects: it aims to ensure transparency, fairness and its exception for privileged purposes. However, in the current reality of powerful AI models, purpose limitation is often impossible to enforce and is thus structurally undermined. This paper addresses a critical regulatory gap in EU digital legislation: the risk of secondary use of trained models and anonymised training datasets. Anonymised training data, as well as AI models trained from this data, pose the threat of being freely reused in potentially harmful contexts such as insurance risk scoring and automated job applicant screening. We propose shifting the focus of purpose limitation from data processing to AI model regulation. This approach mandates that those training AI models define the intended purpose and restrict the use of the model solely to this stated purpose…(More)”.