The Overlooked Importance of Data Reuse in AI Infrastructure


Essay by Oxford Insights and The Data Tank: “Employing data stewards and embedding responsible data reuse principles in the programme or ecosystem and within participating organisations is one of the pathways forward. Data stewards are proactive agents responsible for catalysing collaboration, tackling these challenges and embedding data reuse practices in their organisations. 

The role of Chief Data Officer for government agencies has become more common in recent years and we suggest the same needs to happen with the role of the Chief Data Steward. Chief Data Officers are mostly focused on internal data management and have a technical focus. With the changes in the data governance landscape, this profession needs to be reimagined and iterated. Embedded in both the demand and the supply sides of data, data stewards are proactive agents empowered to create public value by re-using data and data expertise. They are tasked to identify opportunities for productive cross-sectoral collaboration, and proactively request or enable functional access to data, insights, and expertise. 

One exception comes from New Zealand. The UN has released a report on the role of data stewards and National Statistical Offices (NSOs) in the new data ecosystem. This report provides many use-cases that can be adopted by governments seeking to establish such a role. In New Zealand, there is an appointed Government Chief Data Steward, who is in charge of setting the strategic direction for government’s data management, and focuses on data reuse altogether. 

Data stewards can play an important role in organisations leading data reuse programmes. Data stewards would be responsible for responding to the challenges with participation introduced above. 

A Data Steward’s role includes attracting participation for data reuse programmes by:

  • Demonstrating and communicating the value proposition of data reuse and collaborations, by engaging in partnerships and steering data reuse and sharing among data commons, cooperatives, or collaborative infrastructures. 
  • Developing responsible data lifecycle governance, and communicating insights to raise awareness and build trust among stakeholders; 

A Data Steward’s role includes maintaining and scaling participation for data reuse programmes by:

  • Maintaining trust by engaging with wider stakeholders and establishing clear engagement methodologies. For example, by embedding a social license, data stewards assure the digital self determination principle is embedded in data reuse processes. 
  • Fostering sustainable partnerships and collaborations around data, via developing business cases for data sharing and reuse, and measuring impact to build the societal case for data collaboration; and
  • Innovating in the sector by turning data to decision intelligence to ensure that insights derived from data are more effectively integrated into decision-making processes…(More)”.

From Answer-Giving to Question-Asking: Inverting the Socratic Method in the Age of AI


Blog by Anthea Roberts: “…If questioning is indeed becoming a premier cognitive skill in the AI age, how should education and professional development evolve? Here are some possibilities:

  1. Assessment Through Iterative Questioning: Rather than evaluating students solely on their answers, we might assess their ability to engage in sustained, productive questioning—their skill at probing, following up, identifying inconsistencies, and refining inquiries over multiple rounds. Can they navigate a complex problem through a series of well-crafted questions? Can they identify when an AI response contains subtle errors or omissions that require further exploration?
  2. Prompt Literacy as Core Curriculum: Just as reading and writing are foundational literacies, the ability to effectively prompt and question AI systems may become a basic skill taught from early education onward. This would include teaching students how to refine queries, test assumptions, and evaluate AI responses critically—recognizing that AI systems still hallucinate, contain biases from their training data, and have uneven performance across different domains.
  3. Socratic AI Interfaces: Future AI interfaces might be designed explicitly to encourage Socratic dialogue rather than one-sided Q&A. Instead of simply answering queries, these systems might respond with clarifying questions of their own: “It sounds like you’re asking about X—can you tell me more about your specific interest in this area?” This would model the kind of iterative exchange that characterizes productive human-human dialogue…(More)”.

The Future of Health Is Preventive — If We Get Data Governance Right


Article by Stefaan Verhulst: “After a long gestation period of three years, the European Health Data Space (EHDS) is now coming into effect across the European Union, potentially ushering in a new era of health data access, interoperability, and innovation. As this ambitious initiative enters the implementation phase, it brings with it the opportunity to fundamentally reshape how health systems across Europe operate. More generally, the EHDS contains important lessons (and some cautions) for the rest of the world, suggesting how a fragmented, reactive model of healthcare may transition to one that is more integrated, proactive, and prevention-oriented.

For too long, health systems–in the EU and around the world–have been built around treating diseases rather than preventing them. Now, we have an opportunity to change that paradigm. Data, and especially the advent of AI, give us the tools to predict and intervene before illness takes hold. Data offers the potential for a system that prioritizes prevention–one where individuals receive personalized guidance to stay healthy, policymakers access real-time evidence to address risks before they escalate, and epidemics are predicted weeks in advance, enabling proactive, rapid, and highly effective responses.

But to make AI-powered preventive health care a reality, and to make the EHDS a success, we need a new data governance approach, one that would include two key components:

  • The ability to reuse data collected for other purposes (e.g., mobility, retail sales, workplace trends) to improve health outcomes.
  • The ability to integrate different data sources–clinical records and electronic health records (EHRS), but also environmental, social, and economic data — to build a complete picture of health risks.

In what follows, we outline some critical aspects of this new governance framework, including responsible data access and reuse (so-called secondary use), moving beyond traditional consent models to a social license for reuse, data stewardship, and the need to prioritize high-impact applications. We conclude with some specific recommendations for the EHDS, built from the preceding general discussion about the role of AI and data in preventive health…(More)”.

Unlocking Public Value with Non-Traditional Data: Recent Use Cases and Emerging Trends


Article by Adam Zable and Stefaan Verhulst: “Non-Traditional Data (NTD)—digitally captured, mediated, or observed data such as mobile phone records, online transactions, or satellite imagery—is reshaping how we identify, understand, and respond to public interest challenges. As part of the Third Wave of Open Data, these often privately held datasets are being responsibly re-used through new governance models and cross-sector collaboration to generate public value at scale.

In our previous post, we shared emerging case studies across health, urban planning, the environment, and more. Several months later, the momentum has not only continued but diversified. New projects reaffirm NTD’s potential—especially when linked with traditional data, embedded in interdisciplinary research, and deployed in ways that are privacy-aware and impact-focused.

This update profiles recent initiatives that push the boundaries of what NTD can do. Together, they highlight the evolving domains where this type of data is helping to surface hidden inequities, improve decision-making, and build more responsive systems:

  • Financial Inclusion
  • Public Health and Well-Being
  • Socioeconomic Analysis
  • Transportation and Urban Mobility
  • Data Systems and Governance
  • Economic and Labor Dynamics
  • Digital Behavior and Communication…(More)”.

Data Localization: A Global Threat to Human Rights Online


Article by Freedom House: “From Pakistan to Zambia, governments around the world are increasingly proposing and passing data localization legislation. These laws, which refer to the rules governing the storage and transfer of electronic data across jurisdictions, are often justified as addressing concerns such as user privacy, cybersecurity, national security, and monopolistic market practices. Notwithstanding these laudable goals, data localization initiatives cause more harm than good, especially in legal environments with poor rule of law.

Data localization requirements can take many different forms. A government may require all companies collecting and processing certain types of data about local users to store the data on servers located in the country. Authorities may also restrict the foreign transfer of certain types of data or allow it only under narrow circumstances, such as after obtaining the explicit consent of users, receiving a license or permit from a public authority, or conducting a privacy assessment of the country to which the data will be transferred.

While data localization can have significant economic and security implications, the focus of this piece—inline with that of the Global Network Initiative and Freedom House—is on its potential human rights impacts, which are varied. Freedom House’s research shows that the rise in data localization policies worldwide is contributing to the global decline of internet freedom. Without robust transparency and accountability frameworks embedded into these provisions, digital rights are often put on the line. As these types of legislation continue to pop up globally, the need for rights-respecting solutions and norms for cross-border data flows is greater than ever…(More)”.

Engaging Youth on Responsible Data Reuse: 5 Lessons Learnt from a Multi-Country Experiment


Article by Elena Murray, Moiz Shaikh and Stefaan G. Verhulst: “Young people seeking essential services — like mental health care, education, or public benefits — are often asked to share personal data in order to access the service, without having any say in how it is being collected, shared or used, or why. If young people distrust how their data is being used, they may avoid services or withhold important information, fearing misuse. This can unintentionally widen the very gaps these services aim to close.

To build trust, service providers and policymakers must involve young people in co-designing how their data is collected and used. Understanding their concerns, values, and expectations is key to developing data practices that reflect their needs. Empowering young people to develop the conditions for data re-use and design solutions to their concerns enables digital self determination.

The question is then: what does meaningful engagement actually look like — and how can we get it right?

To answer that question, we engaged four partners in four different countries and conducted:

  • 1000 hours of youth participation, involving more than 70 young people.
  • 12 youth engagement events.
  • Six expert talks and mentorship sessions.

These activities were undertaken as part of the NextGenData project, a year-long global collaboration supported by the Botnar Foundation, that piloted a methodology for youth engagement on responsible data reuse in Moldova, Tanzania, India and Kyrgyzstan.

A key outcome of our work was a youth engagement methodology, which we recently launched. In the below, we reflect on what we learnt — and how we can apply these learnings to ensure that the future of data-driven services both serves the needs of, and is guided by, young people.

Lessons Learnt:…(More)”

A graph illustrating the engagement cycle on data literacy: Foster Data Literacy, Develop Real-World Use Cases, Align with Local Realities, Optimise Participation, Implement Scalable Methodologies
A Cycle for Youth Engagement on Data — NextGenData Project

Beyond data egoism: let’s embrace data altruism


Blog by Frank Hamerlinck: “When it comes to data sharing, there’s often a gap between ambition and reality. Many organizations recognize the potential of data collaboration, yet when it comes down to sharing their own data, hesitation kicks in. The concern? Costs, risks, and unclear returns. At the same time, there’s strong enthusiasm for accessing data.

This is the paradox we need to break. Because if data egoism rules, real innovation is out of reach, making the need for data altruism more urgent than ever.

…More and more leaders recognize that unlocking data is essential to staying competitive on a global scale, and they understand that we must do so while upholding our European values. However, the real challenge lies in translating this growing willingness into concrete action. Many acknowledge its importance in principle, but few are ready to take the first step. And that’s a challenge we need to address – not just as organizations but as a society…

To break down barriers and accelerate data-driven innovation, we’re launching the FTI Data Catalog – a step toward making data sharing easier, more transparent, and more impactful.

The catalog provides a structured, accessible overview of available datasets, from location data and financial data to well-being data. It allows organizations to discover, understand, and responsibly leverage data with ease. Whether you’re looking for insights to fuel innovation, enhance decision-making, drive new partnerships or unlock new value from your own data, the catalog is built to support open and secure data exchange.

Feeling curious? Explore the catalog

By making data more accessible, we’re laying the foundation for a culture of collaboration. The road to data altruism is long, but it’s one worth walking. The future belongs to those who dare to share!..(More)”

How crawlers impact the operations of the Wikimedia projects


Article by the Wikimedia Foundation: “Since the beginning of 2024, the demand for the content created by the Wikimedia volunteer community – especially for the 144 million images, videos, and other files on Wikimedia Commons – has grown significantly. In this post, we’ll discuss the reasons for this trend and its impact.

The Wikimedia projects are the largest collection of open knowledge in the world. Our sites are an invaluable destination for humans searching for information, and for all kinds of businesses that access our content automatically as a core input to their products. Most notably, the content has been a critical component of search engine results, which in turn has brought users back to our sites. But with the rise of AI, the dynamic is changing: We are observing a significant increase in request volume, with most of this traffic being driven by scraping bots collecting training data for large language models (LLMs) and other use cases. Automated requests for our content have grown exponentially, alongside the broader technology economy, via mechanisms including scraping, APIs, and bulk downloads. This expansion happened largely without sufficient attribution, which is key to drive new users to participate in the movement, and is causing a significant load on the underlying infrastructure that keeps our sites available for everyone. 

When Jimmy Carter died in December 2024, his page on English Wikipedia saw more than 2.8 million views over the course of a day. This was relatively high, but manageable. At the same time, quite a few users played a 1.5 hour long video of Carter’s 1980 presidential debate with Ronald Reagan. This caused a surge in the network traffic, doubling its normal rate. As a consequence, for about one hour a small number of Wikimedia’s connections to the Internet filled up entirely, causing slow page load times for some users. The sudden traffic surge alerted our Site Reliability team, who were swiftly able to address this by changing the paths our internet connections go through to reduce the congestion. But still, this should not have caused any issues, as the Foundation is well equipped to handle high traffic spikes during exceptional events. So what happened?…

Since January 2024, we have seen the bandwidth used for downloading multimedia content grow by 50%. This increase is not coming from human readers, but largely from automated programs that scrape the Wikimedia Commons image catalog of openly licensed images to feed images to AI models. Our infrastructure is built to sustain sudden traffic spikes from humans during high-interest events, but the amount of traffic generated by scraper bots is unprecedented and presents growing risks and costs…(More)”.

Web 3.0 Requires Data Integrity


Article by Bruce Schneier and Davi Ottenheimer: “If you’ve ever taken a computer security class, you’ve probably learned about the three legs of computer security—confidentiality, integrity, and availability—known as the CIA triad.a When we talk about a system being secure, that’s what we’re referring to. All are important, but to different degrees in different contexts. In a world populated by artificial intelligence (AI) systems and artificial intelligent agents, integrity will be paramount.

What is data integrity? It’s ensuring that no one can modify data—that’s the security angle—but it’s much more than that. It encompasses accuracy, completeness, and quality of data—all over both time and space. It’s preventing accidental data loss; the “undo” button is a primitive integrity measure. It’s also making sure that data is accurate when it’s collected—that it comes from a trustworthy source, that nothing important is missing, and that it doesn’t change as it moves from format to format. The ability to restart your computer is another integrity measure.

The CIA triad has evolved with the Internet. The first iteration of the Web—Web 1.0 of the 1990s and early 2000s—prioritized availability. This era saw organizations and individuals rush to digitize their content, creating what has become an unprecedented repository of human knowledge. Organizations worldwide established their digital presence, leading to massive digitization projects where quantity took precedence over quality. The emphasis on making information available overshadowed other concerns.

As Web technologies matured, the focus shifted to protecting the vast amounts of data flowing through online systems. This is Web 2.0: the Internet of today. Interactive features and user-generated content transformed the Web from a read-only medium to a participatory platform. The increase in personal data, and the emergence of interactive platforms for e-commerce, social media, and online everything demanded both data protection and user privacy. Confidentiality became paramount.

We stand at the threshold of a new Web paradigm: Web 3.0. This is a distributed, decentralized, intelligent Web. Peer-to-peer social-networking systems promise to break the tech monopolies’ control on how we interact with each other. Tim Berners-Lee’s open W3C protocol, Solid, represents a fundamental shift in how we think about data ownership and control. A future filled with AI agents requires verifiable, trustworthy personal data and computation. In this world, data integrity takes center stage…(More)”.

Should AGI-preppers embrace DOGE?


Blog by Henry Farrell: “…AGI-prepping is reshaping our politics. Wildly ambitious claims for AGI have not only shaped America’s grand strategy, but are plausibly among the justifying reasons for DOGE.

After the announcement of DOGE, but before it properly got going, I talked to someone who was not formally affiliated, but was very definitely DOGE adjacent. I put it to this individual that tearing out the decision making capacities of government would not be good for America’s ability to do things in the world. Their response (paraphrased slightly) was: so what? We’ll have AGI by late 2026. And indeed, one of DOGE’s major ambitions, as described in a new article in WIRED, appears to have been to pull as much government information as possible into a large model that could then provide useful information across the totality of government.

The point – which I don’t think is understood nearly widely enough – is that radical institutional revolutions such as DOGE follow naturally from the AGI-prepper framework. If AGI is right around the corner, we don’t need to have a massive federal government apparatus, organizing funding for science via the National Science Foundation and the National Institute for Health. After all, in Amodei and Pottinger’s prediction:

By 2027, AI developed by frontier labs will likely be smarter than Nobel Prize winners across most fields of science and engineering. … It will be able to … complete complex tasks that would take people months or years, such as designing new weapons or curing diseases.

Who needs expensive and cumbersome bureaucratic institutions for organizing funding scientists in a near future where a “country of geniuses [will be] contained in a data center,” ready to solve whatever problems we ask them to? Indeed, if these bottled geniuses are cognitively superior to humans across most or all tasks, why do we need human expertise at all, beyond describing and explaining human wants? From this perspective, most human based institutions are obsolescing assets that need to be ripped out, and DOGE is only the barest of beginnings…(More)”.