Explore our articles
View All Results

Stefaan Verhulst

Article by Erika Tyagi et al: “In this post, we offer a behind-the-scenes look at our approach and the lessons we’ve learned, with the aim of helping other organizations and leaders build the foundations for data systems that unlock insights and improve outcomes.

  1. Start privacy and governance conversations early

Successfully managing risks requires treating privacy, governance, and disclosure protections as first-order design choices—not afterthoughts. Starting these conversations early helps align goals, clarify decisionmaking authority, and design processes that scale, especially in multiagency or cross-jurisdictional projects with significant legal and operational constraints. These early conversations can enable trust and efficiency down the line.

Urban’s work with the DC Education Research Collaborative demonstrates the importance of early governance. From the start, the collaborative was structured as a research-practice partnership, bringing together education agencies, researchers, and community stakeholders through formal governance bodies. These included a cross-sector advisory committee and a research council of academic and analytic partners.

Governance and disclosure processes were established early and collaboratively, allowing agencies to securely and easily share data once with the collaborative, where a central team generates consistent, research-ready datasets that are then used by researchers across different organizations and teams. This approach, combined with regular meetings and feedback loops, reduces administrative burden, provides predictable data access, and keeps use aligned with the collaborative’s shared values.

  1. Design for change

Data systems must evolve as new data sources, users, and technologies emerge. The Education Data Portallaunched in 2018, was intentionally designed with this flexibility in mind. It uses an API-first, metadata-rich design to harmonize datasets over time and across sources and pairs those data with detailed, programmatically accessible documentation. This approach allows us to add new data sources and build downstream tools, such as programming libraries and interactive dashboards, without requiring changes to the underlying system.

Because the portal’s core architecture is modular and scalable, it’s remained resilient while expanding to support new use cases and audiences. Today, it’s evolving in response to federal policy shifts and advances in artificial intelligence. We are incorporating nonfederal data from states, for example, and integrating with initiatives like Google’s Data Commons to broaden access and usability. While Urban could not have anticipated these specific developments in 2018, the decision to prioritize an API-first design and curated metadata has enabled us to adapt the portal to new datasets, users, and tools without reengineering its foundation…(More)”.

Five Lessons for Building Sustainable Data Systems to Support Policy Insights

Article by Indranil Ghosh: “Iran is systematically crippling Starlink, the satellite internet service said to be almost impossible to jam.

Military-grade GPS jammers deployed since January 8 have cut satellite internet performance by as much as 80% in parts of the country, according toAmir Rashidi, director of digital rights at the Miaan Group, a U.S.-based nonprofit focused on Iranian internet censorship and digital rights.

“The level of violence by the government is unlike anything I have ever witnessed,” Rashidi wrote on LinkedIn. “The Islamic Republic is killing to survive.”

The U.S.-based Human Rights Activists News Agency reports at least 572 people have been killed and more than 10,600 arrested since protests erupted on December 28. Iran Human Rights, based in Norway, said the real toll could be far higher. Iranian Nobel laureate Shirin Ebadi warned of a potential “massacre under the cover of a sweeping communications blackout.”

The nationwide internet shutdown, which started on January 8, has disconnected 85 million Iranians from the outside world. Cloudflare, a major internet infrastructure company, recorded a 98.5% collapse in Iranian internet traffic within 30 minutes of the shutdown starting. NetBlocks, an internet monitoring group, confirmed non-satellite connectivity dropped below 2% of normal levels.

Iran has cut internet access 17 times since 2018, according to the Internet Society, a nonprofit that advocates for an open internet. Mohammed Soliman, a technology analyst at the Middle East Institute, a Washington-based think tank, said years of sanctions have left the government with near-total control over internet infrastructure…(More)”.

Iran crippled Starlink and why the rest of the world should worry

Article by Noam Angrist, Amanda Beatty, Claire Cullen & Tendekai Mukoyi Nkwane: “Many nonprofits in low- and middle-income countries face a critical mismatch: urgent social problems demand rapid program iteration, yet organizations often wait years for externally-produced evaluation results. When they do conduct rigorous evaluations, these are typically one-off studies that rarely keep pace with evolving implementation contexts or inform real-time decisions.

This tension between problem urgency and evidence generation speed is familiar to many implementers. After our organization, Youth Impact, ran an initial Randomized Controlled Trial (RCT) in Botswana on an HIV and teen pregnancy prevention program, we faced new questions relevant for government scale-up. The RCT showed near-peer educators effectively changed risky teen behavior while other messengers like public school teachers did not, but government partners needed ongoing answers about cost-effectiveness, implementation variations, and program adaptations. Waiting years between evaluation cycles meant missing the window to influence program design and consequential government reforms.

We needed an approach that maintained rigorous standards but operated at implementation speed. The technology sector offered a model: Microsoft alone runs approximately 100,000 A/B tests each year to continuously optimize products. A famous Gmail experiment, testing different advertising link colors, generated $200 million annually for Google and showed how small, rigorously tested variations can have outsized impact.

While social impact programs present unique complexities, we have found that a similar underlying approach can translate well to the social sector. Iterative A/B testing uses randomization to compare multiple program variations to answer questions about efficiency and cost-effectiveness, in addition to questions about general effectiveness (as in a traditional RCT). A/B testing also produces causal evidence in weeks or months, instead of years as in traditional randomized trials. Iterative A/B testing has a critical role to play to unlock social impact: causal evidence delivered rapidly enough to optimize programs during implementation and scale-up…(More)”.

Iterative A/B Testing for Social Impact: Rigorous, Rapid, Regular 

Paper by Emilija Gagrčin, et al: “Platformisation and the growing adoption of AI-driven systems have intensified pervasive data extraction and appropriation that bring distinct harms for both individuals and societies at large. Yet, little is known about how distinct harm perceptions shape citizens’ preferences for different control mechanisms. Based on survey data from six EU countries (N=2,889), we examine differences in perceptions of personal vs. societal harm and their implications for individual control preferences and support for regulation. We find a surprising inverse relationship between perceived personal harm and desire for individual control: when citizens’ perceive greater personal harm, they become less inclined to seek individual data control, suggesting privacy resignation. Conversely, perceived societal harm positively relates to both individual and regulatory control preferences, underscoring citizens’ view of these mechanisms as complementary, particularly when they perceive harms to democracy. For policymakers, the findings suggest that regulators should treat both dimensions as related but distinct inputs when designing interventions and address the conditions that generate both individual and collective harms. Specifically, regulatory frameworks with an overreliance on individual control mechanisms (like consent requirements) may be insufficient or even counterproductive when citizens already perceive data harms…(More)”.

Perceived personal and societal data harms shape users’ data control preferences

Article by Yiran Wang et al: “Public health decisions increasingly rely on large-scale data and emerging technologies such as artificial intelligence and mobile health. However, many populations—including those in rural areas, with disabilities, experiencing homelessness, or living in low- and middle-income regions of the world—remain underrepresented in health datasets, leading to biased findings and suboptimal health outcomes for certain subgroups. Addressing data inequities is critical to ensuring that technological and digital advances improve health outcomes for all.

This article proposes 10 core concepts to improve data equity throughout the operational arc of data science research and practice in public health. The framework integrates computer science principles such as fairness, transparency, and privacy protection, with best practices in public health data science that focus on mitigating information and selection biases, learning causality, and ensuring generalizability. These concepts are applied together throughout the data life cycle, from study design to data collection, analysis, and interpretation to policy translation, offering a structured approach for evaluating whether data practices adequately represent and serve all populations.

Data equity is a foundational requirement for producing trustworthy inference and actionable evidence. When data equity is built into public health research from the start, technological and digital advances are more likely to improve health outcomes for everyone rather than widening existing health gaps. These 10 core concepts can be used to operationalize data equity in public health. Although data equity is an essential first step, it does not automatically guarantee information, learning, or decision equity. Advancing data equity must be accompanied by parallel efforts in information theory and structural changes that promote informed decision-making…(More)”.

Ten Core Concepts for Ensuring Data Equity in Public Health


Paper by Kalena Cortes, Brian Holzman, Melissa D. Gentry & Miranda I. Lambert: “This study examines how digital incentives influence survey participation and engagement in a large randomized controlled trial of parents across seven Texas school districts. We test how incentive amount and information about vendor options affect response behavior and explore differences by language background. Incentivized parents were more likely to start and complete surveys and claim gift cards, though Spanish-speaking parents exhibited distinct patterns—greater completion rates but lower redemption rates, often selecting essential-goods vendors. Increasing incentive value and providing advance information both improved engagement. Findings inform the design of equitable, effective digital incentive strategies for diverse populations…(More)”.

Digital Incentives in Surveys: Response Rates and Sociodemographic Effects in a Large-Scale Parental Nudge Intervention

World Bank Report: “Text and voice messages have emerged as a low-cost and popular tool for nudging recipients to change behavior. This paper presents findings from a randomized controlled trial designed to evaluate the impact of an information campaign using text and voice messages implemented in Punjab, Pakistan during the COVID-19-induced school closures. This campaign sought to increase study time and provide academic support while schools were closed and to encourage reenrollment when they opened, to reduce the number of dropouts. The campaign targeted girls enrolled in grades 5 to 7. Messages were sent out by a government institution, and the campaign lasted from October 2020 until November 2021, when schools had permanently re-opened. Households were randomized across three treatment groups and a control group that did not receive any messages. The first treatment group received gender-specific messages that explicitly referenced daughters in their households, and the second treatment group received gender-neutral messages. A third group was cross randomized across the first two treatment arms and received academic support messages (practice math problems and solutions). The results show that the messages increased reenrollment by 6.0 percentage points approximately three months after the intervention finished. Gender neutral messages (+8.9 percentage points) showed larger effect size on enrolment than gender-specific messages (+ 4.3 percentage points), although the difference is not statistically significant. The message program also increased learning outcomes by 0.2 standard deviation for Urdu and 0.2 standard deviation for math. The paper finds a small positive effect on the intensive margin of remote learning and an (equivalent) small negative effect on the intensive margin of outside tutoring. In line with similar studies on pandemic remediation efforts, the paper finds no effect of the academic support intervention on learning. The findings suggest that increased school enrollment played a role in supporting the observed increase in learning outcomes…(More)”.

Nudging at Scale: Evidence from a Government Text Messaging Campaign during School Shutdowns in Punjab, Pakistan

Article by Tony Curzon Price: “Is 2026 the year that data collectives – unions, trusts, mutuals and clubs – tilt the balance of power in cyberspace away from mega-platforms and towards the citizen?

Last year, tech boss Sam Altman enabled ChatGPT to better remember past conversations in some jurisdictions, meaning that the AI might soon know us better than anyone else. In response to this sort of shift in power, we saw the creation of the First International Data Union (FIDU) to ensure that the data, knowledge and intimacy that Altman wants for ChatGPT would remain under members’ control and be managed according to their values.

Generative AI is causing a major overhaul of humanity’s life in cyberspace. There aren’t many examples of this sort of change – the web itself, Web 2.0 platforms, social media and mobile. The arrival of generative AI is upturning a decades-old equilibrium. ChatGPT has been the fastest-growing consumer application in history. It is displacing Google search in many lives. Open source models, especially from China, suggest that there are no natural moats in the technology, which means businesses can easily be overtaken by competitors with similar ideas.

Since the 2010s, many citizens and countries have become uncomfortable with how mega platforms have shaped the web. Scholars have pointed to these changes as important contributors to the deterioration of the mental health of children, the economic growth crisis and even falling global average IQs.

With the pieces of the cyberspace puzzle thrown into the air, citizens and governments do not want what happens next to be a repeat of what came before. Yet governments have discovered that their traditional policy tools against market power, like antitrust, are largely ineffective. Moreover, with the United States pushing back against tighter regulation abroad, even direct regulation by non-US states is proving difficult.

With other avenues of control largely defanged, this might be the moment for data unions. Data mutualisation promises to harness the collective power of citizens, providing a direct challenge to platforms…(More)”.

Data unions: people-powered data control

Article by Simon Ilyushchenko: “The Italian aphorism traduttore, traditore – the translator is a traitor – encapsulates a deep-seated suspicion about the act of translation: that to carry meaning from one language to another is always, to some degree, a corruption.

The writer and semiotician Umberto Eco took this charge seriously. In Experiences in Translation, Eco treats translation as an interpretive act – negotiation, compromise, loss. Every translation is an imperfect reproduction of the original. Every translator, in choosing what to preserve, chooses what to betray.

This is the situation confronting anyone who works with geospatial data – human or AI.

In 2019, Colombian researchers studied the relationship between armed conflict and forest cover in their country. Using the Global Forest Change dataset – a widely respected product derived from satellite imagery – they found something striking: if analysis is not done carefully, armed conflict appeared to be correlated with increases in forest cover.

One might infer, perversely, that violence was somehow good for forests. The authors’ interpretation of the ground data was the opposite.

Here is the mechanism they propose: armed conflict destabilized the rule of law, which enabled the rapid clearing of native forests for oil palm plantations. These plantations are monocultures – ecological deserts compared to the biodiverse forests they replaced. But to a satellite sensor, a mature oil palm plantation can read as ‘forest’. It has trees. The canopy closes. The pixels are green.

And even this example gets messy fast. The relationship between Colombian conflict and forest cover has generated substantial literature – but no consensus. Ganzenmüller et al. (2022) identified seven distinct categories of deforestation dynamics across Colombian municipalities; the same peace agreement drove opposite outcomes in different regions. Bodini et al. (2024), using loop analysis to model the socio-ecological system, found that causal pathways connecting violence, coca, cattle, and deforestation were so intertwined that their models for left-wing guerrilla dynamics showed “very low agreement with observed correlations.” The data didn’t fit a simple narrative – any simple narrative…(More)”.

To translate is to betray: On the Inevitable Betrayals of Geospatial Data

Paper by Arianna Zuanazzi, Michael P. Milham & Gregory Kiar: “Modern brain science is inherently multidisciplinary, requiring the integration of neuroimaging, psychology, behavioral science, genetics, computational neuroscience and artificial intelligence (to name a few) to advance our understanding of the brain. Critical challenges in the field of brain health — including clinical psychology, cognitive and brain sciences, and digital mental health — include the great heterogeneity of human data, small sample sizes and the subjectivity or limited reproducibility of measured constructs. Large-scale, multi-site and multimodal open science initiatives can represent a solution to these challenges (for example, see refs.); however, they often struggle with balancing data quality while maximizing sample size5 and ensuring that the resulting data are findable, accessible, interoperable and reusable (FAIR). Furthermore, large-scale high-dimensional multimodal datasets demand advanced analytic approaches beyond conventional statistical models, requiring the expertise and interdisciplinary collaboration of the broader scientific community…

Data science competitions (such as Kaggle, DrivenData, CodaBench and AIcrowd) offer a powerful mechanism to bridge disciplines, solve complex problems and crowdsource novel solutions, as they bring individuals from around the world together to solve real-world problems. For more than 20 years (for example, see refs.), such competitions have been hosted by companies, organizations and research institutions to answer scientific questions, advance methods and techniques, extract valuable insights from data, promote organizations’ missions and foster collaboration with stakeholders. Every stage of a data science competition offers opportunities to promote big data exploration, advance analytic innovation and strengthen community engagement (Fig. 1). To translate these opportunities into actionable steps, we have shared our Data Science Competition Organizer Checklist at https://doi.org/10.17605/osf.io/hnx9b; this offers practical guidance for designing and implementing data science competitions in the brain health domain…(More)”

How data science competitions accelerate brain health discovery

Get the latest news right in you inbox

Subscribe to curated findings and actionable knowledge from The Living Library, delivered to your inbox every Friday