Kickstarting Collaborative, AI-Ready Datasets in the Life Sciences with Government-funded Projects


Article by Erika DeBenedictis, Ben Andrew & Pete Kelly: “In the age of Artificial Intelligence (AI), large high-quality datasets are needed to move the field of life science forward. However, the research community lacks strategies to incentivize collaboration on high-quality data acquisition and sharing. The government should fund collaborative roadmapping, certification, collection, and sharing of large, high-quality datasets in life science. In such a system, nonprofit research organizations engage scientific communities to identify key types of data that would be valuable for building predictive models, and define quality control (QC) and open science standards for collection of that data. Projects are designed to develop automated methods for data collection, certify data providers, and facilitate data collection in consultation with researchers throughout various scientific communities. Hosting of the resulting open data is subsidized as well as protected by security measures. This system would provide crucial incentives for the life science community to identify and amass large, high-quality open datasets that will immensely benefit researchers…(More)”.

Trust but Verify: A Guide to Conducting Due Diligence When Leveraging Non-Traditional Data in the Public Interest


New Report by Sara Marcucci, Andrew J. Zahuranec, and Stefaan Verhulst: “In an increasingly data-driven world, organizations across sectors are recognizing the potential of non-traditional data—data generated from sources outside conventional databases, such as social media, satellite imagery, and mobile usage—to provide insights into societal trends and challenges. When harnessed thoughtfully, this data can improve decision-making and bolster public interest projects in areas as varied as disaster response, healthcare, and environmental protection. However, with these new data streams come heightened ethical, legal, and operational risks that organizations need to manage responsibly. That’s where due diligence comes in, helping to ensure that data initiatives are beneficial and ethical.

The report, Trust but Verify: A Guide to Conducting Due Diligence When Leveraging Non-Traditional Data in the Public Interest, co-authored by Sara Marcucci, Andrew J. Zahuranec, and Stefaan Verhulst, offers a comprehensive framework to guide organizations in responsible data partnerships. Whether you’re a public agency or a private enterprise, this report provides a six-step process to ensure due diligence and maintain accountability, integrity, and trust in data initiatives…(More) (Blog)”.

Innovating with Non-Traditional Data: Recent Use Cases for Unlocking Public Value


Article by Stefaan Verhulst and Adam Zable: “Non-Traditional Data (NTD): “data that is digitally captured (e.g. mobile phone records), mediated (e.g. social media), or observed (e.g. satellite imagery), using new instrumentation mechanisms, often privately held.”

Digitalization and the resulting datafication have introduced a new category of data that, when re-used responsibly, can complement traditional data in addressing public interest questions—from public health to environmental conservation. Unlocking these often privately held datasets through data collaboratives is a key focus of what we have called The Third Wave of Open Data

To help bridge this gap, we have curated below recent examples of the use of NTD for research and decision-making that were published the past few months. They are organized into five categories:

  • Health and Well-being;
  • Humanitarian Aid;
  • Environment and Climate;
  • Urban Systems and Mobility, and 
  • Economic and Labor Dynamics…(More)”.

The Emergence of National Data Initiatives: Comparing proposals and initiatives in the United Kingdom, Germany, and the United States


Article by Stefaan Verhulst and Roshni Singh: “Governments are increasingly recognizing data as a pivotal asset for driving economic growth, enhancing public service delivery, and fostering research and innovation. This recognition has intensified as policymakers acknowledge that data serves as the foundational element of artificial intelligence (AI) and that advancing AI sovereignty necessitates a robust data ecosystem. However, substantial portions of generated data remain inaccessible or underutilized. In response, several nations are initiating or exploring the launch of comprehensive national data strategies designed to consolidate, manage, and utilize data more effectively and at scale. As these initiatives evolve, discernible patterns in their objectives, governance structures, data-sharing mechanisms, and stakeholder engagement frameworks reveal both shared principles and country-specific approaches.

This blog seeks to start some initial research on the emergence of national data initiatives by examining three national data initiatives and exploring their strategic orientations and broader implications. They include:

Garden city: A synthetic dataset and sandbox environment for analysis of pre-processing algorithms for GPS human mobility data



Paper by Thomas H. Li, and Francisco Barreras: “Human mobility datasets have seen increasing adoption in the past decade, enabling diverse applications that leverage the high precision of measured trajectories relative to other human mobility datasets. However, there are concerns about whether the high sparsity in some commercial datasets can introduce errors due to lack of robustness in processing algorithms, which could compromise the validity of downstream results. The scarcity of “ground-truth” data makes it particularly challenging to evaluate and calibrate these algorithms. To overcome these limitations and allow for an intermediate form of validation of common processing algorithms, we propose a synthetic trajectory simulator and sandbox environment meant to replicate the features of commercial datasets that could cause errors in such algorithms, and which can be used to compare algorithm outputs with “ground-truth” synthetic trajectories and mobility diaries. Our code is open-source and is publicly available alongside tutorial notebooks and sample datasets generated with it….(More)”

National biodiversity data infrastructures: ten essential functions for science, policy, and practice 


Paper by Anton Güntsch et al: “Today, at the international level, powerful data portals are available to biodiversity researchers and policymakers, offering increasingly robust computing and network capacities and capable data services for internationally agreed-on standards. These accelerate individual and complex workflows to map data-driven research processes or even to make them possible for the first time. At the national level, however, and alongside these international developments, national infrastructures are needed to take on tasks that cannot be easily funded or addressed internationally. To avoid gaps, as well as redundancies in the research landscape, national tasks and responsibilities must be clearly defined to align efforts with core priorities. In the present article, we outline 10 essential functions of national biodiversity data infrastructures. They serve as key providers, facilitators, mediators, and platforms for effective biodiversity data management, integration, and analysis that require national efforts to foster biodiversity science, policy, and practice…(More)”.

Access, Signal, Action: Data Stewardship Lessons from Valencia’s Floods


Article by Marta Poblet, Stefaan Verhulst, and Anna Colom: “Valencia has a rich history in water management, a legacy shaped by both triumphs and tragedies. This connection to water is embedded in the city’s identity, yet modern floods test its resilience in new ways.

During the recent floods, Valencians experienced a troubling paradox. In today’s connected world, digital information flows through traditional and social media, weather apps, and government alert systems designed to warn us of danger and guide rapid responses. Despite this abundance of data, a tragedy unfolded last month in Valencia. This raises a crucial question: how can we ensure access to the right data, filter it for critical signals, and transform those signals into timely, effective action?

Data stewardship becomes essential in this process.

In particular, the devastating floods in Valencia underscore the importance of:

  • having access to data to strengthen the signal (first mile challenges)
  • separating signal from noise
  • translating signal into action (last mile challenges)…(More)”.

Beached Plastic Debris Index; a modern index for detecting plastics on beaches


Paper by Jenna Guffogg et al: “Plastic pollution on shorelines poses a significant threat to coastal ecosystems, underscoring the urgent need for scalable detection methods to facilitate debris removal. In this study, the Beached Plastic Debris Index (BPDI) was developed to detect plastic accumulation on beaches using shortwave infrared spectral features. To validate the BPDI, plastic targets with varying sub-pixel covers were placed on a sand spit and captured using WorldView-3 satellite imagery. The performance of the BPDI was analysed in comparison with the Normalized Difference Plastic Index (NDPI), the Plastic Index (PI), and two hydrocarbon indices (HI, HC). The BPDI successfully detected the plastic targets from sand, water, and vegetation, outperforming the other indices and identifying pixels with <30 % plastic cover. The robustness of the BPDI suggests its potential as an effective tool for mapping plastic debris accumulations along coastlines…(More)”.

Once Upon a Crime: Towards Crime Prediction from Demographics and Mobile Data


Paper by Andrey Bogomolov, Bruno Lepri, Jacopo Staiano, Nuria Oliver, Fabio Pianesi, and Alex Pentland: “In this paper, we present a novel approach to predict crime in a geographic space from multiple data sources, in particular mobile phone and demographic data. The main contribution of the proposed approach lies in using aggregated and anonymized human behavioral data derived from mobile network activity to tackle the crime prediction problem. While previous research efforts have used either background historical knowledge or offenders’ profiling, our findings support the hypothesis that aggregated human behavioral data captured from the mobile network infrastructure, in combination with basic demographic information, can be used to predict crime. In our experimental results with real crime data from London we obtain an accuracy of almost 70% when predicting whether a specific area in the city will be a crime hotspot or not. Moreover, we provide a discussion of the implications of our findings for data-driven crime analysis…(More)”.

Unlocking Green Deal Data: Innovative Approaches for Data Governance and Sharing in Europe


JRC Report: “Drawing upon the ambitious policy and legal framework outlined in the Europe Strategy for Data (2020) and the establishment of common European data spaces, this Science for Policy report explores innovative approaches for unlocking relevant data to achieve the objectives of the European Green Deal.

The report focuses on the governance and sharing of Green Deal data, analysing a variety of topics related to the implementation of new regulatory instruments, namely the Data Governance Act and the Data Act, as well as the roles of various actors in the data ecosystem. It provides an overview of the current incentives and disincentives for data sharing and explores the existing landscape of Data Intermediaries and Data Altruism Organizations. Additionally, it offers insights from a private sector perspective and outlines key data governance and sharing practices concerning Citizen-Generated Data (CGD).

The main conclusions build upon the concept of “Systemic Data Justice,” which emphasizes equity, accountability, and fair representation to foster stronger connections between the supply and demand of data for a more effective and sustainable data economy. Five policy recommendations outline a set of main implications and actionable points for the revision of the INSPIRE Directive (2007) within the context of the common European Green Deal data space, and toward a more sustainable and fair data ecosystem. However, the relevance of these recommendations spills over Green Deal data only, as they outline key elements to ensure that any data ecosystem is both just and impact-oriented…(More)”.