Opening industry data: The private sector’s role in addressing societal challenges


Paper by Jennifer Hansen and Yiu-Shing Pang: “This commentary explores the potential of private companies to advance scientific progress and solve social challenges through opening and sharing their data. Open data can accelerate scientific discoveries, foster collaboration, and promote long-term business success. However, concerns regarding data privacy and security can hinder data sharing. Companies have options to mitigate the challenges through developing data governance mechanisms, collaborating with stakeholders, communicating the benefits, and creating incentives for data sharing, among others. Ultimately, open data has immense potential to drive positive social impact and business value, and companies can explore solutions for their specific circumstances and tailor them to their specific needs…(More)”.

Fighting poverty with synthetic data


Article by Jack Gisby, Anna Kiknadze, Thomas Mitterling, and Isabell Roitner-Fransecky: “If you have ever used a smartwatch or other wearable tech to track your steps, heart rate, or sleep, you are part of the “quantified self” movement. You are voluntarily submitting millions of intimate data points for collection and analysis. The Economist highlighted the benefits of good quality personal health and wellness data—increased physical activity, more efficient healthcare, and constant monitoring of chronic conditions. However, not everyone is enthusiastic about this trend. Many fear corporations will use the data to discriminate against the poor and vulnerable. For example, insurance firms could exclude patients based on preconditions obtained from personal data sharing.

Can we strike a balance between protecting the privacy of individuals and gathering valuable information? This blog explores applying a synthetic populations approach in New York City,  a city with an established reputation for using big data approaches to support urban management, including for welfare provisions and targeted policy interventions.

To better understand poverty rates at the census tract level, World Data Lab, with the support of the Sloan Foundation, generated a synthetic population based on the borough of Brooklyn. Synthetic populations rely on a combination of microdata and summary statistics:

  • Microdata consists of personal information at the individual level. In the U.S., such data is available at the Public Use Microdata Area (PUMA) level. PUMA are geographic areas partitioning the state, containing no fewer than 100,000 people each. However, due to privacy concerns, microdata is unavailable at the more granular census tract level. Microdata consists of both household and individual-level information, including last year’s household income, the household size, the number of rooms, and the age, sex, and educational attainment of each individual living in the household.
  • Summary statistics are based on populations rather than individuals and are available at the census tract level, given that there are fewer privacy concerns. Census tracts are small statistical subdivisions of a county, averaging about 4,000 inhabitants. In New York City, a census tract roughly equals a building block. Similar to microdata, summary statistics are available for individuals and households. On the census tract level, we know the total population, the corresponding demographic breakdown, the number of households within different income brackets, the number of households by number of rooms, and other similar variables…(More)”.

Local Data Spaces: Leveraging trusted research environments for secure location-based policy research


Paper by Jacob L. Macdonald, Mark A. Green, Maurizio Gibin, Simon Leech, Alex Singleton and Paul Longely: “This work explores the use of Trusted Research Environments for the secure analysis of sensitive, record-level data on local coronavirus disease-2019 (COVID-19) inequalities and economic vulnerabilities. The Local Data Spaces (LDS) project was a targeted rapid response and cross-disciplinary collaborative initiative using the Office for National Statistics’ Secure Research Service for localized comparison and analysis of health and economic outcomes over the course of the COVID-19 pandemic. Embedded researchers worked on co-producing a range of locally focused insights and reports built on secure secondary data and made appropriately open and available to the public and all local stakeholders for wider use. With secure infrastructure and overall data governance practices in place, accredited researchers were able to access a wealth of detailed data and resources to facilitate more targeted local policy analysis. Working with data within such infrastructure as part of a larger research project involved advanced planning and coordination to be efficient. As new and novel granular data resources become securely available (e.g., record-level administrative digital health records or consumer data), a range of local policy insights can be gained across issues of public health or local economic vitality. Many of these new forms of data however often come with a large degree of sensitivity around issues of personal identifiability and how the data is used for public-facing research and require secure and responsible use. Learning to work appropriately with secure data and research environments can open up many avenues for collaboration and analysis…(More)”

Opportunities and Challenges in Reusing Public Genomics Data


Introduction to Special Issue by Mahmoud Ahmed and Deok Ryong Kim: “Genomics data is accumulating in public repositories at an ever-increasing rate. Large consortia and individual labs continue to probe animal and plant tissue and cell cultures, generating vast amounts of data using established and novel technologies. The human genome project kickstarted the era of systems biology (1, 2). Ambitious projects followed to characterize non-coding regions, variations across species, and between populations (3, 4, 5). The cost reduction allowed individual labs to generate numerous smaller high-throughput datasets (6, 7, 8, 9). As a result, the scientific community should consider strategies to overcome the challenges and maximize the opportunities to use these resources for research and the public good. In this collection, we will elicit opinions and perspectives from researchers in the field on the opportunities and challenges of reusing public genomics data. The articles in this research topic converge on the need for data sharing while acknowledging the challenges that come with it. Two articles defined and highlighted the distinction between data and metadata. The characteristic of each should be considered when designing optimal sharing strategies. One article focuses on the specific issues surrounding the sharing of genomics interval data, and another on balancing the need for protecting pediatric rights and the sharing benefits.

The definition of what counts as data is itself a moving target. As technology advances, data can be produced in more ways and from novel sources. Events of recent years have highlighted this fact. “The pandemic has underscored the urgent need to recognize health data as a global public good with mechanisms to facilitate rapid data sharing and governance,” Schwalbe and colleagues (2020). The challenges facing these mechanisms could be technical, economic, legal, or political. Defining what data is and its type, therefore, is necessary to overcome these barriers because “the mechanisms to facilitate data sharing are often specific to data types.” Unlike genomics data, which has established platforms, sharing clinical data “remains in a nascent phase.” The article by Patrinos and colleagues (2022) considers the strong ethical imperative for protecting pediatric data while acknowledging the need not to overprotections. The authors discuss a model of consent for pediatric research that can balance the need to protect participants and generate health benefits.

Xue et al. (2023) focus on reusing genomic interval data. Identifying and retrieving the relevant data can be difficult, given the state of the repositories and the size of these data. Similarly, integrating interval data in reference genomes can be hard. The author calls for standardized formats for the data and the metadata to facilitate reuse.

Sheffield and colleagues (2023) highlight the distinction between data and metadata. Metadata describes the characteristics of the sample, experiment, and analysis. The nature of this information differs from that of the primary data in size, source, and ways of use. Therefore, an optimal strategy should consider these specific attributes for sharing metadata. Challenges specifics to sharing metadata include the need for standardized terms and formats, making it portable and easier to find.

We go beyond the reuse issue to highlight two other aspects that might increase the utility of available public data in Ahmed et al. (2023). These are curation and integration…(More)”.

From the Economic Graph to Economic Insights: Building the Infrastructure for Delivering Labor Market Insights from LinkedIn Data


Blog by Patrick Driscoll and Akash Kaura: “LinkedIn’s vision is to create economic opportunity for every member of the global workforce. Since its inception in 2015, the Economic Graph Research and Insights (EGRI) team has worked to make this vision a reality by generating labor market insights such as:

In this post, we’ll describe how the EGRI Data Foundations team (Team Asimov) leverages LinkedIn’s cutting-edge data infrastructure tools such as Unified Metrics PlatformPinot, and Datahub to ensure we can deliver data and insights robustly, securely, and at scale to a myriad of partners. We will illustrate this through a case study of how we built the pipeline for our most well-known and oft-cited flagship metric: the LinkedIn Hiring Rate…(More)”.

WHO Launches Global Infectious Disease Surveillance Network


Article by Shania Kennedy: “The World Health Organization (WHO) launched the International Pathogen Surveillance Network (IPSN), a public health network to prevent and detect infectious disease threats before they become epidemics or pandemics.

IPSN will rely on insights generated from pathogen genomics, which helps analyze the genetic material of viruses, bacteria, and other disease-causing micro-organisms to determine how they spread and how infectious or deadly they may be.

Using these data, researchers can identify and track diseases to improve outbreak prevention, response, and treatments.

“The goal of this new network is ambitious, but it can also play a vital role in health security: to give every country access to pathogen genomic sequencing and analytics as part of its public health system,” said WHO Director-General Tedros Adhanom Ghebreyesus, PhD, in the press release.  “As was so clearly demonstrated to us during the COVID-19 pandemic, the world is stronger when it stands together to fight shared health threats.”

Genomics capacity worldwide was scaled up during the pandemic, but the press release indicates that many countries still lack effective tools and systems for public health data collection and analysis. This lack of resources and funding could slow the development of a strong global health surveillance infrastructure, which IPSN aims to help address.

The network will bring together experts in genomics and data analytics to optimize routine disease surveillance, including for COVID-19. According to the press release, pathogen genomics-based analyses of the SARS-COV-2 virus helped speed the development of effective vaccines and the identification of more transmissible virus variants…(More)”.

Crime, inequality and public health: a survey of emerging trends in urban data science


Paper by Massimiliano Luca, Gian Maria Campedelli, Simone Centellegher, Michele Tizzoni, and Bruno Lepri: “Urban agglomerations are constantly and rapidly evolving ecosystems, with globalization and increasing urbanization posing new challenges in sustainable urban development well summarized in the United Nations’ Sustainable Development Goals (SDGs). The advent of the digital age generated by modern alternative data sources provides new tools to tackle these challenges with spatio-temporal scales that were previously unavailable with census statistics. In this review, we present how new digital data sources are employed to provide data-driven insights to study and track (i) urban crime and public safety; (ii) socioeconomic inequalities and segregation; and (iii) public health, with a particular focus on the city scale…(More)”.

Can Mobility of Care Be Identified From Transit Fare Card Data? A Case Study In Washington D.C.


Paper by Daniela Shuman, et al: “Studies in the literature have found significant differences in travel behavior by gender on public transit that are largely attributable to household and care responsibilities falling disproportionately on women. While the majority of studies have relied on survey and qualitative data to assess “mobility of care”, we propose a novel data-driven workflow utilizing transit fare card transactions, name-based gender inference, and geospatial analysis to identify mobility of care trip making. We find that the share of women travelers trip-chaining in the direct vicinity of mobility of care places of interest is 10% – 15% higher than men….(More)”.

How a small news site built an innovative data project to visualise the impact of climate change on Uruguay’s capital


Interview by Marina Adami: “La ciudad sumergida (The submerged city), an investigation produced by Uruguayan science and technology news site Amenaza Roboto, is one of the winners of this year’s Sigma Awards for data journalism. The project uses maps of the country’s capital, Montevideo, to create impressive visualisations of the impact sea level rises are predicted to have on the city and its infrastructure. The project is a first of its kind for Uruguay, a small South American country in which data journalism is still a novelty. It is also a good example of a way news outlets can investigate and communicate the disastrous effects of climate change in local communities. 

I spoke to Miguel Dobrich, a journalist, educator and digital entrepreneur who worked on the project together with colleagues Gabriel FaríasNatalie Aubet and Nahuel Lamas, to find out what lessons other outlets can take from this project and from Amenaza Roboto’s experiments with analysing public data, collaborating with scientists, and keeping the focus on their communities….(More)”

Global Data Stewardship


On-line Course by Stefaan G. Verhulst: “Creating a systematic and sustainable data access program is critical for data stewardship. What you do with your data, how you reuse it, and how you make it available to the general public can help others reimagine what’s possible for data sharing and cross-sector data collaboration. In this course, instructor Stefaan Verhulst shows you how to develop and manage data reuse initiatives as a competent and responsible global data steward.

Following the insights of current research and practical, real-world examples, learn about the growing importance of data stewardship, data supply, and data demand to understand the value proposition and societal case for data reuse. Get tips on designing and implementing data collaboration models, governance framework, and infrastructure, as well as best practices for measuring, sunsetting, and supporting data reuse initiatives. Upon completing this course, you’ll be ready to start pushing your new skill set and continue your data stewardship learning journey….(More)”