The State of Open Humanitarian Data


Report by Centre for Humanitarian Data: “The goal of this report is to increase awareness of the data available for humanitarian response activities and to highlight what is missing, as measured through OCHA’s Humanitarian Data Exchange (HDX) platform. We want to recognize the valuable and long-standing contributions of data-sharing organizations. We also want to be more targeted in our outreach on what data is required to understand crises so that new actors might be compelled to join the platform. Data is not an end in itself but a critical ingredient to the analysis that informs decision making. With nearly 168 million people in need of humanitarian assistance in 2020 — the highest figure in decades — there is no time, or data, to lose…(More)”.

Making Public Transit Fairer to Women Demands Way More Data


Flavie Halais at Wired: “Public transportation is sexist. This may be unintentional or implicit, but it’s also easy to see. Women around the world do more care and domestic work than men, and their resulting mobility habits are hobbled by most transport systems. The demands of running errands and caring for children and other family members mean repeatedly getting on and off the bus, meaning paying more fares. Strollers and shopping bags make travel cumbersome. A 2018 study of New Yorkers found women were harassed on the subway far more frequently than men were, and as a result paid more money to avoid transit in favor of taxis and ride-hail….

What is not measured is not known, and the world of transit data is still largely blind to women and other vulnerable populations. Getting that data, though, isn’t easy. Traditional sources like national censuses and user surveys provide reliable information that serve as the basis for policies and decisionmaking. But surveys are costly to run, and it can take years for a government to go through the process of adding a question to its national census.

Before pouring resources into costly data collection to find answers about women’s transport needs, cities could first turn to the trove of unconventional gender-disaggregated data that’s already produced. They include data exhaust, or the trail of data we leave behind as a result of our interactions with digital products and services like mobile phones, credit cards, and social media. Last year, researchers in Santiago, Chile, released a report based on their parsing of anonymized call detail records of female mobile phone users, to extract location information and analyze their mobility patterns. They found that women tended to travel to fewer locations than men, and within smaller geographical areas. When researchers cross-referenced location information with census data, they found a higher gender gap among lower-income residents, as poorer women made even shorter trips. And when using data from the local transit agency, they saw that living close to a public transit stop increased mobility for both men and women, but didn’t close the gender gap for poorer residents.

To encourage private companies to share such info, Stefaan Verhulst advocates for data collaboratives, flexible partnerships between data providers and researchers. Verhulst is the head of research and development at GovLab, a research center at New York University that contributed to the research in Santiago. And that’s how GovLab and its local research partner, Universidad del Desarollo, got access to the phone records owned by the Chilean phone company, Telefónica. Data collaboratives can enhance access to private data without exposing companies to competition or privacy concerns. “We need to find ways to access data according to different shades of openness,” Verhulst says….(More)”.

The promise and perils of big gender data


Essay by Bapu Vaitla, Stefaan Verhulst, Linus Bengtsson, Marta C. González, Rebecca Furst-Nichols & Emily Courey Pryor in Special Issue on Big Data of Nature Medicine: “Women and girls are legally and socially marginalized in many countries. As a result, policymakers neglect key gendered issues such as informal labor markets, domestic violence, and mental health1. The scientific community can help push such topics onto policy agendas, but science itself is riven by inequality: women are underrepresented in academia, and gendered research is rarely a priority of funding agencies.

However, the critical importance of better gender data for societal well-being is clear. Mental health is a particularly striking example. Estimates from the Global Burden of Disease database suggest that depressive and anxiety disorders are the second leading cause of morbidity among females between 10 and 63 years of age2. But little is known about the risk factors that contribute to mental illness among specific groups of women and girls, the challenges of seeking care for depression and anxiety, or the long-term consequences of undiagnosed and untreated illness. A lack of data similarly impedes policy action on domestic and intimate-partner violence, early marriage, and sexual harassment, among many other topics.

‘Big data’ can help fill that gap. The massive amounts of information passively generated by electronic devices represent a rich portrait of human life, capturing where people go, the decisions they make, and how they respond to changes in their socio-economic environment. For example, mobile-phone data allow better understanding of health-seeking behavior as well as the dynamics of infectious-disease transmission3. Social-media platforms generate the world’s largest database of thoughts and emotions—information that, if leveraged responsibly, can be used to infer gendered patterns of mental health4. Remote sensors, especially satellites, can be used in conjunction with traditional data sources to increase the spatial and temporal granularity of data on women’s economic activity and health status5.

But the risk of gendered algorithmic bias is a serious obstacle to the responsible use of big data. Data are not value free; they reproduce the conscious and unconscious attitudes held by researchers, programmers, and institutions. Consider, for example, the training datasets on which the interpretation of big data depends. Training datasets establish the association between two or more directly observed phenomena of interest—for example, the mental health of a platform user (typically collected through a diagnostic survey) and the semantic content of the user’s social-media posts. These associations are then used to develop algorithms that interpret big data streams. In the example here, the (directly unobserved) mental health of a large population of social-media users would be inferred from their observed posts….(More)”.

Data as infrastructure? A study of data sharing legal regimes


Paper by Charlotte Ducuing: “The article discusses the concept of infrastructure in the digital environment, through a study of three data sharing legal regimes: the Public Sector Information Directive (PSI Directive), the discussions on in-vehicle data governance and the freshly adopted data sharing legal regime in the Electricity Directive.

While aiming to contribute to the scholarship on data governance, the article deliberately focuses on network industries. Characterised by the existence of physical infrastructure, they have a special relationship to digitisation and ‘platformisation’ and are exposed to specific risks. Adopting an explanatory methodology, the article exposes that these regimes are based on two close but different sources of inspiration, yet intertwined and left unclear. By targeting entities deemed ‘monopolist’ with regard to the data they create and hold, data sharing obligations are inspired from competition law and especially the essential facility doctrine. On the other hand, beneficiaries appear to include both operators in related markets needing data to conduct their business (except for the PSI Directive), and third parties at large to foster innovation. The latter rationale illustrates what is called here a purposive view of data as infrastructure. The underlying understanding of ‘raw’ data (management) as infrastructure for all to use may run counter the ability for the regulated entities to get a fair remuneration for ‘their’ data.

Finally, the article pleads for more granularity when mandating data sharing obligations depending upon the purpose. Shifting away from a ‘one-size-fits-all’ solution, the regulation of data could also extend to the ensuing context-specific data governance regime, subject to further research…(More)”.

Paging Dr. Google: How the Tech Giant Is Laying Claim to Health Data


Wall Street Journal: “Roughly a year ago, Google offered health-data company Cerner Corp.an unusually rich proposal.

Cerner was interviewing Silicon Valley giants to pick a storage provider for 250 million health records, one of the largest collections of U.S. patient data. Google dispatched former chief executive Eric Schmidt to personally pitch Cerner over several phone calls and offered around $250 million in discounts and incentives, people familiar with the matter say. 

Google had a bigger goal in pushing for the deal than dollars and cents: a way to expand its effort to collect, analyze and aggregate health data on millions of Americans. Google representatives were vague in answering questions about how Cerner’s data would be used, making the health-care company’s executives wary, the people say. Eventually, Cerner struck a storage deal with Amazon.com Inc. instead.

The failed Cerner deal reveals an emerging challenge to Google’s move into health care: gaining the trust of health care partners and the public. So far, that has hardly slowed the search giant.

Google has struck partnerships with some of the country’s largest hospital systems and most-renowned health-care providers, many of them vast in scope and few of their details previously reported. In just a few years, the company has achieved the ability to view or analyze tens of millions of patient health records in at least three-quarters of U.S. states, according to a Wall Street Journal analysis of contractual agreements. 

In certain instances, the deals allow Google to access personally identifiable health information without the knowledge of patients or doctors. The company can review complete health records, including names, dates of birth, medications and other ailments, according to people familiar with the deals.

The prospect of tech giants’ amassing huge troves of health records has raised concerns among lawmakers, patients and doctors, who fear such intimate data could be used without individuals’ knowledge or permission, or in ways they might not anticipate. 

Google is developing a search tool, similar to its flagship search engine, in which patient information is stored, collated and analyzed by the company’s engineers, on its own servers. The portal is designed for use by doctors and nurses, and eventually perhaps patients themselves, though some Google staffers would have access sooner. 

Google executives and some health systems say that detailed data sharing has the potential to improve health outcomes. Large troves of data help fuel algorithms Google is creating to detect lung cancer, eye disease and kidney injuries. Hospital executives have long sought better electronic record systems to reduce error rates and cut down on paperwork….

Legally, the information gathered by Google can be used for purposes beyond diagnosing illnesses, under laws enacted during the dial-up era. U.S. federal privacy laws make it possible for health-care providers, with little or no input from patients, to share data with certain outside companies. That applies to partners, like Google, with significant presences outside health care. The company says its intentions in health are unconnected with its advertising business, which depends largely on data it has collected on users of its many services, including email and maps.

Medical information is perhaps the last bounty of personal data yet to be scooped up by technology companies. The health data-gathering efforts of other tech giants such as Amazon and International Business Machines Corp. face skepticism from physician and patient advocates. But Google’s push in particular has set off alarm bells in the industry, including over privacy concerns. U.S. senators, as well as health-industry executives, are questioning Google’s expansion and its potential for commercializing personal data….(More)”.

Global Fishing Watch: Pooling Data and Expertise to Combat Illegal Fishing


Data Collaborative Case Study by Michelle Winowatan, Andrew Young, and Stefaan Verhulst: “

Global Fishing Watch, originally set up through a collaboration between Oceana, SkyTruth and Google, is an independent nonprofit organization dedicated to advancing responsible stewardship of our oceans through increased transparency in fishing activity and scientific research. Using big data processing and machine learning, Global Fishing Watch visualizes, tracks, and shares data about global fishing activity in near-real time and for free via their public map. To date, the platform tracks approximately 65,000 commercial fishing vessels globally. These insights have been used in a number of academic publications, ocean advocacy efforts, and law enforcement activities.

Data Collaborative Model: Based on the typology of data collaborative practice areas, Global Fishing Watch is an example of the data pooling model of data collaboration, specifically a public data pool. Public data pools co-mingle data assets from multiple data holders — including governments and companies — and make those shared assets available on the web. This approach enabled the data stewards and stakeholders involved in Global Fishing Watch to bring together multiple data streams from both public- and private-sector entities in a single location. This single point of access provides the public and relevant authorities with user-friendly access to actionable, previously fragmented data that can drive efforts to address compliance in fisheries and illegal fishing around the world.

Data Stewardship Approach: Global Fishing Watch also provides a clear illustration of the importance of data stewards. For instance, representatives from Google Earth Outreach, one of the data holders, played an important stewardship role in seeking to connect and coordinate with SkyTruth and Oceana, two important nonprofit environmental actors who were working separately prior to this initiative. The brokering of this partnership helped to bring relevant data assets from the public and private sectors to bear in support of institutional efforts to address the stubborn challenge of illegal fishing.

Read the full case study here.”

Trusted smart statistics: Motivations and principles


Paper by Fabio Ricciato et al : “In this contribution we outline the concept of Trusted Smart Statistics as the natural evolution of official statistics in the new datafied world. Traditional data sources, namely survey and administrative data, represent nowadays a valuable but small portion of the global data stock, much thereof being held in the private sector. The availability of new data sources is only one aspect of the global change that concerns official statistics. Other aspects, more subtle but not less important, include the changes in perceptions, expectations, behaviours and relations between the stakeholders. The environment around official statistics has changed: statistical offices are not any more data monopolists, but one prominent species among many others in a larger (and complex) ecosystem. What was established in the traditional world of legacy data sources (in terms of regulations, technologies, practices, etc.) is not guaranteed to be sufficient any more with new data sources.

Trusted Smart Statistics is not about replacing existing sources and processes, but augmenting them with new ones. Such augmentation however will not be only incremental: the path towards Trusted Smart Statistics is not about tweaking some components of the legacy system but about building an entirely new system that will coexist with the legacy one. In this position paper we outline some key design principles for the new Trusted Smart Statistics system. Taken collectively they picture a system where the smart and trust aspects enable and reinforce each other. A system that is more extrovert towards external stakeholders (citizens, private companies, public authorities) with whom Statistical Offices will be sharing computation, control, code, logs and of course final statistics, without necessarily sharing the raw input data….(More)”.

Towards adaptive governance in big data health research: implementing regulatory principles


Chapter by Alessandro Blasimme and Effy Vayena: “While data-enabled health care systems are in their infancy, biomedical research is rapidly adopting the big data paradigm. Digital epidemiology for example, already employs data generated outside the public health care system – that is, data generated without the intent of using them for epidemiological research – to understand and prevent patterns of diseases in populations (Salathé 2018)(Salathé 2018). Precision medicine – pooling together genomic, environmental and lifestyle data – also represents a prominent example of how data integration can drive both fundamental and translational research in important medical domains such as oncology (D. C. Collins et al. 2017). All of this requires the collection, storage, analysis and distribution of massive amounts of personal information as well as the use of state-of-the art data analytics tools to uncover healthand disease related patterns.


The realization of the potential of big data in health evokes a necessary commitment to a sense of “continuity” articulated in three distinct ways: a) from data generation to use (as in the data enabled learning health care ); b) from research to clinical practice e.g. discovery of new mutations in the context of diagnostics; c) from strictly speaking health data (Vayena and Gasser 2016) e.g. clinical records, to less so e.g. tweets used in digital epidemiology. These continuities face the challenge of regulatory and governance approaches that were designed for clear data taxonomies, for a less blurred boundary between research and clinical practice, and for rules that focused mostly on data generation and less on their eventual and multiple uses.

The result is significant uncertainty about how responsible use of such large amounts of sensitive personal data could be fostered. In this chapter we focus on the uncertainties surrounding the use of biomedical big data in the context of health research. Are new criteria needed to review biomedical big data research projects? Do current mechanisms, such as informed consent, offer sufficient protection to research participants’ autonomy and privacy in this new context? Do existing oversight mechanisms ensure transparency and accountability in data access and sharing? What monitoring tools are available to assess how personal data are used over time? Is the equitable distribution of benefits accruing from such data uses considered, or can it be ensured? How is the public being involved – if at all – with decisions about creating and using large data
repositories for research purposes? What is the role that IT (information technology) players, and especially big ones, acquire in research? And what regulatory instruments do we have to ensure that such players do not undermine the independence of research?…(More)”.

Official Statistics 4.0: Verified Facts for People in the 21st Century


Book by Walter J. Radermacher: “This book explores official statistics and their social function in modern societies. Digitisation and globalisation are creating completely new opportunities and risks, a context in which facts (can) play an enormously important part if they are produced with a quality that makes them credible and purpose-specific. In order for this to actually happen, official statistics must continue to actively pursue the modernisation of their working methods.

This book is not about the technical and methodological challenges associated with digitisation and globalisation; rather, it focuses on statistical sociology, which scientifically deals with the peculiarities and pitfalls of governing-by-numbers, and assigns statistics a suitable position in the future informational ecosystem. Further, the book provides a comprehensive overview of modern issues in official statistics, embodied in a historical and conceptual framework that endows it with different and innovative perspectives. Central to this work is the quality of statistical information provided by official statistics. The implementation of the UN Sustainable Development Goals in the form of indicators is another driving force in the search for answers, and is addressed here….(More)”

Accelerating Medicines Partnership (AMP): Improving Drug Research Efficiency through Biomarker Data Sharing


Data Collaborative Case Study by Michelle Winowatan, Andrew Young, and Stefaan Verhulst: “Accelerating Medicines Partnership (AMP) is a cross-sector data-sharing partnership in the United States between the National Institutes of Health (NIH), the Food and Drug Administration (FDA), multiple biopharmaceutical and life science companies, as well as non-profit organizations that seeks to improve the efficiency of developing new diagnostics and treatments for several types of disease. To achieve this goal, the partnership created a pre-competitive collaborative ecosystem where the biomedical community can pool data and resources that are relevant to the prioritized disease areas. A key component of the partnership is to make biomarkers data available to the medical research community through online portals.

Data Collaboratives Model: Based on our typology of data collaborative models, AMP is an example of the data pooling model of data collaboration, specifically a public data pool. Public data pools co-mingle data assets from multiple data holders — in this case pharmaceutical companies — and make those shared assets available on the web. Pools often limit contributions to approved partners (as public data pools are not crowdsourcing efforts), but access to the shared assets is open, enabling independent re-uses.

Data Stewardship Approach: Data stewardship is built into the partnership through the establishment of an executive committee, which governs the entire partnership, and a steering committee for each disease area, which governs each of the sub-projects within AMP. These committees consist of representatives from the institutional partners involved in AMP and perform data stewards function including enabling inter-institutional engagement as well as intra-institutional coordination, data audit and assessment of value and risk, communication of findings, and nurture the collaboration to sustainability….(Full Case Study)”.