The National Cancer Institute Cancer Moonshot Public Access and Data Sharing Policy—Initial assessment and implications


Paper by Tammy M. Frisby and Jorge L. Contreras: “Since 2013, federal research-funding agencies have been required to develop and implement broad data sharing policies. Yet agencies today continue to grapple with the mechanisms necessary to enable the sharing of a wide range of data types, from genomic and other -omics data to clinical and pharmacological data to survey and qualitative data. In 2016, the National Cancer Institute (NCI) launched the ambitious $1.8 billion Cancer Moonshot Program, which included a new Public Access and Data Sharing (PADS) Policy applicable to funding applications submitted on or after October 1, 2017. The PADS Policy encourages the immediate public release of published research results and data and requires all Cancer Moonshot grant applicants to submit a PADS plan describing how they will meet these goals. We reviewed the PADS plans submitted with approximately half of all funded Cancer Moonshot grant applications in fiscal year 2018, and found that a majority did not address one or more elements required by the PADS Policy. Many such plans made no reference to the PADS Policy at all, and several referenced obsolete or outdated National Institutes of Health (NIH) policies instead. We believe that these omissions arose from a combination of insufficient education and outreach by NCI concerning its PADS Policy, both to potential grant applicants and among NCI’s program staff and external grant reviewers. We recommend that other research funding agencies heed these findings as they develop and roll out new data sharing policies….(More)”.

The Computermen


Podcast Episode by Jill Lepore: “In 1966, just as the foundations of the Internet were being imagined, the federal government considered building a National Data Center. It would be a centralized federal facility to hold computer records from each federal agency, in the same way that the Library of Congress holds books and the National Archives holds manuscripts. Proponents argued that it would help regulate and compile the vast quantities of data the government was collecting. Quickly, though, fears about privacy, government conspiracies, and government ineptitude buried the idea. But now, that National Data Center looks like a missed opportunity to create rules about data and privacy before the Internet took off. And in the absence of government action, corporations have made those rules themselves….(More)”.

Best Practices to Cover Ad Information Used for Research, Public Health, Law Enforcement & Other Uses


Press Release: “The Network Advertising Initiative (NAI) released privacy Best Practices for its members to follow if they use data collected for Tailored Advertising or Ad Delivery and Reporting for non-marketing purposes, such as sharing with research institutions, public health agencies, or law enforcement entities.

“Ad tech companies have data that can be a powerful resource for the public good if they follow this set of best practices for consumer privacy,” said Leigh Freund, NAI President and CEO. “During the COVID-19 pandemic, we’ve seen the opportunity for substantial public health benefits from sharing aggregate and de-identified location data.”

The NAI Code of Conduct – the industry’s premier self-regulatory framework for privacy, transparency, and consumer choice – covers data collected and used for Tailored Advertising or Ad Delivery and Reporting. The NAI Code has long addressed certain non-marketing uses of data collected for Tailored Advertising and Ad Delivery and Reporting by prohibiting any
eligibility uses of such data, including uses for credit, insurance, healthcare, and employment decisions.

The NAI has always firmly believed that data collected for advertising purposes should not have a negative effect on consumers in their daily lives. However, over the past year, novel data uses have been introduced, especially during the recent health crisis. In the case of opted-in data such as Precise Location Information, a company may determine a user would benefit from more detailed disclosure in a just-in-time notice about non-marketing uses of the data being collected….(More)”.

How Facebook, Twitter and other data troves are revolutionizing social science


Heidi Ledford at Nature: “Elizaveta Sivak spent nearly a decade training as a sociologist. Then, in the middle of a research project, she realized that she needed to head back to school.

Sivak studies families and childhood at the National Research University Higher School of Economics in Moscow. In 2015, she studied the movements of adolescents by asking them in a series of interviews to recount ten places that they had visited in the past five days. A year later, she had analysed the data and was feeling frustrated by the narrowness of relying on individual interviews, when a colleague pointed her to a paper analysing data from the Copenhagen Networks Study, a ground-breaking project that tracked the social-media contacts, demographics and location of about 1,000 students, with five-minute resolution, over five months1. She knew then that her field was about to change. “I realized that these new kinds of data will revolutionize social science forever,” she says. “And I thought that it’s really cool.”

With that, Sivak decided to learn how to program, and join the revolution. Now, she and other computational social scientists are exploring massive and unruly data sets, extracting meaning from society’s digital imprint. They are tracking people’s online activities; exploring digitized books and historical documents; interpreting data from wearable sensors that record a person’s every step and contact; conducting online surveys and experiments that collect millions of data points; and probing databases that are so large that they will yield secrets about society only with the help of sophisticated data analysis.

Over the past decade, researchers have used such techniques to pick apart topics that social scientists have chased for more than a century: from the psychological underpinnings of human morality, to the influence of misinformation, to the factors that make some artists more successful than others. One study uncovered widespread racism in algorithms that inform health-care decisions2; another used mobile-phone data to map impoverished regions in Rwanda3.

“The biggest achievement is a shift in thinking about digital behavioural data as an interesting and useful source”, says Markus Strohmaier, a computational social scientist at the GESIS Leibniz Institute for the Social Sciences in Cologne, Germany.

Not everyone has embraced that shift. Some social scientists are concerned that the computer scientists flooding into the field with ambitions as big as their data sets are not sufficiently familiar with previous research. Another complaint is that some computational researchers look only at patterns and do not consider the causes, or that they draw weighty conclusions from incomplete and messy data — often gained from social-media platforms and other sources that are lacking in data hygiene.

The barbs fly both ways. Some computational social scientists who hail from fields such as physics and engineering argue that many social-science theories are too nebulous or poorly defined to be tested.

This all amounts to “a power struggle within the social-science camp”, says Marc Keuschnigg, an analytical sociologist at Linköping University in Norrköping, Sweden. “Who in the end succeeds will claim the label of the social sciences.”

But the two camps are starting to merge. “The intersection of computational social science with traditional social science is growing,” says Keuschnigg, pointing to the boom in shared journals, conferences and study programmes. “The mutual respect is growing, also.”…(More)”.

Gender gaps in urban mobility


Paper by Laetitia Gauvin, Michele Tizzoni, Simone Piaggesi, Andrew Young, Natalia Adler, Stefaan Verhulst, Leo Ferres & Ciro Cattuto in Humanities and Social Sciences Communications: “Mobile phone data have been extensively used to study urban mobility. However, studies based on gender-disaggregated large-scale data are still lacking, limiting our understanding of gendered aspects of urban mobility and our ability to design policies for gender equality. Here we study urban mobility from a gendered perspective, combining commercial and open datasets for the city of Santiago, Chile.

We analyze call detail records for a large cohort of anonymized mobile phone users and reveal a gender gap in mobility: women visit fewer unique locations than men, and distribute their time less equally among such locations. Mapping this mobility gap over administrative divisions, we observe that a wider gap is associated with lower income and lack of public and private transportation options. Our results uncover a complex interplay between gendered mobility patterns, socio-economic factors and urban affordances, calling for further research and providing insights for policymakers and urban planners….(More)”.

Why local data is the key to successful place making


Blog by Sally Kerr: “The COVID emergency has brought many challenges that were unimaginable a few months ago. The first priorities were safety and health, but when lockdown started one of the early issues was accessing and sharing local data to help everyone deal with and live through the emergency. Communities grappled with the scarcity of local data, finding it difficult to source for some services, food deliveries and goods. This was not a new issue, but the pandemic brought it into sharp relief.

Local data use covers a broad spectrum. People moving to a new area want information about the environment — schools, amenities, transport, crime rates and local health. For residents, continuing knowledge of business opening hours, events, local issues, council plans and roadworks remains important, not only for everyday living but to help understand issues and future plans that will change their environment. Really local data (hyperlocal data) is either fragmented or unavailable, making it difficult for local people to stay informed, whilst larger data sets about an area (e.g. population, school performance) are not always easy to understand or use. They sit in silos owned by different sectors, on disparate websites, usually collated for professional or research use.

Third sector organisations in a community will gather data relevant to their work such as contacts and event numbers but may not source wider data sets about the area, such as demographics, to improve their work. Using this data could strengthen future grant applications by validating their work. For Government or Health bodies carrying out place making community projects, there is a reliance on their own or national data sources supplemented with qualitative data snapshots. Their dependence on tried and tested sources is due to time and resource pressures but means there is no time to gather that rich seam of local data that profiles individual needs.

Imagine a future community where local data is collected and managed together for both official organisations and the community itself. Where there are shared aims and varied use. Current and relevant data would be accessible and easy to understand, provided in formats that suit the user — from data scientist to school child. A curated data hub would help citizens learn data skills and carry out collaborative projects on anything from air quality to local biodiversity, managing the data and offering increased insight and useful validation for wider decision making. Costs would be reduced with duplication and effort reduced….(More)”.

Laying the Foundation for Effective Partnerships: An Examination of Data Sharing Agreements


Paper by Hayden Dahmm: “In the midst of the COVID-19 pandemic, data has never been more salient. COVID has generated new data demands and increased cross-sector data collaboration. Yet, these data collaborations require careful planning and evaluation of risks and opportunities, especially when sharing sensitive data. Data sharing agreements (DSAs) are written agreements that establish the terms for how data are shared between parties and are important for establishing accountability and trust.

However, negotiating DSAs is often time consuming, and collaborators lacking legal or financial capacity are disadvantaged. Contracts for Data Collaboration (C4DC) is a joint initiative between SDSN TReNDS, NYU’s GovLab, the World Economic Forum, and the University of Washington, working to strengthen trust and transparency of data collaboratives. The partners have created an online library of DSAs which represents a selection of data applications and contexts.

This report introduces C4DC and its DSA library. We demonstrate how the library can support the data community to strengthen future data collaborations by showcasing various DSA applications and key considerations. First, we explain our method of analyzing the agreements and consider how six major issues are addressed by different agreements in the library. Key issues discussed include data use, access, breaches, proprietary issues, publicization of the analysis, and deletion of data upon termination of the agreement. For each of these issues, we describe approaches illustrated with examples from the library. While our analysis suggests some pertinent issues are regularly not addressed in DSAs, we have identified common areas of practice that may be helpful for entities negotiating partnership agreements to consider in the future….(More)”.

Sector-Specific (Data-) Access Regimes of Competitors


Paper by Jörg Hoffmann: “The expected economic and social benefits of data access and sharing are enormous. And yet, particularly in the B2B context, data sharing of privately held data between companies has not taken off at efficient scale. This already led to the adoption of sector specific data governance and access regimes. Two of these regimes are enshrined in the PSD2 that introduced an access to account and a data portability rule for specific account information for third party payment providers.

This paper analyses these sector-specific access and portability regimes and identifies regulatory shortcomings that should be addressed and can serve as further guidance for further data access regulation. It first develops regulatory guidelines that build around the multiple regulatory dimensions of data and the potential adverse effects that may be created by too broad data access regimes.

In this regard the paper assesses the role of factual data exclusivity for data driven innovation incentives for undertakings, the role of industrial policy driven market regulation within the principle of a free market economy, the impact of data sharing on consumer sovereignty and choice, and ultimately data induced-distortions of competition. It develops the findings by taking recourse to basic IP and information economics and the EU competition law case law pertaining refusal to supply cases, the rise of ‘surveillance capitalism’ and to current competition policy considerations with regard to the envisioned preventive competition control regime tackling data rich ‘undertakings of paramount importance for competition across markets’ in Germany. This is then followed by an analysis of the PSD2 access and portability regimes in light of the regulatory principles….(More)”.

How data analysis helped Mozambique stem a cholera outbreak


Andrew Jack at the Financial Times: “When Mozambique was hit by two cyclones in rapid succession last year — causing death and destruction from a natural disaster on a scale not seen in Africa for a generation — government officials added an unusual recruit to their relief efforts. Apart from the usual humanitarian and health agencies, the National Health Institute also turned to Zenysis, a Silicon Valley start-up.

As the UN and non-governmental organisations helped to rebuild lives and tackle outbreaks of disease including cholera, Zenysis began gathering and analysing large volumes of disparate data. “When we arrived, there were 400 new cases of cholera a day and they were doubling every 24 hours,” says Jonathan Stambolis, the company’s chief executive. “None of the data was shared [between agencies]. Our software harmonised and integrated fragmented sources to produce a coherent picture of the outbreak, the health system’s ability to respond and the resources available.

“Three and a half weeks later, they were able to get infections down to zero in most affected provinces,” he adds. The government attributed that achievement to the availability of high-quality data to brief the public and international partners.

“They co-ordinated the response in a way that drove infections down,” he says. Zenysis formed part of a “virtual control room”, integrating information to help decision makers understand what was happening in the worst hit areas, identify sources of water contamination and where to prioritise cholera vaccinations.

It supported an “mAlert system”, which integrated health surveillance data into a single platform for analysis. The output was daily reports distilled from data issued by health facilities and accommodation centres in affected areas, disease monitoring and surveillance from laboratory testing….(More)”.

Practical Synthetic Data Generation: Balancing Privacy and the Broad Availability of Data


Book by Khaled El Emam, Lucy Mosquera, and Richard Hoptroff: “Building and testing machine learning models requires access to large and diverse data. But where can you find usable datasets without running into privacy issues? This practical book introduces techniques for generating synthetic data—fake data generated from real data—so you can perform secondary analysis to do research, understand customer behaviors, develop new products, or generate new revenue.

Data scientists will learn how synthetic data generation provides a way to make such data broadly available for secondary purposes while addressing many privacy concerns. Analysts will learn the principles and steps for generating synthetic data from real datasets. And business leaders will see how synthetic data can help accelerate time to a product or solution.

This book describes:

  • Steps for generating synthetic data using multivariate normal distributions
  • Methods for distribution fitting covering different goodness-of-fit metrics
  • How to replicate the simple structure of original data
  • An approach for modeling data structure to consider complex relationships
  • Multiple approaches and metrics you can use to assess data utility
  • How analysis performed on real data can be replicated with synthetic data
  • Privacy implications of synthetic data and methods to assess identity disclosure…(More)”.