Sector-Specific (Data-) Access Regimes of Competitors


Paper by Jörg Hoffmann: “The expected economic and social benefits of data access and sharing are enormous. And yet, particularly in the B2B context, data sharing of privately held data between companies has not taken off at efficient scale. This already led to the adoption of sector specific data governance and access regimes. Two of these regimes are enshrined in the PSD2 that introduced an access to account and a data portability rule for specific account information for third party payment providers.

This paper analyses these sector-specific access and portability regimes and identifies regulatory shortcomings that should be addressed and can serve as further guidance for further data access regulation. It first develops regulatory guidelines that build around the multiple regulatory dimensions of data and the potential adverse effects that may be created by too broad data access regimes.

In this regard the paper assesses the role of factual data exclusivity for data driven innovation incentives for undertakings, the role of industrial policy driven market regulation within the principle of a free market economy, the impact of data sharing on consumer sovereignty and choice, and ultimately data induced-distortions of competition. It develops the findings by taking recourse to basic IP and information economics and the EU competition law case law pertaining refusal to supply cases, the rise of ‘surveillance capitalism’ and to current competition policy considerations with regard to the envisioned preventive competition control regime tackling data rich ‘undertakings of paramount importance for competition across markets’ in Germany. This is then followed by an analysis of the PSD2 access and portability regimes in light of the regulatory principles….(More)”.

How data analysis helped Mozambique stem a cholera outbreak


Andrew Jack at the Financial Times: “When Mozambique was hit by two cyclones in rapid succession last year — causing death and destruction from a natural disaster on a scale not seen in Africa for a generation — government officials added an unusual recruit to their relief efforts. Apart from the usual humanitarian and health agencies, the National Health Institute also turned to Zenysis, a Silicon Valley start-up.

As the UN and non-governmental organisations helped to rebuild lives and tackle outbreaks of disease including cholera, Zenysis began gathering and analysing large volumes of disparate data. “When we arrived, there were 400 new cases of cholera a day and they were doubling every 24 hours,” says Jonathan Stambolis, the company’s chief executive. “None of the data was shared [between agencies]. Our software harmonised and integrated fragmented sources to produce a coherent picture of the outbreak, the health system’s ability to respond and the resources available.

“Three and a half weeks later, they were able to get infections down to zero in most affected provinces,” he adds. The government attributed that achievement to the availability of high-quality data to brief the public and international partners.

“They co-ordinated the response in a way that drove infections down,” he says. Zenysis formed part of a “virtual control room”, integrating information to help decision makers understand what was happening in the worst hit areas, identify sources of water contamination and where to prioritise cholera vaccinations.

It supported an “mAlert system”, which integrated health surveillance data into a single platform for analysis. The output was daily reports distilled from data issued by health facilities and accommodation centres in affected areas, disease monitoring and surveillance from laboratory testing….(More)”.

Practical Synthetic Data Generation: Balancing Privacy and the Broad Availability of Data


Book by Khaled El Emam, Lucy Mosquera, and Richard Hoptroff: “Building and testing machine learning models requires access to large and diverse data. But where can you find usable datasets without running into privacy issues? This practical book introduces techniques for generating synthetic data—fake data generated from real data—so you can perform secondary analysis to do research, understand customer behaviors, develop new products, or generate new revenue.

Data scientists will learn how synthetic data generation provides a way to make such data broadly available for secondary purposes while addressing many privacy concerns. Analysts will learn the principles and steps for generating synthetic data from real datasets. And business leaders will see how synthetic data can help accelerate time to a product or solution.

This book describes:

  • Steps for generating synthetic data using multivariate normal distributions
  • Methods for distribution fitting covering different goodness-of-fit metrics
  • How to replicate the simple structure of original data
  • An approach for modeling data structure to consider complex relationships
  • Multiple approaches and metrics you can use to assess data utility
  • How analysis performed on real data can be replicated with synthetic data
  • Privacy implications of synthetic data and methods to assess identity disclosure…(More)”.

Using Data for COVID-19 Requires New and Innovative Governance Approaches


Stefaan G. Verhulst and Andrew Zahuranec at Data & Policy blog: “There has been a rapid increase in the number of data-driven projects and tools released to contain the spread of COVID-19. Over the last three months, governments, tech companies, civic groups, and international agencies have launched hundreds of initiatives. These efforts range from simple visualizations of public health data to complex analyses of travel patterns.

When designed responsibly, data-driven initiatives could provide the public and their leaders the ability to be more effective in addressing the virus. The Atlantic andNew York Times have both published work that relies on innovative data use. These and other examples, detailed in our #Data4COVID19 repository, can fill vital gaps in our understanding and allow us to better respond and recover to the crisis.

But data is not without risk. Collecting, processing, analyzing and using any type of data, no matter how good intention of its users, can lead to harmful ends. Vulnerable groups can be excluded. Analysis can be biased. Data use can reveal sensitive information about people and locations. In addressing all these hazards, organizations need to be intentional in how they work throughout the data lifecycle.

Decision Provenance: Documenting decisions and decision makers across the Data Life Cycle

Unfortunately the individuals and teams responsible for making these design decisions at each critical point of the data lifecycle are rarely identified or recognized by all those interacting with these data systems.

The lack of visibility into the origins of these decisions can impact professional accountability negatively as well as limit the ability of actors to identify the optimal intervention points for mitigating data risks and to avoid missed use of potentially impactful data. Tracking decision provenance is essential.

As Jatinder Singh, Jennifer Cobbe, and Chris Norval of the University of Cambridge explain, decision provenance refers to tracking and recording decisions about the collection, processing, sharing, analyzing, and use of data. It involves instituting mechanisms to force individuals to explain how and why they acted. It is about using documentation to provide transparency and oversight in the decision-making process for everyone inside and outside an organization.

Toward that end, The GovLab at NYU Tandon developed the Decision Provenance Mapping. We designed this tool for designated data stewards tasked with coordinating the responsible use of data across organizational priorities and departments….(More)”

Unlock the Hidden Value of Your Data


Stefaan G. Verhulst at the Harvard Business Review: “Twenty years ago, Kevin Rivette and David Kline wrote a book about the hidden value contained within companies’ underutilized patents. These patents, Rivette and Kline argued, represented “Rembrandts in the Attic” (the title of their book). Patents, the authors suggested, shouldn’t be seen merely as passive properties, but as strategic assets — a “new currency” that could be deployed in the quest for competition, brand reputation, and advances in research and development.

We are still living in the knowledge economy, and organizations are still trying to figure out how to unlock under-utilized assets. But the currency has changed: Today’s Rembrandts in the attic are data.

It is widely accepted now that the vast amounts of data that companies generate represents a tremendous repository of potential value. This value is monetary, and also social; it contains terrific potential to impact the public good. But do organizations — and do we as a society — know how to unlock this value? Do we know how to find the insights hidden in our digital attics and use them to improve society and peoples’ lives?

In what follows, I outline four steps that could help organizations maximize their data assets for public good. If there is an overarching theme, it is about the value of re-using data. Recent years have seen a growing open data movement, in which previously siloed government datasets have been made accessible to outside groups. Despite occasional trepidation on the part of data holders, research has consistently shown that such initiatives can be value-enhancing for both data holders and society. The same is true for private sector data assets. Better and more transparent reuse of data is arguably the single most important measure we can take to unleash this dual potential.

To help maximize data for the public good, we need to:

  • Develop methodologies to measure the value of data...
  • Develop structures to incentivize collaboration. ….
  • Encourage data collaboratives. 
  • Identify and nurture data stewards. …(More)”

Removing the pump handle: Stewarding data at times of public health emergency


Reema Patel at Significance: “There is a saying, incorrectly attributed to Mark Twain, that states: “History never repeat itself but it rhymes”. Seeking to understand the implications of the current crisis for the effective use of data, I’ve drawn on the nineteenth-century cholera outbreak in London’s Soho to identify some “rhyming patterns” that might inform our approaches to data use and governance at this time of public health crisis.

Where better to begin than with the work of Victorian pioneer John Snow? In 1854, Snow’s use of a dot map to illustrate clusters of cholera cases around public water pumps, and of statistics to establish the connection between the quality of water sources and cholera outbreaks, led to a breakthrough in public health interventions – and, famously, the removal of the handle of a water pump in Broad Street.

Data is vital

We owe a lot to Snow, especially now. His examples teaches us that data has a central role to play in saving lives, and that the effective use of (and access to) data is critical for enabling timely responses to public health emergencies.

Take, for instance, transport app CityMapper’s rapid redeployment of its aggregated transport data. In the early days of the Covid-19 pandemic, this formed part of an analysis of compliance with social distancing restrictions across a range of European cities. There is also the US-based health weather map, which uses anonymised and aggregated data to visualise fever, specifically influenza-like illnesses. This data helped model early indications of where, and how quickly, Covid-19 was spreading….

Ethics and human rights still matter

As the current crisis evolves, many have expressed concern that the pandemic will be used to justify the rapid roll out of surveillance technologies that do not meet ethical and human rights standards, and that this will be done in the name of the “public good”. Examples of these technologies include symptom- and contact-tracing applications. Privacy experts are also increasingly concerned that governments will be trading off more personal data than is necessary or proportionate to respond to the public health crisis.

Many ethical and human rights considerations (including those listed at the bottom of this piece) are at risk of being overlooked at this time of emergency, and governments would be wise not to press ahead regardless, ignoring legitimate concerns about rights and standards. Instead, policymakers should begin to address these concerns by asking how we can prepare (now and in future) to establish clear and trusted boundaries for the use of data (personal and non-personal) in such crises.

Democratic states in Europe and the US have not, in recent memory, prioritised infrastructures and systems for a crisis of this scale – and this has contributed to our current predicament. Contrast this with Singapore, which suffered outbreaks of SARS and H1N1, and channelled this experience into implementing pandemic preparedness measures.

We cannot undo the past, but we can begin planning and preparing constructively for the future, and that means strengthening global coordination and finding mechanisms to share learning internationally. Getting the right data infrastructure in place has a central role to play in addressing ethical and human rights concerns around the use of data….(More)”.

The Law and Policy of Government Access to Private Sector Data (‘B2G Data Sharing’)


Paper by Heiko Richter: “The tremendous rate of technological advancement in recent years has fostered a policy de-bate about improving the state’s access to privately held data (‘B2G data sharing’). Access to such ‘data of general interest’ can significantly improve social welfare and serve the common good. At the same time, expanding the state’s access to privately held data poses risks. This chapter inquires into the potential and limits of mandatory access rules, which would oblige private undertakings to grant access to data for specific purposes that lie in the public interest. The article discusses the key questions that access rules should address and develops general principles for designing and implementing such rules. It puts particular emphasis on the opportunities and limitations for the implementation of horizontal B2G access frameworks. Finally, the chapter outlines concrete recommendations for legislative reforms….(More)”.

Viruses Cross Borders. To Fight Them, Countries Must Let Medical Data Flow, Too


Nigel Cory at ITIF: “If nations could regulate viruses the way many regulate data, there would be no global pandemics. But the sad reality is that, in the midst of the worst global pandemic in living memory, many nations make it unnecessarily complicated and costly, if not illegal, for health data to cross their borders. In so doing, they are hindering critically needed medical progress.

In the COVID-19 crisis, data analytics powered by artificial intelligence (AI) is critical to identifying the exact nature of the pandemic and developing effective treatments. The technology can produce powerful insights and innovations, but only if researchers can aggregate and analyze data from populations around the globe. And that requires data to move across borders as part of international research efforts by private firms, universities, and other research institutions. Yet, some countries, most notably China, are stopping health and genomic data at their borders.

Indeed, despite the significant benefits to companies, citizens, and economies that arise from the ability to easily share data across borders, dozens of countries—across every stage of development—have erected barriers to cross-border data flows. These data-residency requirements strictly confine data within a country’s borders, a concept known as “data localization,” and many countries have especially strict requirements for health data.

China is a noteworthy offender, having created a new digital iron curtain that requires data localization for a range of data types, including health data, as part of its so-called “cyber sovereignty” strategy. A May 2019 State Council regulation required genomic data to be stored and processed locally by Chinese firms—and foreign organizations are prohibited. This is in service of China’s mercantilist strategy to advance its domestic life sciences industry. While there has been collaboration between U.S. and Chinese medical researchers on COVID-19, including on clinical trials for potential treatments, these restrictions mean that it won’t involve the transfer, aggregation, and analysis of Chinese personal data, which otherwise might help find a treatment or vaccine. If China truly wanted to make amends for blocking critical information during the early stages of the outbreak in Wuhan, then it should abolish this restriction and allow genomic and other health data to cross its borders.

But China is not alone in limiting data flows. Russia requires all personal data, health-related or not, to be stored locally. India’s draft data protection bill permits the government to classify any sensitive personal data as critical personal data and mandate that it be stored and processed only within the country. This would be consistent with recent debates and decisions to require localization for payments data and other types of data. And despite its leading role in pushing for the free flow of data as part of new digital trade agreementsAustralia requires genomic and other data attached to personal electronic health records to be only stored and processed within its borders.

Countries also enact de facto barriers to health and genomic data transfers by making it harder and more expensive, if not impractical, for firms to transfer it overseas than to store it locally. For example, South Korea and Turkey require firms to get explicit consent from people to transfer sensitive data like genomic data overseas. Doing this for hundreds or thousands of people adds considerable costs and complexity.

And the European Union’s General Data Protection Regulation encourages data localization as firms feel pressured to store and process personal data within the EU given the restrictions it places on data transfers to many countries. This is in addition to the renewed push for local data storage and processing under the EU’s new data strategy.

Countries rationalize these steps on the basis that health data, particularly genomic data, is sensitive. But requiring health data to be stored locally does little to increase privacy or data security. The confidentiality of data does not depend on which country the information is stored in, only on the measures used to store it securely, such as via encryption, and the policies and procedures the firms follow in storing or analyzing the data. For example, if a nation has limits on the use of genomics data, then domestic organizations using that data face the same restrictions, whether they store the data in the country or outside of it. And if they share the data with other organizations, they must require those organizations, regardless of where they are located, to abide by the home government’s rules.

As such, policymakers need to stop treating health data differently when it comes to cross-border movement, and instead build technical, legal, and ethical protections into both domestic and international data-governance mechanisms, which together allow the responsible sharing and transfer of health and genomic data.

This is clearly possible—and needed. In February 2020, leading health researchers called for an international code of conduct for genomic data following the end of their first-of-its-kind international data-driven research project. The project used a purpose-built cloud service that stored 800 terabytes of genomic data on 2,658 cancer genomes across 13 data centers on three continents. The collaboration and use of cloud computing were transformational in enabling large-scale genomic analysis….(More)”.

A data sharing method in the open web environment: Data sharing in hydrology


Paper by Jin Wang et al: “Data sharing plays a fundamental role in providing data resources for geographic modeling and simulation. Although there are many successful cases of data sharing through the web, current practices for sharing data mostly focus on data publication using metadata at the file level, which requires identifying, restructuring and synthesizing raw data files for further usage. In hydrology, because the same hydrological information is often stored in data files with different formats, modelers should identify the required information from multisource data sets and then customize data requirements for their applications. However, these data customization tasks are difficult to repeat, which leads to repetitive labor. This paper presents a data sharing method that provides a solution for data manipulation based on a structured data description model rather than raw data files. With the structured data description model, multisource hydrological data can be accessed and processed in a unified way and published as data services using a designed data server. This study also proposes a data configuration manager to customize data requirements through an interactive programming tool, which can help in using the data services. In addition, a component-based data viewer is developed for the visualization of multisource data in a sharable visualization scheme. A case study that involves sharing and applying hydrological data is designed to examine the applicability and feasibility of the proposed data sharing method….(More)”.

Responsible Data Toolkit


Andrew Young at The GovLab: “The GovLab and UNICEF, as part of the Responsible Data for Children initiative (RD4C), are pleased to share a set of user-friendly tools to support organizations and practitioners seeking to operationalize the RD4C Principles. These principles—Purpose-Driven, People-Centric, Participatory, Protective of Children’s Rights, Proportional, Professionally Accountable, and Prevention of Harms Across the Data Lifecycle—are especially important in the current moment, as actors around the world are taking a data-driven approach to the fight against COVID-19.

The initial components of the RD4C Toolkit are:

The RD4C Data Ecosystem Mapping Tool intends to help users to identify the systems generating data about children and the key components of those systems. After using this tool, users will be positioned to understand the breadth of data they generate and hold about children; assess data systems’ redundancies or gaps; identify opportunities for responsible data use; and achieve other insights.

The RD4C Decision Provenance Mapping methodology provides a way for actors designing or assessing data investments for children to identify key decision points and determine which internal and external parties influence those decision points. This distillation can help users to pinpoint any gaps and develop strategies for improving decision-making processes and advancing more professionally accountable data practices.

The RD4C Opportunity and Risk Diagnostic provides organizations with a way to take stock of the RD4C principles and how they might be realized as an organization reviews a data project or system. The high-level questions and prompts below are intended to help users identify areas in need of attention and to strategize next steps for ensuring more responsible handling of data for and about children across their organization.

Finally, the Data for Children Collaborative with UNICEF developed an Ethical Assessment that “forms part of [their] safe data ecosystem, alongside data management and data protection policies and practices.” The tool reflects the RD4C Principles and aims to “provide an opportunity for project teams to reflect on the material consequences of their actions, and how their work will have real impacts on children’s lives.

RD4C launched in October 2019 with the release of the RD4C Synthesis ReportSelected Readings, and the RD4C Principles. Last month we published the The RD4C Case Studies, which analyze data systems deployed in diverse country environments, with a focus on their alignment with the RD4C Principles. The case studies are: Romania’s The Aurora ProjectChildline Kenya, and Afghanistan’s Nutrition Online Database.

To learn more about Responsible Data for Children, visit rd4c.org or contact rd4c [at] thegovlab.org. To join the RD4C conversation and be alerted to future releases, subscribe at this link.”