Synthetic data: innovation for public good


Blog Post by Catrin Cheung: “What is synthetic data, and how can it be used for public good? ….Synthetic data are artificially generated data that have the look and structure of real data, but do not contain any information on individuals. They also contain more general characteristics that are used to find patterns in the data.

They are modelled on real data, but designed in a way which safeguards the legal, ethical and confidentiality requirements of the original data. Given their resemblance to the original data, synthetic data are useful in a range of situations, for example when data is sensitive or missing. They are used widely as teaching materials, to test code or mathematical models, or as training data for machine learning models….

There’s currently a wealth of research emerging from the health sector, as the nature of data published is often sensitive. Public Health England have synthesised cancer data which can be freely accessed online. NHS Scotland are making advances in cutting-edge machine learning methods such as Variational Auto Encoders and Generative Adversarial Networks (GANs).

There is growing interest in this area of research, and its influence extends beyond the statistical community. While the Data Science Campus have also used GANs to generate synthetic data in their latest research, its power is not limited to data generation. It can be trained to construct features almost identical to our own across imagery, music, speech and text. In fact, GANs have been used to create a painting of Edmond de Belamy, which sold for $432,500 in 2018!

Within the ONS, a pilot to create synthetic versions of securely held Labour Force Survey data has been carried out using a package in R called “synthpop”. This synthetic dataset can be shared with approved researchers to de-bug codes, prior to analysis of data held in the Secure Research Service….

Although much progress is done in this field, one challenge that persists is guaranteeing the accuracy of synthetic data. We must ensure that the statistical properties of synthetic data match properties of the original data.

Additional features, such as the presence of non-numerical data, add to this difficult task. For example, if something is listed as “animal” and can take the possible values “dog”,”cat” or “elephant”, it is difficult to convert this information into a format suitable for precise calculations. Furthermore, given that datasets have different characteristics, there is no straightforward solution that can be applied to all types of data….particular focus was also placed on the use of synthetic data in the field of privacy, following from the challenges and opportunities identified by the National Statistician’s Quality Review of privacy and data confidentiality methods published in December 2018….(More)”.

e-Democracy: Toward a New Model of (Inter)active Society


Book by Alfredo M. Ronchi: “This book explores the main elements of e-Democracy, the term normally used to describe the implementation of democratic government processes by electronic means. It provides insights into the main technological and human issues regarding governance, government, participation, inclusion, empowerment, procurement and, last but not least, ethical and privacy issues. Its main aim is to bridge the gap between technological solutions, their successful implementation, and the fruitful utilization of the main set of e-Services totally or partially delivered by governments or non-government organizations.


Today, various parameters actively influence e-Services’ success or failure: cultural aspects, organisational issues, bureaucracy and workflows, infrastructure and technology in general, user habits, literacy, capacity or merely interaction design. This includes having a significant population of citizens who are willing and able to adopt and use online services; as well as developing the managerial and technical capability to implement applications that meet citizens’ needs. This book helps readers understand the mutual dependencies involved; further, a selection of success stories and failures, duly commented on, enables readers to identify the right approach to innovation in governmental e-Services….(More)”

Credit denial in the age of AI


Paper by Aaron Klein: “Banks have been in the business of deciding who is eligible for credit for centuries. But in the age of artificial intelligence (AI), machine learning (ML), and big data, digital technologies have the potential to transform credit allocation in positive as well as negative directions. Given the mix of possible societal ramifications, policymakers must consider what practices are and are not permissible and what legal and regulatory structures are necessary to protect consumers against unfair or discriminatory lending practices.

In this paper, I review the history of credit and the risks of discriminatory practices. I discuss how AI alters the dynamics of credit denials and what policymakers and banking officials can do to safeguard consumer lending. AI has the potential to alter credit practices in transformative ways and it is important to ensure that this happens in a safe and prudent manner….(More)”.

Characterizing the cultural niches of North American birds


Justin G. Schuetz and Alison Johnston at PNAS: “Efforts to mitigate the current biodiversity crisis require a better understanding of how and why humans value other species. We use Internet query data and citizen science data to characterize public interest in 621 bird species across the United States. We estimate the relative popularity of different birds by quantifying how frequently people use Google to search for species, relative to the rates at which they are encountered in the environment.

In intraspecific analyses, we also quantify the degree to which Google searches are limited to, or extend beyond, the places in which people encounter each species. The resulting metrics of popularity and geographic specificity of interest allow us to define aspects of relationships between people and birds within a cultural niche space. We then estimate the influence of species traits and socially constructed labels on niche positions to assess the importance of observations and ideas in shaping public interest in birds.

Our analyses show clear effects of migratory strategy, color, degree of association with bird feeders, and, especially, body size on niche position. They also indicate that cultural labels, including “endangered,” “introduced,” and, especially, “team mascot,” are strongly associated with the magnitude and geographic specificity of public interest in birds. Our results provide a framework for exploring complex relationships between humans and other species and enable more informed decision-making across diverse bird conservation strategies and goals….(More)”.

Predictive Big Data Analytics using the UK Biobank Data


Paper by Ivo D Dinov et al: “The UK Biobank is a rich national health resource that provides enormous opportunities for international researchers to examine, model, and analyze census-like multisource healthcare data. The archive presents several challenges related to aggregation and harmonization of complex data elements, feature heterogeneity and salience, and health analytics. Using 7,614 imaging, clinical, and phenotypic features of 9,914 subjects we performed deep computed phenotyping using unsupervised clustering and derived two distinct sub-cohorts. Using parametric and nonparametric tests, we determined the top 20 most salient features contributing to the cluster separation. Our approach generated decision rules to predict the presence and progression of depression or other mental illnesses by jointly representing and modeling the significant clinical and demographic variables along with the derived salient neuroimaging features. We reported consistency and reliability measures of the derived computed phenotypes and the top salient imaging biomarkers that contributed to the unsupervised clustering. This clinical decision support system identified and utilized holistically the most critical biomarkers for predicting mental health, e.g., depression. External validation of this technique on different populations may lead to reducing healthcare expenses and improving the processes of diagnosis, forecasting, and tracking of normal and pathological aging….(More)”.

Tracking Phones, Google Is a Dragnet for the Police


Jennifer Valentino-DeVries at the New York Times: “….The warrants, which draw on an enormous Google database employees call Sensorvault, turn the business of tracking cellphone users’ locations into a digital dragnet for law enforcement. In an era of ubiquitous data gathering by tech companies, it is just the latest example of how personal information — where you go, who your friends are, what you read, eat and watch, and when you do it — is being used for purposes many people never expected. As privacy concerns have mounted among consumers, policymakers and regulators, tech companies have come under intensifying scrutiny over their data collection practices.

The Arizona case demonstrates the promise and perils of the new investigative technique, whose use has risen sharply in the past six months, according to Google employees familiar with the requests. It can help solve crimes. But it can also snare innocent people.

Technology companies have for years responded to court orders for specific users’ information. The new warrants go further, suggesting possible suspects and witnesses in the absence of other clues. Often, Google employees said, the company responds to a single warrant with location information on dozens or hundreds of devices.

Law enforcement officials described the method as exciting, but cautioned that it was just one tool….

The technique illustrates a phenomenon privacy advocates have long referred to as the “if you build it, they will come” principle — anytime a technology company creates a system that could be used in surveillance, law enforcement inevitably comes knocking. Sensorvault, according to Google employees, includes detailed location records involving at least hundreds of millions of devices worldwide and dating back nearly a decade….(More)”.

Access to Algorithms


Paper by Hannah Bloch-Wehba: “Federal, state, and local governments increasingly depend on automated systems — often procured from the private sector — to make key decisions about civil rights and civil liberties. When individuals affected by these decisions seek access to information about the algorithmic methodologies that produced them, governments frequently assert that this information is proprietary and cannot be disclosed. 

Recognizing that opaque algorithmic governance poses a threat to civil rights and liberties, scholars have called for a renewed focus on transparency and accountability for automated decision making. But scholars have neglected a critical avenue for promoting public accountability and transparency for automated decision making: the law of access to government records and proceedings. This Article fills this gap in the literature, recognizing that the Freedom of Information Act, its state equivalents, and the First Amendment provide unappreciated legal support for algorithmic transparency.

The law of access performs three critical functions in promoting algorithmic accountability and transparency. First, by enabling any individual to challenge algorithmic opacity in government records and proceedings, the law of access can relieve some of the burden otherwise borne by parties who are often poor and under-resourced. Second, access law calls into question government’s procurement of algorithmic decision making technologies from private vendors, subject to contracts that include sweeping protections for trade secrets and intellectual property rights. Finally, the law of access can promote an urgently needed public debate on algorithmic governance in the public sector….(More)”.

Statistics Estonia to coordinate data governance


Article by Miriam van der Sangen at CBS: “In 2018, Statistics Estonia launched a new strategy for the period 2018-2022. This strategy addresses the organisation’s aim to produce statistics more quickly while minimising the response burden on both businesses and citizens. Another element in the strategy is addressing the high expectations in Estonian society regarding the use of data. ‘We aim to transform Statistics Estonia into a national data agency,’ says Director General Mägi. ‘This means our role as a producer of official statistics will be enlarged by data governance responsibilities in the public sector. Taking on such responsibilities requires a clear vision of the whole public data ecosystem and also agreement to establish data stewards in most public sector institutions.’…

the Estonian Parliament passed new legislation that effectively expanded the number of official tasks for Statistics Estonia. Mägi elaborates: ‘Most importantly, we shall be responsible for coordinating data governance. The detailed requirements and conditions of data governance will be specified further in the coming period.’ Under the new Act, Statistics Estonia will also have more possibilities to share data with other parties….

Statistics Estonia is fully committed to producing statistics which are based on big data. Mägi explains: ‘At the moment, we are actively working on two big data projects. One project involves the use of smart electricity meters. In this project, we are looking into ways to visualise business and household electricity consumption information. The second project involves web scraping of prices and enterprise characteristics. This project is still in an initial phase, but we can already see that the use of web scraping can improve the efficiency of our production process.’ We are aiming to extend the web scraping project by also identifying e-commerce and innovation activities of enterprises.’

Yet another ambitious goal for Statistics Estonia lies in the field of data science. ‘Similarly to Statistics Netherlands, we established experimental statistics and data mining activities years ago. Last year, we developed a so-called think-tank service, providing insights from data into all aspects of our lives. Think of birth, education, employment, et cetera. Our key clients are the various ministries, municipalities and the private sector. The main aim in the coming years is to speed up service time thanks to visualisations and data lake solutions.’ …(More)”.

Black Wave: How Networks and Governance Shaped Japan’s 3/11 Disasters


Book by Daniel Aldrich: “Despite the devastation caused by the magnitude 9.0 earthquake and 60-foot tsunami that struck Japan in 2011, some 96% of those living and working in the most disaster-stricken region of Tōhoku made it through. Smaller earthquakes and tsunamis have killed far more people in nearby China and India. What accounts for the exceptionally high survival rate? And why is it that some towns and cities in the Tōhoku region have built back more quickly than others?

Black Wave illuminates two critical factors that had a direct influence on why survival rates varied so much across the Tōhoku region following the 3/11 disasters and why the rebuilding process has also not moved in lockstep across the region. Individuals and communities with stronger networks and better governance, Daniel P. Aldrich shows, had higher survival rates and accelerated recoveries. Less connected communities with fewer such ties faced harder recovery processes and lower survival rates. Beyond the individual and neighborhood levels of survival and recovery, the rebuilding process has varied greatly, as some towns and cities have sought to work independently on rebuilding plans, ignoring recommendations from the national governments and moving quickly to institute their own visions, while others have followed the guidelines offered by Tokyo-based bureaucrats for economic development and rebuilding….(More)”.

Big Data Applications in Governance and Policy


Introduction to Special Issue of Politics and Governance by Sarah Giest and Reuben Ng: ” Recent literature has been trying to grasp the extent as to which big data applications affect the governance and policymaking of countries and regions (Boyd & Crawford, 2012; Giest, 2017; Höchtl, Parycek, & Schöllhammer, 2015; Poel, Meyer, & Schroeder, 2018). The discussion includes the comparison to e-government and evidence-based policymaking developments that existed long before the idea of big data entered the policy realm. The theoretical extent of this discussion however lacks some of the more practical consequences that come with the active use of data-driven applications. In fact, much of the work focuses on the input-side of policymaking, looking at which data and technology enters the policy process, however very little is dedicated to the output side.

In short, how has big data shaped data governance and policymaking? The contributions to this thematic issue shed light on this question by looking at a range of factors, such as campaigning in the US election (Trish, 2018) or local government data projects (Durrant, Barnett, & Rempel, 2018). The goal is to unpack the mixture of big data applications and existing policy processes in order to understand whether these new tools and applications enhance or hinder policymaking….(More)”.