Safeguards for human studies can’t cope with big data


Nathaniel Raymond at Nature: “One of the primary documents aiming to protect human research participants was published in the US Federal Register 40 years ago this week. The Belmont Report was commissioned by Congress in the wake of the notorious Tuskegee syphilis study, in which researchers withheld treatment from African American men for years and observed how the disease caused blindness, heart disease, dementia and, in some cases, death.

The Belmont Report lays out core principles now generally required for human research to be considered ethical. Although technically governing only US federally supported research, its influence reverberates across academia and industry globally. Before academics with US government funding can begin research involving humans, their institutional review boards (IRBs) must determine that the studies comply with regulation largely derived from a document that was written more than a decade before the World Wide Web and nearly a quarter of a century before Facebook.

It is past time for a Belmont 2.0. We should not be asking those tasked with protecting human participants to single-handedly identify and contend with the implications of the digital revolution. Technological progress, including machine learning, data analytics and artificial intelligence, has altered the potential risks of research in ways that the authors of the first Belmont report could not have predicted. For example, Muslim cab drivers can be identified from patterns indicating that they stop to pray; the Ugandan government can try to identify gay men from their social-media habits; and researchers can monitor and influence individuals’ behaviour online without enrolling them in a study.

Consider the 2014 Facebook ‘emotional contagion study’, which manipulated users’ exposure to emotional content to evaluate effects on mood. That project, a collaboration with academic researchers, led the US Department of Health and Human Services to launch a long rule-making process that tweaked some regulations governing IRBs.

A broader fix is needed. Right now, data science overlooks risks to human participants by default….(More)”.

Data Cultures, Culture as Data


Introduction to Special Issue of Cultural Analytics by Amelia Acker and Tanya Clement: “Data have become pervasive in research in the humanities and the social sciences. New areas, objects, and situations for study have developed; and new methods for working with data are shepherded by new epistemologies and (potential) paradigm shifts. But data didn’t just happen to us. We have happened to data. In every field, scholars are drawing boundaries between data and humans as if making meaning with data is innocent work. But these boundaries are never innocent. Questions are emerging about the relationships of culture to data—urgent questions that focus on the codification (or code-ification) of social and cultural bias and the erosion of human agency, subjectivity, and identity.

For this special issue of Cultural Analytics we invited submissions to respond to these concerns as they relate to the proximity and distance between the creation of data and its collection; the nature of data as object or content; modes and contexts of data circulation, dissemination and preservation; histories and imaginary data futures; data expertise; data and technological progressivism; the cultivation and standardization of data; and the cultures, communities, and consciousness of data production. The contributions we received ranged in type from research or theory articles to data reviews and opinion pieces responding to the theme of “data cultures”. Each contribution asks questions we should all be asking: What is the role we play in the data cultures/culture as data we form around sociomaterial practices? How can we better understand how these practices effect, and affect, the materialization of subjects, objects, and the relations between them? How can we engage our data culture(s) in practical, critical, and generative ways? As Karen Barad writes, “We are responsible for the world in which we live not because it is an arbitrary construction of our choosing, but because it is sedimented out of particular practices that we have a role in shaping.”1Ultimately, our contributors are focused on this central concern: where is our agency in the responsibility of shaping data cultures? What role can scholarship play in better understanding our culture as data?…(More)”.

Digital Health Data And Information Sharing: A New Frontier For Health Care Competition?


Paper by Lucia Savage, Martin Gaynor and Julie Adler-Milstein: “There are obvious benefits to having patients’ health information flow across health providers. Providers will have more complete information about patients’ health and treatment histories, allowing them to make better treatment recommendations, and avoid unnecessary and duplicative testing or treatment. This should result in better and more efficient treatment, and better health outcomes. Moreover, the federal government has provided substantial incentives for the exchange of health information. Since 2009, the federal government has spent more than $40 billion to ensure that most physicians and hospitals use electronic health records, and to incentivize the use of electronic health information and health information exchange (the enabling statute is the Health Information Technology for Clinical Health Act), and in 2016 authorized substantial fines for failing to share appropriate information.

Yet, in spite of these incentives and the clear benefits to patients, the exchange of health information remains limited. There is evidence that this limited exchange in due in part to providers and platforms attempting to retain, rather than share, information (“information blocking”). In this article we examine legal and business reasons why health information may not be flowing. In particular, we discuss incentives providers and platforms can have for information blocking as a means to maintain or enhance their market position and thwart competition. Finally, we recommend steps to better understand whether the absence of information exchange, is due to information blocking that harms competition and consumers….(More)”

Synthetic data: innovation for public good


Blog Post by Catrin Cheung: “What is synthetic data, and how can it be used for public good? ….Synthetic data are artificially generated data that have the look and structure of real data, but do not contain any information on individuals. They also contain more general characteristics that are used to find patterns in the data.

They are modelled on real data, but designed in a way which safeguards the legal, ethical and confidentiality requirements of the original data. Given their resemblance to the original data, synthetic data are useful in a range of situations, for example when data is sensitive or missing. They are used widely as teaching materials, to test code or mathematical models, or as training data for machine learning models….

There’s currently a wealth of research emerging from the health sector, as the nature of data published is often sensitive. Public Health England have synthesised cancer data which can be freely accessed online. NHS Scotland are making advances in cutting-edge machine learning methods such as Variational Auto Encoders and Generative Adversarial Networks (GANs).

There is growing interest in this area of research, and its influence extends beyond the statistical community. While the Data Science Campus have also used GANs to generate synthetic data in their latest research, its power is not limited to data generation. It can be trained to construct features almost identical to our own across imagery, music, speech and text. In fact, GANs have been used to create a painting of Edmond de Belamy, which sold for $432,500 in 2018!

Within the ONS, a pilot to create synthetic versions of securely held Labour Force Survey data has been carried out using a package in R called “synthpop”. This synthetic dataset can be shared with approved researchers to de-bug codes, prior to analysis of data held in the Secure Research Service….

Although much progress is done in this field, one challenge that persists is guaranteeing the accuracy of synthetic data. We must ensure that the statistical properties of synthetic data match properties of the original data.

Additional features, such as the presence of non-numerical data, add to this difficult task. For example, if something is listed as “animal” and can take the possible values “dog”,”cat” or “elephant”, it is difficult to convert this information into a format suitable for precise calculations. Furthermore, given that datasets have different characteristics, there is no straightforward solution that can be applied to all types of data….particular focus was also placed on the use of synthetic data in the field of privacy, following from the challenges and opportunities identified by the National Statistician’s Quality Review of privacy and data confidentiality methods published in December 2018….(More)”.

Credit denial in the age of AI


Paper by Aaron Klein: “Banks have been in the business of deciding who is eligible for credit for centuries. But in the age of artificial intelligence (AI), machine learning (ML), and big data, digital technologies have the potential to transform credit allocation in positive as well as negative directions. Given the mix of possible societal ramifications, policymakers must consider what practices are and are not permissible and what legal and regulatory structures are necessary to protect consumers against unfair or discriminatory lending practices.

In this paper, I review the history of credit and the risks of discriminatory practices. I discuss how AI alters the dynamics of credit denials and what policymakers and banking officials can do to safeguard consumer lending. AI has the potential to alter credit practices in transformative ways and it is important to ensure that this happens in a safe and prudent manner….(More)”.

Digital Data for Development


LinkedIn: “The World Bank Group and LinkedIn share a commitment to helping workers around the world access opportunities that make good use of their talents and skills. The two organizations have come together to identify new ways that data from LinkedIn can help inform policymakers who seek to boost employment and grow their economies.

This site offers data and automated visuals of industries where LinkedIn data is comprehensive enough to provide an emerging picture. The data complements a wealth of official sources and can offer a more real-time view in some areas particularly for new, rapidly changing digital and technology industries.

The data shared in the first phase of this collaboration focuses on 100+ countries with at least 100,000 LinkedIn members each, distributed across 148 industries and 50,000 skills categories. In the near term, it will help World Bank Group teams and government partners pinpoint ways that developing countries could stimulate growth and expand opportunity, especially as disruptive technologies reshape the economic landscape. As LinkedIn’s membership and digital platforms continue to grow in developing countries, this collaboration will assess the possibility to expand the sectors and countries covered in the next annual update.

This site offers downloadable data, visualizations, and an expanding body of insights and joint research from the World Bank Group and LinkedIn. The data is being made accessible as a public good, though it will be most useful for policy analysts, economists, and researchers….(More)”.

Predictive Big Data Analytics using the UK Biobank Data


Paper by Ivo D Dinov et al: “The UK Biobank is a rich national health resource that provides enormous opportunities for international researchers to examine, model, and analyze census-like multisource healthcare data. The archive presents several challenges related to aggregation and harmonization of complex data elements, feature heterogeneity and salience, and health analytics. Using 7,614 imaging, clinical, and phenotypic features of 9,914 subjects we performed deep computed phenotyping using unsupervised clustering and derived two distinct sub-cohorts. Using parametric and nonparametric tests, we determined the top 20 most salient features contributing to the cluster separation. Our approach generated decision rules to predict the presence and progression of depression or other mental illnesses by jointly representing and modeling the significant clinical and demographic variables along with the derived salient neuroimaging features. We reported consistency and reliability measures of the derived computed phenotypes and the top salient imaging biomarkers that contributed to the unsupervised clustering. This clinical decision support system identified and utilized holistically the most critical biomarkers for predicting mental health, e.g., depression. External validation of this technique on different populations may lead to reducing healthcare expenses and improving the processes of diagnosis, forecasting, and tracking of normal and pathological aging….(More)”.

Tracking Phones, Google Is a Dragnet for the Police


Jennifer Valentino-DeVries at the New York Times: “….The warrants, which draw on an enormous Google database employees call Sensorvault, turn the business of tracking cellphone users’ locations into a digital dragnet for law enforcement. In an era of ubiquitous data gathering by tech companies, it is just the latest example of how personal information — where you go, who your friends are, what you read, eat and watch, and when you do it — is being used for purposes many people never expected. As privacy concerns have mounted among consumers, policymakers and regulators, tech companies have come under intensifying scrutiny over their data collection practices.

The Arizona case demonstrates the promise and perils of the new investigative technique, whose use has risen sharply in the past six months, according to Google employees familiar with the requests. It can help solve crimes. But it can also snare innocent people.

Technology companies have for years responded to court orders for specific users’ information. The new warrants go further, suggesting possible suspects and witnesses in the absence of other clues. Often, Google employees said, the company responds to a single warrant with location information on dozens or hundreds of devices.

Law enforcement officials described the method as exciting, but cautioned that it was just one tool….

The technique illustrates a phenomenon privacy advocates have long referred to as the “if you build it, they will come” principle — anytime a technology company creates a system that could be used in surveillance, law enforcement inevitably comes knocking. Sensorvault, according to Google employees, includes detailed location records involving at least hundreds of millions of devices worldwide and dating back nearly a decade….(More)”.

Access to Algorithms


Paper by Hannah Bloch-Wehba: “Federal, state, and local governments increasingly depend on automated systems — often procured from the private sector — to make key decisions about civil rights and civil liberties. When individuals affected by these decisions seek access to information about the algorithmic methodologies that produced them, governments frequently assert that this information is proprietary and cannot be disclosed. 

Recognizing that opaque algorithmic governance poses a threat to civil rights and liberties, scholars have called for a renewed focus on transparency and accountability for automated decision making. But scholars have neglected a critical avenue for promoting public accountability and transparency for automated decision making: the law of access to government records and proceedings. This Article fills this gap in the literature, recognizing that the Freedom of Information Act, its state equivalents, and the First Amendment provide unappreciated legal support for algorithmic transparency.

The law of access performs three critical functions in promoting algorithmic accountability and transparency. First, by enabling any individual to challenge algorithmic opacity in government records and proceedings, the law of access can relieve some of the burden otherwise borne by parties who are often poor and under-resourced. Second, access law calls into question government’s procurement of algorithmic decision making technologies from private vendors, subject to contracts that include sweeping protections for trade secrets and intellectual property rights. Finally, the law of access can promote an urgently needed public debate on algorithmic governance in the public sector….(More)”.

Statistics Estonia to coordinate data governance


Article by Miriam van der Sangen at CBS: “In 2018, Statistics Estonia launched a new strategy for the period 2018-2022. This strategy addresses the organisation’s aim to produce statistics more quickly while minimising the response burden on both businesses and citizens. Another element in the strategy is addressing the high expectations in Estonian society regarding the use of data. ‘We aim to transform Statistics Estonia into a national data agency,’ says Director General Mägi. ‘This means our role as a producer of official statistics will be enlarged by data governance responsibilities in the public sector. Taking on such responsibilities requires a clear vision of the whole public data ecosystem and also agreement to establish data stewards in most public sector institutions.’…

the Estonian Parliament passed new legislation that effectively expanded the number of official tasks for Statistics Estonia. Mägi elaborates: ‘Most importantly, we shall be responsible for coordinating data governance. The detailed requirements and conditions of data governance will be specified further in the coming period.’ Under the new Act, Statistics Estonia will also have more possibilities to share data with other parties….

Statistics Estonia is fully committed to producing statistics which are based on big data. Mägi explains: ‘At the moment, we are actively working on two big data projects. One project involves the use of smart electricity meters. In this project, we are looking into ways to visualise business and household electricity consumption information. The second project involves web scraping of prices and enterprise characteristics. This project is still in an initial phase, but we can already see that the use of web scraping can improve the efficiency of our production process.’ We are aiming to extend the web scraping project by also identifying e-commerce and innovation activities of enterprises.’

Yet another ambitious goal for Statistics Estonia lies in the field of data science. ‘Similarly to Statistics Netherlands, we established experimental statistics and data mining activities years ago. Last year, we developed a so-called think-tank service, providing insights from data into all aspects of our lives. Think of birth, education, employment, et cetera. Our key clients are the various ministries, municipalities and the private sector. The main aim in the coming years is to speed up service time thanks to visualisations and data lake solutions.’ …(More)”.