Paging Dr. Google: How the Tech Giant Is Laying Claim to Health Data


Wall Street Journal: “Roughly a year ago, Google offered health-data company Cerner Corp.an unusually rich proposal.

Cerner was interviewing Silicon Valley giants to pick a storage provider for 250 million health records, one of the largest collections of U.S. patient data. Google dispatched former chief executive Eric Schmidt to personally pitch Cerner over several phone calls and offered around $250 million in discounts and incentives, people familiar with the matter say. 

Google had a bigger goal in pushing for the deal than dollars and cents: a way to expand its effort to collect, analyze and aggregate health data on millions of Americans. Google representatives were vague in answering questions about how Cerner’s data would be used, making the health-care company’s executives wary, the people say. Eventually, Cerner struck a storage deal with Amazon.com Inc. instead.

The failed Cerner deal reveals an emerging challenge to Google’s move into health care: gaining the trust of health care partners and the public. So far, that has hardly slowed the search giant.

Google has struck partnerships with some of the country’s largest hospital systems and most-renowned health-care providers, many of them vast in scope and few of their details previously reported. In just a few years, the company has achieved the ability to view or analyze tens of millions of patient health records in at least three-quarters of U.S. states, according to a Wall Street Journal analysis of contractual agreements. 

In certain instances, the deals allow Google to access personally identifiable health information without the knowledge of patients or doctors. The company can review complete health records, including names, dates of birth, medications and other ailments, according to people familiar with the deals.

The prospect of tech giants’ amassing huge troves of health records has raised concerns among lawmakers, patients and doctors, who fear such intimate data could be used without individuals’ knowledge or permission, or in ways they might not anticipate. 

Google is developing a search tool, similar to its flagship search engine, in which patient information is stored, collated and analyzed by the company’s engineers, on its own servers. The portal is designed for use by doctors and nurses, and eventually perhaps patients themselves, though some Google staffers would have access sooner. 

Google executives and some health systems say that detailed data sharing has the potential to improve health outcomes. Large troves of data help fuel algorithms Google is creating to detect lung cancer, eye disease and kidney injuries. Hospital executives have long sought better electronic record systems to reduce error rates and cut down on paperwork….

Legally, the information gathered by Google can be used for purposes beyond diagnosing illnesses, under laws enacted during the dial-up era. U.S. federal privacy laws make it possible for health-care providers, with little or no input from patients, to share data with certain outside companies. That applies to partners, like Google, with significant presences outside health care. The company says its intentions in health are unconnected with its advertising business, which depends largely on data it has collected on users of its many services, including email and maps.

Medical information is perhaps the last bounty of personal data yet to be scooped up by technology companies. The health data-gathering efforts of other tech giants such as Amazon and International Business Machines Corp. face skepticism from physician and patient advocates. But Google’s push in particular has set off alarm bells in the industry, including over privacy concerns. U.S. senators, as well as health-industry executives, are questioning Google’s expansion and its potential for commercializing personal data….(More)”.

Trusted smart statistics: Motivations and principles


Paper by Fabio Ricciato et al : “In this contribution we outline the concept of Trusted Smart Statistics as the natural evolution of official statistics in the new datafied world. Traditional data sources, namely survey and administrative data, represent nowadays a valuable but small portion of the global data stock, much thereof being held in the private sector. The availability of new data sources is only one aspect of the global change that concerns official statistics. Other aspects, more subtle but not less important, include the changes in perceptions, expectations, behaviours and relations between the stakeholders. The environment around official statistics has changed: statistical offices are not any more data monopolists, but one prominent species among many others in a larger (and complex) ecosystem. What was established in the traditional world of legacy data sources (in terms of regulations, technologies, practices, etc.) is not guaranteed to be sufficient any more with new data sources.

Trusted Smart Statistics is not about replacing existing sources and processes, but augmenting them with new ones. Such augmentation however will not be only incremental: the path towards Trusted Smart Statistics is not about tweaking some components of the legacy system but about building an entirely new system that will coexist with the legacy one. In this position paper we outline some key design principles for the new Trusted Smart Statistics system. Taken collectively they picture a system where the smart and trust aspects enable and reinforce each other. A system that is more extrovert towards external stakeholders (citizens, private companies, public authorities) with whom Statistical Offices will be sharing computation, control, code, logs and of course final statistics, without necessarily sharing the raw input data….(More)”.

Towards adaptive governance in big data health research: implementing regulatory principles


Chapter by Alessandro Blasimme and Effy Vayena: “While data-enabled health care systems are in their infancy, biomedical research is rapidly adopting the big data paradigm. Digital epidemiology for example, already employs data generated outside the public health care system – that is, data generated without the intent of using them for epidemiological research – to understand and prevent patterns of diseases in populations (Salathé 2018)(Salathé 2018). Precision medicine – pooling together genomic, environmental and lifestyle data – also represents a prominent example of how data integration can drive both fundamental and translational research in important medical domains such as oncology (D. C. Collins et al. 2017). All of this requires the collection, storage, analysis and distribution of massive amounts of personal information as well as the use of state-of-the art data analytics tools to uncover healthand disease related patterns.


The realization of the potential of big data in health evokes a necessary commitment to a sense of “continuity” articulated in three distinct ways: a) from data generation to use (as in the data enabled learning health care ); b) from research to clinical practice e.g. discovery of new mutations in the context of diagnostics; c) from strictly speaking health data (Vayena and Gasser 2016) e.g. clinical records, to less so e.g. tweets used in digital epidemiology. These continuities face the challenge of regulatory and governance approaches that were designed for clear data taxonomies, for a less blurred boundary between research and clinical practice, and for rules that focused mostly on data generation and less on their eventual and multiple uses.

The result is significant uncertainty about how responsible use of such large amounts of sensitive personal data could be fostered. In this chapter we focus on the uncertainties surrounding the use of biomedical big data in the context of health research. Are new criteria needed to review biomedical big data research projects? Do current mechanisms, such as informed consent, offer sufficient protection to research participants’ autonomy and privacy in this new context? Do existing oversight mechanisms ensure transparency and accountability in data access and sharing? What monitoring tools are available to assess how personal data are used over time? Is the equitable distribution of benefits accruing from such data uses considered, or can it be ensured? How is the public being involved – if at all – with decisions about creating and using large data
repositories for research purposes? What is the role that IT (information technology) players, and especially big ones, acquire in research? And what regulatory instruments do we have to ensure that such players do not undermine the independence of research?…(More)”.

Responsible data sharing in a big data-driven translational research platform: lessons learned


Paper by S. Kalkman et al: “The sharing of clinical research data is increasingly viewed as a moral duty [1]. Particularly in the context of making clinical trial data widely available, editors of international medical journals have labeled data sharing a highly efficient way to advance scientific knowledge [2,3,4]. The combination of even larger datasets into so-called “Big Data” is considered to offer even greater benefits for science, medicine and society [5]. Several international consortia have now promised to build grand-scale, Big Data-driven translational research platforms to generate better scientific evidence regarding disease etiology, diagnosis, treatment and prognosis across various disease areas [6,7,8].

Despite anticipated benefits, large-scale sharing of health data is charged with ethical questions. Stakeholders have been urged to consider how to manage privacy and confidentiality issues, ensure valid informed consent, and determine who gets to decide about data access [9]. More fundamentally, new data sharing activities prompt questions about social justice and public trust [10]. To balance potential benefits and ethical considerations, data sharing platforms require guidance for the processes of interaction and decision-making. In the European Union (EU), legal norms specified for the sharing of personal data for health research, most notably those set out in the General Data Protection Regulation (GDPR) (EU 2016/679), remain open to interpretation and offer limited practical guidance to researchers [12,12,13]. Striking in this regard is that the GDPR itself stresses the importance of adherence to ethical standards, when broad consent is put forward as a legal basis for the processing of personal data. For example, Recital 33 of the GDPR states that data subjects should be allowed to give “consent to certain areas of scientific research when in keeping with recognised ethical standards for scientific research” [14]. In fact, the GDPR actually encourages data controllers to establish self-regulating mechanisms, such as a code of conduct. To foster responsible and sustainable data sharing in translational research platforms, ethical guidance and governance is therefore necessary. Here, we define governance as ‘the processes of interaction and decision-making among the different stakeholders that are involved in a collective problem that lead to the creation, reinforcement, or reproduction of social norms and institutions’…(More)”.

A Matter of Trust: Higher Education Institutions as Information Fiduciaries in an Age of Educational Data Mining and Learning Analytics


Paper by Kyle M. L. Jones, Alan Rubel and Ellen LeClere: “Higher education institutions are mining and analyzing student data to effect educational, political, and managerial outcomes. Done under the banner of “learning analytics,” this work can—and often does—surface sensitive data and information about, inter alia, a student’s demographics, academic performance, offline and online movements, physical fitness, mental wellbeing, and social network. With these data, institutions and third parties are able to describe student life, predict future behaviors, and intervene to address academic or other barriers to student success (however defined). Learning analytics, consequently, raise serious issues concerning student privacy, autonomy, and the appropriate flow of student data.

We argue that issues around privacy lead to valid questions about the degree to which students should trust their institution to use learning analytics data and other artifacts (algorithms, predictive scores) with their interests in mind. We argue that higher education institutions are paradigms of information fiduciaries. As such, colleges and universities have a special responsibility to their students. In this article, we use the information fiduciary concept to analyze cases when learning analytics violate an institution’s responsibility to its students….(More)”.

Biased Algorithms Are Easier to Fix Than Biased People


Sendhil Mullainathan in The New York Times: “In one study published 15 years ago, two people applied for a job. Their résumés were about as similar as two résumés can be. One person was named Jamal, the other Brendan.

In a study published this year, two patients sought medical care. Both were grappling with diabetes and high blood pressure. One patient was black, the other was white.

Both studies documented racial injustice: In the first, the applicant with a black-sounding name got fewer job interviews. In the second, the black patient received worse care.

But they differed in one crucial respect. In the first, hiring managers made biased decisions. In the second, the culprit was a computer program.

As a co-author of both studies, I see them as a lesson in contrasts. Side by side, they show the stark differences between two types of bias: human and algorithmic.

Marianne Bertrand, an economist at the University of Chicago, and I conducted the first study: We responded to actual job listings with fictitious résumés, half of which were randomly assigned a distinctively black name.

The study was: “Are Emily and Greg more employable than Lakisha and Jamal?”

The answer: Yes, and by a lot. Simply having a white name increased callbacks for job interviews by 50 percent.

I published the other study in the journal “Science” in late October with my co-authors: Ziad Obermeyer, a professor of health policy at University of California at Berkeley; Brian Powers, a clinical fellow at Brigham and Women’s Hospital; and Christine Vogeli, a professor of medicine at Harvard Medical School. We focused on an algorithm that is widely used in allocating health care services, and has affected roughly a hundred million people in the United States.

To better target care and provide help, health care systems are turning to voluminous data and elaborately constructed algorithms to identify the sickest patients.

We found these algorithms have a built-in racial bias. At similar levels of sickness, black patients were deemed to be at lower risk than white patients. The magnitude of the distortion was immense: Eliminating the algorithmic bias would more than double the number of black patients who would receive extra help. The problem lay in a subtle engineering choice: to measure “sickness,” they used the most readily available data, health care expenditures. But because society spends less on black patients than equally sick white ones, the algorithm understated the black patients’ true needs.

One difference between these studies is the work needed to uncover bias…(More)”.

One Nation Tracked: An investigation into the smartphone tracking industry


Stuart A. Thompson and Charlie Warzel at the New York Times: “…For brands, following someone’s precise movements is key to understanding the “customer journey” — every step of the process from seeing an ad to buying a product. It’s the Holy Grail of advertising, one marketer said, the complete picture that connects all of our interests and online activity with our real-world actions.

Pointillist location data also has some clear benefits to society. Researchers can use the raw data to provide key insights for transportation studies and government planners. The City Council of Portland, Ore., unanimously approved a deal to study traffic and transit by monitoring millions of cellphones. Unicef announced a plan to use aggregated mobile location data to study epidemics, natural disasters and demographics.

For individual consumers, the value of constant tracking is less tangible. And the lack of transparency from the advertising and tech industries raises still more concerns.

Does a coupon app need to sell second-by-second location data to other companies to be profitable? Does that really justify allowing companies to track millions and potentially expose our private lives?

Data companies say users consent to tracking when they agree to share their location. But those consent screens rarely make clear how the data is being packaged and sold. If companies were clearer about what they were doing with the data, would anyone agree to share it?

What about data collected years ago, before hacks and leaks made privacy a forefront issue? Should it still be used, or should it be deleted for good?

If it’s possible that data stored securely today can easily be hacked, leaked or stolen, is this kind of data worth that risk?

Is all of this surveillance and risk worth it merely so that we can be served slightly more relevant ads? Or so that hedge fund managers can get richer?

The companies profiting from our every move can’t be expected to voluntarily limit their practices. Congress has to step in to protect Americans’ needs as consumers and rights as citizens.

Until then, one thing is certain: We are living in the world’s most advanced surveillance system. This system wasn’t created deliberately. It was built through the interplay of technological advance and the profit motive. It was built to make money. The greatest trick technology companies ever played was persuading society to surveil itself….(More)”.

Why the Global South should nationalise its data


Ulises Ali Mejias at AlJazeera: “The recent coup in Bolivia reminds us that poor countries rich in resources continue to be plagued by the legacy of colonialism. Anything that stands in the way of a foreign corporation’s ability to extract cheap resources must be removed.

Today, apart from minerals and fossil fuels, corporations are after another precious resource: Personal data. As with natural resources, data too has become the target of extractive corporate practices.

As sociologist Nick Couldry and I argue in our book, The Costs of Connection: How Data is Colonizing Human Life and Appropriating It for Capitalism, there is a new form of colonialism emerging in the world: data colonialism. By this, we mean a new resource-grab whereby human life itself has become a direct input into economic production in the form of extracted data.

We acknowledge that this term is controversial, given the extreme physical violence and structures of racism that historical colonialism employed. However, our point is not to say that data colonialism is the same as historical colonialism, but rather to suggest that it shares the same core function: extraction, exploitation, and dispossession.

Like classical colonialism, data colonialism violently reconfigures human relations to economic production. Things like land, water, and other natural resources were valued by native people in the precolonial era, but not in the same way that colonisers (and later, capitalists) came to value them: as private property. Likewise, we are experiencing a situation in which things that were once primarily outside the economic realm – things like our most intimate social interactions with friends and family, or our medical records – have now been commodified and made part of an economic cycle of data extraction that benefits a few corporations.

So what could countries in the Global South do to avoid the dangers of data colonialism?…(More)”.

Assessing employer intent when AI hiring tools are biased


Report by Caitlin Chin at Brookings: “When it comes to gender stereotypes in occupational roles, artificial intelligence (AI) has the potential to either mitigate historical bias or heighten it. In the case of the Word2vec model, AI appears to do both.

Word2vec is a publicly available algorithmic model built on millions of words scraped from online Google News articles, which computer scientists commonly use to analyze word associations. In 2016, Microsoft and Boston University researchers revealed that the model picked up gender stereotypes existing in online news sources—and furthermore, that these biased word associations were overwhelmingly job related. Upon discovering this problem, the researchers neutralized the biased word correlations in their specific algorithm, writing that “in a small way debiased word embeddings can hopefully contribute to reducing gender bias in society.”

Their study draws attention to a broader issue with artificial intelligence: Because algorithms often emulate the training datasets that they are built upon, biased input datasets could generate flawed outputs. Because many contemporary employers utilize predictive algorithms to scan resumes, direct targeted advertising, or even conduct face- or voice-recognition-based interviews, it is crucial to consider whether popular hiring tools might be susceptible to the same cultural biases that the researchers discovered in Word2vec.

In this paper, I discuss how hiring is a multi-layered and opaque process and how it will become more difficult to assess employer intent as recruitment processes move online. Because intent is a critical aspect of employment discrimination law, I ultimately suggest four ways upon which to include it in the discussion surrounding algorithmic bias….(More)”

This report from The Brookings Institution’s Artificial Intelligence and Emerging Technology (AIET) Initiative is part of “AI and Bias,” a series that explores ways to mitigate possible biases and create a pathway toward greater fairness in AI and emerging technologies.

Responsible Operations: Data Science, Machine Learning, and AI in Libraries


OCLC Research Position Paper by Thomas Padilla: “Despite greater awareness, significant gaps persist between concept and operationalization in libraries at the level of workflows (managing bias in probabilistic description), policies (community engagement vis-à-vis the development of machine-actionable collections), positions (developing staff who can utilize, develop, critique, and/or promote services influenced by data science, machine learning, and AI), collections (development of “gold standard” training data), and infrastructure (development of systems that make use of these technologies and methods). Shifting from awareness to operationalization will require holistic organizational commitment to responsible operations. The viability of responsible operations depends on organizational incentives and protections that promote constructive dissent…(More)”.