The Why of the World


Book review by Tim Maudlin of The Book of Why: The New Science of Cause and Effect by Judea Pearl and Dana Mackenzie: “Correlation is not causation.” Though true and important, the warning has hardened into the familiarity of a cliché. Stock examples of so-called spurious correlations are now a dime a dozen. As one example goes, a Pacific island tribe believed flea infestations to be good for one’s health because they observed that healthy people had fleas while sick people did not. The correlation is real and robust, but fleas do not cause health, of course: they merely indicate it. Fleas on a fevered body abandon ship and seek a healthier host. One should not seek out and encourage fleas in the quest to ward off sickness.

The rub lies in another observation: that the evidence for causation seems to lie entirely in correlations. But for seeing correlations, we would have no clue about causation. The only reason we discovered that smoking causes lung cancer, for example, is that we observed correlations in that particular circumstance. And thus a puzzle arises: if causation cannot be reduced to correlation, how can correlation serve as evidence of causation?

The Book of Why, co-authored by the computer scientist Judea Pearl and the science writer Dana Mackenzie, sets out to give a new answer to this old question, which has been around—in some form or another, posed by scientists and philosophers alike—at least since the Enlightenment. In 2011 Pearl won the Turing Award, computer science’s highest honor, for “fundamental contributions to artificial intelligence through the development of a calculus of probabilistic and causal reasoning,” and this book sets out to explain what all that means for a general audience, updating his more technical book on the same subject, Causality, published nearly two decades ago. Written in the first person, the new volume mixes theory, history, and memoir, detailing both the technical tools of causal reasoning Pearl has developed as well as the tortuous path by which he arrived at them—all along bucking a scientific establishment that, in his telling, had long ago contented itself with data-crunching analysis of correlations at the expense of investigation of causes. There are nuggets of wisdom and cautionary tales in both these aspects of the book, the scientific as well as the sociological…(More)”.

Sharenthood: Why We Should Think before We Talk about Our Kids Online


Book by Leah Plunkett: “Our children’s first digital footprints are made before they can walk—even before they are born—as parents use fertility apps to aid conception, post ultrasound images, and share their baby’s hospital mug shot. Then, in rapid succession come terabytes of baby pictures stored in the cloud, digital baby monitors with built-in artificial intelligence, and real-time updates from daycare. When school starts, there are cafeteria cards that catalog food purchases, bus passes that track when kids are on and off the bus, electronic health records in the nurse’s office, and a school surveillance system that has eyes everywhere. Unwittingly, parents, teachers, and other trusted adults are compiling digital dossiers for children that could be available to everyone—friends, employers, law enforcement—forever. In this incisive book, Leah Plunkett examines the implications of “sharenthood”—adults’ excessive digital sharing of children’s data. She outlines the mistakes adults make with kids’ private information, the risks that result, and the legal system that enables “sharenting.”

Plunkett describes various modes of sharenting—including “commercial sharenting,” efforts by parents to use their families’ private experiences to make money—and unpacks the faulty assumptions made by our legal system about children, parents, and privacy. She proposes a “thought compass” to guide adults in their decision making about children’s digital data: play, forget, connect, and respect. Enshrining every false step and bad choice, Plunkett argues, can rob children of their chance to explore and learn lessons. The Internet needs to forget. We need to remember….(More)”.

Is Privacy and Personal Data Set to Become the New Intellectual Property?


Paper by Leon Trakman, Robert Walters, and Bruno Zeller: “A pressing concern today is whether the rationale underlying the protection of personal data is itself a meaningful foundation for according intellectual property (IP) rights in personal data to data subjects. In particular, are there particular technological attributes about the collection, use and processing of personal data on the Internet, and global access to that data, that provide a strong justification to extend IP rights to data subjects? A central issue in so determining is whether data subjects need the protection of such rights in a technological revolution in which they are increasingly exposed to the use and abuse of their personal data. A further question is how IP law can provide them with the requisite protection of their private space, or whether other means of protecting personal data, such as through general contract rights, render IP protections redundant, or at least, less necessary. This paper maintains that lawmakers often fail to distinguish between general property and IP protection of personal data; that IP protection encompasses important attributes of both property and contract law; and that laws that implement IP protection in light of its sui generis attributes are more fitting means of protecting personal data than the alternatives. The paper demonstrates that one of the benefits of providing IP rights in personal data goes some way to strengthening data subjects’ control and protection over their personal data and strengthening data protection law more generally. It also argues for greater harmonization of IP law across jurisdictions to ensure that the protection of personal data becomes more coherent and internationally sustainable….(More)”.

How to Build Artificial Intelligence We Can Trust


Gary Marcus and Ernest Davis at the New York Times: “Artificial intelligence has a trust problem. We are relying on A.I. more and more, but it hasn’t yet earned our confidence.

Tesla cars driving in Autopilot mode, for example, have a troubling history of crashing into stopped vehicles. Amazon’s facial recognition system works great much of the time, but when asked to compare the faces of all 535 members of Congress with 25,000 public arrest photos, it found 28 matches, when in reality there were none. A computer program designed to vet job applicants for Amazon was discovered to systematically discriminate against women. Every month new weaknesses in A.I. are uncovered.

The problem is not that today’s A.I. needs to get better at what it does. The problem is that today’s A.I. needs to try to do something completely different.

In particular, we need to stop building computer systems that merely get better and better at detecting statistical patterns in data sets — often using an approach known as deep learning — and start building computer systems that from the moment of their assembly innately grasp three basic concepts: time, space and causality….

We face a choice. We can stick with today’s approach to A.I. and greatly restrict what the machines are allowed to do (lest we end up with autonomous-vehicle crashes and machines that perpetuate bias rather than reduce it). Or we can shift our approach to A.I. in the hope of developing machines that have a rich enough conceptual understanding of the world that we need not fear their operation. Anything else would be too risky….(More)”.

How Should Scientists’ Access To Health Databanks Be Managed?


Richard Harris at NPR: “More than a million Americans have donated genetic information and medical data for research projects. But how that information gets used varies a lot, depending on the philosophy of the organizations that have gathered the data.

Some hold the data close, while others are working to make the data as widely available to as many researchers as possible — figuring science will progress faster that way. But scientific openness can be constrained b y both practical and commercial considerations.

Three major projects in the United States illustrate these differing philosophies.

VA scientists spearhead research on veterans database

The first project involves three-quarters of a million veterans, mostly men over age 60. Every day, 400 to 500 blood samples show up in a modern lab in the basement of the Veterans Affairs hospital in Boston. Luis Selva, the center’s associate director, explains that robots extract DNA from the samples and then the genetic material is sent out for analysis….

Intermountain Healthcare teams with deCODE genetics

Our second example involves what is largely an extended family: descendants of settlers in Utah, primarily from the Church of Jesus Christ of Latter-day Saints. This year, Intermountain Healthcare in Utah announced that it was going to sequence the complete DNA of half a million of its patients, resulting in what the health system says will be the world’s largest collection of complete genomes….

NIH’s All of Us aims to diversify and democratize research

Our third and final example is an effort by the National Institutes of Health to recruit a million Americans for a long-term study of health, behavior and genetics. Its philosophy sharply contrasts with that of Intermountain Health.

“We do have a very strong goal around diversity, in making sure that the participants in the All of Us research program reflect the vast diversity of the United States,” says Stephanie Devaney, the program’s deputy director….(More)”.

Raw data won’t solve our problems — asking the right questions will


Stefaan G. Verhulst in apolitical: “If I had only one hour to save the world, I would spend fifty-five minutes defining the questions, and only five minutes finding the answers,” is a famous aphorism attributed to Albert Einstein.

Behind this quote is an important insight about human nature: Too often, we leap to answers without first pausing to examine our questions. We tout solutions without considering whether we are addressing real or relevant challenges or priorities. We advocate fixes for problems, or for aspects of society, that may not be broken at all.

This misordering of priorities is especially acute — and represents a missed opportunity — in our era of big data. Today’s data has enormous potential to solve important public challenges.

However, policymakers often fail to invest in defining the questions that matter, focusing mainly on the supply side of the data equation (“What data do we have or must have access to?”) rather than the demand side (“What is the core question and what data do we really need to answer it?” or “What data can or should we actually use to solve those problems that matter?”).

As such, data initiatives often provide marginal insights while at the same time generating unnecessary privacy risks by accessing and exploring data that may not in fact be needed at all in order to address the root of our most important societal problems.

A new science of questions

So what are the truly vexing questions that deserve attention and investment today? Toward what end should we strategically seek to leverage data and AI?

The truth is that policymakers and other stakeholders currently don’t have a good way of defining questions or identifying priorities, nor a clear framework to help us leverage the potential of data and data science toward the public good.

This is a situation we seek to remedy at The GovLab, an action research center based at New York University.

Our most recent project, the 100 Questions Initiative, seeks to begin developing a new science and practice of questions — one that identifies the most urgent questions in a participatory manner. Launched last month, the goal of this project is to develop a process that takes advantage of distributed and diverse expertise on a range of given topics or domains so as to identify and prioritize those questions that are high impact, novel and feasible.

Because we live in an age of data and much of our work focuses on the promises and perils of data, we seek to identify the 100 most pressing problems confronting the world that could be addressed by greater use of existing, often inaccessible, datasets through data collaboratives – new forms of cross-disciplinary collaboration beyond public-private partnerships focused on leveraging data for good….(More)”.

How Tulsa is Preserving Privacy and Sharing Data for Social Good


Data across Sectors for Health: “Data sharing between organizations addressing social risk factors has the potential to amplify impact by increasing direct service capacity and efficiency. Unfortunately, the risks of and restrictions on sharing personal data often limit this potential, and adherence to regulations such as HIPAA and FERPA can make data sharing a significant challenge.

DASH CIC-START awardee Restore Hope Ministries worked with Asemio to utilize technology that allows for the analysis of personally identifiable information while preserving clients’ privacy. The collaboration shared their findings in a new white paper that describes the process of using multi-party computation technology to answer questions that can aid service providers in exploring the barriers that underserved populations may be facing. The first question they asked: what is the overlap of populations served by two distinct organizations? The results of the overlap analysis confirmed that a significant opportunity exists to increase access to services for a subset of individuals through better outreach…(More)”

What statistics can and can’t tell us about ourselves


Hannah Fry at The New Yorker: “Harold Eddleston, a seventy-seven-year-old from Greater Manchester, was still reeling from a cancer diagnosis he had been given that week when, on a Saturday morning in February, 1998, he received the worst possible news. He would have to face the future alone: his beloved wife had died unexpectedly, from a heart attack.

Eddleston’s daughter, concerned for his health, called their family doctor, a well-respected local man named Harold Shipman. He came to the house, sat with her father, held his hand, and spoke to him tenderly. Pushed for a prognosis as he left, Shipman replied portentously, “I wouldn’t buy him any Easter eggs.” By Wednesday, Eddleston was dead; Dr. Shipman had murdered him.

Harold Shipman was one of the most prolific serial killers in history. In a twenty-three-year career as a mild-mannered and well-liked family doctor, he injected at least two hundred and fifteen of his patients with lethal doses of opiates. He was finally arrested in September, 1998, six months after Eddleston’s death.

David Spiegelhalter, the author of an important and comprehensive new book, “The Art of Statistics” (Basic), was one of the statisticians tasked by the ensuing public inquiry to establish whether the mortality rate of Shipman’s patients should have aroused suspicion earlier. Then a biostatistician at Cambridge, Spiegelhalter found that Shipman’s excess mortality—the number of his older patients who had died in the course of his career over the number that would be expected of an average doctor’s—was a hundred and seventy-four women and forty-nine men at the time of his arrest. The total closely matched the number of victims confirmed by the inquiry….

In 1825, the French Ministry of Justice ordered the creation of a national collection of crime records. It seems to have been the first of its kind anywhere in the world—the statistics of every arrest and conviction in the country, broken down by region, assembled and ready for analysis. It’s the kind of data set we take for granted now, but at the time it was extraordinarily novel. This was an early instance of Big Data—the first time that mathematical analysis had been applied in earnest to the messy and unpredictable realm of human behavior.

Or maybe not so unpredictable. In the early eighteen-thirties, a Belgian astronomer and mathematician named Adolphe Quetelet analyzed the numbers and discovered a remarkable pattern. The crime records were startlingly consistent. Year after year, irrespective of the actions of courts and prisons, the number of murders, rapes, and robberies reached almost exactly the same total. There is a “terrifying exactitude with which crimes reproduce themselves,” Quetelet said. “We know in advance how many individuals will dirty their hands with the blood of others. How many will be forgers, how many poisoners.”

To Quetelet, the evidence suggested that there was something deeper to discover. He developed the idea of a “Social Physics,” and began to explore the possibility that human lives, like planets, had an underlying mechanistic trajectory. There’s something unsettling in the idea that, amid the vagaries of choice, chance, and circumstance, mathematics can tell us something about what it is to be human. Yet Quetelet’s overarching findings still stand: at some level, human life can be quantified and predicted. We can now forecast, with remarkable accuracy, the number of women in Germany who will choose to have a baby each year, the number of car accidents in Canada, the number of plane crashes across the Southern Hemisphere, even the number of people who will visit a New York City emergency room on a Friday evening….(More)”

The 9 Pitfalls of Data Science


Book by Gary Smith and Jay Cordes: “Data science has never had more influence on the world. Large companies are now seeing the benefit of employing data scientists to interpret the vast amounts of data that now exists. However, the field is so new and is evolving so rapidly that the analysis produced can be haphazard at best. 

The 9 Pitfalls of Data Science shows us real-world examples of what can go wrong. Written to be an entertaining read, this invaluable guide investigates the all too common mistakes of data scientists – who can be plagued by lazy thinking, whims, hunches, and prejudices – and indicates how they have been at the root of many disasters, including the Great Recession. 

Gary Smith and Jay Cordes emphasise how scientific rigor and critical thinking skills are indispensable in this age of Big Data, as machines often find meaningless patterns that can lead to dangerous false conclusions. The 9 Pitfalls of Data Science is loaded with entertaining tales of both successful and misguided approaches to interpreting data, both grand successes and epic failures. These cautionary tales will not only help data scientists be more effective, but also help the public distinguish between good and bad data science….(More)”.

Study finds Big Data eliminates confidentiality in court judgements


Swissinfo: “Swiss researchers have found that algorithms that mine large swaths of data can eliminate anonymity in federal court rulings. This could have major ramifications for transparency and privacy protection.

This is the result of a study by the University of Zurich’s Institute of Law, published in the legal journal “Jusletter” and shared by Swiss public television SRF on Monday.

The study relied on a “web scraping technique” or mining of large swaths of data. The researchers created a database of all decisions of the Supreme Court available online from 2000 to 2018 – a total of 122,218 decisions. Additional decisions from the Federal Administrative Court and the Federal Office of Public Health were also added.

Using an algorithm and manual searches for connections between data, the researchers were able to de-anonymise, in other words reveal identities, in 84% of the judgments in less than an hour.

In this specific study, the researchers were able to identify the pharma companies and medicines hidden in the documents of the complaints filed in court.  

Study authors say that this could have far-reaching consequences for transparency and privacy. One of the study’s co-authors Kerstin Noëlle Vokinger, professor of law at the University of Zurich explains that, “With today’s technological possibilities, anonymisation is no longer guaranteed in certain areas”. The researchers say the technique could be applied to any publicly available database.

Vokinger added there is a need to balance necessary transparency while safeguarding the personal rights of individuals.

Adrian Lobsiger, the Swiss Federal Data Protection Commissioner, told SRF that this confirms his view that facts may need to be treated as personal data in the age of technology….(More)”.