What statistics can and can’t tell us about ourselves


Hannah Fry at The New Yorker: “Harold Eddleston, a seventy-seven-year-old from Greater Manchester, was still reeling from a cancer diagnosis he had been given that week when, on a Saturday morning in February, 1998, he received the worst possible news. He would have to face the future alone: his beloved wife had died unexpectedly, from a heart attack.

Eddleston’s daughter, concerned for his health, called their family doctor, a well-respected local man named Harold Shipman. He came to the house, sat with her father, held his hand, and spoke to him tenderly. Pushed for a prognosis as he left, Shipman replied portentously, “I wouldn’t buy him any Easter eggs.” By Wednesday, Eddleston was dead; Dr. Shipman had murdered him.

Harold Shipman was one of the most prolific serial killers in history. In a twenty-three-year career as a mild-mannered and well-liked family doctor, he injected at least two hundred and fifteen of his patients with lethal doses of opiates. He was finally arrested in September, 1998, six months after Eddleston’s death.

David Spiegelhalter, the author of an important and comprehensive new book, “The Art of Statistics” (Basic), was one of the statisticians tasked by the ensuing public inquiry to establish whether the mortality rate of Shipman’s patients should have aroused suspicion earlier. Then a biostatistician at Cambridge, Spiegelhalter found that Shipman’s excess mortality—the number of his older patients who had died in the course of his career over the number that would be expected of an average doctor’s—was a hundred and seventy-four women and forty-nine men at the time of his arrest. The total closely matched the number of victims confirmed by the inquiry….

In 1825, the French Ministry of Justice ordered the creation of a national collection of crime records. It seems to have been the first of its kind anywhere in the world—the statistics of every arrest and conviction in the country, broken down by region, assembled and ready for analysis. It’s the kind of data set we take for granted now, but at the time it was extraordinarily novel. This was an early instance of Big Data—the first time that mathematical analysis had been applied in earnest to the messy and unpredictable realm of human behavior.

Or maybe not so unpredictable. In the early eighteen-thirties, a Belgian astronomer and mathematician named Adolphe Quetelet analyzed the numbers and discovered a remarkable pattern. The crime records were startlingly consistent. Year after year, irrespective of the actions of courts and prisons, the number of murders, rapes, and robberies reached almost exactly the same total. There is a “terrifying exactitude with which crimes reproduce themselves,” Quetelet said. “We know in advance how many individuals will dirty their hands with the blood of others. How many will be forgers, how many poisoners.”

To Quetelet, the evidence suggested that there was something deeper to discover. He developed the idea of a “Social Physics,” and began to explore the possibility that human lives, like planets, had an underlying mechanistic trajectory. There’s something unsettling in the idea that, amid the vagaries of choice, chance, and circumstance, mathematics can tell us something about what it is to be human. Yet Quetelet’s overarching findings still stand: at some level, human life can be quantified and predicted. We can now forecast, with remarkable accuracy, the number of women in Germany who will choose to have a baby each year, the number of car accidents in Canada, the number of plane crashes across the Southern Hemisphere, even the number of people who will visit a New York City emergency room on a Friday evening….(More)”

Index: The Data Universe 2019


By Michelle Winowatan, Andrew J. Zahuranec, Andrew Young, Stefaan Verhulst, Max Jun Kim

The Living Library Index – inspired by the Harper’s Index – provides important statistics and highlights global trends in governance innovation. This installment focuses on the data universe.

Please share any additional, illustrative statistics on data, or other issues at the nexus of technology and governance, with us at [email protected]

Internet Traffic:

  • Percentage of the world’s population that uses the internet: 51.2% (3.9 billion people) – 2018
  • Number of search processed worldwide by Google every year: at least 2 trillion – 2016
  • Website traffic worldwide generated through mobile phones: 52.2% – 2018
  • The total number of mobile subscriptions in the first quarter of 2019: 7.9 billion (addition of 44 million in quarter) – 2019
  • Amount of mobile data traffic worldwide: nearly 30 billion GB – 2018
  • Data category with highest traffic worldwide: video (60%) – 2018
  • Global average of data traffic per smartphone per month: 5.6 GB – 2018
    • North America: 7 GB – 2018
    • Latin America: 3.1 GB – 2018
    • Western Europe: 6.7 GB – 2018
    • Central and Eastern Europe: 4.5 GB – 2018
    • North East Asia: 7.1 GB – 2018
    • Southeast Asia and Oceania: 3.6 GB – 2018
    • India, Nepal, and Bhutan: 9.8 GB – 2018
    • Middle East and Africa: 3.0 GB – 2018
  • Time between the creation of each new bitcoin block: 9.27 minutes – 2019

Streaming Services:

  • Total hours of video streamed by Netflix users every minute: 97,222 – 2017
  • Hours of YouTube watched per day: over 1 billion – 2018
  • Number of tracks uploaded to Spotify every day: Over 20,000 – 2019
  • Number of Spotify’s monthly active users: 232 million – 2019
  • Spotify’s total subscribers: 108 million – 2019
  • Spotify’s hours of content listened: 17 billion – 2019
  • Total number of songs on Spotify’s catalog: over 30 million – 2019
  • Apple Music’s total subscribers: 60 million – 2019
  • Total number of songs on Apple Music’s catalog: 45 million – 2019

Social Media:

Calls and Messaging:

Retail/Financial Transaction:

  • Number of packages shipped by Amazon in a year: 5 billion – 2017
  • Total value of payments processed by Venmo in a year: USD 62 billion – 2019
  • Based on an independent analysis of public transactions on Venmo in 2017:
  • Based on a non-representative survey of 2,436 US consumers between the ages of 21 and 72 on P2P platforms:
    • The average volume of transactions handled by Venmo: USD 64.2 billion – 2019
    • The average volume of transactions handled by Zelle: USD 122.0 billion – 2019
    • The average volume of transactions handled by PayPal: USD 141.8 billion – 2019 
    • Platform with the highest percent adoption among all consumers: PayPal (48%) – 2019 

Internet of Things:

Sources:

The hidden assumptions in public engagement: A case study of engaging on ethics in government data analysis


Paper by Emily S. Rempel, Julie Barnett and Hannah Durrant: “This study examines the hidden assumptions around running public-engagement exercises in government. We study an example of public engagement on the ethics of combining and analysing data in national government – often called data science ethics. We study hidden assumptions, drawing on hidden curriculum theories in education research, as it allows us to identify conscious and unconscious underlying processes related to conducting public engagement that may impact results. Through participation in the 2016 Public Dialogue for Data Science Ethics in the UK, four key themes were identified that exposed underlying public engagement norms. First, that organizers had constructed a strong imagined public as neither overly critical nor supportive, which they used to find and engage participants. Second, that official aims of the engagement, such as including publics in developing ethical data regulations, were overshadowed by underlying meta-objectives, such as counteracting public fears. Third, that advisory group members, organizers and publics understood the term ‘engagement’ in varying ways, from creating interest to public inclusion. And finally, that stakeholder interests, particularly government hopes for a positive report, influenced what was written in the final report. Reflection on these underlying mechanisms, such as the development of meta-objectives that seek to benefit government and technical stakeholders rather than publics, suggests that the practice of public engagement can, in fact, shut down opportunities for meaningful public dialogue….(More)”.

Sharing Private Data for Public Good


Stefaan G. Verhulst at Project Syndicate: “After Hurricane Katrina struck New Orleans in 2005, the direct-mail marketing company Valassis shared its database with emergency agencies and volunteers to help improve aid delivery. In Santiago, Chile, analysts from Universidad del Desarrollo, ISI Foundation, UNICEF, and the GovLab collaborated with Telefónica, the city’s largest mobile operator, to study gender-based mobility patterns in order to design a more equitable transportation policy. And as part of the Yale University Open Data Access project, health-care companies Johnson & Johnson, Medtronic, and SI-BONE give researchers access to previously walled-off data from 333 clinical trials, opening the door to possible new innovations in medicine.

These are just three examples of “data collaboratives,” an emerging form of partnership in which participants exchange data for the public good. Such tie-ups typically involve public bodies using data from corporations and other private-sector entities to benefit society. But data collaboratives can help companies, too – pharmaceutical firms share data on biomarkers to accelerate their own drug-research efforts, for example. Data-sharing initiatives also have huge potential to improve artificial intelligence (AI). But they must be designed responsibly and take data-privacy concerns into account.

Understanding the societal and business case for data collaboratives, as well as the forms they can take, is critical to gaining a deeper appreciation the potential and limitations of such ventures. The GovLab has identified over 150 data collaboratives spanning continents and sectors; they include companies such as Air FranceZillow, and Facebook. Our research suggests that such partnerships can create value in three main ways….(More)”.

Companies Collect a Lot of Data, But How Much Do They Actually Use?


Article by Priceonomics Data Studio: “For all the talk of how data is the new oil and the most valuable resource of any enterprise, there is a deep dark secret companies are reluctant to share — most of the data collected by businesses simply goes unused.

This unknown and unused data, known as dark data comprises more than half the data collected by companies. Given that some estimates indicate that 7.5 septillion (7,700,000,000,000,000,000,000) gigabytes of data are generated every single day, not using  most of it is a considerable issue.

In this article, we’ll look at this dark data. Just how much of it is created by companies, what are the reasons this data isn’t being analyzed, and what are the costs and implications of companies not using the majority of the data they collect.  

Before diving into the analysis, it’s worth spending a moment clarifying what we mean by the term “dark data.” Gartner defines dark data as:

“The information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships and direct monetizing). 

To learn more about this phenomenon, Splunk commissioned a global survey of 1,300+ business leaders to better understand how much data they collect, and how much is dark. Respondents were from IT and business roles, and were located in Australia, China, France, Germany, Japan, the United States, and the United Kingdom. across various industries. For the report, Splunk defines dark data as: “all the unknown and untapped data across an organization, generated by systems, devices and interactions.”

While the costs of storing data has decreased overtime, the cost of saving septillions of gigabytes of wasted data is still significant. What’s more, during this time the strategic importance of data has increased as companies have found more and more uses for it. Given the cost of storage and the value of data, why does so much of it go unused?

The following chart shows the reasons why dark data isn’t currently being harnessed:

By a large margin, the number one reason given for not using dark data is that companies lack a tool to capture or analyze the data. Companies accumulate data from server logs, GPS networks, security tools, call records, web traffic and more. Companies track everything from digital transactions to the temperature of their server rooms to the contents of retail shelves. Most of this data lies in separate systems, is unstructured, and cannot be connected or analyzed.

Second, the data captured just isn’t good enough. You might have important customer information about a transaction, but it’s missing location or other important metadata because that information sits somewhere else or was never captured in useable format.

Additionally, dark data exists because there is simply too much data out there and a lot of is unstructured. The larger the dataset (or the less structured it is), the more sophisticated the tool required for analysis. Additionally, these kinds of datasets often time require analysis by individuals with significant data science expertise who are often is short supply

The implications of the prevalence are vast. As a result of the data deluge, companies often don’t know where all the sensitive data is stored and can’t be confident they are complying with consumer data protection measures like GDPR. …(More)”.

How does Finland use health and social data for the public benefit?


Karolina Mackiewicz at ICT & Health: “…Better innovation opportunities, quicker access to comprehensive ready-combined data, smoother permit procedures needed for research – those are some of the benefits for society, academia or business announced by the Ministry of Social Affairs and Health of Finland when the Act on the Secondary Use of Health and Social Data was introduced.

It came into force on 1st of May 2019. According to the Finnish Innovation Fund SITRA, which was involved in the development of the legislation and carried out the pilot projects, it’s a ‘groundbreaking’ piece of legislation. It’ not only effectively introduces a one-stop-shop for data but it’s also one of the first, if not the first, implementations of the GDPR (the EU’s General Data Protection Regulation) for the secondary use of data in Europe. 

The aim of the Act is “to facilitate the effective and safe processing and access to the personal social and health data for steering, supervision, research, statistics and development in the health and social sector”. A second objective is to guarantee an individual’s legitimate expectations as well as their rights and freedoms when processing personal data. In other words, the Ministry of Health promises that the Act will help eliminate the administrative burden in access to the data by the researchers and innovative businesses while respecting the privacy of individuals and providing conditions for the ethically sustainable way of using data….(More)”.

Bringing machine learning to the masses


Matthew Hutson at Science: “Artificial intelligence (AI) used to be the specialized domain of data scientists and computer programmers. But companies such as Wolfram Research, which makes Mathematica, are trying to democratize the field, so scientists without AI skills can harness the technology for recognizing patterns in big data. In some cases, they don’t need to code at all. Insights are just a drag-and-drop away. One of the latest systems is software called Ludwig, first made open-source by Uber in February and updated last week. Uber used Ludwig for projects such as predicting food delivery times before releasing it publicly. At least a dozen startups are using it, plus big companies such as Apple, IBM, and Nvidia. And scientists: Tobias Boothe, a biologist at the Max Planck Institute of Molecular Cell Biology and Genetics in Dresden, Germany, uses it to visually distinguish thousands of species of flatworms, a difficult task even for experts. To train Ludwig, he just uploads images and labels….(More)”.

Trust in Contemporary Society


Book edited by Masamichi Sasaki: “… deals with conceptual, theoretical and social interaction analyses, historical data on societies, national surveys or cross-national comparative studies, and methodological issues related to trust. The authors are from a variety of disciplines: psychology, sociology, political science, organizational studies, history, and philosophy, and from Britain, the United States, the Czech Republic, the Netherlands, Australia, Germany, and Japan. They bring their vast knowledge from different historical and cultural backgrounds to illuminate contemporary issues of trust and distrust. The socio-cultural perspective of trust is important and increasingly acknowledged as central to trust research. Accordingly, future directions for comparative trust research are also discussed….(More)”.

How we can place a value on health care data


Report by E&Y: “Unlocking the power of health care data to fuel innovation in medical research and improve patient care is at the heart of today’s health care revolution. When curated or consolidated into a single longitudinal dataset, patient-level records will trace a complete story of a patient’s demographics, health, wellness, diagnosis, treatments, medical procedures and outcomes. Health care providers need to recognize patient data for what it is: a valuable intangible asset desired by multiple stakeholders, a treasure trove of information.

Among the universe of providers holding significant data assets, the United Kingdom’s National Health Service (NHS) is the single largest integrated health care provider in the world. Its patient records cover the entire UK population from birth to death.

We estimate that the 55 million patient records held by the NHS today may have an indicative market value of several billion pounds to a commercial organization. We estimate also that the value of the curated NHS dataset could be as much as £5bn per annum and deliver around £4.6bn of benefit to patients per annum, in potential operational savings for the NHS, enhanced patient outcomes and generation of wider economic benefits to the UK….(More)”.

Applying design science in public policy and administration research


Paper by Sjoerd Romme and Albert Meijer: “There is increasing debate about the role that public policy research can play in identifying solutions to complex policy challenges. Most studies focus on describing and explaining how governance systems operate. However, some scholars argue that because current institutions are often not up to the task, researchers need to rethink this ‘bystander’ approach and engage in experimentation and interventions that can help to change and improve governance systems.

This paper contributes to this discourse by developing a design science framework that integrates retrospective research (scientific validation) and prospective research (creative design). It illustrates the merits and challenges of doing this through two case studies in the Netherlands and concludes that a design science framework provides a way of integrating traditional validation-oriented research with intervention-oriented design approaches. We argue that working at the interface between them will create new opportunities for these complementary modes of public policy research to achieve impact….(More)”