The future of statistics and data science


Paper by Sofia C. Olhede and Patrick J. Wolfe in Statistics & Probability Letters: “The Danish physicist Niels Bohr is said to have remarked: “Prediction is very difficult, especially about the future”. Predicting the future of statistics in the era of big data is not so very different from prediction about anything else. Ever since we started to collect data to predict cycles of the moon, seasons, and hence future agriculture yields, humankind has worked to infer information from indirect observations for the purpose of making predictions.

Even while acknowledging the momentous difficulty in making predictions about the future, a few topics stand out clearly as lying at the current and future intersection of statistics and data science. Not all of these topics are of a strictly technical nature, but all have technical repercussions for our field. How might these repercussions shape the still relatively young field of statistics? And what can sound statistical theory and methods bring to our understanding of the foundations of data science? In this article we discuss these issues and explore how new open questions motivated by data science may in turn necessitate new statistical theory and methods now and in the future.

Together, the ubiquity of sensing devices, the low cost of data storage, and the commoditization of computing have led to a volume and variety of modern data sets that would have been unthinkable even a decade ago. We see four important implications for statistics.

First, many modern data sets are related in some way to human behavior. Data might have been collected by interacting with human beings, or personal or private information traceable back to a given set of individuals might have been handled at some stage. Mathematical or theoretical statistics traditionally does not concern itself with the finer points of human behavior, and indeed many of us have only had limited training in the rules and regulations that pertain to data derived from human subjects. Yet inevitably in a data-rich world, our technical developments cannot be divorced from the types of data sets we can collect and analyze, and how we can handle and store them.

Second, the importance of data to our economies and civil societies means that the future of regulation will look not only to protect our privacy, and how we store information about ourselves, but also to include what we are allowed to do with that data. For example, as we collect high-dimensional vectors about many family units across time and space in a given region or country, privacy will be limited by that high-dimensional space, but our wish to control what we do with data will go beyond that….

Third, the growing complexity of algorithms is matched by an increasing variety and complexity of data. Data sets now come in a variety of forms that can be highly unstructured, including images, text, sound, and various other new forms. These different types of observations have to be understood together, resulting in multimodal data, in which a single phenomenon or event is observed through different types of measurement devices. Rather than having one phenomenon corresponding to single scalar values, a much more complex object is typically recorded. This could be a three-dimensional shape, for example in medical imaging, or multiple types of recordings such as functional magnetic resonance imaging and simultaneous electroencephalography in neuroscience. Data science therefore challenges us to describe these more complex structures, modeling them in terms of their intrinsic patterns.

Finally, the types of data sets we now face are far from satisfying the classical statistical assumptions of identically distributed and independent observations. Observations are often “found” or repurposed from other sampling mechanisms, rather than necessarily resulting from designed experiments….

 Our field will either meet these challenges and become increasingly ubiquitous, or risk rapidly becoming irrelevant to the future of data science and artificial intelligence….(More)”.

What if technology could help improve conversations online?


Introduction to “Perspective”: “Discussing things you care about can be difficult. The threat of abuse and harassment online means that many people stop expressing themselves and give up on seeking different opinions….Perspective is an API that makes it easier to host better conversations. The API uses machine learning models to score the perceived impact a comment might have on a conversation. Developers and publishers can use this score to give realtime feedback to commenters or help moderators do their job, or allow readers to more easily find relevant information, as illustrated in two experiments below. We’ll be releasing more machine learning models later in the year, but our first model identifies whether a comment could be perceived as “toxic” to a discussion….(More)”.

Can Crowdsourcing and Collaboration Improve the Future of Human Health?


Ben Wiegand at Scientific American: “The process of medical research has been likened to searching for a needle in a haystack. With the continued acceleration of novel science and health care technologies in areas like artificial intelligence, digital therapeutics and the human microbiome we have tremendous opportunity to search the haystack in new and exciting ways. Applying these high-tech advances to today’s most pressing health issues increases our ability to address the root cause of disease, intervene earlier and change the trajectory of human health.

Global crowdsourcing forums, like the Johnson & Johnson Innovation QuickFire Challenges, can be incredibly valuable tools for searching the “haystack.” An initiative of JLABS—the no-strings-attached incubators of Johnson & Johnson Innovation—these contests spur scientific diversity through crowdsourcing, inspiring and attracting fresh thinking. They seek to stimulate the global innovation ecosystem through funding, mentorship and access to resources that can kick-start breakthrough ideas.

Our most recent challenge, the Next-Gen Baby Box QuickFire Challenge, focused on updating the 80-year-old “Finnish baby box,” a free, government-issued maternity supply kit for new parents containing such essentials as baby clothing, bath and sleep supplies packaged in a sleep-safe cardboard box. Since it first launched, the baby box has, together with increased use of maternal healthcare services early in pregnancy, helped to significantly reduce the Finnish infant mortality rate from 65 in every 1,000 live births in the 1930s to 2.5 per 1,000 today—one of the lowest rates in the world.

Partnering with Finnish innovation and government groups, we set out to see if updating this popular early parenting tool with the power of personalized health technology might one day impact Finland’s unparalleled high rate of type 1 diabetes. We issued the call globally to help create “the Baby Box of the future” as part of the Janssen and Johnson & Johnson Innovation vision to create a world without disease by accelerating science and delivering novel solutions to prevent, intercept and cure disease. The contest brought together entrepreneurs, researchers and innovators to focus on ideas with the potential to promote child health, detect childhood disease earlier and facilitate healthy parenting.

Incentive challenges like this award participants who have most effectively met a predefined objective or task. It’s a concept that emerged well before our time—as far back as the 18th century—from Napoleon’s Food Preservation Prize, meant to find a way to keep troops fed during battle, to the Longitude Prize for improved marine navigation.

Research shows that prize-based challenges that attract talent across a wide range of disciplines can generate greater risk-taking and yield more dramatic solutions….(More)”.

An AI That Reads Privacy Policies So That You Don’t Have To


Andy Greenberg at Wired: “…Today, researchers at Switzerland’s Federal Institute of Technology at Lausanne (EPFL), the University of Wisconsin and the University of Michigan announced the release of Polisis—short for “privacy policy analysis”—a new website and browser extension that uses their machine-learning-trained app to automatically read and make sense of any online service’s privacy policy, so you don’t have to.

In about 30 seconds, Polisis can read a privacy policy it’s never seen before and extract a readable summary, displayed in a graphic flow chart, of what kind of data a service collects, where that data could be sent, and whether a user can opt out of that collection or sharing. Polisis’ creators have also built a chat interface they call Pribot that’s designed to answer questions about any privacy policy, intended as a sort of privacy-focused paralegal advisor. Together, the researchers hope those tools can unlock the secrets of how tech firms use your data that have long been hidden in plain sight….

Polisis isn’t actually the first attempt to use machine learning to pull human-readable information out of privacy policies. Both Carnegie Mellon University and Columbia have made their own attempts at similar projects in recent years, points out NYU Law Professor Florencia Marotta-Wurgler, who has focused her own research on user interactions with terms of service contracts online. (One of her own studies showed that only .07 percent of users actually click on a terms of service link before clicking “agree.”) The Usable Privacy Policy Project, a collaboration that includes both Columbia and CMU, released its own automated tool to annotate privacy policies just last month. But Marotta-Wurgler notes that Polisis’ visual and chat-bot interfaces haven’t been tried before, and says the latest project is also more detailed in how it defines different kinds of data. “The granularity is really nice,” Marotta-Wurgler says. “It’s a way of communicating this information that’s more interactive.”…(More)”.

Artificial Intelligence and Foreign Policy


Paper by Ben ScottStefan Heumann and Philppe Lorenz: “The plot-lines of the development of Artificial Intelligence (AI) are debated and contested. But it is safe to predict that it will become one of the central technologies of the 21st century. It is fashionable these days to speak about data as the new oil. But if we want to “refine” the vast quantities of data we are collecting today and make sense of it, we will need potent AI. The consequences of the AI revolution could not be more far reaching. Value chains will be turned upside down, labor markets will get disrupted and economic power will shift to those who control this new technology. And as AI is deeply embedded in the connectivity of the Internet, the challenge of AI is global in nature. Therefore it is striking that AI is almost absent from the foreign policy agenda.

This paper seeks to provide a foundation for planning a foreign policy strategy that responds effectively to the emerging power of AI in international affairs. The developments in AI are so dynamic and the implications so wide-ranging that ministries need to begin engaging immediately. That means starting with the assets and resources at hand while planning for more significant changes in the future. Many of the tools of traditional diplomacy can be adapted to this new field. While the existing toolkit can get us started, this pragmatic approach does not preclude thinking about more drastic changes that the technological changes might require for our foreign policy institutions and instruments.

The paper approaches this challenge, drawing on the existing foreign policy toolbox and reflecting on the past lessons of adapting this toolbox to the Internet revolution. The paper goes on to make suggestions on how the tools could be applied to the international challenges that the AI revolution will bring about. The toolbox includes policy making, public diplomacy, bilateral and multilateral engagement, actions through international and treaty organizations, convenings and partnerships, grant-making and information-gathering and analysis. The analysis of the international challenges of the AI transformation are divided into three topical areas. Each of the three sections includes concrete suggestions how instruments from the tool box could be applied to address the challenges AI will bring about in international affairs….(More)“.

Artificial intelligence and privacy


Report by the The Norwegian Data Protection Authority (DPA): “…If people cannot trust that information about them is being handled properly, it may limit their willingness to share information – for example with their doctor, or on social media. If we find ourselves in a situation in which sections of the population refuse to share information because they feel that their personal integrity is being violated, we will be faced with major challenges to our freedom of speech and to people’s trust in the authorities.

A refusal to share personal information will also represent a considerable challenge with regard to the commercial use of such data in sectors such as the media, retail trade and finance services.

About the report

This report elaborates on the legal opinions and the technologies described in the 2014 report «Big Data – privacy principles under pressure». In this report we will provide greater technical detail in describing artificial intelligence (AI), while also taking a closer look at four relevant AI challenges associated with the data protection principles embodied in the GDPR:

  • Fairness and discrimination
  • Purpose limitation
  • Data minimisation
  • Transparency and the right to information

This represents a selection of data protection concerns that in our opinion are most relevance for the use of AI today.

The target group for this report consists of people who work with, or who for other reasons are interested in, artificial intelligence. We hope that engineers, social scientists, lawyers and other specialists will find this report useful….(More) (Download Report)”.

Earth Observation Open Science and Innovation


Open Access book edited by Pierre-Philippe Mathieu and Christoph Aubrecht: “Over  the  past  decades,  rapid developments in digital and sensing technologies, such  as the Cloud, Web and Internet of Things, have dramatically changed the way we live and work. The digital transformation is revolutionizing our ability to monitor our planet and transforming the  way we access, process and exploit Earth Observation data from satellites.

This book reviews these megatrends and their implications for the Earth Observation community as well as the wider data economy. It provides insight into new paradigms of Open Science and Innovation applied to space data, which are characterized by openness, access to large volume of complex data, wide availability of new community tools, new techniques for big data analytics such as Artificial Intelligence, unprecedented level of computing power, and new types of collaboration among researchers, innovators, entrepreneurs and citizen scientists. In addition, this book aims to provide readers with some reflections on the future of Earth Observation, highlighting through a series of use cases not just the new opportunities created by the New Space revolution, but also the new challenges that must be addressed in order to make the most of the large volume of complex and diverse data delivered by the new generation of satellites….(More)”.

How AI Could Help the Public Sector


Emma Martinho-Truswell in the Harvard Business Review: “A public school teacher grading papers faster is a small example of the wide-ranging benefits that artificial intelligence could bring to the public sector. A.I could be used to make government agencies more efficient, to improve the job satisfaction of public servants, and to increase the quality of services offered. Talent and motivation are wasted doing routine tasks when they could be doing more creative ones.

Applications of artificial intelligence to the public sector are broad and growing, with early experiments taking place around the world. In addition to education, public servants are using AI to help them make welfare payments and immigration decisions, detect fraud, plan new infrastructure projects, answer citizen queries, adjudicate bail hearings, triage health care cases, and establish drone paths.  The decisions we are making now will shape the impact of artificial intelligence on these and other government functions. Which tasks will be handed over to machines? And how should governments spend the labor time saved by artificial intelligence?

So far, the most promising applications of artificial intelligence use machine learning, in which a computer program learns and improves its own answers to a question by creating and iterating algorithms from a collection of data. This data is often in enormous quantities and from many sources, and a machine learning algorithm can find new connections among data that humans might not have expected. IBM’s Watson, for example, is a treatment recommendation-bot, sometimes finding treatments that human doctors might not have considered or known about.

Machine learning program may be better, cheaper, faster, or more accurate than humans at tasks that involve lots of data, complicated calculations, or repetitive tasks with clear rules. Those in public service, and in many other big organizations, may recognize part of their job in that description. The very fact that government workers are often following a set of rules — a policy or set of procedures — already presents many opportunities for automation.

To be useful, a machine learning program does not need to be better than a human in every case. In my work, we expect that much of the “low hanging fruit” of government use of machine learning will be as a first line of analysis or decision-making. Human judgment will then be critical to interpret results, manage harder cases, or hear appeals.

When the work of public servants can be done in less time, a government might reduce its staff numbers, and return money saved to taxpayers — and I am sure that some governments will pursue that option. But it’s not necessarily the one I would recommend. Governments could instead choose to invest in the quality of its services. They can re-employ workers’ time towards more rewarding work that requires lateral thinking, empathy, and creativity — all things at which humans continue to outperform even the most sophisticated AI program….(More)”.

After Big Data: The Coming Age of “Big Indicators”


Andrew Zolli at the Stanford Social Innovation Review: “Consider, for a moment, some of the most pernicious challenges facing humanity today: the increasing prevalence of natural disasters; the systemic overfishing of the world’s oceans; the clear-cutting of primeval forests; the maddening persistence of poverty; and above all, the accelerating effects of global climate change.

Each item in this dark litany inflicts suffering on the world in its own, awful way. Yet as a group, they share some common characteristics. Each problem is messy, with lots of moving parts. Each is riddled with perverse incentives, which can lead local actors to behave in a way that is not in the common interest. Each is opaque, with dynamics that are only partially understood, even by experts; each can, as a result, often be made worse by seemingly rational and well-intentioned interventions. When things do go wrong, each has consequences that diverge dramatically from our day-to-day experiences, making their full effects hard to imagine, predict, and rehearse. And each is global in scale, raising questions about who has the legal obligation to act—and creating incentives for leaders to disavow responsibility (and sometimes even question the legitimacy of the problem itself).

With dynamics like these, it’s little wonder systems theorists label these kinds of problems “wicked” or even “super wicked.” It’s even less surprising that these challenges remain, by and large, externalities to the global system—inadequately measured, perennially underinvested in, and poorly accounted for—until their consequences spill disastrously and expensively into view.

For real progress to occur, we’ve got to move these externalities into the global system, so that we can fully assess their costs, and so that we can sufficiently incentivize and reward stakeholders for addressing them and penalize them if they don’t. And that’s going to require a revolution in measurement, reporting, and financial instrumentation—the mechanisms by which we connect global problems with the resources required to address them at scale.

Thankfully, just such a revolution is under way.

It’s a complex story with several moving parts, but it begins with important new technical developments in three critical areas of technology: remote sensing and big data, artificial intelligence, and cloud computing.

Remote sensing and big data allow us to collect unprecedented streams of observations about our planet and our impacts upon it, and dramatic advances in AI enable us to extract the deeper meaning and patterns contained in those vast data streams. The rise of the cloud empowers anyone with an Internet connection to access and interact with these insights, at a fraction of the traditional cost.

In the years to come, these technologies will shift much of the current conversation focused on big data to one focused on “big indicators”—highly detailed, continuously produced, global indicators that track change in the health of the Earth’s most important systems, in real time. Big indicators will form an important mechanism for guiding human action, allow us to track the impact of our collective actions and interventions as never before, enable better and more timely decisions, transform reporting, and empower new kinds of policy and financing instruments. In short, they will reshape how we tackle a number of global problems, and everyone—especially nonprofits, NGOs, and actors within the social and environmental sectors—will play a role in shaping and using them….(More)”.

Improving refugee integration through data-driven algorithmic assignment


Kirk Bansak, et al in Science Magazine: “Developed democracies are settling an increased number of refugees, many of whom face challenges integrating into host societies. We developed a flexible data-driven algorithm that assigns refugees across resettlement locations to improve integration outcomes. The algorithm uses a combination of supervised machine learning and optimal matching to discover and leverage synergies between refugee characteristics and resettlement sites.

The algorithm was tested on historical registry data from two countries with different assignment regimes and refugee populations, the United States and Switzerland. Our approach led to gains of roughly 40 to 70%, on average, in refugees’ employment outcomes relative to current assignment practices. This approach can provide governments with a practical and cost-efficient policy tool that can be immediately implemented within existing institutional structures….(More)”.