Collection

The future of statistics and data science

Curated on February 21, 2018August 3, 2018 by Stefaan Verhulst

Paper by Sofia C. Olhede and Patrick J. Wolfe in Statistics & Probability Letters: “The Danish physicist Niels Bohr is said to have remarked: “Prediction is very difficult, especially about the future”. Predicting the future of statistics in the era of big data is not so very different from prediction about anything else. Ever since we started to collect data to predict cycles of the moon, seasons, and hence future agriculture yields, humankind has worked to infer information from indirect observations for the purpose of making predictions.

Even while acknowledging the momentous difficulty in making predictions about the future, a few topics stand out clearly as lying at the current and future intersection of statistics and data science. Not all of these topics are of a strictly technical nature, but all have technical repercussions for our field. How might these repercussions shape the still relatively young field of statistics? And what can sound statistical theory and methods bring to our understanding of the foundations of data science? In this article we discuss these issues and explore how new open questions motivated by data science may in turn necessitate new statistical theory and methods now and in the future.

Together, the ubiquity of sensing devices, the low cost of data storage, and the commoditization of computing have led to a volume and variety of modern data sets that would have been unthinkable even a decade ago. We see four important implications for statistics.

First, many modern data sets are related in some way to human behavior. Data might have been collected by interacting with human beings, or personal or private information traceable back to a given set of individuals might have been handled at some stage. Mathematical or theoretical statistics traditionally does not concern itself with the finer points of human behavior, and indeed many of us have only had limited training in the rules and regulations that pertain to data derived from human subjects. Yet inevitably in a data-rich world, our technical developments cannot be divorced from the types of data sets we can collect and analyze, and how we can handle and store them.

Second, the importance of data to our economies and civil societies means that the future of regulation will look not only to protect our privacy, and how we store information about ourselves, but also to include what we are allowed to do with that data. For example, as we collect high-dimensional vectors about many family units across time and space in a given region or country, privacy will be limited by that high-dimensional space, but our wish to control what we do with data will go beyond that….

Third, the growing complexity of algorithms is matched by an increasing variety and complexity of data. Data sets now come in a variety of forms that can be highly unstructured, including images, text, sound, and various other new forms. These different types of observations have to be understood together, resulting in multimodal data, in which a single phenomenon or event is observed through different types of measurement devices. Rather than having one phenomenon corresponding to single scalar values, a much more complex object is typically recorded. This could be a three-dimensional shape, for example in medical imaging, or multiple types of recordings such as functional magnetic resonance imaging and simultaneous electroencephalography in neuroscience. Data science therefore challenges us to describe these more complex structures, modeling them in terms of their intrinsic patterns.

Finally, the types of data sets we now face are far from satisfying the classical statistical assumptions of identically distributed and independent observations. Observations are often “found” or repurposed from other sampling mechanisms, rather than necessarily resulting from designed experiments….

Our field will either meet these challenges and become increasingly ubiquitous, or risk rapidly becoming irrelevant to the future of data science and artificial intelligence….(More)”.

What if technology could help improve conversations online?

Curated on February 21, 2018August 3, 2018 by Stefaan Verhulst

Introduction to “Perspective”: “Discussing things you care about can be difficult. The threat of abuse and harassment online means that many people stop expressing themselves and give up on seeking different opinions….Perspective is an API that makes it easier to host better conversations. The API uses machine learning models to score the perceived impact a comment might have on a conversation. Developers and publishers can use this score to give realtime feedback to commenters or help moderators do their job, or allow readers to more easily find relevant information, as illustrated in two experiments below. We’ll be releasing more machine learning models later in the year, but our first model identifies whether a comment could be perceived as “toxic” to a discussion….(More)”.

Who Killed Albert Einstein? From Open Data to Murder Mystery Games

Curated on February 21, 2018August 3, 2018 by Stefaan Verhulst

Gabriella A. B. Barros et al at arXiv: “This paper presents a framework for generating adventure games from open data. Focusing on the murder mystery type of adventure games, the generator is able to transform open data from Wikipedia articles, OpenStreetMap and images from Wikimedia Commons into WikiMysteries. Every WikiMystery game revolves around the murder of a person with a Wikipedia article, and populates the game with suspects who must be arrested by the player if guilty of the murder or absolved if innocent. Starting from only one person as the victim, an extensive generative pipeline finds suspects, their alibis, and paths connecting them from open data, transforms open data into cities, buildings, non-player characters, locks and keys and dialog options. The paper describes in detail each generative step, provides a specific playthrough of one WikiMystery where Albert Einstein is murdered, and evaluates the outcomes of games generated for the 100 most influential people of the 20th century….(More)”.

Data journalism and the ethics of publishing Twitter data

Curated on February 20, 2018August 3, 2018 by Stefaan Verhulst

Matthew L. Williams at Data Driven Journalism: “Collecting and publishing data collected from social media sites such as Twitter are everyday practices for the data journalist. Recent findings from Cardiff University’s Social Data Science Lab question the practice of publishing Twitter content without seeking some form of informed consent from users beforehand. Researchers found that tweets collected around certain topics, such as those related to terrorism, political votes, changes in the law and health problems, create datasets that might contain sensitive content, such as extreme political opinion, grossly offensive comments, overly personal revelations and threats to life (both to oneself and to others). Handling these data in the process of analysis (such as classifying content as hateful and potentially illegal) and reporting has brought the ethics of using social media in social research and journalism into sharp focus.

Ethics is an issue that is becoming increasingly salient in research and journalism using social media data. The digital revolution has outpaced parallel developments in research governance and agreed good practice. Codes of ethical conduct that were written in the mid twentieth century are being relied upon to guide the collection, analysis and representation of digital data in the twenty-first century. Social media is particularly ethically challenging because of the open availability of the data (particularly from Twitter). Many platforms’ terms of service specifically state users’ data that are public will be made available to third parties, and by accepting these terms users legally consent to this. However, researchers and data journalists must interpret and engage with these commercially motivated terms of service through a more reflexive lens, which implies a context sensitive approach, rather than focusing on the legally permissible uses of these data.

Social media researchers and data journalists have experimented with data from a range of sources, including Facebook, YouTube, Flickr, Tumblr and Twitter to name a few. Twitter is by far the most studied of all these networks. This is because Twitter differs from other networks, such as Facebook, that are organised around groups of ‘friends’, in that it is more ‘open’ and the data (in part) are freely available to researchers. This makes Twitter a more public digital space that promotes the free exchange of opinions and ideas. Twitter has become the primary space for online citizens to publicly express their reaction to events of national significance, and also the primary source of data for social science research into digital publics.

The Twitter streaming API provides three levels of data access: the free random 1% that provides ~5M tweets daily and the random 10% and 100% (chargeable or free to academic researchers upon request). Datasets on social interactions of this scale, speed and ease of access have been hitherto unrealisable in the social sciences and journalism, and have led to a flood of journal articles and news pieces, many of which include tweets with full text content and author identity without informed consent. This is presumably because of Twitter’s ‘open’ nature, which leads to the assumption that ‘these are public data’ and using it does not require the rigor and scrutiny of an ethical oversight. Even when these data are scrutinised, journalists don’t need to be convinced by the ‘public data’ argument, due to the lack of a framework to evaluate the potential harms to users. The Social Data Science Lab takes a more ethically reflexive approach to the use of social media data in social research, and carefully considers users’ perceptions, online context and the role of algorithms in estimating potentially sensitive user characteristics.

A recent Lab survey conducted into users’ perceptions of the use of their social media posts found the following:

94% were aware that social media companies had Terms of Service
65% had read the Terms of Service in whole or in part
76% knew that when accepting Terms of Service they were giving permission for some of their information to be accessed by third parties
80% agreed that if their social media information is used in a publication they would expect to be asked for consent
90% agreed that if their tweets were used without their consent they should be anonymized…(More)”.

Spanning Today’s Chasms: Seven Steps to Building Trusted Data Intermediaries

Curated on February 20, 2018August 3, 2018 by Stefaan Verhulst

James Shulman at the Mellon Foundation: “In 2001, when hundreds of individual colleges and universities were scrambling to scan their slide libraries, The Andrew W. Mellon Foundation created a new organization, Artstor, to assemble a massive library of digital images from disparate sources to support teaching and research in the arts and humanities.

Rather than encouraging—or paying for—each school to scan its own slide of the Mona Lisa, the Mellon Foundation created an intermediary organization that would balance the interests of those who created, photographed and cared for art works, such as artists and museums, and those who wanted to use such images for the admirable calling of teaching and studying history and culture. This organization would reach across the gap that separated these two communities and would respect and balance the interests of both sides, while helping each accomplish their missions. At the same time that Napster was using technology to facilitate the un-balanced transfer of digital content from creators to users, the Mellon Foundation set up a new institution aimed at respecting the interests of one side of the market and supporting the socially desirable work of the other.

As the internet has enabled the sharing of data across the world, new intermediaries have emerged as entire platforms. A networked world needs such bridges—think Etsy or Ebay sitting between sellers and buyers, or Facebook sitting between advertisers and users. While intermediaries that match sellers and buyers of things provide a marketplace to bridge from one side or the other, aggregators of data work in admittedly more shadowy territories.

In the many realms that market forces won’t support, however, a great deal of public good can be done by aggregating and managing access to datasets that might otherwise continue to live in isolation. Whether due to institutional sociology that favors local solutions, the technical challenges associated with merging heterogeneous databases built with different data models, intellectual property limitations, or privacy concerns, datasets are built and maintained by independent groups that—if networked—could be used to further each other’s work.

Think of those studying coral reefs, or those studying labor practices in developing markets, or child welfare offices seeking to call upon court records in different states, or medical researchers working in different sub-disciplines but on essentially the same disease. What intermediary invests in joining these datasets? Many people assume that computers can simply “talk” to each other and share data intuitively, but without targeted investment in connecting them, they can’t. Unlike modern databases that are now often designed with the cloud in mind, decades of locally created databases churn away in isolation, at great opportunity cost to us all.

Art history research is an unusually vivid example. Most people can understand that if you want to study Caravaggio, you don’t want to hunt and peck across hundreds of museums, books, photo archives, libraries, churches, and private collections. You want all that content in one place—exactly what Mellon sought to achieve by creating Artstor.

What did we learn in creating Artstor that might be distilled as lessons for others taking on an aggregation project to serve the public good?….(More)”.

Just and Unjust Leaks: When to Spill Secrets

Curated on February 20, 2018August 3, 2018 by Stefaan Verhulst

Michael Walzer at Foreign Affairs: “All governments, all political parties, and all politicians keep secrets and tell lies. Some lie more than others, and those differences are important, but the practice is general. And some lies and secrets may be justified, whereas others may not. Citizens, therefore, need to know the difference between just and unjust secrets and between just and unjust deception before they can decide when it may be justifiable for someone to reveal the secrets or expose the lies—when leaking confidential information, releasing classified documents, or blowing the whistle on misconduct may be in the public interest or, better, in the interest of democratic government.

Revealing official secrets and lies involves a form of moral risk-taking: whistleblowers may act out of a sense of duty or conscience, but the morality of their actions can be judged only by their fellow citizens, and only after the fact. This is often a difficult judgment to make—and has probably become more difficult in the Trump era.

LIES AND DAMNED LIES

A quick word about language: “leaker” and “whistleblower” are overlapping terms, but they aren’t synonyms. A leaker, in this context, anonymously reveals information that might embarrass officials or open up the government’s internal workings to unwanted public scrutiny. In Washington, good reporters cultivate sources inside every presidential administration and every Congress and hope for leaks. A whistleblower reveals what she believes to be immoral or illegal official conduct to her bureaucratic superiors or to the public. Certain sorts of whistle-blowing, relating chiefly to mismanagement and corruption, are protected by law; leakers are not protected, nor are whistleblowers who reveal state secrets…(More)”.

Facebook’s next project: American inequality

Curated on February 19, 2018August 3, 2018 by Stefaan Verhulst

Nancy Scola at Politico: “Facebook CEO Mark Zuckerberg is quietly cracking open his company’s vast trove of user data for a study on economic inequality in the U.S. — the latest sign of his efforts to reckon with divisions in American society that the social network is accused of making worse.

The study, which hasn’t previously been reported, is mining the social connections among Facebook’s American users to shed light on the growing income disparity in the U.S., where the top 1 percent of households is said to control 40 percent of the country’s wealth. Facebook is an incomparably rich source of information for that kind of research: By one estimate, about three of five American adults use the social network….

Facebook confirmed the broad contours of its partnership with Chetty but declined to elaborate on the substance of the study. Chetty, in a brief interview following a January speech in Washington, said he and his collaborators — who include researchers from Stanford and New York University — have been working on the inequality study for at least six months.

“We’re using social networks, and measuring interactions there, to understand the role of social capital much better than we’ve been able to,” he said.

Researchers say they see Facebook’s enormous cache of data as a remarkable resource, offering an unprecedentedly detailed and sweeping look at American society. That store of information contains both details that a user might tell Facebook — their age, hometown, schooling, family relationships — and insights that the company has picked up along the way, such as the interest groups they’ve joined and geographic distribution of who they call a “friend.”

It’s all the more significant, researchers say, when you consider that Facebook’s user base — about 239 million monthly users in the U.S. and Canada at last count — cuts across just about every demographic group.

And all that information, say researchers, lets them take guesses about users’ wealth. Facebook itself recently patented a way of figuring out someone’s socioeconomic status using factors ranging from their stated hobbies to how many internet-connected devices they own.

A Facebook spokesman addressed the potential privacy implications of the study’s access to user data, saying, “We conduct research at Facebook responsibly, which includes making sure we protect people’s information.” The spokesman added that Facebook follows an “enhanced” review process for research projects, adopted in 2014 after a controversy over a study that manipulated some people’s news feeds to see if it made them happier or sadder.

According to a Stanford University source familiar with Chetty’s study, the Facebook account data used in the research has been stripped of any details that could be used to identify users. The source added that academics involved in the study have gone through security screenings that include background checks, and can access the Facebook data only in secure facilities….(More)”.

When citizens set the budget: lessons from ancient Greece

Curated on February 19, 2018May 29, 2019 by Stefaan Verhulst

David M. Pritchard and Lyn Carson in The Conversation: “Today elected representatives take the tough decisions about public finances behind closed doors. In doing so, democratic politicians rely on the advice of financial bureaucrats, who, often, cater to the political needs of the elected government. Politicians rarely ask voters what they think of budget options. They are no better at explaining the reasons for a budget. Explanations are usually no more than vacuous phrases, such as “jobs and growth” or “on the move”. They never explain the difficult trade-offs that go into a budget nor their overall financial reasoning.

This reluctance to explain public finances was all too evident during the global financial crisis.

In Australia, Britain and France, centre-left governments borrowed huge sums in order to maintain private demand and, in one case, to support private banks. In each country these policies helped a lot to minimise the crisis’s human costs.

Yet, in the elections that followed the centre-left politicians that had introduced these policies refused properly to justify them. They feared that voters would not tolerate robust discussion about public finances. Without a justification for their generally good policies each of these government was defeated by centre-right opponents.

In most democracies there is the same underlying problem: elected representatives do not believe that voters can tolerate the financial truth. They assume that democracy is not good at managing public finances. For them it can only balance the budget by leaving voters in the dark.

For decades, we, independently, have studied democracy today and in the ancient past. We have learned that this assumption is dead wrong. There are more and more examples of how involving ordinary voters results in better budgets.

In 1989, councils in poor Brazilian towns began to involve residents in setting budgets. This participatory budgeting soon spread throughout South America. It has now been successfully tried in Germany, Spain, Italy, Portugal, Sweden, the United States, Poland and Australia, and some pilot projects were set up in France too. Participatory budgeting is based on the clear principle that those who will be most affected by a tough budget should be involved in setting it.

In spite of such successful democratic experiments, elected representatives still shy away from involving ordinary voters in setting budgets. This is very different from what happened in ancient Athens 2,500 years ago….

In Athenian democracy ordinary citizens actually set the budget. This ancient Greek state had a solid budget, in spite of, or, we would say, because of the involvement of the citizens in taking tough budget decisions….(More)”.

Can Crowdsourcing and Collaboration Improve the Future of Human Health?

Curated on February 18, 2018July 19, 2019 by Stefaan Verhulst

Ben Wiegand at Scientific American: “The process of medical research has been likened to searching for a needle in a haystack. With the continued acceleration of novel science and health care technologies in areas like artificial intelligence, digital therapeutics and the human microbiome we have tremendous opportunity to search the haystack in new and exciting ways. Applying these high-tech advances to today’s most pressing health issues increases our ability to address the root cause of disease, intervene earlier and change the trajectory of human health.

Global crowdsourcing forums, like the Johnson & Johnson Innovation QuickFire Challenges, can be incredibly valuable tools for searching the “haystack.” An initiative of JLABS—the no-strings-attached incubators of Johnson & Johnson Innovation—these contests spur scientific diversity through crowdsourcing, inspiring and attracting fresh thinking. They seek to stimulate the global innovation ecosystem through funding, mentorship and access to resources that can kick-start breakthrough ideas.

Our most recent challenge, the Next-Gen Baby Box QuickFire Challenge, focused on updating the 80-year-old “Finnish baby box,” a free, government-issued maternity supply kit for new parents containing such essentials as baby clothing, bath and sleep supplies packaged in a sleep-safe cardboard box. Since it first launched, the baby box has, together with increased use of maternal healthcare services early in pregnancy, helped to significantly reduce the Finnish infant mortality rate from 65 in every 1,000 live births in the 1930s to 2.5 per 1,000 today—one of the lowest rates in the world.

Partnering with Finnish innovation and government groups, we set out to see if updating this popular early parenting tool with the power of personalized health technology might one day impact Finland’s unparalleled high rate of type 1 diabetes. We issued the call globally to help create “the Baby Box of the future” as part of the Janssen and Johnson & Johnson Innovation vision to create a world without disease by accelerating science and delivering novel solutions to prevent, intercept and cure disease. The contest brought together entrepreneurs, researchers and innovators to focus on ideas with the potential to promote child health, detect childhood disease earlier and facilitate healthy parenting.

Incentive challenges like this award participants who have most effectively met a predefined objective or task. It’s a concept that emerged well before our time—as far back as the 18th century—from Napoleon’s Food Preservation Prize, meant to find a way to keep troops fed during battle, to the Longitude Prize for improved marine navigation.

Research shows that prize-based challenges that attract talent across a wide range of disciplines can generate greater risk-taking and yield more dramatic solutions….(More)”.

The Social Media Threat to Society and Security

Curated on February 18, 2018August 3, 2018 by Stefaan Verhulst

George Soros at Project Syndicate: “It takes significant effort to assert and defend what John Stuart Mill called the freedom of mind. And there is a real chance that, once lost, those who grow up in the digital age – in which the power to command and shape people’s attention is increasingly concentrated in the hands of a few companies – will have difficulty regaining it.

The current moment in world history is a painful one. Open societies are in crisis, and various forms of dictatorships and mafia states, exemplified by Vladimir Putin’s Russia, are on the rise. In the United States, President Donald Trump would like to establish his own mafia-style state but cannot, because the Constitution, other institutions, and a vibrant civil society won’t allow it….

The rise and monopolistic behavior of the giant American Internet platform companies is contributing mightily to the US government’s impotence. These companies have often played an innovative and liberating role. But as Facebook and Google have grown ever more powerful, they have become obstacles to innovation, and have caused a variety of problems of which we are only now beginning to become aware…

Social media companies’ true customers are their advertisers. But a new business model is gradually emerging, based not only on advertising but also on selling products and services directly to users. They exploit the data they control, bundle the services they offer, and use discriminatory pricing to keep more of the benefits that they would otherwise have to share with consumers. This enhances their profitability even further, but the bundling of services and discriminatory pricing undermine the efficiency of the market economy.

Social media companies deceive their users by manipulating their attention, directing it toward their own commercial purposes, and deliberately engineering addiction to the services they provide. This can be very harmful, particularly for adolescents.

There is a similarity between Internet platforms and gambling companies. Casinos have developed techniques to hook customers to the point that they gamble away all of their money, even money they don’t have.

Something similar – and potentially irreversible – is happening to human attention in our digital age. This is not a matter of mere distraction or addiction; social media companies are actually inducing people to surrender their autonomy. And this power to shape people’s attention is increasingly concentrated in the hands of a few companies.

It takes significant effort to assert and defend what John Stuart Mill called the freedom of mind. Once lost, those who grow up in the digital age may have difficulty regaining it.

This would have far-reaching political consequences. People without the freedom of mind can be easily manipulated. This danger does not loom only in the future; it already played an important role in the 2016 US presidential election.

There is an even more alarming prospect on the horizon: an alliance between authoritarian states and large, data-rich IT monopolies, bringing together nascent systems of corporate surveillance with already-developed systems of state-sponsored surveillance. This may well result in a web of totalitarian control the likes of which not even George Orwell could have imagined….(More)”.