On the privacy-conscientious use of mobile phone data


Yves-Alexandre de Montjoye et al in Nature: “The breadcrumbs we leave behind when using our mobile phones—who somebody calls, for how long, and from where—contain unprecedented insights about us and our societies. Researchers have compared the recent availability of large-scale behavioral datasets, such as the ones generated by mobile phones, to the invention of the microscope, giving rise to the new field of computational social science.

With mobile phone penetration rates reaching 90% and under-resourced national statistical agencies, the data generated by our phones—traditional Call Detail Records (CDR) but also high-frequency x-Detail Record (xDR)—have the potential to become a primary data source to tackle crucial humanitarian questions in low- and middle-income countries. For instance, they have already been used to monitor population displacement after disasters, to provide real-time traffic information, and to improve our understanding of the dynamics of infectious diseases. These data are also used by governmental and industry practitioners in high-income countries.

While there is little doubt on the potential of mobile phone data for good, these data contain intimate details of our lives: rich information about our whereabouts, social life, preferences, and potentially even finances. A BCG study showed, e.g., that 60% of Americans consider location data and phone number history—both available in mobile phone data—as “private”.

Historically and legally, the balance between the societal value of statistical data (in aggregate) and the protection of privacy of individuals has been achieved through data anonymization. While hundreds of different anonymization algorithms exist, most of them are variations and improvements of the seminal k-anonymity algorithm introduced in 1998. Recent studies have, however, shown that pseudonymization and standard de-identification are not sufficient to prevent users from being re-identified in mobile phone data. Four data points—approximate places and times where an individual was present—have been shown to be enough to uniquely re-identify them 95% of the time in a mobile phone dataset of 1.5 million people. Furthermore, re-identification estimations using unicity—a metric to evaluate the risk of re-identification in large-scale datasets—and attempts at k-anonymizing mobile phone data ruled out de-identification as sufficient to truly anonymize the data. This was echoed in the recent report of the [US] President’s Council of Advisors on Science and Technology on Big Data Privacy which consider de-identification to be useful as an “added safeguard, but [emphasized that] it is not robust against near-term future re-identification methods”.

The limits of the historical de-identification framework to adequately balance risks and benefits in the use of mobile phone data are a major hindrance to their use by researchers, development practitioners, humanitarian workers, and companies. This became particularly clear at the height of the Ebola crisis, when qualified researchers (including some of us) were prevented from accessing relevant mobile phone data on time despite efforts by mobile phone operators, the GSMA, and UN agencies, with privacy being cited as one of the main concerns.

These privacy concerns are, in our opinion, due to the failures of the traditional de-identification model and the lack of a modern and agreed upon framework for the privacy-conscientious use of mobile phone data by third-parties especially in the context of the EU General Data Protection Regulation (GDPR). Such frameworks have been developed for the anonymous use of other sensitive data such as census, household survey, and tax data. The positive societal impact of making these data accessible and the technical means available to protect people’s identity have been considered and a trade-off, albeit far from perfect, has been agreed on and implemented. This has allowed the data to be used in aggregate for the benefit of society. Such thinking and an agreed upon set of models has been missing so far for mobile phone data. This has left data protection authorities, mobile phone operators, and data users with little guidance on technically sound yet reasonable models for the privacy-conscientious use of mobile phone data. This has often resulted in suboptimal tradeoffs if any.

In this paper, we propose four models for the privacy-conscientious use of mobile phone data (Fig. 1). All of these models 1) focus on a use of mobile phone data in which only statistical, aggregate information is ultimately needed by a third-party and, while this needs to be confirmed on a per-country basis, 2) are designed to fall under the legal umbrella of “anonymous use of the data”. Examples of cases in which only statistical aggregated information is ultimately needed by the third-party are discussed below. They would include, e.g., disaster management, mobility analysis, or the training of AI algorithms in which only aggregate information on people’s mobility is ultimately needed by agencies, and exclude cases in which individual-level identifiable information is needed such as targeted advertising or loans based on behavioral data.

Figure 1
Figure 1: Matrix of the four models for the privacy-conscientious use of mobile phone data.

First, it is important to insist that none of these models is a silver bullet…(More)”.

Towards matching user mobility traces in large-scale datasets


Paper by Daniel Kondor, Behrooz Hashemian,  Yves-Alexandre de Montjoye and Carlo Ratti: “The problem of unicity and reidentifiability of records in large-scale databases has been studied in different contexts and approaches, with focus on preserving privacy or matching records from different data sources. With an increasing number of service providers nowadays routinely collecting location traces of their users on unprecedented scales, there is a pronounced interest in the possibility of matching records and datasets based on spatial trajectories. Extending previous work on reidentifiability of spatial data and trajectory matching, we present the first large-scale analysis of user matchability in real mobility datasets on realistic scales, i.e. among two datasets that consist of several million people’s mobility traces, coming from a mobile network operator and transportation smart card usage. We extract the relevant statistical properties which influence the matching process and analyze their impact on the matchability of users. We show that for individuals with typical activity in the transportation system (those making 3-4 trips per day on average), a matching algorithm based on the co-occurrence of their activities is expected to achieve a 16.8% success only after a one-week long observation of their mobility traces, and over 55% after four weeks. We show that the main determinant of matchability is the expected number of co-occurring records in the two datasets. Finally, we discuss different scenarios in terms of data collection frequency and give estimates of matchability over time. We show that with higher frequency data collection becoming more common, we can expect much higher success rates in even shorter intervals….(More)”.

New methods help identify what drives sensitive or socially unacceptable behaviors


Mary Guiden at Physorg: “Conservation scientists and statisticians at Colorado State University have teamed up to solve a key problem for the study of sensitive behaviors like poaching, harassment, bribery, and drug use.

Sensitive behaviors—defined as socially unacceptable or not compliant with rules and regulations—are notoriously hard to study, researchers say, because people often do not want to answer direct questions about them.

To overcome this challenge, scientists have developed indirect questioning approaches that protect responders’ identities. However, these methods also make it difficult to predict which sectors of a population are more likely to participate in sensitive behaviors, and which factors, such as knowledge of laws, education, or income, influence the probability that an individual will engage in a sensitive behavior.

Assistant Professor Jennifer Solomon and Associate Professor Michael Gavin of the Department of Human Dimensions of Natural Resources at CSU, and Abu Conteh from MacEwan University in Alberta, Canada, have teamed up with Professor Jay Breidt and doctoral student Meng Cao in the CSU Department of Statistics to develop a new method to solve the problem.

The study, “Understanding the drivers of sensitive behavior using Poisson regression from quantitative randomized response technique data,” was published recently in PLOS One.

Conteh, who, as a doctoral student, worked with Gavin in New Zealand, used a specific technique, known as quantitative randomized response, to elicit confidential answers to questions on behaviors related to non-compliance with natural resource regulations from a protected area in Sierra Leone.

In this technique, the researcher conducting interviews has a large container containing pingpong balls, some with numbers and some without numbers. The interviewer asks the respondent to pick a ball at random, without revealing it to the interviewer. If the ball has a number, the respondent tells the interviewer the number. If the ball does not have a number, the respondent reveals how many times he illegaly hunted animals in a given time period….

Armed with the new computer program, the scientists found that people from rural communities with less access to jobs in urban centers were more likely to hunt in the reserve. People in communities with a greater proportion people displaced by Sierra Leone’s 10-year civil war were also more likely to hunt illegally….(More)”

The researchers said that collaborating across disciplines was and is key to addressing complex problems like this one. It is commonplace for people to be noncompliant with rules and regulations and equally important for social scientists to analyze these behaviors….(More)”

Better Data for Doing Good: Responsible Use of Big Data and Artificial Intelligence


Report by the World Bank: “Describes opportunities for harnessing the value of big data and artificial intelligence (AI) for social good and how new families of AI algorithms now make it possible to obtain actionable insights automatically and at scale. Beyond internet business or commercial applications, multiple examples already exist of how big data and AI can help achieve shared development objectives, such as the 2030 Agenda for Sustainable Development and the Sustainable Development Goals (SDGs). But ethical frameworks in line with increased uptake of these new technologies remain necessary—not only concerning data privacy but also relating to the impact and consequences of using data and algorithms. Public recognition has grown concerning AI’s potential to create both opportunities for societal benefit and risks to human rights. Development calls for seizing the opportunity to shape future use as a force for good, while at the same time ensuring the technologies address inequalities and avoid widening the digital divide….(More)”.

What difference does data make? Data management and social change


Paper by Morgan E. Currie and Joan M. Donovan: “The purpose of this paper is to expand on emergent data activism literature to draw distinctions between different types of data management practices undertaken by groups of data activists.

The authors offer three case studies that illuminate the data management strategies of these groups. Each group discussed in the case studies is devoted to representing a contentious political issue through data, but their data management practices differ in meaningful ways. The project Making Sense produces their own data on pollution in Kosovo. Fatal Encounters collects “missing data” on police homicides in the USA. The Environmental Data Governance Initiative hopes to keep vulnerable US data on climate change and environmental injustices in the public domain.

In analysing our three case studies, the authors surface how temporal dimensions, geographic scale and sociotechnical politics influence their differing data management strategies….(More)”.

Recalculating GDP for the Facebook age


Gillian Tett at the Financial Times: How big is the impact of Facebook on our lives? That question has caused plenty of hand-wringing this year, as revelations have tumbled out about the political influence of Big Tech companies.

Economists are attempting to look at this question too — but in a different way. They have been quietly trying to calculate the impact of Facebook on gross domestic product data, ie to measure what our social-media addiction is doing to economic output….

Kevin Fox, an Australian economist, thinks there is. Working with four other economists, including Erik Brynjolfsson, a professor at MIT, he recently surveyed consumers to see what they would “pay” for Facebook in monetary terms, concluding conservatively that this was about $42 a month. Extrapolating this to the wider economy, he then calculated that the “value” of the social-media platform is equivalent to 0.11 per cent of US GDP. That might not sound transformational. But this week Fox presented the group’s findings at an IMF conference on the digital economy in Washington DC and argued that if Facebook activity had been counted as output in the GDP data, it would have raised the annual average US growth rate from 1.83 per cent to 1.91 per cent between 2003 and 2017. The number would rise further if you included other platforms – researchers believe that “maps” and WhatsApp are particularly important – or other services.  Take photographs.

Back in 2000, as the group points out, about 80 billion photos were taken each year at a cost of 50 cents a picture in camera and processing fees. This was recorded in GDP. Today, 1.6 trillion photos are taken each year, mostly on smartphones, for “free”, and excluded from that GDP data. What would happen if that was measured too, along with other types of digital services?

The bad news is that there is no consensus among economists on this point, and the debate is still at a very early stage. … A separate paper from Charles Hulten and Leonard Nakamura, economists at the University of Maryland and Philadelphia Fed respectively, explained another idea: a measurement known as “EGDP” or “Expanded GDP”, which incorporates “welfare” contributions from digital services. “The changes wrought by the digital revolution require changes to official statistics,” they said.

Yet another paper from Nakamura, co-written with Diane Coyle of Cambridge University, argued that we should also reconfigure the data to measure how we “spend” our time, rather than “just” how we spend our money. “To recapture welfare in the age of digitalisation, we need shadow prices, particularly of time,” they said. Meanwhile, US government number-crunchers have been trying to measure the value of “free” open-source software, such as R, Python, Julia and Java Script, concluding that if captured in statistics these would be worth about $3bn a year. Another team of government statisticians has been trying to value the data held by companies – this estimates, using one method, that Amazon’s data is currently worth $125bn, with a 35 per cent annual growth rate, while Google’s is worth $48bn, growing at 22 per cent each year. It is unlikely that these numbers – and methodologies – will become mainstream any time soon….(More)”.

Big Data Ethics and Politics: Toward New Understandings


Introductory paper by Wenhong Chen and Anabel Quan-Haase of Special Issue of the Social Science Computer Review:  “The hype around big data does not seem to abate nor do the scandals. Privacy breaches in the collection, use, and sharing of big data have affected all the major tech players, be it Facebook, Google, Apple, or Uber, and go beyond the corporate world including governments, municipalities, and educational and health institutions. What has come to light is that enabled by the rapid growth of social media and mobile apps, various stakeholders collect and use large amounts of data, disregarding the ethics and politics.

As big data touch on many realms of daily life and have profound impacts in the social world, the scrutiny around big data practice becomes increasingly relevant. This special issue investigates the ethics and politics of big data using a wide range of theoretical and methodological approaches. Together, the articles provide new understandings of the many dimensions of big data ethics and politics, showing it is important to understand and increase awareness of the biases and limitations inherent in big data analysis and practices….(More)”

Data-Driven Development


Report by the World Bank: “…Decisions based on data can greatly improve people’s lives. Data can uncover patterns, unexpected relationships and market trends, making it possible to address previously intractable problems and leverage hidden opportunities. For example, tracking genes associated with certain types of cancer to improve treatment, or using commuter travel patterns to devise public transportation that is affordable and accessible for users, as well as profitable for operators.

Data is clearly a precious commodity, and the report points out that people should have greater control over the use of their personal data. Broadly speaking, there are three possible answers to the question “Who controls our data?”: firms, governments, or users. No global consensus yet exists on the extent to which private firms that mine data about individuals should be free to use the data for profit and to improve services.

User’s willingness to share data in return for benefits and free services – such as virtually unrestricted use of social media platforms – varies widely by country. In addition to that, early internet adopters, who grew up with the internet and are now age 30–40, are the most willing to share (GfK 2017).

Are you willing to share your data? (source: GfK 2017)

Image

On the other hand, data can worsen the digital divide – the data poor, who leave no digital trail because they have limited access, are most at risk from exclusion from services, opportunities and rights, as are those who lack a digital ID, for instance.

Firms and Data

For private sector firms, particularly those in developing countries, the report suggests how they might expand their markets and improve their competitive edge. Companies are already developing new markets and making profits by analyzing data to better understand their customers. This is transforming conventional business models. For years, telecommunications has been funded by users paying for phone calls. Today, advertisers pay for users’ data and attention are funding the internet, social media, and other platforms, such as apps, reversing the value flow.

Governments and Data

For governments and development professionals, the report provides guidance on how they might use data more creatively to help tackle key global challenges, such as eliminating extreme poverty, promoting shared prosperity, or mitigating the effects of climate change. The first step is developing appropriate guidelines for data sharing and use, and for anonymizing personal data. Governments are already beginning to use the huge quantities of data they hold to enhance service delivery, though they still have far to go to catch up with the commercial giants, the report finds.

Data for Development

The Information and Communications for Development report analyses how the data revolution is changing the behavior of governments, individuals, and firms and how these changes affect economic, social, and cultural development. This is a topic of growing importance that cannot be ignored, and the report aims to stimulate wider debate on the unique challenges and opportunities of data for development. It will be useful for policy makers, but also for anyone concerned about how their personal data is used and how the data revolution might affect their future job prospects….(More)”.

Artificial Intelligence: Risks to Privacy and Democracy


Karl Manheim and Lyric Kaplan at Yale Journal of Law and Technology: “A “Democracy Index” is published annually by the Economist. For 2017, it reported that half of the world’s countries scored lower than the previous year. This included the United States, which was demoted from “full democracy” to “flawed democracy.” The principal factor was “erosion of confidence in government and public institutions.” Interference by Russia and voter manipulation by Cambridge Analytica in the 2016 presidential election played a large part in that public disaffection.

Threats of these kinds will continue, fueled by growing deployment of artificial intelligence (AI) tools to manipulate the preconditions and levers of democracy. Equally destructive is AI’s threat to decisional andinforma-tional privacy. AI is the engine behind Big Data Analytics and the Internet of Things. While conferring some consumer benefit, their principal function at present is to capture personal information, create detailed behavioral profiles and sell us goods and agendas. Privacy, anonymity and autonomy are the main casualties of AI’s ability to manipulate choices in economic and political decisions.

The way forward requires greater attention to these risks at the nation-al level, and attendant regulation. In its absence, technology giants, all of whom are heavily investing in and profiting from AI, will dominate not only the public discourse, but also the future of our core values and democratic institutions….(More)”.

Better “nowcasting” can reveal what weather is about to hit within 500 meters


MIT Technology Review: “Weather forecasting is impressively accurate given how changeable and chaotic Earth’s climate can be. It’s not unusual to get 10-day forecasts with a reasonable level of accuracy.

But there is still much to be done.  One challenge for meteorologists is to improve their “nowcasting,” the ability to forecast weather in the next six hours or so at a spatial resolution of a square kilometer or less.

In areas where the weather can change rapidly, that is difficult. And there is much at stake. Agricultural activity is increasingly dependent on nowcasting, and the safety of many sporting events depends on it too. Then there is the risk that sudden rainfall could lead to flash flooding, a growing problem in many areas because of climate change and urbanization. That has implications for infrastructure, such as sewage management, and for safety, since this kind of flooding can kill.

So meteorologists would dearly love to have a better way to make their nowcasts.

Enter Blandine Bianchi from EPFL in Lausanne, Switzerland, and a few colleagues, who have developed a method for combining meteorological data from several sources to produce nowcasts with improved accuracy. Their work has the potential to change the utility of this kind of forecasting for everyone from farmers and gardeners to emergency services and sewage engineers.

Current forecasting is limited by the data and the scale on which it is gathered and processed. For example, satellite data has a spatial resolution of 50 to 100 km and allows the tracking and forecasting of large cloud cells over a time scale of six to nine hours. By contrast, radar data is updated every five minutes, with a spatial resolution of about a kilometer, and leads to predictions on the time scale of one to three hours. Another source of data is the microwave links used by telecommunications companies, which are degraded by rainfall….(More)”