Data protection in a big data society. Ideas for a future regulation


Paper by Alessandro Mantelero and Giuseppe Vaciago at Digital Investigation: “Big data society has changed the traditional forms of data analysis and created a new predictive approach to knowledge and investigation. In this light, it is necessary to consider the impact of this new paradigm on the traditional notion of data protection and its regulation.

Focussing on the individual and communal dimension of data use, encompassing digital investigations, the authors outline the challenges that big data poses for individual information self-determination, reasonable suspicion and collective interests. Therefore, the article suggests some innovative proposals that may update the existing data protection legal framework and contribute to make it respondent to the present algorithmic society….(More)”

The promise and perils of predictive policing based on big data


H. V. Jagadish in the Conversation: “Police departments, like everyone else, would like to be more effective while spending less. Given the tremendous attention to big data in recent years, and the value it has provided in fields ranging from astronomy to medicine, it should be no surprise that police departments are using data analysis to inform deployment of scarce resources. Enter the era of what is called “predictive policing.”

Some form of predictive policing is likely now in force in a city near you.Memphis was an early adopter. Cities from Minneapolis to Miami have embraced predictive policing. Time magazine named predictive policing (with particular reference to the city of Santa Cruz) one of the 50 best inventions of 2011. New York City Police Commissioner William Bratton recently said that predictive policing is “the wave of the future.”

The term “predictive policing” suggests that the police can anticipate a crime and be there to stop it before it happens and/or apprehend the culprits right away. As the Los Angeles Times points out, it depends on “sophisticated computer analysis of information about previous crimes, to predict where and when crimes will occur.”

At a very basic level, it’s easy for anyone to read a crime map and identify neighborhoods with higher crime rates. It’s also easy to recognize that burglars tend to target businesses at night, when they are unoccupied, and to target homes during the day, when residents are away at work. The challenge is to take a combination of dozens of such factors to determine where crimes are more likely to happen and who is more likely to commit them. Predictive policing algorithms are getting increasingly good at such analysis. Indeed, such was the premise of the movie Minority Report, in which the police can arrest and convict murderers before they commit their crime.

Predicting a crime with certainty is something that science fiction can have a field day with. But as a data scientist, I can assure you that in reality we can come nowhere close to certainty, even with advanced technology. To begin with, predictions can be only as good as the input data, and quite often these input data have errors.

But even with perfect, error-free input data and unbiased processing, ultimately what the algorithms are determining are correlations. Even if we have perfect knowledge of your troubled childhood, your socializing with gang members, your lack of steady employment, your wacko posts on social media and your recent gun purchases, all that the best algorithm can do is to say it is likely, but not certain, that you will commit a violent crime. After all, to believe such predictions as guaranteed is to deny free will….

What data can do is give us probabilities, rather than certainty. Good data coupled with good analysis can give us very good estimates of probability. If you sum probabilities over many instances, you can usually get a robust estimate of the total.

For example, data analysis can provide a probability that a particular house will be broken into on a particular day based on historical records for similar houses in that neighborhood on similar days. An insurance company may add this up over all days in a year to decide how much to charge for insuring that house….(More)”

Questioning Smart Urbanism: Is Data-Driven Governance a Panacea?


 at the Chicago Policy Review: “In the era of data explosion, urban planners are increasingly relying on real-time, streaming data generated by “smart” devices to assist with city management. “Smart cities,” referring to cities that implement pervasive and ubiquitous computing in urban planning, are widely discussed in academia, business, and government. These cities are characterized not only by their use of technology but also by their innovation-driven economies and collaborative, data-driven city governance. Smart urbanism can seem like an effective strategy to create more efficient, sustainable, productive, and open cities. However, there are emerging concerns about the potential risks in the long-term development of smart cities, including political neutrality of big data, technocratic governance, technological lock-ins, data and network security, and privacy risks.

In a study entitled, “The Real-Time City? Big Data and Smart Urbanism,” Rob Kitchin provides a critical reflection on the potential negative effects of data-driven city governance on social development—a topic he claims deserves greater governmental, academic, and social attention.

In contrast to traditional datasets that rely on samples or are aggregated to a coarse scale, “big data” is huge in volume, high in velocity, and diverse in variety. Since the early 2000s, there has been explosive growth in data volume due to the rapid development and implementation of technology infrastructure, including networks, information management, and data storage. Big data can be generated from directed, automated, and volunteered sources. Automated data generation is of particular interest to urban planners. One example Kitchin cites is urban sensor networks, which allow city governments to monitor the movements and statuses of individuals, materials, and structures throughout the urban environment by analyzing real-time data.

With the huge amount of streaming data collected by smart infrastructure, many city governments use real-time analysis to manage different aspects of city operations. There has been a recent trend in centralizing data streams into a single hub, integrating all kinds of surveillance and analytics. These one-stop data centers make it easier for analysts to cross-reference data, spot patterns, identify problems, and allocate resources. The data are also often accessible by field workers via operations platforms. In London and some other cities, real-time data are visualized on “city dashboards” and communicated to citizens, providing convenient access to city information.

However, the real-time city is not a flawless solution to all the problems faced by city managers. The primary concern is the politics of big, urban data. Although raw data are often perceived as neutral and objective, no data are free of bias; the collection of data is a subjective process that can be shaped by various confounding factors. The presentation of data can also be manipulated to answer a specific question or enact a particular political vision….(More)”

Build digital democracy


Dirk Helbing & Evangelos Pournaras in Nature: “Fridges, coffee machines, toothbrushes, phones and smart devices are all now equipped with communicating sensors. In ten years, 150 billion ‘things’ will connect with each other and with billions of people. The ‘Internet of Things’ will generate data volumes that double every 12 hours rather than every 12 months, as is the case now.

Blinded by information, we need ‘digital sunglasses’. Whoever builds the filters to monetize this information determines what we see — Google and Facebook, for example. Many choices that people consider their own are already determined by algorithms. Such remote control weakens responsible, self-determined decision-making and thus society too.

The European Court of Justice’s ruling on 6 October that countries and companies must comply with European data-protection laws when transferring data outside the European Union demonstrates that a new digital paradigm is overdue. To ensure that no government, company or person with sole control of digital filters can manipulate our decisions, we need information systems that are transparent, trustworthy and user-controlled. Each of us must be able to choose, modify and build our own tools for winnowing information.

With this in mind, our research team at the Swiss Federal Institute of Technology in Zurich (ETH Zurich), alongside international partners, has started to create a distributed, privacy-preserving ‘digital nervous system’ called Nervousnet. Nervousnet uses the sensor networks that make up the Internet of Things, including those in smartphones, to measure the world around us and to build a collective ‘data commons’. The many challenges ahead will be best solved using an open, participatory platform, an approach that has proved successful for projects such as Wikipedia and the open-source operating system Linux.

A wise king?

The science of human decision-making is far from understood. Yet our habits, routines and social interactions are surprisingly predictable. Our behaviour is increasingly steered by personalized advertisements and search results, recommendation systems and emotion-tracking technologies. Thousands of pieces of metadata have been collected about every one of us (seego.nature.com/stoqsu). Companies and governments can increasingly manipulate our decisions, behaviour and feelings1.

Many policymakers believe that personal data may be used to ‘nudge’ people to make healthier and environmentally friendly decisions. Yet the same technology may also promote nationalism, fuel hate against minorities or skew election outcomes2 if ethical scrutiny, transparency and democratic control are lacking — as they are in most private companies and institutions that use ‘big data’. The combination of nudging with big data about everyone’s behaviour, feelings and interests (‘big nudging’, if you will) could eventually create close to totalitarian power.

Countries have long experimented with using data to run their societies. In the 1970s, Chilean President Salvador Allende created computer networks to optimize industrial productivity3. Today, Singapore considers itself a data-driven ‘social laboratory’4 and other countries seem keen to copy this model.

The Chinese government has begun rating the behaviour of its citizens5. Loans, jobs and travel visas will depend on an individual’s ‘citizen score’, their web history and political opinion. Meanwhile, Baidu — the Chinese equivalent of Google — is joining forces with the military for the ‘China brain project’, using ‘deep learning’ artificial-intelligence algorithms to predict the behaviour of people on the basis of their Internet activity6.

The intentions may be good: it is hoped that big data can improve governance by overcoming irrationality and partisan interests. But the situation also evokes the warning of the eighteenth-century philosopher Immanuel Kant, that the “sovereign acting … to make the people happy according to his notions … becomes a despot”. It is for this reason that the US Declaration of Independence emphasizes the pursuit of happiness of individuals.

Ruling like a ‘benevolent dictator’ or ‘wise king’ cannot work because there is no way to determine a single metric or goal that a leader should maximize. Should it be gross domestic product per capita or sustainability, power or peace, average life span or happiness, or something else?

Better is pluralism. It hedges risks, promotes innovation, collective intelligence and well-being. Approaching complex problems from varied perspectives also helps people to cope with rare and extreme events that are costly for society — such as natural disasters, blackouts or financial meltdowns.

Centralized, top-down control of data has various flaws. First, it will inevitably become corrupted or hacked by extremists or criminals. Second, owing to limitations in data-transmission rates and processing power, top-down solutions often fail to address local needs. Third, manipulating the search for information and intervening in individual choices undermines ‘collective intelligence’7. Fourth, personalized information creates ‘filter bubbles’8. People are exposed less to other opinions, which can increase polarization and conflict9.

Fifth, reducing pluralism is as bad as losing biodiversity, because our economies and societies are like ecosystems with millions of interdependencies. Historically, a reduction in diversity has often led to political instability, collapse or war. Finally, by altering the cultural cues that guide peoples’ decisions, everyday decision-making is disrupted, which undermines rather than bolsters social stability and order.

Big data should be used to solve the world’s problems, not for illegitimate manipulation. But the assumption that ‘more data equals more knowledge, power and success’ does not hold. Although we have never had so much information, we face ever more global threats, including climate change, unstable peace and socio-economic fragility, and political satisfaction is low worldwide. About 50% of today’s jobs will be lost in the next two decades as computers and robots take over tasks. But will we see the macroeconomic benefits that would justify such large-scale ‘creative destruction’? And how can we reinvent half of our economy?

The digital revolution will mainly benefit countries that achieve a ‘win–win–win’ situation for business, politics and citizens alike10. To mobilize the ideas, skills and resources of all, we must build information systems capable of bringing diverse knowledge and ideas together. Online deliberation platforms and reconfigurable networks of smart human minds and artificially intelligent systems can now be used to produce collective intelligence that can cope with the diverse and complex challenges surrounding us….(More)” See Nervousnet project

The Transformation of Human Rights Fact-Finding


Book edited by Philip Alston and Sarah Knuckey: “Fact-finding is at the heart of human rights advocacy, and is often at the center of international controversies about alleged government abuses. In recent years, human rights fact-finding has greatly proliferated and become more sophisticated and complex, while also being subjected to stronger scrutiny from governments. Nevertheless, despite the prominence of fact-finding, it remains strikingly under-studied and under-theorized. Too little has been done to bring forth the assumptions, methodologies, and techniques of this rapidly developing field, or to open human rights fact-finding to critical and constructive scrutiny.

The Transformation of Human Rights Fact-Finding offers a multidisciplinary approach to the study of fact-finding with rigorous and critical analysis of the field of practice, while providing a range of accounts of what actually happens. It deepens the study and practice of human rights investigations, and fosters fact-finding as a discretely studied topic, while mapping crucial transformations in the field. The contributions to this book are the result of a major international conference organized by New York University Law School’s Center for Human Rights and Global Justice. Engaging the expertise and experience of the editors and contributing authors, it offers a broad approach encompassing contemporary issues and analysis across the human rights spectrum in law, international relations, and critical theory. This book addresses the major areas of human rights fact-finding such as victim and witness issues; fact-finding for advocacy, enforcement, and litigation; the role of interdisciplinary expertise and methodologies; crowd sourcing, social media, and big data; and international guidelines for fact-finding….(More)”

Privacy in a Digital, Networked World: Technologies, Implications and Solutions


Book edited by Zeadally, Sherali and Badra, Mohamad: “This comprehensive textbook/reference presents a focused review of the state of the art in privacy research, encompassing a range of diverse topics. The first book of its kind designed specifically to cater to courses on privacy, this authoritative volume provides technical, legal, and ethical perspectives on privacy issues from a global selection of renowned experts. Features: examines privacy issues relating to databases, P2P networks, big data technologies, social networks, and digital information networks; describes the challenges of addressing privacy concerns in various areas; reviews topics of privacy in electronic health systems, smart grid technology, vehicular ad-hoc networks, mobile devices, location-based systems, and crowdsourcing platforms; investigates approaches for protecting privacy in cloud applications; discusses the regulation of personal information disclosure and the privacy of individuals; presents the tools and the evidence to better understand consumers’ privacy behaviors….(More)”

New flu tracker uses Google search data better than Google


 at ArsTechnica: “With big data comes big noise. Google learned this lesson the hard way with its now kaput Google Flu Trends. The online tracker, which used Internet search data to predict real-life flu outbreaks, emerged amid fanfare in 2008. Then it met a quiet death this August after repeatedly coughing up bad estimates.

But big Internet data isn’t out of the disease tracking scene yet.

With hubris firmly in check, a team of Harvard researchers have come up with a way to tame the unruly data, combine it with other data sets, and continually calibrate it to track flu outbreaks with less error. Their new model, published Monday in the Proceedings of the National Academy of Sciences, out-performs Google Flu Trends and other models with at least double the accuracy. If the model holds up in coming flu seasons, it could reinstate some optimism in using big data to monitor disease and herald a wave of more accurate second-generation models.

Big data has a lot of potential, Samuel Kou, a statistics professor at Harvard University and coauthor on the new study, told Ars. It’s just a question of using the right analytics, he said.

Kou and his colleagues built on Google’s flu tracking model for their new version, called ARGO (AutoRegression with GOogle search data). Google Flu Trends basically relied on trends in Internet search terms, such as headache and chills, to estimate the number of flu cases. Those search terms were correlated with flu outbreak data collected by the Centers for Disease Control and Prevention. The CDC’s data relies on clinical reports from around the country. But compiling and analyzing that data can be slow, leading to a lag time of one to three weeks. The Google data, on the other hand, offered near real-time tracking for health experts to manage and prepare for outbreaks.

At first Google’s tracker appeared to be pretty good, matching CDC data’s late-breaking data somewhat closely. But, two notable stumbles led to its ultimate downfall: an underestimate of the 2009 H1N1 swine flu outbreak and an alarming overestimate (almost double real numbers) of the 2012-2013 flu season’s cases…..For ARGO, he and colleagues took the trend data and then designed a model that could self-correct for changes in how people search. The model has a two-year sliding window in which it re-calibrates current search term trends with the CDC’s historical flu data (the gold standard for flu data). They also made sure to exclude winter search terms, such as March Madness and the Oscars, so they didn’t get accidentally correlated with seasonal flu trends. Last, they incorporated data on the historical seasonality of flu.

The result was a model that significantly out-competed the Google Flu Trends estimates for the period between March 29, 2009 to July 11, 2015. ARGO also beat out other models, including one based on current and historical CDC data….(More)”

See also Proceedings of the National Academy of Sciences, 2015. DOI: 10.1073/pnas.1515373112

How smartphones are solving one of China’s biggest mysteries


Ana Swanson at the Washington Post: “For decades, China has been engaged in a building boom of a scale that is hard to wrap your mind around. In the last three decades, 260 million people have moved from the countryside to Chinese cities — equivalent to around 80 percent of the population of the U.S. To make room for all of those people, the size of China’s built-up urban areas nearly quintupled between 1984 and 2010.

Much of that development has benefited people’s lives, but some has not. In a breathless rush to boost growth and development, some urban areas have built vast, unused real estate projects — China’s infamous “ghost cities.” These eerie, shining developments are complete except for one thing: people to live in them.

China’s ghost cities have sparked a lot of debate over the last few years. Some argue that the developments are evidence of the waste in top-down planning, or the result of too much cheap funding for businesses. Some blame the lack of other good places for average people to invest their money, or the desire of local officials to make a quick buck — land sales generate a lot of revenue for China’s local governments.

Others say the idea of ghost cities has been overblown. They espouse a “build it and they will come” philosophy, pointing out that, with time, some ghost cities fill up and turn into vibrant communities.

It’s been hard to evaluate these claims, since most of the research on ghost cities has been anecdotal. Even the most rigorous research methods leave a lot to be desired — for example, investment research firms sending poor junior employees out to remote locations to count how many lights are turned on in buildings at night.

Now new research from Baidu, one of China’s biggest technology companies, provides one of the first systematic looks at Chinese ghost cities. Researchers from Baidu’s Big Data Lab and Peking University in Beijing used the kind of location data gathered by mobile phones and GPS receivers to track how people moved in and out suspected ghost cities, in real time and on a national scale, over a period of six months. You can see the interactive project here.

Google has been blocked in China for years, and Baidu dominates the market in terms of search, mobile maps and other offerings. That gave the researchers a huge data base to work with —  770 million users, a hefty chunk of China’s 1.36 billion people.

To identify potential ghost cities, the researchers created an algorithm that identifies urban areas with a relatively spare population. They define a ghost city as an urban region with a population of fewer than 5,000 people per square kilometer – about half the density recommended by the Chinese Ministry of Housing and Urban-Rural Development….(More)”

Mobile data: Made to measure


Neil Savage in Nature: “For decades, doctors around the world have been using a simple test to measure the cardiovascular health of patients. They ask them to walk on a hard, flat surface and see how much distance they cover in six minutes. This test has been used to predict the survival rates of lung transplant candidates, to measure the progression of muscular dystrophy, and to assess overall cardiovascular fitness.

The walk test has been studied in many trials, but even the biggest rarely top a thousand participants. Yet when Euan Ashley launched a cardiovascular study in March 2015, he collected test results from 6,000 people in the first two weeks. “That’s a remarkable number,” says Ashley, a geneticist who heads Stanford University’s Center for Inherited Cardiovascular Disease. “We’re used to dealing with a few hundred patients, if we’re lucky.”

Numbers on that scale, he hopes, will tell him a lot more about the relationship between physical activity and heart health. The reason they can be achieved is that millions of people now have smartphones and fitness trackers with sensors that can record all sorts of physical activity. Health researchers are studying such devices to figure out what sort of data they can collect, how reliable those data are, and what they might learn when they analyse measurements of all sorts of day-to-day activities from many tens of thousands of people and apply big-data algorithms to the readings.

By July, more than 40,000 people in the United States had signed up to participate in Ashley’s study, which uses an iPhone application called MyHeart Counts. He expects the numbers to surge as the app becomes more widely available around the world. The study — designed by scientists, approved by institutional review boards, and requiring informed consent — asks participants to answer questions about their health and risk factors, and to use their phone’s motion sensors to collect data about their activities for seven days. They also do a six-minute walk test, and the phone measures the distance they cover. If their own doctors have ordered blood tests, users can enter information such as cholesterol or glucose measurements. Every three months, the app checks back to update their data.

Physicians know that physical activity is a strong predictor of long-term heart health, Ashley says. But it is less clear what kind of activity is best, or whether different groups of people do better with different types of exercise. MyHeart Counts may open a window on such questions. “We can start to look at subgroups and find differences,” he says.

“You can take pretty noisy data, but if you have enough of it, you can find a signal.”

It is the volume of the data that makes such studies possible. In traditional studies, there may not be enough data to find statistically significant results for such subgroups. And rare events may not occur in the smaller samples, or may produce a signal so weak that it is lost in statistical noise. Big data can overcome those problems, and if the data set is big enough, small errors can be smoothed out. “You can take pretty noisy data, but if you have enough of it, you can find a signal,” Ashley says….(More)”.

How big data and The Sims are helping us to build the cities of the future


The Next Web: “By 2050, the United Nations predicts that around 66 percent of the world’s population will be living in urban areas. It is expected that the greatest expansion will take place in developing regions such as Africa and Asia. Cities in these parts will be challenged to meet the needs of their residents, and provide sufficient housing, energy, waste disposal, healthcare, transportation, education and employment.

So, understanding how cities will grow – and how we can make them smarter and more sustainable along the way – is a high priority among researchers and governments the world over. We need to get to grips with the inner mechanisms of cities, if we’re to engineer them for the future. Fortunately, there are tools to help us do this. And even better, using them is a bit like playing SimCity….

Cities are complex systems. Increasingly, scientists studying cities have gone from thinking about “cities as machines”, to approaching “cities as organisms”. Viewing cities as complex, adaptive organisms – similar to natural systems like termite mounds or slime mould colonies – allows us to gain unique insights into their inner workings. …So, if cities are like organisms, it follows that we should examine them from the bottom-up, and seek to understand how unexpected large-scale phenomena emerge from individual-level interactions. Specifically, we can simulate how the behaviour of individual “agents” – whether they are people, households, or organisations – affect the urban environment, using a set of techniques known as “agent-based modelling”….These days, increases in computing power and the proliferation of big datagive agent-based modelling unprecedented power and scope. One of the most exciting developments is the potential to incorporate people’s thoughts and behaviours. In doing so, we can begin to model the impacts of people’s choices on present circumstances, and the future.

For example, we might want to know how changes to the road layout might affect crime rates in certain areas. By modelling the activities of individuals who might try to commit a crime, we can see how altering the urban environment influences how people move around the city, the types of houses that they become aware of, and consequently which places have the greatest risk of becoming the targets of burglary.

To fully realise the goal of simulating cities in this way, models need a huge amount of data. For example, to model the daily flow of people around a city, we need to know what kinds of things people spend their time doing, where they do them, who they do them with, and what drives their behaviour.

Without good-quality, high-resolution data, we have no way of knowing whether our models are producing realistic results. Big data could offer researchers a wealth of information to meet these twin needs. The kinds of data that are exciting urban modellers include:

  • Electronic travel cards that tell us how people move around a city.
  • Twitter messages that provide insight into what people are doing and thinking.
  • The density of mobile telephones that hint at the presence of crowds.
  • Loyalty and credit-card transactions to understand consumer behaviour.
  • Participatory mapping of hitherto unknown urban spaces, such as Open Street Map.

These data can often be refined to the level of a single person. As a result, models of urban phenomena no longer need to rely on assumptions about the population as a whole – they can be tailored to capture the diversity of a city full of individuals, who often think and behave differently from one another….(More)