Sophie Bushwick at Scientific American: “The world produces roughly 2.5 quintillion bytes of digital data per day, adding to a sea of information that includes intimate details about many individuals’ health and habits. To protect privacy, data brokers must anonymize such records before sharing them with researchers and marketers. But a new study finds it is relatively easy to reidentify a person from a supposedly anonymized data set—even when that set is incomplete.
Massive data repositories can reveal trends that teach medical researchers about disease, demonstrate issues such as the effects of income inequality, coach artificial intelligence into humanlike behavior and, of course, aim advertising more efficiently. To shield people who—wittingly or not—contribute personal information to these digital storehouses, most brokers send their data through a process of deidentification. This procedure involves removing obvious markers, including names and social security numbers, and sometimes taking other precautions, such as introducing random “noise” data to the collection or replacing specific details with general ones (for example, swapping a birth date of “March 7, 1990” for “January–April 1990”). The brokers then release or sell a portion of this information.
“Data anonymization is basically how, for the past 25 years, we’ve been using data for statistical purposes and research while preserving people’s privacy,” says Yves-Alexandre de Montjoye, an assistant professor of computational privacy at Imperial College London and co-author of the new study, published this week in Nature Communications. Many commonly used anonymization techniques, however, originated in the 1990s, before the Internet’s rapid development made it possible to collect such an enormous amount of detail about things such as an individual’s health, finances, and shopping and browsing habits. This discrepancy has made it relatively easy to connect an anonymous line of data to a specific person: if a private detective is searching for someone in New York City and knows the subject is male, is 30 to 35 years old and has diabetes, the sleuth would not be able to deduce the man’s name—but could likely do so quite easily if he or she also knows the target’s birthday, number of children, zip code, employer and car model….(More)”