Government wants access to personal data while it pushes privacy


Sara Fischer and Scott Rosenberg at Axios: “Over the past two years, the U.S. government has tried to rein in how major tech companies use the personal data they’ve gathered on their customers. At the same time, government agencies are themselves seeking to harness those troves of data.

Why it matters: Tech platforms use personal information to target ads, whereas the government can use it to prevent and solve crimes, deliver benefits to citizens — or (illegally) target political dissent.

Driving the news: A new report from the Wall Street Journal details the ways in which family DNA testing sites like FamilyTreeDNA are pressured by the FBI to hand over customer data to help solve criminal cases using DNA.

  • The trend has privacy experts worried about the potential implications of the government having access to large pools of genetic data, even though many people whose data is included never agreed to its use for that purpose.

The FBI has particular interest in data from genetic and social media sites, because it could help solve crimes and protect the public.

  • For example, the FBI is “soliciting proposals from outside vendors for a contract to pull vast quantities of public data” from Facebook, Twitter Inc. and other social media companies,“ the Wall Street Journal reports.
  • The request is meant to help the agency surveil social behavior to “mitigate multifaceted threats, while ensuring all privacy and civil liberties compliance requirements are met.”
  • Meanwhile, the Trump administration has also urged social media platforms to cooperate with the governmentin efforts to flag individual users as potential mass shooters.

Other agencies have their eyes on big data troves as well.

  • Earlier this year, settlement talks between Facebook and the Department of Housing and Urban Development broke down over an advertising discrimination lawsuit when, according to a Facebook spokesperson, HUD “insisted on access to sensitive information — like user data — without adequate safeguards.”
  • HUD presumably wanted access to the data to ensure advertising discrimination wasn’t occurring on the platform, but it’s unclear whether the agency needed user data to be able to support that investigation….(More)”.

Investigators Use New Strategy to Combat Opioid Crisis: Data Analytics


Byron Tau and Aruna Viswanatha in the Wall Street Journal: “When federal investigators got a tip in 2015 that a health center in Houston was distributing millions of doses of opioid painkillers, they tried a new approach: look at the numbers.

State and federal prescription and medical billing data showed a pattern of overprescription, giving authorities enough ammunition to send an undercover Drug Enforcement Administration agent. She found a crowded waiting room and armed security guards. After a 91-second appointment with the sole doctor, the agent paid $270 at the cash-only clinic and walked out with 100 10mg pills of the powerful opioid hydrocodone.

The subsequent prosecution of the doctor and the clinic owner, who were sentenced last year to 35 years in prison, laid the groundwork for a new data-driven Justice Department strategy to help target one of the worst public-health crises in the country. Prosecutors expanded the pilot program from Houston to the hard-hit Appalachian region in early 2019. Within months, the effort resulted in the indictments of dozens of doctors, nurses, pharmacists and others. Two-thirds of them had been identified through analyzing the data, a Justice Department official said. A quarter of defendants were expected to plead guilty, according to the Justice Department, and additional indictments through the program are expected in the coming weeks.

“These are doctors behaving like drug dealers,” said Brian Benczkowski, head of the Justice Department’s criminal division who oversaw the expansion.

“They’ve been operating as though nobody could see them for a long period of time. Now we have the data,” Mr. Benczkowski said.

The Justice Department’s fraud section has been using data analytics in health-care prosecutions for several years—combing through Medicare and Medicaid billing data for evidence of fraud, and deploying the strategy in cities around the country that saw outlier billings. In 2018, the health-care fraud unit charged more than 300 people with fraud totaling more than $2 billion, according to the Justice Department.

But using the data to combat the opioid crisis, which is ravaging communities across the country, is a new development for the department, which has made tackling the epidemic a key priority in the Trump administration….(More)”.

Smart Governance for Cities: Perspectives and Experiences


Book edited by Nuno Vasco Moreira Lopes: “This book provides theoretical perspectives and practical experiences on smart governance for smart cities. It presents a balanced linkage between research, policies and practices on this area. The authors discuss the sustainability challenges raised by rapid urbanization, challenges with smart governance models in various countries, and a new governance paradigm seen as a capable approach able to overcome social, economic and environmental sustainability problems. The authors include case studies on transformation, adaption and transfers; and country, regional, municipal contextualization. Also included are best practices on monitoring and evaluating smart governance and impact assessment. The book features contributions from researchers, academics, and practitioners in the field. 

  • Analyzes smart governance for cities from a variety of perspectives and a variety of sectors – both in theory and in practice
  • Features information on the linkage between United Nations Sustainable Development Goals and smart governance
  • Covers the connection between research, policies and practice in smart governance for smart cities…(More)”.

Fostering an Enabling Policy and Regulatory Environment in APEC for Data-Utilizing Businesses


APEC: “The objectives of this study is to better understand: 1) how firms from different sectors use data in their business models; and considering the significant increase in data-related policies and regulations enacted by governments across the world, 2) how such policies and regulations are affecting their use of data and hence business models. The study also tries: 3) to identify some of the middle-ground approaches that would enable governments to achieve public policy objectives, such as data security and privacy, and at the same time, also promote the growth of data-utilizing businesses. 39 firms from 12 economies have participated in this project and they come from a diverse group of industries, including aviation, logistics, shipping, payment services, encryption services, and manufacturing. The synthesis report can be found in Chapter 1 while the case study chapters can be found in Chapter 2 to 10….(More)”.

Sharing Private Data for Public Good


Stefaan G. Verhulst at Project Syndicate: “After Hurricane Katrina struck New Orleans in 2005, the direct-mail marketing company Valassis shared its database with emergency agencies and volunteers to help improve aid delivery. In Santiago, Chile, analysts from Universidad del Desarrollo, ISI Foundation, UNICEF, and the GovLab collaborated with Telefónica, the city’s largest mobile operator, to study gender-based mobility patterns in order to design a more equitable transportation policy. And as part of the Yale University Open Data Access project, health-care companies Johnson & Johnson, Medtronic, and SI-BONE give researchers access to previously walled-off data from 333 clinical trials, opening the door to possible new innovations in medicine.

These are just three examples of “data collaboratives,” an emerging form of partnership in which participants exchange data for the public good. Such tie-ups typically involve public bodies using data from corporations and other private-sector entities to benefit society. But data collaboratives can help companies, too – pharmaceutical firms share data on biomarkers to accelerate their own drug-research efforts, for example. Data-sharing initiatives also have huge potential to improve artificial intelligence (AI). But they must be designed responsibly and take data-privacy concerns into account.

Understanding the societal and business case for data collaboratives, as well as the forms they can take, is critical to gaining a deeper appreciation the potential and limitations of such ventures. The GovLab has identified over 150 data collaboratives spanning continents and sectors; they include companies such as Air FranceZillow, and Facebook. Our research suggests that such partnerships can create value in three main ways….(More)”.

The Ethics of Hiding Your Data From the Machines


Molly Wood at Wired: “…But now that data is being used to train artificial intelligence, and the insights those future algorithms create could quite literally save lives.

So while targeted advertising is an easy villain, data-hogging artificial intelligence is a dangerously nuanced and highly sympathetic bad guy, like Erik Killmonger in Black Panther. And it won’t be easy to hate.

I recently met with a company that wants to do a sincerely good thing. They’ve created a sensor that pregnant women can wear, and it measures their contractions. It can reliably predict when women are going into labor, which can help reduce preterm births and C-sections. It can get women into care sooner, which can reduce both maternal and infant mortality.

All of this is an unquestionable good.

And this little device is also collecting a treasure trove of information about pregnancy and labor that is feeding into clinical research that could upend maternal care as we know it. Did you know that the way most obstetricians learn to track a woman’s progress through labor is based on a single study from the 1950s, involving 500 women, all of whom were white?…

To save the lives of pregnant women and their babies, researchers and doctors, and yes, startup CEOs and even artificial intelligence algorithms need data. To cure cancer, or at least offer personalized treatments that have a much higher possibility of saving lives, those same entities will need data….

And for we consumers, well, a blanket refusal to offer up our data to the AI gods isn’t necessarily the good choice either. I don’t want to be the person who refuses to contribute my genetic data via 23andMe to a massive research study that could, and I actually believe this is possible, lead to cures and treatments for diseases like Parkinson’s and Alzheimer’s and who knows what else.

I also think I deserve a realistic assessment of the potential for harm to find its way back to me, because I didn’t think through or wasn’t told all the potential implications of that choice—like how, let’s be honest, we all felt a little stung when we realized the 23andMe research would be through a partnership with drugmaker (and reliable drug price-hiker) GlaxoSmithKline. Drug companies, like targeted ads, are easy villains—even though this partnership actually couldproduce a Parkinson’s drug. But do we know what GSK’s privacy policy looks like? That deal was a level of sharing we didn’t necessarily expect….(More)”.

Companies Collect a Lot of Data, But How Much Do They Actually Use?


Article by Priceonomics Data Studio: “For all the talk of how data is the new oil and the most valuable resource of any enterprise, there is a deep dark secret companies are reluctant to share — most of the data collected by businesses simply goes unused.

This unknown and unused data, known as dark data comprises more than half the data collected by companies. Given that some estimates indicate that 7.5 septillion (7,700,000,000,000,000,000,000) gigabytes of data are generated every single day, not using  most of it is a considerable issue.

In this article, we’ll look at this dark data. Just how much of it is created by companies, what are the reasons this data isn’t being analyzed, and what are the costs and implications of companies not using the majority of the data they collect.  

Before diving into the analysis, it’s worth spending a moment clarifying what we mean by the term “dark data.” Gartner defines dark data as:

“The information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships and direct monetizing). 

To learn more about this phenomenon, Splunk commissioned a global survey of 1,300+ business leaders to better understand how much data they collect, and how much is dark. Respondents were from IT and business roles, and were located in Australia, China, France, Germany, Japan, the United States, and the United Kingdom. across various industries. For the report, Splunk defines dark data as: “all the unknown and untapped data across an organization, generated by systems, devices and interactions.”

While the costs of storing data has decreased overtime, the cost of saving septillions of gigabytes of wasted data is still significant. What’s more, during this time the strategic importance of data has increased as companies have found more and more uses for it. Given the cost of storage and the value of data, why does so much of it go unused?

The following chart shows the reasons why dark data isn’t currently being harnessed:

By a large margin, the number one reason given for not using dark data is that companies lack a tool to capture or analyze the data. Companies accumulate data from server logs, GPS networks, security tools, call records, web traffic and more. Companies track everything from digital transactions to the temperature of their server rooms to the contents of retail shelves. Most of this data lies in separate systems, is unstructured, and cannot be connected or analyzed.

Second, the data captured just isn’t good enough. You might have important customer information about a transaction, but it’s missing location or other important metadata because that information sits somewhere else or was never captured in useable format.

Additionally, dark data exists because there is simply too much data out there and a lot of is unstructured. The larger the dataset (or the less structured it is), the more sophisticated the tool required for analysis. Additionally, these kinds of datasets often time require analysis by individuals with significant data science expertise who are often is short supply

The implications of the prevalence are vast. As a result of the data deluge, companies often don’t know where all the sensitive data is stored and can’t be confident they are complying with consumer data protection measures like GDPR. …(More)”.

Exploring the Smart City Indexes and the Role of Macro Factors for Measuring Cities Smartness


María Verónica Alderete in Social Indicators Research: “The main objective of this paper is to discuss the key factors involved in the definition of smart city indexes. Although recent literature has explored the smart city subject, it is of concern if macro ICT factors should also be considered for assessing the technological innovation of a city. To achieve this goal, firstly a literature review of smart city is provided. An analysis of the smart city concept together with a theoretical framework based on the knowledge society and the Quintuple Helix innovation model are included. Secondly, the study analyzes some smart city cases in developed and developing countries. Thirdly, it describes, criticizes and compares some well-known smart city indexes. Lastly, the empirical literature is explored to detect if there are studies proposing changes in smart city indexes or methodologies to consider the macro level variables. It results that cities at the top of the indexes rankings are from developed countries. On the other side, most cities at the bottom of the ranking are from developing or not developed countries. As a result, it is addressed that the ICT development of Smart Cities depends both on the cities’ characteristics and features, and on macro-technological factors. Secondly, there is a scarce number of papers in the subject including macro or country factors, and most of them are revisions of the literature or case studies. There is a lack of studies discussing the indexes’ methodologies. This paper provides some guidelines to build one….(More)”.

Stop the Open Data Bus, We Want to Get Off


Paper by Chris Culnane, Benjamin I. P. Rubinstein, and Vanessa Teague: “The subject of this report is the re-identification of individuals in the Myki public transport dataset released as part of the Melbourne Datathon 2018. We demonstrate the ease with which we were able to re-identify ourselves, our co-travellers, and complete strangers; our analysis raises concerns about the nature and granularity of the data released, in particular the ability to identify vulnerable or sensitive groups…..

This work highlights how a large number of passengers could be re-identified in the 2018 Myki data release, with detailed discussion of specific people. The implications of re-identification are potentially serious: ex-partners, one-time acquaintances, or other parties can determine places of home, work, times of travel, co-travelling patterns—presenting risk to vulnerable groups in particular…

In 2018 the Victorian Government released a large passenger centric transport dataset to a data science competition—the 2018 Melbourne Datathon. Access to the data was unrestricted, with a URL provided on the datathon’s website to download the complete dataset from an Amazon S3 Bucket. Over 190 teams continued to analyse the data through the 2 month competition period. The data consisted of touch on and touch off events for the Myki smart card ticketing system used throughout the state of Victoria, Australia. With such data, contestants would be able to apply retrospective analyses on an entire public transport system, explore suitability of predictive models, etc.

The Myki ticketing system is used across Victorian public transport: on trains, buses and trams. The dataset was a longitudinal dataset, consisting of touch on and touch off events from Week 27 in 2015 through to Week 26 in 2018. Each event contained a card identifier (cardId; not the actual card number), the card type, the time of the touch on or off, and various location information, for example a stop ID or route ID, along with other fields which we omit here for brevity. Events could be indexed by the cardId and as such, all the events associated with a single card could be retrieved. There are a total of 15,184,336 cards in the dataset—more than twice the 2018 population of Victoria. It appears that all touch on and off events for metropolitan trains and trams have been included, though other forms of transport such as intercity trains and some buses are absent. In total there are nearly 2 billion touch on and off events in the dataset.

No information was provided as to the de-identification that was performed on the dataset. Our analysis indicates that little to no de-identification took place on the bulk of the data, as will become evident in Section 3. The exception is the cardId, which appears to have been mapped in some way from the Myki Card Number. The exact mapping has not been discovered, although concerns remain as to its security effectiveness….(More)”.

Datafication and accountability in public health


Introduction to a special issue of Social Studies of Science by Klaus Hoeyer, Susanne Bauer, and Martyn Pickersgill: “In recent years and across many nations, public health has become subject to forms of governance that are said to be aimed at establishing accountability. In this introduction to a special issue, From Person to Population and Back: Exploring Accountability in Public Health, we suggest opening up accountability assemblages by asking a series of ostensibly simple questions that inevitably yield complicated answers: What is counted? What counts? And to whom, how and why does it count? Addressing such questions involves staying attentive to the technologies and infrastructures through which data come into being and are made available for multiple political agendas. Through a discussion of public health, accountability and datafication we present three key themes that unite the various papers as well as illustrate their diversity….(More)”.