Synthetic data offers advanced privacy for the Census Bureau, business


Kate Kaye at IAPP: “In the early 2000s, internet accessibility made risks of exposing individuals from population demographic data more likely than ever. So, the U.S. Census Bureau turned to an emerging privacy approach: synthetic data.

Some argue the algorithmic techniques used to develop privacy-secure synthetic datasets go beyond traditional deidentification methods. Today, along with the Census Bureau, clinical researchers, autonomous vehicle system developers and banks use these fake datasets that mimic statistically valid data.

In many cases, synthetic data is built from existing data by filtering it through machine learning models. Real data representing real individuals flows in, and fake data mimicking individuals with corresponding characteristics flows out.

When data scientists at the Census Bureau began exploring synthetic data methods, adoption of the internet had made deidentified, open-source data on U.S. residents, their households and businesses more accessible than in the past.

Especially concerning, census-block-level information was now widely available. Because in rural areas, a census block could represent data associated with as few as one house, simply stripping names, addresses and phone numbers from that information might not be enough to prevent exposure of individuals.

“There was pretty widespread angst” among statisticians, said John Abowd, the bureau’s associate director for research and methodology and chief scientist. The hand-wringing led to a “gradual awakening” that prompted the agency to begin developing synthetic data methods, he said.

Synthetic data built from the real data preserves privacy while providing information that is still relevant for research purposes, Abowd said: “The basic idea is to try to get a model that accurately produces an image of the confidential data.”

The plan for the 2020 census is to produce a synthetic image of that original data. The bureau also produces On the Map, a web-based mapping and reporting application that provides synthetic data showing where workers are employed and where they live along with reports on age, earnings, industry distributions, race, ethnicity, educational attainment and sex.

Of course, the real census data is still locked away, too, Abowd said: “We have a copy and the national archives have a copy of the confidential microdata.”…(More)”.

Scraping the Web for Public Health Gains: Ethical Considerations from a ‘Big Data’ Research Project on HIV and Incarceration


Stuart Rennie, Mara Buchbinder, Eric Juengst, Lauren Brinkley-Rubinstein, and David L Rosen at Public Health Ethics: “Web scraping involves using computer programs for automated extraction and organization of data from the Web for the purpose of further data analysis and use. It is frequently used by commercial companies, but also has become a valuable tool in epidemiological research and public health planning. In this paper, we explore ethical issues in a project that “scrapes” public websites of U.S. county jails as part of an effort to develop a comprehensive database (including individual-level jail incarcerations, court records and confidential HIV records) to enhance HIV surveillance and improve continuity of care for incarcerated populations. We argue that the well-known framework of Emanuel et al. (2000) provides only partial ethical guidance for the activities we describe, which lie at a complex intersection of public health research and public health practice. We suggest some ethical considerations from the ethics of public health practice to help fill gaps in this relatively unexplored area….(More)”.

How Taiwan Used Big Data, Transparency and a Central Command to Protect Its People from Coronavirus


Article by Beth Duff-Brown: “…So what steps did Taiwan take to protect its people? And could those steps be replicated here at home?

Stanford Health Policy’s Jason Wang, MD, PhD, an associate professor of pediatrics at Stanford Medicine who also has a PhD in policy analysis, credits his native Taiwan with using new technology and a robust pandemic prevention plan put into place at the 2003 SARS outbreak.

“The Taiwan government established the National Health Command Center (NHCC) after SARS and it’s become part of a disaster management center that focuses on large-outbreak responses and acts as the operational command point for direct communications,” said Wang, a pediatrician and the director of the Center for Policy, Outcomes, and Prevention at Stanford. The NHCC also established the Central Epidemic Command Center, which was activated in early January.

“And Taiwan rapidly produced and implemented a list of at least 124 action items in the past five weeks to protect public health,” Wang said. “The policies and actions go beyond border control because they recognized that that wasn’t enough.”

Wang outlines the measures Taiwan took in the last six weeks in an article published Tuesday in the Journal of the American Medical Association.

“Given the continual spread of COVID-19 around the world, understanding the action items that were implemented quickly in Taiwan, and the effectiveness of these actions in preventing a large-scale epidemic, may be instructive for other countries,” Wang and his co-authors wrote.

Within the last five weeks, Wang said, the Taiwan epidemic command center rapidly implemented those 124 action items, including border control from the air and sea, case identification using new data and technology, quarantine of suspicious cases, educating the public while fighting misinformation, negotiating with other countries — and formulating policies for schools and businesses to follow.

Big Data Analytics

The authors note that Taiwan integrated its national health insurance database with its immigration and customs database to begin the creation of big data for analytics. That allowed them case identification by generating real-time alerts during a clinical visit based on travel history and clinical symptoms.

Taipei also used Quick Response (QR) code scanning and online reporting of travel history and health symptoms to classify travelers’ infectious risks based on flight origin and travel history in the last 14 days. People who had not traveled to high-risk areas were sent a health declaration border pass via SMS for faster immigration clearance; those who had traveled to high-risk areas were quarantined at home and tracked through their mobile phones to ensure that they stayed home during the incubation period.

The country also instituted a toll-free hotline for citizens to report suspicious symptoms in themselves or others. As the disease progressed, the government called on major cities to establish their own hotlines so that the main hotline would not become jammed….(More)”.

Facebook Ads as a Demographic Tool to Measure the Urban-Rural Divide


Paper by Daniele Rama, Yelena Mejova, Michele Tizzoni, Kyriaki Kalimeri, and Ingmar Weber: “In the global move toward urbanization, making sure the people remaining in rural areas are not left behind in terms of development and policy considerations is a priority for governments worldwide. However, it is increasingly challenging to track important statistics concerning this sparse, geographically dispersed population, resulting in a lack of reliable, up-to-date data. In this study, we examine the usefulness of the Facebook Advertising platform, which offers a digital “census” of over two billions of its users, in measuring potential rural-urban inequalities.

We focus on Italy, a country where about 30% of the population lives in rural areas. First, we show that the population statistics that Facebook produces suffer from instability across time and incomplete coverage of sparsely populated municipalities. To overcome such limitation, we propose an alternative methodology for estimating Facebook Ads audiences that nearly triples the coverage of the rural municipalities from 19% to 55% and makes feasible fine-grained sub-population analysis. Using official national census data, we evaluate our approach and confirm known significant urban-rural divides in terms of educational attainment and income. Extending the analysis to Facebook-specific user “interests” and behaviors, we provide further insights on the divide, for instance, finding that rural areas show a higher interest in gambling. Notably, we find that the most predictive features of income in rural areas differ from those for urban centres, suggesting researchers need to consider a broader range of attributes when examining rural wellbeing. The findings of this study illustrate the necessity of improving existing tools and methodologies to include under-represented populations in digital demographic studies — the failure to do so could result in misleading observations, conclusions, and most importantly, policies….(More)”.

How big data is dividing the public in China’s coronavirus fight – green, yellow, red


Article by Viola Zhou: “On Valentine’s Day, a 36-year-old lawyer Matt Ma in the eastern Chinese province of Zhejiang discovered he had been coded “red”.The colour, displayed in a payment app on his smartphone, indicated that he needed to be quarantined at home even though he had no symptoms of the dangerous coronavirus.

Without a green light from the system, Ma could not travel from his ancestral hometown of Lishui to his new home city of Hangzhou, which is now surrounded by checkpoints set up to contain the epidemic.

Ma is one of the millions of people whose movements are being choreographed by the government through software that feeds on troves of data and issues orders that effectively dictate whether they must stay in or can go to work.Their experience represents a slice of China’s desperate attempt to stop the coronavirus by using a mixed bag of cutting-edge technologies and old-fashioned surveillance. It was also a rare real-world test of the use of technology on a large scale to halt the spread of communicable diseases.

“This kind of massive use of technology is unprecedented,” said Christos Lynteris, a medical anthropologist at the University of St Andrews who has studied epidemics in China.

But Hangzhou’s experiment has also revealed the pitfalls of applying opaque formulas to a large population.

In the city’s case, there are reports of people being marked incorrectly, falling victim to an algorithm that is, by the government’s own admission, not perfect….(More)”.

Who will benefit most from the data economy?


Special Report by The Economist: “The data economy is a work in progress. Its economics still have to be worked out; its infrastructure and its businesses need to be fully built; geopolitical arrangements must be found. But there is one final major tension: between the wealth the data economy will create and how it will be distributed. The data economy—or the “second economy”, as Brian Arthur of the Santa Fe Institute terms it—will make the world a more productive place no matter what, he predicts. But who gets what and how is less clear. “We will move from an economy where the main challenge is to produce more and more efficiently,” says Mr Arthur, “to one where distribution of the wealth produced becomes the biggest issue.”

The data economy as it exists today is already very unequal. It is dominated by a few big platforms. In the most recent quarter, Amazon, Apple, Alphabet, Microsoft and Facebook made a combined profit of $55bn, more than the next five most valuable American tech firms over the past 12 months. This corporate inequality is largely the result of network effects—economic forces that mean size begets size. A firm that can collect a lot of data, for instance, can make better use of artificial intelligence and attract more users, who in turn supply more data. Such firms can also recruit the best data scientists and have the cash to buy the best ai startups.

It is also becoming clear that, as the data economy expands, these sorts of dynamics will increasingly apply to non-tech companies and even countries. In many sectors, the race to become a dominant data platform is on. This is the mission of Compass, a startup, in residential property. It is one goal of Tesla in self-driving cars. And Apple and Google hope to repeat the trick in health care. As for countries, America and China account for 90% of the market capitalisation of the world’s 70 largest platforms (see chart), Africa and Latin America for just 1%. Economies on both continents risk “becoming mere providers of raw data…while having to pay for the digital intelligence produced,” the United Nations Conference on Trade and Development recently warned.

Yet it is the skewed distribution of income between capital and labour that may turn out to be the most pressing problem of the data economy. As it grows, more labour will migrate into the mirror worlds, just as other economic activity will. It is not only that people will do more digitally, but they will perform actual “data work”: generating the digital information needed to train and improve ai services. This can mean simply moving about online and providing feedback, as most people already do. But it will increasingly include more active tasks, such as labelling pictures, driving data-gathering vehicles and perhaps, one day, putting one’s digital twin through its paces. This is the reason why some say ai should actually be called “collective intelligence”: it takes in a lot of human input—something big tech firms hate to admit….(More)”.

How Philanthropy Can Help Lead on Data Justice


Louise Lief at Stanford Social Innovation Review: “Today, data governs almost every aspect of our lives, shaping the opportunities we have, how we perceive reality and understand problems, and even what we believe to be possible. Philanthropy is particularly data driven, relying on it to inform decision-making, define problems, and measure impact. But what happens when data design and collection methods are flawed, lack context, or contain critical omissions and misdirected questions? With bad data, data-driven strategies can misdiagnose problems and worsen inequities with interventions that don’t reflect what is needed.

Data justice begins by asking who controls the narrative. Who decides what data is collected and for which purpose? Who interprets what it means for a community? Who governs it? In recent years, affected communities, social justice philanthropists, and academics have all begun looking deeper into the relationship between data and social justice in our increasingly data-driven world. But philanthropy can play a game-changing role in developing practices of data justice to more accurately reflect the lived experience of communities being studied. Simply incorporating data justice principles into everyday foundation practice—and requiring it of grantees—would be transformative: It would not only revitalize research, strengthen communities, influence policy, and accelerate social change, it would also help address deficiencies in current government data sets.

When Data Is Flawed

Some of the most pioneering work on data justice has been done by Native American communities, who have suffered more than most from problems with bad data. A 2017 analysis of American Indian data challenges—funded by the W.K. Kellogg Foundation and the Morris K. Udall and Stewart L. Udall Foundation—documented how much data on Native American communities is of poor quality, inaccurate, inadequate, inconsistent, irrelevant, and/or inaccessible. The National Congress of American Indians even described American Native communities as “The Asterisk Nation,” because in many government data sets they are represented only by an asterisk denoting sampling errors instead of data points.

Where it concerns Native Americans, data is often not standardized and different government databases identify tribal members at least seven different ways using different criteria; federal and state statistics often misclassify race and ethnicity; and some data collection methods don’t allow tribes to count tribal citizens living off the reservation. For over a decade the Department of the Interior’s Bureau of Indian Affairs has struggled to capture the data it needs for a crucial labor force report it is legally required to produce; methodology errors and reporting problems have been so extensive that at times it prevented the report from even being published. But when the Department of the Interior changed several reporting requirements in 2014 and combined data submitted by tribes with US Census data, it only compounded the problem, making historical comparisons more difficult. Moreover, Native Americans have charged that the Census Bureau significantly undercounts both the American Indian population and key indicators like joblessness….(More)”.

Self-interest and data protection drive the adoption and moral acceptability of big data technologies: A conjoint analysis approach


Paper by Rabia I.Kodapanakka, lMark J.Brandt, Christoph Kogler, and Iljavan Beest: “Big data technologies have both benefits and costs which can influence their adoption and moral acceptability. Prior studies look at people’s evaluations in isolation without pitting costs and benefits against each other. We address this limitation with a conjoint experiment (N = 979), using six domains (criminal investigations, crime prevention, citizen scores, healthcare, banking, and employment), where we simultaneously test the relative influence of four factors: the status quo, outcome favorability, data sharing, and data protection on decisions to adopt and perceptions of moral acceptability of the technologies.

We present two key findings. (1) People adopt technologies more often when data is protected and when outcomes are favorable. They place equal or more importance on data protection in all domains except healthcare where outcome favorability has the strongest influence. (2) Data protection is the strongest driver of moral acceptability in all domains except healthcare, where the strongest driver is outcome favorability. Additionally, sharing data lowers preference for all technologies, but has a relatively smaller influence. People do not show a status quo bias in the adoption of technologies. When evaluating moral acceptability, people show a status quo bias but this is driven by the citizen scores domain. Differences across domains arise from differences in magnitude of the effects but the effects are in the same direction. Taken together, these results highlight that people are not always primarily driven by self-interest and do place importance on potential privacy violations. They also challenge the assumption that people generally prefer the status quo….(More)”.

Big data in official statistics


Paper by Barteld Braaksma and Kees Zeelenberg: “In this paper, we describe and discuss opportunities for big data in official statistics. Big data come in high volume, high velocity and high variety. Their high volume may lead to better accuracy and more details, their high velocity may lead to more frequent and more timely statistical estimates, and their high variety may give opportunities for statistics in new areas. But there are also many challenges: there are uncontrolled changes in sources that threaten continuity and comparability, and data that refer only indirectly to phenomena of statistical interest.

Furthermore, big data may be highly volatile and selective: the coverage of the population to which they refer may change from day to day, leading to inexplicable jumps in time-series. And very often, the individual observations in these big data sets lack variables that allow them to be linked to other datasets or population frames. This severely limits the possibilities for correction of selectivity and volatility. Also, with the advance of big data and open data, there is much more scope for disclosure of individual data, and this poses new problems for statistical institutes. So, big data may be regarded as so-called nonprobability samples. The use of such sources in official statistics requires other approaches than the traditional one based on surveys and censuses.

A first approach is to accept the big data just for what they are: an imperfect, yet very timely, indicator of developments in society. In a sense, this is what national statistical institutes (NSIs) often do: we collect data that have been assembled by the respondents and the reason why, and even just the fact that they have been assembled is very much the same reason why they are interesting for society and thus for an NSI to collect. In short, we might argue: these data exist and that’s why they are interesting.

A second approach is to use formal models and extract information from these data. In recent years, many new methods for dealing with big data have been developed by mathematical and applied statisticians. New methods like machine-learning techniques can be considered alongside more traditional methods like Bayesian techniques. National statistical institutes have always been reluctant to use models, apart from specific cases like small-area estimates. Based on experience at Statistics Netherlands, we argue that NSIs should not be afraid to use models, provided that their use is documented and made transparent to users. On the other hand, in official statistics, models should not be used for all kinds of purposes….(More)”.

Hospitals Give Tech Giants Access to Detailed Medical Records


Melanie Evans at the Wall Street Journal: “Hospitals have granted Microsoft Corp., International Business Machines and Amazon.com Inc. the ability to access identifiable patient information under deals to crunch millions of health records, the latest examples of hospitals’ growing influence in the data economy.

The breadth of access wasn’t always spelled out by hospitals and tech giants when the deals were struck.

The scope of data sharing in these and other recently reported agreements reveals a powerful new role that hospitals play—as brokers to technology companies racing into the $3 trillion health-care sector. Rapid digitization of health records and privacy laws enabling companies to swap patient data have positioned hospitals as a primary arbiter of how such sensitive data is shared. 

“Hospitals are massive containers of patient data,” said Lisa Bari, a consultant and former lead for health information technology for the Centers for Medicare and Medicaid Services Innovation Center. 

Hospitals can share patient data as long as they follow federal privacy laws, which contain limited consumer protections, she said. “The data belongs to whoever has it.”…

Digitizing patients’ medical histories, laboratory results and diagnoses has created a booming market in which tech giants are looking to store and crunch data, with potential for groundbreaking discoveries and lucrative products.

There is no indication of wrongdoing in the deals. Officials at the companies and hospitals say they have safeguards to protect patients. Hospitals control data, with privacy training and close tracking of tech employees with access, they said. Health data can’t be combined independently with other data by tech companies….(More)”.