Social Research in Times of Big Data: The Challenges of New Data Worlds and the Need for a Sociology of Social Research


Paper by Rainer Diaz-Bone et al: “The phenomenon of big data does not only deeply affect current societies but also poses crucial challenges to social research. This article argues for moving towards a sociology of social research in order to characterize the new qualities of big data and its deficiencies. We draw on the neopragmatist approach of economics of convention (EC) as a conceptual basis for such a sociological perspective.

This framework suggests investigating processes of quantification in their interplay with orders of justifications and logics of evaluation. Methodological issues such as the question of the “quality of big data” must accordingly be discussed in their deep entanglement with epistemic values, institutional forms, and historical contexts and as necessarily implying political issues such as who controls and has access to data infrastructures. On this conceptual basis, the article uses the example of health to discuss the challenges of big data analysis for social research.

Phenomena such as the rise of new and massive privately owned data infrastructures, the economic valuation of huge amounts of connected data, or the movement of “quantified self” are presented as indications of a profound transformation compared to established forms of doing social research. Methodological and epistemological, but also institutional and political, strategies are presented to face the risk of being “outperformed” and “replaced” by big data analysis as they are already done in big US American and Chinese Internet enterprises. In conclusion, we argue that the sketched developments have important implications both for research practices and methods teaching in the era of big data…(More)”.

The European data market


European Commission: “It was the first European Data Market study (SMART 2013/0063) contracted by the European Commission in 2013 that made a first attempt to provide facts and figures on the size and trends of the EU data economy by developing a European data market monitoring tool.

The final report of the updated European Data Market (EDM) study (SMART 2016/0063) now presents in detail the results of the final round of measurement of the updated European Data Market Monitoring Tool contracted for the 2017-2020 period.

Designed along a modular structure, as a first pillar of the study, the European Data Market Monitoring Tool is built around a core set of quantitative indicators to provide a series of assessments of the emerging market of data at present, i.e. for the years 2018 through 2020, and with projections to 2025.

The key areas covered by the indicators measured in this report are:

  • The data professionals and the balance between demand and supply of data skills;
  • The data companies and their revenues;
  • The data user companies and their spending for data technologies;
  • The market of digital products and services (“Data market”);
  • The data economy and its impacts on the European economy.
  • Forecast scenarios of all the indicators, based on alternative market trajectories.

Additionally, as a second major work stream, the study also presents a series of descriptive stories providing a complementary view to the one offered by the Monitoring Tool (for example, “How Big Data is driving AI” or “The Secondary Use of Health Data and Data-driven Innovation in the European Healthcare Industry”), adding fresh, real-life information around the quantitative indicators. By focusing on specific issues and aspects of the data market, the stories offer an initial, indicative “catalogue” of good practices of what is happening in the data economy today in Europe and what is likely to affect the development of the EU data economy in the medium term.

Finally, as a third work stream of the study, a landscaping exercise on the EU data ecosystem was carried out together with some community building activities to bring stakeholders together from all segments of the data value chain. The map containing the results of the landscaping of the EU data economy as well as reports from the webinars organised by the study are available on the www.datalandscape.eu website….(More)”.

Social Distancing and Social Capital: Why U.S. Counties Respond Differently to Covid-19


NBER Paper by Wenzhi Ding et al: Since social distancing is the primary strategy for slowing the spread of many diseases, understanding why U.S. counties respond differently to COVID-19 is critical for designing effective public policies. Using daily data from about 45 million mobile phones to measure social distancing we examine how counties responded to both local COVID-19 cases and statewide shelter-in-place orders. We find that social distancing increases more in response to cases and official orders in counties where individuals historically (1) engaged less in community activities and (2) demonstrated greater willingness to incur individual costs to contribute to social objectives. Our work highlights the importance of these two features of social capital—community engagement and individual commitment to societal institutions—in formulating public health policies….(More)”

Are there laws of history?


Amanda Rees at AEON: “…If big data could enable us to turn big history into mathematics rather than narratives, would that make it easier to operationalise our past? Some scientists certainly think so.

In February 2010, Peter Turchin, an ecologist from the University of Connecticut, predicted that 2020 would see a sharp increase in political volatility for Western democracies. Turchin was responding critically to the optimistic speculations of scientific progress in the journal Nature: the United States, he said, was coming to the peak of another instability spike (regularly occurring every 50 years or so), while the world economy was reaching the point of a ‘Kondratiev wave’ dip, that is, a steep downturn in a growth-driven supercycle. Along with a number of ‘seemingly disparate’ social pointers, all indications were that serious problems were looming. In the decade since that prediction, the entrenched, often vicious, social, economic and political divisions that have increasingly characterised North American and European society, have made Turchin’s ‘quantitative historical analysis’ seem remarkably prophetic.

A couple of years earlier, in July 2008, Turchin had made a series of trenchant claims about the nature and future of history. Totting up in excess of ‘200 explanations’ proposed to account for the fall of the Roman empire, he was appalled that historians were unable to agree ‘which explanations are plausible and which should be rejected’. The situation, he maintained, was ‘as risible as if, in physics, phlogiston theory and thermodynamics coexisted on equal terms’. Why, Turchin wanted to know, were the efforts in medicine and environmental science to produce healthy bodies and ecologies not mirrored by interventions to create stable societies? Surely it was time ‘for history to become an analytical, and even a predictive, science’. Knowing that historians were themselves unlikely to adopt such analytical approaches to the past, he proposed a new discipline: ‘theoretical historical social science’ or ‘cliodynamics’ – the science of history.

Like C P Snow 60 years before him, Turchin wanted to challenge the boundary between the sciences and humanities – even as periodic attempts to apply the theories of natural science to human behaviour (sociobiology, for example) or to subject natural sciences to the methodological scrutiny of the social sciences (science wars, anyone?) have frequently resulted in hostile turf wars. So what are the prospects for Turchin’s efforts to create a more desirable future society by developing a science of history?…

In 2010, Cliodynamics, the flagship journal for this new discipline, appeared, with its very first article (by the American sociologist Randall Collins) focusing on modelling victory and defeat in battle in relation to material resources and organisational morale. In a move that paralleled Comte’s earlier argument regarding the successive stages of scientific complexity (from physics, through chemistry and biology, to sociology), Turchin passionately rejected the idea that complexity made human societies unsuitable for quantitative analysis, arguing that it was precisely that complexity which made mathematics essential. Weather predictions were once considered unreliable because of the sheer complexity of managing the necessary data. But improvements in technology (satellites, computers) mean that it’s now possible to describe mathematically, and therefore to model, interactions between the system’s various parts – and therefore to know when it’s wise to carry an umbrella. With equal force, Turchin insisted that the cliodynamic approach was not deterministic. It would not predict the future, but instead lay out for governments and political leaders the likely consequences of competing policy choices.

Crucially, and again on the back of the abundantly available and cheap computer power, cliodynamics benefited from the surge in interest in the digital humanities. Existing archives were being digitised, uploaded and made searchable: every day, it seemed, more data were being presented in a format that encouraged quantification and enabled mathematical analysis – including the Old Bailey’s online database, of which Wolf had fallen foul. At the same time, cliodynamicists were repositioning themselves. Four years after its initial launch, the subtitle of their flagship journal was renamed, from The Journal of Theoretical and Mathematical History to The Journal of Quantitative History and Cultural Evolution. As Turchin’s editorial stated, this move was intended to position cliodynamics within a broader evolutionary analysis; paraphrasing the Russian-American geneticist Theodosius Dobzhansky, he claimed that ‘nothing in human history makes sense except in the light of cultural evolution’. Given Turchin’s ecological background, this evolutionary approach to history is unsurprising. But given the historical outcomes of making politics biological, it is potentially worrying….

Mathematical, data-driven, quantitative models of human experience that aim at detachment, objectivity and the capacity to develop and test hypotheses need to be balanced by explicitly fictional, qualitative and imaginary efforts to create and project a lived future that enable their audiences to empathically ground themselves in the hopes and fears of what might be to come. Both, after all, are unequivocally doing the same thing: using history and historical experience to anticipate the global future so that we might – should we so wish – avoid civilisation’s collapse. That said, the question of who ‘we’ are does, always, remain open….(More)”.

‘For good measure’: data gaps in a big data world


Paper by Sarah Giest & Annemarie Samuels: “Policy and data scientists have paid ample attention to the amount of data being collected and the challenge for policymakers to use and utilize it. However, far less attention has been paid towards the quality and coverage of this data specifically pertaining to minority groups. The paper makes the argument that while there is seemingly more data to draw on for policymakers, the quality of the data in combination with potential known or unknown data gaps limits government’s ability to create inclusive policies. In this context, the paper defines primary, secondary, and unknown data gaps that cover scenarios of knowingly or unknowingly missing data and how that is potentially compensated through alternative measures.

Based on the review of the literature from various fields and a variety of examples highlighted throughout the paper, we conclude that the big data movement combined with more sophisticated methods in recent years has opened up new opportunities for government to use existing data in different ways as well as fill data gaps through innovative techniques. Focusing specifically on the representativeness of such data, however, shows that data gaps affect the economic opportunities, social mobility, and democratic participation of marginalized groups. The big data movement in policy may thus create new forms of inequality that are harder to detect and whose impact is more difficult to predict….(More)“.

How data privacy leader Apple found itself in a data ethics catastrophe


Article by Daniel Wu and Mike Loukides: “…Apple learned a critical lesson from this experience. User buy-in cannot end with compliance with rules. It requires ethics, constantly asking how to protect, fight for, and empower users, regardless of what the law says. These strategies contribute to perceptions of trust.

Trust has to be earned, is easily lost, and is difficult to regain….

In our more global, diverse, and rapidly- changing world, ethics may be embodied by the “platinum rule”: Do unto others as they would want done to them. One established field of ethics—bioethics—offers four principles that are related to the platinum rule: nonmaleficence, justice, autonomy, and beneficence.

For organizations that want to be guided by ethics, regardless of what the law says, these principles as essential tools for a purpose-driven mission: protecting (nonmaleficence), fighting for (justice), and empowering users and employees (autonomy and beneficence).

An ethics leader protects users and workers in its operations by using governance best practices. 

Before creating the product, it understands both the qualitative and quantitative contexts of key stakeholders, especially those who will be most impacted, identifying their needs and fears. When creating the product, it uses data protection by design, working with cross-functional roles like legal and privacy engineers to embed ethical principles into the lifecycle of the product and formalize data-sharing agreements. Before launching, it audits the product thoroughly and conducts scenario planning to understand potential ethical mishaps, such as perceived or real gender bias or human rights violations in its supply chain. After launching, its terms of service and collection methods are highly readable and enables even disaffected users to resolve issues delightfully.

Ethics leaders also fight for users and workers, who can be forgotten. These leaders may champion enforceable consumer protections in the first place, before a crisis erupts. With social movements, leaders fight powerful actors preying on vulnerable communities or the public at large—and critically examines and ameliorates its own participation in systemic violence. As a result, instead of last-minute heroic efforts to change compromised operations, it’s been iterating all along.

Finally, ethics leaders empower their users and workers. With diverse communities and employees, they co-create new products that help improve basic needs and enable more, including the vulnerable, to increase their autonomy and their economic mobility. These entrepreneurial efforts validate new revenue streams and relationships while incubating next-generation workers who self-govern and push the company’s mission forward. Employees voice their values and diversify their relationships. Alison Taylor, the Executive Director of Ethical Systems, argues that internal processes should “improve [workers’] reasoning and creativity, instead of short-circuiting them.” Enabling this is a culture of psychological safety and training to engage kindly with divergent ideas.

These purpose-led strategies boost employee performance and retention, drive deep customer loyalty, and carve legacies.

To be clear, Apple may be implementing at least some of these strategies already—but perhaps not uniformly or transparently. For instance, Apple has implemented some provisions of the European Union’s General Data Protection Regulation for all US residents—not just EU and CA residents—including the ability to access and edit data. This expensive move, which goes beyond strict legal requirements, was implemented even without public pressure.

But ethics strategies have major limitations leaders must address

As demonstrated by the waves of ethical “principles” released by Fortune 500 companies and commissions, ethics programs can be murky, dominated by a white, male, and Western interpretation.

Furthermore, focusing purely on ethics gives companies an easy way to “free ride” off social goodwill, but ultimately stay unaccountable, given the lack of external oversight over ethics programs. When companies substitute unaccountable data ethics principles for thoughtful engagement with the enforceable data regulation principles, users will be harmed.

Long-term, without the ability to wave a $100 million fine with clear-cut requirements and lawyers trained to advocate for them internally, ethics leaders may face barriers to buy-in. Unlike their sales, marketing, or compliance counterparts, ethics programs do not directly add revenue or reduce costs. In recessions, these “soft” programs may be the first on the chopping block.

As a result of these factors, we will likely see a surge in ethics-washing: well-intentioned companies that talk ethics, but don’t walk it. More will view these efforts as PR-driven ethics stunts, which don’t deeply engage with actual ethical issues. If harmful business models do not change, ethics leaders will be fighting a losing battle….(More)”.

Synthetic data offers advanced privacy for the Census Bureau, business


Kate Kaye at IAPP: “In the early 2000s, internet accessibility made risks of exposing individuals from population demographic data more likely than ever. So, the U.S. Census Bureau turned to an emerging privacy approach: synthetic data.

Some argue the algorithmic techniques used to develop privacy-secure synthetic datasets go beyond traditional deidentification methods. Today, along with the Census Bureau, clinical researchers, autonomous vehicle system developers and banks use these fake datasets that mimic statistically valid data.

In many cases, synthetic data is built from existing data by filtering it through machine learning models. Real data representing real individuals flows in, and fake data mimicking individuals with corresponding characteristics flows out.

When data scientists at the Census Bureau began exploring synthetic data methods, adoption of the internet had made deidentified, open-source data on U.S. residents, their households and businesses more accessible than in the past.

Especially concerning, census-block-level information was now widely available. Because in rural areas, a census block could represent data associated with as few as one house, simply stripping names, addresses and phone numbers from that information might not be enough to prevent exposure of individuals.

“There was pretty widespread angst” among statisticians, said John Abowd, the bureau’s associate director for research and methodology and chief scientist. The hand-wringing led to a “gradual awakening” that prompted the agency to begin developing synthetic data methods, he said.

Synthetic data built from the real data preserves privacy while providing information that is still relevant for research purposes, Abowd said: “The basic idea is to try to get a model that accurately produces an image of the confidential data.”

The plan for the 2020 census is to produce a synthetic image of that original data. The bureau also produces On the Map, a web-based mapping and reporting application that provides synthetic data showing where workers are employed and where they live along with reports on age, earnings, industry distributions, race, ethnicity, educational attainment and sex.

Of course, the real census data is still locked away, too, Abowd said: “We have a copy and the national archives have a copy of the confidential microdata.”…(More)”.

Scraping the Web for Public Health Gains: Ethical Considerations from a ‘Big Data’ Research Project on HIV and Incarceration


Stuart Rennie, Mara Buchbinder, Eric Juengst, Lauren Brinkley-Rubinstein, and David L Rosen at Public Health Ethics: “Web scraping involves using computer programs for automated extraction and organization of data from the Web for the purpose of further data analysis and use. It is frequently used by commercial companies, but also has become a valuable tool in epidemiological research and public health planning. In this paper, we explore ethical issues in a project that “scrapes” public websites of U.S. county jails as part of an effort to develop a comprehensive database (including individual-level jail incarcerations, court records and confidential HIV records) to enhance HIV surveillance and improve continuity of care for incarcerated populations. We argue that the well-known framework of Emanuel et al. (2000) provides only partial ethical guidance for the activities we describe, which lie at a complex intersection of public health research and public health practice. We suggest some ethical considerations from the ethics of public health practice to help fill gaps in this relatively unexplored area….(More)”.

How Taiwan Used Big Data, Transparency and a Central Command to Protect Its People from Coronavirus


Article by Beth Duff-Brown: “…So what steps did Taiwan take to protect its people? And could those steps be replicated here at home?

Stanford Health Policy’s Jason Wang, MD, PhD, an associate professor of pediatrics at Stanford Medicine who also has a PhD in policy analysis, credits his native Taiwan with using new technology and a robust pandemic prevention plan put into place at the 2003 SARS outbreak.

“The Taiwan government established the National Health Command Center (NHCC) after SARS and it’s become part of a disaster management center that focuses on large-outbreak responses and acts as the operational command point for direct communications,” said Wang, a pediatrician and the director of the Center for Policy, Outcomes, and Prevention at Stanford. The NHCC also established the Central Epidemic Command Center, which was activated in early January.

“And Taiwan rapidly produced and implemented a list of at least 124 action items in the past five weeks to protect public health,” Wang said. “The policies and actions go beyond border control because they recognized that that wasn’t enough.”

Wang outlines the measures Taiwan took in the last six weeks in an article published Tuesday in the Journal of the American Medical Association.

“Given the continual spread of COVID-19 around the world, understanding the action items that were implemented quickly in Taiwan, and the effectiveness of these actions in preventing a large-scale epidemic, may be instructive for other countries,” Wang and his co-authors wrote.

Within the last five weeks, Wang said, the Taiwan epidemic command center rapidly implemented those 124 action items, including border control from the air and sea, case identification using new data and technology, quarantine of suspicious cases, educating the public while fighting misinformation, negotiating with other countries — and formulating policies for schools and businesses to follow.

Big Data Analytics

The authors note that Taiwan integrated its national health insurance database with its immigration and customs database to begin the creation of big data for analytics. That allowed them case identification by generating real-time alerts during a clinical visit based on travel history and clinical symptoms.

Taipei also used Quick Response (QR) code scanning and online reporting of travel history and health symptoms to classify travelers’ infectious risks based on flight origin and travel history in the last 14 days. People who had not traveled to high-risk areas were sent a health declaration border pass via SMS for faster immigration clearance; those who had traveled to high-risk areas were quarantined at home and tracked through their mobile phones to ensure that they stayed home during the incubation period.

The country also instituted a toll-free hotline for citizens to report suspicious symptoms in themselves or others. As the disease progressed, the government called on major cities to establish their own hotlines so that the main hotline would not become jammed….(More)”.

Facebook Ads as a Demographic Tool to Measure the Urban-Rural Divide


Paper by Daniele Rama, Yelena Mejova, Michele Tizzoni, Kyriaki Kalimeri, and Ingmar Weber: “In the global move toward urbanization, making sure the people remaining in rural areas are not left behind in terms of development and policy considerations is a priority for governments worldwide. However, it is increasingly challenging to track important statistics concerning this sparse, geographically dispersed population, resulting in a lack of reliable, up-to-date data. In this study, we examine the usefulness of the Facebook Advertising platform, which offers a digital “census” of over two billions of its users, in measuring potential rural-urban inequalities.

We focus on Italy, a country where about 30% of the population lives in rural areas. First, we show that the population statistics that Facebook produces suffer from instability across time and incomplete coverage of sparsely populated municipalities. To overcome such limitation, we propose an alternative methodology for estimating Facebook Ads audiences that nearly triples the coverage of the rural municipalities from 19% to 55% and makes feasible fine-grained sub-population analysis. Using official national census data, we evaluate our approach and confirm known significant urban-rural divides in terms of educational attainment and income. Extending the analysis to Facebook-specific user “interests” and behaviors, we provide further insights on the divide, for instance, finding that rural areas show a higher interest in gambling. Notably, we find that the most predictive features of income in rural areas differ from those for urban centres, suggesting researchers need to consider a broader range of attributes when examining rural wellbeing. The findings of this study illustrate the necessity of improving existing tools and methodologies to include under-represented populations in digital demographic studies — the failure to do so could result in misleading observations, conclusions, and most importantly, policies….(More)”.