Beyond Randomized Controlled Trials


Iqbal Dhaliwal, John Floretta & Sam Friedlander at SSIR: “…In its post-Nobel phase, one of J-PAL’s priorities is to unleash the treasure troves of big digital data in the hands of governments, nonprofits, and private firms. Primary data collection is by far the most time-, money-, and labor-intensive component of the vast majority of experiments that evaluate social policies. Randomized evaluations have been constrained by simple numbers: Some questions are just too big or expensive to answer. Leveraging administrative data has the potential to dramatically expand the types of questions we can ask and the experiments we can run, as well as implement quicker, less expensive, larger, and more reliable RCTs, an invaluable opportunity to scale up evidence-informed policymaking massively without dramatically increasing evaluation budgets.

Although administrative data hasn’t always been of the highest quality, recent advances have significantly increased the reliability and accuracy of GPS coordinates, biometrics, and digital methods of collection. But despite good intentions, many implementers—governments, businesses, and big NGOs—aren’t currently using the data they already collect on program participants and outcomes to improve anti-poverty programs and policies. This may be because they aren’t aware of its potential, don’t have the in-house technical capacity necessary to create use and privacy guidelines or analyze the data, or don’t have established partnerships with researchers who can collaborate to design innovative programs and run rigorous experiments to determine which are the most impactful. 

At J-PAL, we are leveraging this opportunity through a new global research initiative we are calling the “Innovations in Data and Experiments for Action” Initiative (IDEA). IDEA supports implementers to make their administrative data accessible, analyze it to improve decision-making, and partner with researchers in using this data to design innovative programs, evaluate impact through RCTs, and scale up successful ideas. IDEA will also build the capacity of governments and NGOs to conduct these types of activities with their own data in the future….(More)”.

Car Data Facts


About: “Welcome to CarDataFacts.eu! This website provides a fact-based overview on everything related to the sharing of vehicle-generated data with third parties. Through a series of educational infographics, this website answers the most common questions about access to car data in a clear and simple way.

CarDataFacts.eu also addresses consumer concerns about sharing data in a safe and a secure way, as well as explaining some of the complex and technical terminology surrounding the debate.

CarDataFacts.eu is brought to you by ACEA, the European Automobile Manufacturers’ Association, which represents the 15 Europe-based car, van, truck and bus makers….(More)”.

Invest 5% of research funds in ensuring data are reusable


Barend Mons at Nature: “It is irresponsible to support research but not data stewardship…

Many of the world’s hardest problems can be tackled only with data-intensive, computer-assisted research. And I’d speculate that the vast majority of research data are never published. Huge sums of taxpayer funds go to waste because such data cannot be reused. Policies for data reuse are falling into place, but fixing the situation will require more resources than the scientific community is willing to face.

In 2013, I was part of a group of Dutch experts from many disciplines that called on our national science funder to support data stewardship. Seven years later, policies that I helped to draft are starting to be put into practice. These require data created by machines and humans to meet the FAIR principles (that is, they are findable, accessible, interoperable and reusable). I now direct an international Global Open FAIR office tasked with helping communities to implement the guidelines, and I am convinced that doing so will require a large cadre of professionals, about one for every 20 researchers.

Even when data are shared, the metadata, expertise, technologies and infrastructure necessary for reuse are lacking. Most published data sets are scattered into ‘supplemental files’ that are often impossible for machines or even humans to find. These and other sloppy data practices keep researchers from building on each other’s work. In cases of disease outbreaks, for instance, this might even cost lives….(More)”.

Tesco Grocery 1.0, a large-scale dataset of grocery purchases in London


Paper by Luca Maria Aiello, Daniele Quercia, Rossano Schifanella & Lucia Del Prete: “We present the Tesco Grocery 1.0 dataset: a record of 420 M food items purchased by 1.6 M fidelity card owners who shopped at the 411 Tesco stores in Greater London over the course of the entire year of 2015, aggregated at the level of census areas to preserve anonymity. For each area, we report the number of transactions and nutritional properties of the typical food item bought including the average caloric intake and the composition of nutrients.

The set of global trade international numbers (barcodes) for each food type is also included. To establish data validity we: i) compare food purchase volumes to population from census to assess representativeness, and ii) match nutrient and energy intake to official statistics of food-related illnesses to appraise the extent to which the dataset is ecologically valid. Given its unprecedented scale and geographic granularity, the data can be used to link food purchases to a number of geographically-salient indicators, which enables studies on health outcomes, cultural aspects, and economic factors….(More)”.

Monitoring of the Venezuelan exodus through Facebook’s advertising platform


Paper by Palotti et al: “Venezuela is going through the worst economical, political and social crisis in its modern history. Basic products like food or medicine are scarce and hyperinflation is combined with economic depression. This situation is creating an unprecedented refugee and migrant crisis in the region. Governments and international agencies have not been able to consistently leverage reliable information using traditional methods. Therefore, to organize and deploy any kind of humanitarian response, it is crucial to evaluate new methodologies to measure the number and location of Venezuelan refugees and migrants across Latin America.

In this paper, we propose to use Facebook’s advertising platform as an additional data source for monitoring the ongoing crisis. We estimate and validate national and sub-national numbers of refugees and migrants and break-down their socio-economic profiles to further understand the complexity of the phenomenon. Although limitations exist, we believe that the presented methodology can be of value for real-time assessment of refugee and migrant crises world-wide….(More)”.

Experts say privately held data available in the European Union should be used better and more


European Commission: “Data can solve problems from traffic jams to disaster relief, but European countries are not yet using this data to its full potential, experts say in a report released today. More secure and regular data sharing across the EU could help public administrations use private sector data for the public good.

In order to increase Business-to-Government (B2G) data sharing, the experts advise to make data sharing in the EU easier by taking policy, legal and investment measures in three main areas:

  1. Governance of B2G data sharing across the EU: such as putting in place national governance structures, setting up a recognised function (‘data stewards’) in public and private organisations, and exploring the creation of a cross-EU regulatory framework.
  2. Transparency, citizen engagement and ethics: such as making B2G data sharing more citizen-centric, developing ethical guidelines, and investing in training and education.
  3. Operational models, structures and technical tools: such as creating incentives for companies to share data, carrying out studies on the benefits of B2G data sharing, and providing support to develop the technical infrastructure through the Horizon Europe and Digital Europe programmes.

They also revised the principles on private sector data sharing in B2G contexts and included new principles on accountability and on fair and ethical data use, which should guide B2G data sharing for the public interest. Examples of successful B2G data sharing partnerships in the EU include an open forest data system in Finland to help manage the ecosystem, mapping of EU fishing activities using ship tracking data, and genome sequencing data of breast cancer patients to identify new personalised treatments. …

The High-Level Expert Group on Business-to-Government Data Sharing was set up in autumn 2018 and includes members from a broad range of interests and sectors. The recommendations presented today in its final report feed into the European strategy for data and can be used as input for other possible future Commission initiatives on Business-to-Government data sharing….(More)”.

New privacy-protected Facebook data for independent research on social media’s impact on democracy


Chaya Nayak at Facebook: “In 2018, Facebook began an initiative to support independent academic research on social media’s role in elections and democracy. This first-of-its-kind project seeks to provide researchers access to privacy-preserving data sets in order to support research on these important topics.

Today, we are announcing that we have substantially increased the amount of data we’re providing to 60 academic researchers across 17 labs and 30 universities around the world. This release delivers on the commitment we made in July 2018 to share a data set that enables researchers to study information and misinformation on Facebook, while also ensuring that we protect the privacy of our users.

This new data release supplants data we released in the fall of 2019. That 2019 data set consisted of links that had been shared publicly on Facebook by at least 100 unique Facebook users. It included information about share counts, ratings by Facebook’s third-party fact-checkers, and user reporting on spam, hate speech, and false news associated with those links. We have expanded the data set to now include more than 38 million unique links with new aggregated information to help academic researchers analyze how many people saw these links on Facebook and how they interacted with that content – including views, clicks, shares, likes, and other reactions. We’ve also aggregated these shares by age, gender, country, and month. And, we have expanded the time frame covered by the data from January 2017 – February 2019 to January 2017 – August 2019.

With this data, researchers will be able to understand important aspects of how social media shapes our world. They’ll be able to make progress on the research questions they proposed, such as “how to characterize mainstream and non-mainstream online news sources in social media” and “studying polarization, misinformation, and manipulation across multiple platforms and the larger information ecosystem.”

In addition to the data set of URLs, researchers will continue to have access to CrowdTangle and Facebook’s Ad Library API to augment their analyses. Per the original plan for this project, outside of a limited review to ensure that no confidential or user data is inadvertently released, these researchers will be able to publish their findings without approval from Facebook.

We are sharing this data with researchers while continuing to prioritize the privacy of people who use our services. This new data set, like the data we released before it, is protected by a method known as differential privacy. Researchers have access to data tables from which they can learn about aggregated groups, but where they cannot identify any individual user. As Harvard University’s Privacy Tools project puts it:

“The guarantee of a differentially private algorithm is that its behavior hardly changes when a single individual joins or leaves the dataset — anything the algorithm might output on a database containing some individual’s information is almost as likely to have come from a database without that individual’s information. … This gives a formal guarantee that individual-level information about participants in the database is not leaked.” …(More)”

This emoji could mean your suicide risk is high, according to AI


Rebecca Ruiz at Mashable: “Since its founding in 2013, the free mental health support service Crisis Text Line has focused on using data and technology to better aid those who reach out for help. 

Unlike helplines that offer assistance based on the order in which users dialed, texted, or messaged, Crisis Text Line has an algorithm that determines who is in most urgent need of counseling. The nonprofit is particularly interested in learning which emoji and words texters use when their suicide risk is high, so as to quickly connect them with a counselor. Crisis Text Line just released new insights about those patterns. 

Based on its analysis of 129 million messages processed between 2013 and the end of 2019, the nonprofit found that the pill emoji, or ?, was 4.4 times more likely to end in a life-threatening situation than the word suicide. 

Other words that indicate imminent danger include 800mg, acetaminophen, excedrin, and antifreeze; those are two to three times more likely than the word suicide to involve an active rescue of the texter. The loudly crying emoji face, or ?, is similarly high-risk. In general, the words that trigger the greatest alarm suggest the texter has a method or plan to attempt suicide or may be in the process of taking their own life. …(More)”.

Our personal health history is too valuable to be harvested by the tech giants


Eerke Boiten at The Guardian: “…It is clear that the black box society does not only feed on internet surveillance information. Databases collected by public bodies are becoming more and more part of the dark data economy. Last month, it emerged that a data broker in receipt of the UK’s national pupil database had shared its access with gambling companies. This is likely to be the tip of the iceberg; even where initial recipients of shared data might be checked and vetted, it is much harder to oversee who the data is passed on to from there.

Health data, the rich population-wide information held within the NHS, is another such example. Pharmaceutical companies and internet giants have been eyeing the NHS’s extensive databases for commercial exploitation for many years. Google infamously claimed it could save 100,000 lives if only it had free rein with all our health data. If there really is such value hidden in NHS data, do we really want Google to extract it to sell it to us? Google still holds health data that its subsidiary DeepMind Health obtained illegally from the NHS in 2016.

Although many health data-sharing schemes, such as in the NHS’s register of approved data releases], are said to be “anonymised”, this offers a limited guarantee against abuse.

There is just too much information included in health data that points to other aspects of patients’ lives and existence. If recipients of anonymised health data want to use it to re-identify individuals, they will often be able to do so by combining it, for example, with publicly available information. That this would be illegal under UK data protection law is a small consolation as it would be extremely hard to detect.

It is clear that providing access to public organisations’ data for research purposes can serve the greater good and it is unrealistic to expect bodies such as the NHS to keep this all in-house.

However, there are other methods by which to do this, beyond the sharing of anonymised databases. CeLSIUS, for example, a physical facility where researchers can interrogate data under tightly controlled conditions for specific registered purposes, holds UK census information over many years.

These arrangements prevent abuse, such as through deanonymisation, do not have the problem of shared data being passed on to third parties and ensure complete transparency of the use of the data. Online analogues of such set-ups do not yet exist, but that is where the future of safe and transparent access to sensitive data lies….(More)”.

Self-interest and data protection drive the adoption and moral acceptability of big data technologies: A conjoint analysis approach


Paper by Rabia I.Kodapanakka, lMark J.Brandt, Christoph Kogler, and Iljavan Beest: “Big data technologies have both benefits and costs which can influence their adoption and moral acceptability. Prior studies look at people’s evaluations in isolation without pitting costs and benefits against each other. We address this limitation with a conjoint experiment (N = 979), using six domains (criminal investigations, crime prevention, citizen scores, healthcare, banking, and employment), where we simultaneously test the relative influence of four factors: the status quo, outcome favorability, data sharing, and data protection on decisions to adopt and perceptions of moral acceptability of the technologies.

We present two key findings. (1) People adopt technologies more often when data is protected and when outcomes are favorable. They place equal or more importance on data protection in all domains except healthcare where outcome favorability has the strongest influence. (2) Data protection is the strongest driver of moral acceptability in all domains except healthcare, where the strongest driver is outcome favorability. Additionally, sharing data lowers preference for all technologies, but has a relatively smaller influence. People do not show a status quo bias in the adoption of technologies. When evaluating moral acceptability, people show a status quo bias but this is driven by the citizen scores domain. Differences across domains arise from differences in magnitude of the effects but the effects are in the same direction. Taken together, these results highlight that people are not always primarily driven by self-interest and do place importance on potential privacy violations. They also challenge the assumption that people generally prefer the status quo….(More)”.