Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing


Book by Ron Kohavi, Diane Tang, and Ya Xu: “Getting numbers is easy; getting numbers you can trust is hard. This practical guide by experimentation leaders at Google, LinkedIn, and Microsoft will teach you how to accelerate innovation using trustworthy online controlled experiments, or A/B tests. Based on practical experiences at companies that each run more than 20,000 controlled experiments a year, the authors share examples, pitfalls, and advice for students and industry professionals getting started with experiments, plus deeper dives into advanced topics for practitioners who want to improve the way they make data-driven decisions.

Learn how to use the scientific method to evaluate hypotheses using controlled experiments Define key metrics and ideally an Overall Evaluation Criterion Test for trustworthiness of the results and alert experimenters to violated assumptions. Build a scalable platform that lowers the marginal cost of experiments close to zero. Avoid pitfalls like carryover effects and Twyman’s law. Understand how statistical issues play out in practice….(More)”.

A Closer Look at Location Data: Privacy and Pandemics


Assessment by Stacey Gray: “In light of COVID-19, there is heightened global interest in harnessing location data held by major tech companies to track individuals affected by the virus, better understand the effectiveness of social distancing, or send alerts to individuals who might be affected based on their previous proximity to known cases. Governments around the world are considering whether and how to use mobile location data to help contain the virus: Israel’s government passed emergency regulations to address the crisis using cell phone location data; the European Commission requested that mobile carriers provide anonymized and aggregate mobile location data; and South Korea has created a publicly available map of location data from individuals who have tested positive. 

Public health agencies and epidemiologists have long been interested in analyzing device location data to track diseases. In general, the movement of devices effectively mirrors movement of people (with some exceptions discussed below). However, its use comes with a range of ethical and privacy concerns. 

In order to help policymakers address these concerns, we provide below a brief explainer guide of the basics: (1) what is location data, (2) who holds it, and (3) how is it collected? Finally we discuss some preliminary ethical and privacy considerations for processing location data. Researchers and agencies should consider: how and in what context location data was collected; the fact and reasoning behind location data being classified as legally “sensitive” in most jurisdictions; challenges to effective “anonymization”; representativeness of the location dataset (taking into account potential bias and lack of inclusion of low-income and elderly subpopulations who do not own phones); and the unique importance of purpose limitation, or not re-using location data for other civil or law enforcement purposes after the pandemic is over….(More)”.

Human migration: the big data perspective


Alina Sîrbu et al at the International Journal of Data Science and Analytics: “How can big data help to understand the migration phenomenon? In this paper, we try to answer this question through an analysis of various phases of migration, comparing traditional and novel data sources and models at each phase. We concentrate on three phases of migration, at each phase describing the state of the art and recent developments and ideas. The first phase includes the journey, and we study migration flows and stocks, providing examples where big data can have an impact. The second phase discusses the stay, i.e. migrant integration in the destination country. We explore various data sets and models that can be used to quantify and understand migrant integration, with the final aim of providing the basis for the construction of a novel multi-level integration index. The last phase is related to the effects of migration on the source countries and the return of migrants….(More)”.

The Law and Economics of Online Republication


Paper by Ronen Perry: “Jerry publishes unlawful content about Newman on Facebook, Elaine shares Jerry’s post, the share automatically turns into a tweet because her Facebook and Twitter accounts are linked, and George immediately retweets it. Should Elaine and George be liable for these republications? The question is neither theoretical nor idiosyncratic. On occasion, it reaches the headlines, as when Jennifer Lawrence’s representatives announced she would sue every person involved in the dissemination, through various online platforms, of her illegally obtained nude pictures. Yet this is only the tip of the iceberg. Numerous potentially offensive items are reposted daily, their exposure expands in widening circles, and they sometimes “go viral.”

This Article is the first to provide a law and economics analysis of the question of liability for online republication. Its main thesis is that liability for republication generates a specter of multiple defendants which might dilute the originator’s liability and undermine its deterrent effect. The Article concludes that, subject to several exceptions and methodological caveats, only the originator should be liable. This seems to be the American rule, as enunciated in Batzel v. Smith and Barrett v. Rosenthal. It stands in stark contrast to the prevalent rules in other Western jurisdictions and has been challenged by scholars on various grounds since its very inception.

The Article unfolds in three Parts. Part I presents the legal framework. It first discusses the rules applicable to republication of self-created content, focusing on the emergence of the single publication rule and its natural extension to online republication. It then turns to republication of third-party content. American law makes a clear-cut distinction between offline republication which gives rise to a new cause of action against the republisher (subject to a few limited exceptions), and online republication which enjoys an almost absolute immunity under § 230 of the Communications Decency Act. Other Western jurisdictions employ more generous republisher liability regimes, which usually require endorsement, a knowing expansion of exposure or repetition.

Part II offers an economic justification for the American model. Law and economics literature has showed that attributing liability for constant indivisible harm to multiple injurers, where each could have single-handedly prevented that harm (“alternative care” settings), leads to dilution of liability. Online republication scenarios often involve multiple tortfeasors. However, they differ from previously analyzed phenomena because they are not alternative care situations, and because the harm—increased by the conduct of each tortfeasor—is not constant and indivisible. Part II argues that neither feature precludes the dilution argument. It explains that the impact of the multiplicity of injurers in the online republication context on liability and deterrence provides a general justification for the American rule. This rule’s relatively low administrative costs afford additional support.

Part III considers the possible limits of the theoretical argument. It maintains that exceptions to the exclusive originator liability rule should be recognized when the originator is unidentifiable or judgment-proof, and when either the republisher’s identity or the republication’s audience was unforeseeable. It also explains that the rule does not preclude liability for positive endorsement with a substantial addition, which constitutes a new original publication, or for the dissemination of illegally obtained content, which is an independent wrong. Lastly, Part III addresses possible challenges to the main argument’s underlying assumptions, namely that liability dilution is a real risk and that it is undesirable….(More)”.

A controlled trial for reproducibility


Marc P. Raphael, Paul E. Sheehan & Gary J. Vora at Nature: “In 2016, the US Defense Advanced Research Projects Agency (DARPA) told eight research groups that their proposals had made it through the review gauntlet and would soon get a few million dollars from its Biological Technologies Office (BTO). Along with congratulations, the teams received a reminder that their award came with an unusual requirement — an independent shadow team of scientists tasked with reproducing their results.

Thus began an intense, multi-year controlled trial in reproducibility. Each shadow team consists of three to five researchers, who visit the ‘performer’ team’s laboratory and often host visits themselves. Between 3% and 8% of the programme’s total funds go to this independent validation and verification (IV&V) work. But DARPA has the flexibility and resources for such herculean efforts to assess essential techniques. In one unusual instance, an IV&V laboratory needed a sophisticated US$200,000 microscopy and microfluidic set-up to make an accurate assessment.

These costs are high, but we think they are an essential investment to avoid wasting taxpayers’ money and to advance fundamental research towards beneficial applications. Here, we outline what we’ve learnt from implementing this programme, and how it could be applied more broadly….(More)”.

COVID-19 is creating a democratic deficit – here’s how to reduce it


Article by Matt Ryan: “As parliaments around the country move to scale down operations and defer sittings as part of containing COVID-19 people are beginning to ring the accountability alarm bells….

The good news is that we can learn from those parliaments and politicians around the world who have already been trialling new ways of working that go beyond traditional sittings. Leveraging simple and widely available technologies, they are involving more people with more diverse backgrounds in their processes with less reliance on those people being physically present.

Select Committees in the UK Parliament, for example, have used online “evidence checks” to scrutinise the basis for policy. These one-month exercises use targeted outreach and social media strategies to invite comments from knowledgeable stakeholders and members of the public about the rigour of evidence on which a government department’s policy is based. Evidence for departmental policy is summarised in a two-page document and comments publicly displayed in a web forum that resembles a readers’ comments section in an online news article.

In Taiwan, a participatory governance process pioneered by civic rights activists at the behest of a government minister combines large-scale online participation with smaller in-person gatherings to build a “rough consensus” on legislative proposals related to the digital economy before they are introduced. Known as vTaiwan, the process has led to 26 pieces of national legislation dealing with issues such as Uber, telemedicine and online alcohol sales, and has involved 200,000 people.

The government of Mexico City has raised the stakes even higher, involving more than 400,000 people in a process to draft a new constitution. It included a novel partnership between Change.org and the city mayor that enabled residents to create petition-backed proposals which, once they reached a certain threshold of support, bound the mayor to include them in the draft he submitted to a special constitutional assembly.

Processes like these can also offer relief for politicians and parliamentary officials managing the strain of examining an ever-increasing number of issues of greater complexity with limited personnel and budget. Evidence checks provide access to a wider pool of experts who can bolster existing research capacity. vTaiwan helps to find workable ways forward in industries being rapidly transformed by digital technologies. By “crowdsourcing” the city’s constitution, Mexico City’s mayor retained the trust of residents while undertaking reform at a grand scale….(More)”.

Ask a Scientist


NYU Press Release: “Unreliable tips on how to protect oneself from the novel coronavirus and fake news about the COVID-19 pandemic are spreading as quickly as the virus itself.

The Governance Lab (The GovLab) at the New York University Tandon School of Engineering has collaborated with the Federation of American Scientists (FAS) and the State of New Jersey Office of Innovation to launch a free, interactive tool aimed at cutting through the noise and presenting clear, scientist-led, and evidence-based information and advice to the public.

Available in English and Spanish, “Ask a Scientist,” allows users to find answers to a wide range of commonly asked questions about the virus, the severity of the outbreak, best methods of prevention, and steps to take in the event you fall ill. All posted content is obtained from the World Health Organization, the Centers for Disease Control and Prevention, and other rigorously verified sources.

screenshot of website that allows users to type in questions about COVID-19

“Ask a Scientist” features a free, interactive tool allowing users to submit questions to a team of FAS researchers and a crowdsourced network of vetted science experts. In English and Spanish, the site also includes top articles and the latest information, and answers to a wide range of commonly asked questions about the COVID-19 epidemic, the severity of the outbreak, best methods of prevention, and steps to take in the event you fall ill.

If users do not find an answer to their specific questions, they have the option of submitting them to a team of FAS researchers and a crowdsourced network of vetted science experts led by the National Science Policy Network. Users can expect an answer within an hour, although that timeframe is expected to shorten as the network increases in size. Every answer is reviewed to ensure accuracy and timeliness, then added to the knowledge base for the benefit of others….(More)”.

Why we need responsible data for children


Andrew Young and Stefaan Verhulst at The Conversation: “…Without question, the increased use of data poses unique risks for and responsibilities to children. While practitioners may have well-intended purposes to leverage data for and about children, the data systems used are often designed with (consenting) adults in mind without a focus on the unique needs and vulnerabilities of children. This can lead to the collection of inaccurate and unreliable data as well as the inappropriate and potentially harmful use of data for and about children….

Research undertaken in the context of the RD4C initiative uncovered the following trends and realities. These issues make clear why we need a dedicated data responsibility approach for children.

  • Today’s children are the first generation growing up at a time of rapid datafication where almost all aspects of their lives, both on and off-line, are turned into data points. An entire generation of young people is being datafied – often starting even before birth. Every year the average child will have more data collected about them in their lifetime than would a similar child born any year prior. The potential uses of such large volumes of data and the impact on children’s lives are unpredictable, and could potentially be used against them.
  • Children typically do not have full agency to make decisions about their participation in programs or services which may generate and record personal data. Children may also lack the understanding to assess a decision’s purported risks and benefits. Privacy terms and conditions are often barely understood by educated adults, let alone children. As a result, there is a higher duty of care for children’s data.
  • Disaggregating data according to socio-demographic characteristics can improve service delivery and assist with policy development. However, it also creates risks for group privacy. Children can be identified, exposing them to possible harms. Disaggregated data for groups such as child-headed households and children experiencing gender-based violence can put vulnerable communities and children at risk. Data about children’s location itself can be risky, especially if they have some additional vulnerability that could expose them to harm.
  • Mishandling data can cause children to lose trust in institutions that deliver essential services including vaccines, medicine, and nutrition supplies. For organizations dealing with child well-being, these retreats can have severe consequences. Distrust can cause families and children to refuse health, education, child protection and other public services. Such privacy protective behavior can impact children throughout the course of their lifetime, and potentially exacerbate existing inequities and vulnerabilities.
  • As volumes of collected and stored data increase, obligations and protections traditionally put in place for children may be difficult or impossible to uphold. The interests of children are not always prioritized when organizations define their legitimate interest to access or share personal information of children. The immediate benefit of a service provided does not always justify the risk or harm that might be caused by it in the future. Data analysis may be undertaken by people who do not have expertise in the area of child rights, as opposed to traditional research where practitioners are specifically educated in child subject research. Similarly, service providers collecting children’s data are not always specially trained to handle it, as international standards recommend.
  • Recent events around the world reveal the promise and pitfalls of algorithmic decision-making. While it can expedite certain processes, algorithms and their inferences can possess biases that can have adverse effects on people, for example those seeking medical care and attempting to secure jobs. The danger posed by algorithmic bias is especially pronounced for children and other vulnerable populations. These groups often lack the awareness or resources necessary to respond to instances of bias or to rectify any misconceptions or inaccuracies in their data.
  • Many of the children served by child welfare organizations have suffered trauma. Whether physical, social, emotional in nature, repeatedly making children register for services or provide confidential personal information can amount to revictimization – re-exposing them to traumas or instigating unwarranted feelings of shame and guilt.

These trends and realities make clear the need for new approaches for maximizing the value of data to improve children’s lives, while mitigating the risks posed by our increasingly datafied society….(More)”.

Data Collaboratives in Response to COVID19


Living Repository: “This document is part of a call for action to build a responsible infrastructure for data-driven pandemic response. 

It serves as a living repository for data collaboratives seeking to address the spread of COVID-19 and its secondary effects. 

> You can find ongoing data collaborative projects here

> Requests for data and expertise that might lead to data collaboratives can be found here.

> Data competitions, challenges, and calls for proposals, which can lead to useful tools to combat COVID-19, can be found here.

The repository aims to include projects that show a commitment to privacy protection, data responsibility, and overall user well-being. 

It will be updated regularly as we receive projects and proposals or otherwise become aware of them. 

HELP US MAKE THIS REPOSITORY BETTER:  Individuals are encouraged to edit the repo and/or suggest additions to this document if a project is not currently listed.

See full Living Repository here.

Location Surveillance to Counter COVID-19: Efficacy Is What Matters


Susan Landau at Lawfare: “…Some government officials believe that the location information that phones can provide will be useful in the current crisis. After all, if cellphone location information can be used to track terrorists and discover who robbed a bank, perhaps it can be used to determine whether you rubbed shoulders yesterday with someone who today was diagnosed as having COVID-19, the respiratory disease that the novel coronavirus causes. But such thinking ignores the reality of how phone-tracking technology works.

Let’s look at the details of what we can glean from cellphone location information. Cell towers track which phones are in their locale—but that is a very rough measure, useful perhaps for tracking bank robbers, but not for the six-foot proximity one wants in order to determine who might have been infected by the coronavirus.

Finer precision comes from GPS signals, but these can only work outside. That means the location information supplied by your phone—if your phone and that of another person are both on—can tell you if you both went into the same subway stop around the same time. But it won’t tell you whether you rode the same subway car. And the location information from your phone isn’t fully precise. So not only can’t it reveal if, for example, you were in the same aisle in the supermarket as the ill person, but sometimes it will make errors about whether you made it into the store, as opposed to just sitting on a bench outside. What’s more, many people won’t have the location information available because GPS drains the battery, so they’ll shut it off when they’re not using it. Their phones don’t have the location information—and neither do the providers, at least not at the granularity to determine coronavirus exposure.

GPS is not the only way that cellphones can collect location information. Various other ways exist, including through the WiFi network to which a phone is connected. But while two individuals using the same WiFi network are likely to be close together inside a building, the WiFi data would typically not be able to determine whether they were in that important six-foot proximity range.

Other devices can also get within that range, including Bluetooth beacons. These are used within stores, seeking to determine precisely what people are—and aren’t—buying; they track peoples’ locations indoors within inches. But like WiFi, they’re not ubiquitous, so their ability to track exposure will be limited.

If the apps lead to the government’s dogging people’s whereabouts at work, school, in the supermarket and at church, will people still be willing to download the tracking apps that get them get discounts when they’re passing the beer aisle? China follows this kind of surveillance model, but such a surveillance-state solution is highly unlikely to be acceptable in the United States. Yet anything less is unlikely to pinpoint individuals exposed to the virus.

South Korea took a different route. In precisely tracking coronavirus exposure, the country used additional digital records, including documentation of medical and pharmacy visits, history of credit card transactions, and CCTV videos, to determine where potentially exposed people had been—then followed up with interviews not just of infected people but also of their acquaintances, to determine where they had traveled.

Validating such records is labor intensive. And for the United States, it may not be the best use of resources at this time. There’s an even more critical reason that the Korean solution won’t work for the U.S.: South Korea was able to test exposed people. The U.S. can’t do this. Currently the country has a critical shortage of test kits; patients who are not sufficiently ill as to be hospitalized are not being tested. The shortage of test kits is sufficiently acute that in New York City, the current epicenter of the pandemic, the rule is, “unless you are hospitalized and a diagnosis will impact your care, you will not be tested.” With this in mind, moving to the South Korean model of tracking potentially exposed individuals won’t change the advice from federal and state governments that everyone should engage in social distancing—but employing such tracking would divert government resources and thus be counterproductive.

Currently, phone tracking in the United States is not efficacious. It cannot be unless all people are required to carry such location-tracking devices at all times; have location tracking on; and other forms of information tracking, including much wider use of CCTV cameras, Bluetooth beacons, and the like, are also in use. There are societies like this. But so far, even in the current crisis, no one is seriously contemplating the U.S. heading in that direction….(More)”.