Facial Recognition Software requires Checks and Balances


David Eaves,  and Naeha Rashid in Policy Options: “A few weeks ago, members of the Nexus traveller identification program were notified that Canadian Border Services is upgrading its automated system, from iris scanners to facial recognition technology. This is meant to simplify identification and increase efficiency without compromising security. But it also raises profound questions concerning how we discuss and develop public policies around such technology – questions that may not be receiving sufficiently open debate in the rush toward promised greater security.

Analogous to the U.S. Customs and Border Protection (CBP) program Global Entry, Nexus is a joint Canada-US border control system designed for low-risk, pre-approved travellers. Nexus does provide a public good, and there are valid reasons to improve surveillance at airports. Even before 9/11, border surveillance was an accepted annoyance and since then, checkpoint operations have become more vigilant and complex in response to the public demand for safety.

Nexus is one of the first North America government-sponsored services to adopt facial recognition, and as such it could be a pilot program that other services will follow. Left unchecked, the technology will likely become ubiquitous at North American border crossings within the next decade, and it will probably be adopted by governments to solve domestic policy challenges.

Facial recognition software is imperfect and has documented bias, but it will continue to improve and become superior to humans in identifying individuals. Given this, questions arise such as, what policies guide the use of this technology? What policies should inform future government use? In our headlong rush toward enhanced security, we risk replicating the justification the used by the private sector in an attempt to balance effectiveness, efficiency and privacy.

One key question involves citizens’ capacity to consent. Previously, Nexus members submitted to fingerprint and retinal scans – biometric markers that are relatively unique and enable government to verify identity at the border. Facial recognition technology uses visual data and seeks, analyzes, and stores identifying facial information in a database, which is then used to compare with new images and video….(More)”.

Tesco Grocery 1.0, a large-scale dataset of grocery purchases in London


Paper by Luca Maria Aiello, Daniele Quercia, Rossano Schifanella & Lucia Del Prete: “We present the Tesco Grocery 1.0 dataset: a record of 420 M food items purchased by 1.6 M fidelity card owners who shopped at the 411 Tesco stores in Greater London over the course of the entire year of 2015, aggregated at the level of census areas to preserve anonymity. For each area, we report the number of transactions and nutritional properties of the typical food item bought including the average caloric intake and the composition of nutrients.

The set of global trade international numbers (barcodes) for each food type is also included. To establish data validity we: i) compare food purchase volumes to population from census to assess representativeness, and ii) match nutrient and energy intake to official statistics of food-related illnesses to appraise the extent to which the dataset is ecologically valid. Given its unprecedented scale and geographic granularity, the data can be used to link food purchases to a number of geographically-salient indicators, which enables studies on health outcomes, cultural aspects, and economic factors….(More)”.

How big data is dividing the public in China’s coronavirus fight – green, yellow, red


Article by Viola Zhou: “On Valentine’s Day, a 36-year-old lawyer Matt Ma in the eastern Chinese province of Zhejiang discovered he had been coded “red”.The colour, displayed in a payment app on his smartphone, indicated that he needed to be quarantined at home even though he had no symptoms of the dangerous coronavirus.

Without a green light from the system, Ma could not travel from his ancestral hometown of Lishui to his new home city of Hangzhou, which is now surrounded by checkpoints set up to contain the epidemic.

Ma is one of the millions of people whose movements are being choreographed by the government through software that feeds on troves of data and issues orders that effectively dictate whether they must stay in or can go to work.Their experience represents a slice of China’s desperate attempt to stop the coronavirus by using a mixed bag of cutting-edge technologies and old-fashioned surveillance. It was also a rare real-world test of the use of technology on a large scale to halt the spread of communicable diseases.

“This kind of massive use of technology is unprecedented,” said Christos Lynteris, a medical anthropologist at the University of St Andrews who has studied epidemics in China.

But Hangzhou’s experiment has also revealed the pitfalls of applying opaque formulas to a large population.

In the city’s case, there are reports of people being marked incorrectly, falling victim to an algorithm that is, by the government’s own admission, not perfect….(More)”.

Accelerating AI with synthetic data


Essay by Khaled El Emam: “The application of artificial intelligence and machine learning to solve today’s problems requires access to large amounts of data. One of the key obstacles faced by analysts is access to this data (for example, these issues were reflected in reports from the General Accountability Office and the McKinsey Institute).

Synthetic data can help solve this data problem in a privacy preserving manner.

What is synthetic data ?

Data synthesis is an emerging privacy-enhancing technology that can enable access to realistic data, which is information that may be synthetic, but has the properties of an original dataset. It also simultaneously ensures that such information can be used and disclosed with reduced obligations under contemporary privacy statutes. Synthetic data retains the statistical properties of the original data. Therefore, there are an increasing number of use cases where it would serve as a proxy for real data.

Synthetic data is created by taking an original (real) dataset and then building a model to characterize the distributions and relationships in that data — this is called the “synthesizer.” The synthesizer is typically an artificial neural network or other machine learning technique that learns these (original) data characteristics. Once that model is created, it can be used to generate synthetic data. The data is generated from the model and does not have a 1:1 mapping to real data, meaning that the likelihood of mapping the synthetic records to real individuals would be very small — it is not considered personal information.

Many different types of data can be synthesized, including images, video, audio, text and structured data. The main focus in this article is on the synthesis of structured data.

Even though data can be generated in this manner, that does not mean it cannot be personal information. If the synthesizer is overfit to real data, then the generated data will replicate the original real data. Therefore, the synthesizer has to be constructed in a manner to avoid such overfitting. A formal privacy assurance should also be performed on the synthesized data to validate that there is a weak mapping between synthetic records to individuals….(More)”.

Monitoring of the Venezuelan exodus through Facebook’s advertising platform


Paper by Palotti et al: “Venezuela is going through the worst economical, political and social crisis in its modern history. Basic products like food or medicine are scarce and hyperinflation is combined with economic depression. This situation is creating an unprecedented refugee and migrant crisis in the region. Governments and international agencies have not been able to consistently leverage reliable information using traditional methods. Therefore, to organize and deploy any kind of humanitarian response, it is crucial to evaluate new methodologies to measure the number and location of Venezuelan refugees and migrants across Latin America.

In this paper, we propose to use Facebook’s advertising platform as an additional data source for monitoring the ongoing crisis. We estimate and validate national and sub-national numbers of refugees and migrants and break-down their socio-economic profiles to further understand the complexity of the phenomenon. Although limitations exist, we believe that the presented methodology can be of value for real-time assessment of refugee and migrant crises world-wide….(More)”.

Crowdsourcing data to mitigate epidemics


Gabriel M Leung and Kathy Leung at The Lancet: “Coronavirus disease 2019 (COVID-19) has spread with unprecedented speed and scale since the first zoonotic event that introduced the causative virus—severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)—into humans, probably during November, 2019, according to phylogenetic analyses suggesting the most recent common ancestor of the sequenced genomes emerged between Oct 23, and Dec 16, 2019. The reported cumulative number of confirmed patients worldwide already exceeds 70 000 in almost 30 countries and territories as of Feb 19, 2020, although that the actual number of infections is likely to far outnumber this case count.

During any novel emerging epidemic, let alone one with such magnitude and speed of global spread, a first task is to put together a line list of suspected, probable, and confirmed individuals on the basis of working criteria of the respective case definitions. This line list would allow for quick preliminary assessment of epidemic growth and potential for spread, evidence-based determination of the period of quarantine and isolation, and monitoring of efficiency of detection of potential cases. Frequent refreshing of the line list would further enable real-time updates as more clinical, epidemiological, and virological (including genetic) knowledge become available as the outbreak progresses….

We surveyed different and varied sources of possible line lists for COVID-19 (appendix pp 1–4). A bottleneck remains in carefully collating as much relevant data as possible, sifting through and verifying these data, extracting intelligence to forecast and inform outbreak strategies, and thereafter repeating this process in iterative cycles to monitor and evaluate progress. A possible methodological breakthrough would be to develop and validate algorithms for automated bots to search through cyberspaces of all sorts, by text mining and natural language processing (in languages not limited to English) to expedite these processes.In this era of smartphone and their accompanying applications, the authorities are required to combat not only the epidemic per se, but perhaps an even more sinister outbreak of fake news and false rumours, a so-called infodemic…(More)”.

The Economic Impact of Open Data: Opportunities for value creation in Europe


Press Release: “The European Data Portal publishes its study “The Economic Impact of Open Data: Opportunities for value creation in Europe”. It researches the value created by open data in Europe. It is the second study by the European Data Portal, following the 2015 report. The open data market size is estimated at €184 billion and forecast to reach between €199.51 and €334.21 billion in 2025. The report additionally considers how this market size is distributed along different sectors and how many people are employed due to open data. The efficiency gains from open data, such as potential lives saved, time saved, environmental benefits, and improvement of language services, as well as associated potential costs savings are explored and quantified where possible. Finally, the report also considers examples and insights from open data re-use in organisations. The key findings of the report are summarised below:

  1. The specification and implementation of high-value datasets as part of the new Open Data Directive is a promising opportunity to address quality & quantity demands of open data.
  2. Addressing quality & quantity demands is important, yet not enough to reach the full potential of open data.
  3. Open data re-users have to be aware and capable of understanding and leveraging the potential.
  4. Open data value creation is part of the wider challenge of skill and process transformation: a lengthy process whose change and impact are not always easy to observe and measure.
  5. Sector-specific initiatives and collaboration in and across private and public sector foster value creation.
  6. Combining open data with personal, shared, or crowdsourced data is vital for the realisation of further growth of the open data market.
  7. For different challenges, we must explore and improve multiple approaches of data re-use that are ethical, sustainable, and fit-for-purpose….(More)”.

Mapping Wikipedia


Michael Mandiberg at The Atlantic: “Wikipedia matters. In a time of extreme political polarization, algorithmically enforced filter bubbles, and fact patterns dismissed as fake news, Wikipedia has become one of the few places where we can meet to write a shared reality. We treat it like a utility, and the U.S. and U.K. trust it about as much as the news.

But we know very little about who is writing the world’s encyclopedia. We do know that just because anyone can edit, doesn’t mean that everyone does: The site’s editors are disproportionately cis white men from the global North. We also know that, as with most of the internet, a small number of the editors do a large amount of the editing. But that’s basically it: In the interest of improving retention, the Wikimedia Foundation’s own research focuses on the motivations of people who do edit, not on those who don’t. The media, meanwhile, frequently focus on Wikipedia’s personality stories, even when covering the bigger questions. And Wikipedia’s own culture pushes back against granular data harvesting: The Wikimedia Foundation’s strong data-privacy rules guarantee users’ anonymity and limit the modes and duration of their own use of editor data.

But as part of my research in producing Print Wikipedia, I discovered a data set that can offer an entry point into the geography of Wikipedia’s contributors. Every time anyone edits Wikipedia, the software records the text added or removed, the time of the edit, and the username of the editor. (This edit history is part of Wikipedia’s ethos of radical transparency: Everyone is anonymous, and you can see what everyone is doing.) When an editor isn’t logged in with a username, the software records that user’s IP address. I parsed all of the 884 million edits to English Wikipedia to collect and geolocate the 43 million IP addresses that have edited English Wikipedia. I also counted 8.6 million username editors who have made at least one edit to an article.

The result is a set of maps that offer, for the first time, insight into where the millions of volunteer editors who build and maintain English Wikipedia’s 5 million pages are—and, maybe more important, where they aren’t….

Like the Enlightenment itself, the modern encyclopedia has a history entwined with colonialism. Encyclopédie aimed to collect and disseminate all the world’s knowledge—but in the end, it could not escape the biases of its colonial context. Likewise, Napoleon’s Description de l’Égypte augmented an imperial military campaign with a purportedly objective study of the nation, which was itself an additional form of conquest. If Wikipedia wants to break from the past and truly live up to its goal to compile the sum of all human knowledge, it requires the whole world’s participation….(More)”.

Wisdom or Madness? Comparing Crowds with Expert Evaluation in Funding the Arts


Paper by Ethan R. Mollick and Ramana Nanda: “In fields as diverse as technology entrepreneurship and the arts, crowds of interested stakeholders are increasingly responsible for deciding which innovations to fund, a privilege that was previously reserved for a few experts, such as venture capitalists and grant‐making bodies. Little is known about the degree to which the crowd differs from experts in judging which ideas to fund, and, indeed, whether the crowd is even rational in making funding decisions. Drawing on a panel of national experts and comprehensive data from the largest crowdfunding site, we examine funding decisions for proposed theater projects, a category where expert and crowd preferences might be expected to differ greatly.

We instead find significant agreement between the funding decisions of crowds and experts. Where crowds and experts disagree, it is far more likely to be a case where the crowd is willing to fund projects that experts may not. Examining the outcomes of these projects, we find no quantitative or qualitative differences between projects funded by the crowd alone, and those that were selected by both the crowd and experts. Our findings suggest that crowdfunding can play an important role in complementing expert decisions, particularly in sectors where the crowds are end users, by allowing projects the option to receive multiple evaluations and thereby lowering the incidence of “false negatives.”…(More)”.

Can Technology Support Democracy?


Essay by Douglas Schuler: “The utopian optimism about democracy and the internet has given way to disillusionment. At the same time, given the complexity of today’s wicked problems, the need for democracy is critical. Unfortunately democracy is under attack around the world, and there are ominous signs of its retreat.

How does democracy fare when digital technology is added to the picture? Weaving technology and democracy together is risky, and technologists who begin any digital project with the conviction that technology can and will solve “problems” of democracy are likely to be disappointed. Technology can be a boon to democracy if it is informed technology.

The goal in writing this essay was to encourage people to help develop and cultivate a rich democratic sphere. Democracy has great potential that it rarely achieves. It is radical, critical, complex, and fragile. It takes different forms in different contexts. These forms are complex and the solutionism promoted by the computer industry and others is not appropriate in the case of democracies. The primary aim of technology in the service of democracy is not merely to make it easier or more convenient but to improve society’s civic intelligence, its ability to address the problems it faces effectively and equitably….(More)”.