How Can We Overcome the Challenge of Biased and Incomplete Data?


Knowledge@Wharton: “Data analytics and artificial intelligence are transforming our lives. Be it in health care, in banking and financial services, or in times of humanitarian crises — data determine the way decisions are made. But often, the way data is collected and measured can result in biased and incomplete information, and this can significantly impact outcomes.  

In a conversation with Knowledge@Wharton at the SWIFT Institute Conference on the Impact of Artificial Intelligence and Machine Learning in the Financial Services Industry, Alexandra Olteanu, a post-doctoral researcher at Microsoft Research, U.S. and Canada, discussed the ethical and people considerations in data collection and artificial intelligence and how we can work towards removing the biases….

….Knowledge@Wharton: Bias is a big issue when you’re dealing with humanitarian crises, because it can influence who gets help and who doesn’t. When you translate that into the business world, especially in financial services, what implications do you see for algorithmic bias? What might be some of the consequences?

Olteanu: A good example is from a new law in the New York state according to which insurance companies can now use social media to decide the level for your premiums. But, they could in fact end up using incomplete information. For instance, you might be buying your vegetables from the supermarket or a farmer’s market, but these retailers might not be tracking you on social media. So nobody knows that you are eating vegetables. On the other hand, a bakery that you visit might post something when you buy from there. Based on this, the insurance companies may conclude that you only eat cookies all the time. This shows how even incomplete data can affect you….(More)”.

Commission publishes guidance on free flow of non-personal data


European Commission: “The guidance fulfils an obligation in the Regulation on the free flow of non-personal data (FFD Regulation), which requires the Commission to publish a guidance on the interaction between this Regulation and the General Data Protection Regulation (GDPR), especially as regards datasets composed of both personal and non-personal data. It aims to help users – in particular small and medium-sized enterprises – understand the interaction between the two regulations.

In line with the existing GDPR documents, prepared by the European Data Protection Board, this guidance document aims to clarify which rules apply when processing personal and non-personal data. It gives a useful overview of the central concepts of the free flow of personal and non-personal data within the EU, while explaining the relation between the two Regulations in practical terms and with concrete examples….

Non-personal data are distinct from personal data, as laid down in the GDPR Regulation. The non-personal data can be categorised in terms of origin, namely:

  • data which originally did not relate to an identified or identifiable natural person, such as data on weather conditions generated by sensors installed on wind turbines, or data on maintenance needs for industrial machines; or
  • data which was initially personal data, but later made anonymous.

While the guidance refers to more examples of non-personal data, it also explains the concept of personal data, anonymised and pseudonymised, to provide a better understanding as well describes the limitations between personal and non-personal data.

What are mixed datasets?

In most real-life situations, a dataset is very likely to be composed of both personal and non-personal data. This is often referred to as a “mixed dataset”. Mixed datasets represent the majority of datasets used in the data economy and commonly gathered thanks to technological developments such as the Internet of Things (i.e. digitally connecting objects), artificial intelligence and technologies enabling big data analytics.

Examples of mixed datasets include a company’s tax records, mentioning the name and telephone number of the managing director of the company. This can also include a company’s knowledge of IT problems and solutions based on individual incident reports, or a research institution’s anonymised statistical data and the raw data initially collected, such as the replies of individual respondents to statistical survey questions….(More)”.

MegaPixels


About: “…MegaPixels is an art and research project first launched in 2017 for an installation at Tactical Technology Collective’s GlassRoom about face recognition datasets. In 2018 MegaPixels was extended to cover pedestrian analysis datasets for a commission by Elevate Arts festival in Austria. Since then MegaPixels has evolved into a large-scale interrogation of hundreds of publicly-available face and person analysis datasets, the first of which launched on this site in April 2019.

MegaPixels aims to provide a critical perspective on machine learning image datasets, one that might otherwise escape academia and industry funded artificial intelligence think tanks that are often supported by the several of the same technology companies who have created datasets presented on this site.

MegaPixels is an independent project, designed as a public resource for educators, students, journalists, and researchers. Each dataset presented on this site undergoes a thorough review of its images, intent, and funding sources. Though the goals are similar to publishing an academic paper, MegaPixels is a website-first research project, with an academic publication to follow.

One of the main focuses of the dataset investigations presented on this site is to uncover where funding originated. Because of our emphasis on other researcher’s funding sources, it is important that we are transparent about our own….(More)”.

Privacy and Identity in a Networked Society: Refining Privacy Impact Assessment,


Book by Stefan Strauß: “This book offers an analysis of privacy impacts resulting from and reinforced by technology and discusses fundamental risks and challenges of protecting privacy in the digital age.

Privacy is among the most endangered “species” in our networked society: personal information is processed for various purposes beyond our control. Ultimately, this affects the natural interplay between privacy, personal identity and identification. This book investigates that interplay from a systemic, socio-technical perspective by combining research from the social and computer sciences. It sheds light on the basic functions of privacy, their relation to identity, and how they alter with digital identification practices. The analysis reveals a general privacy control dilemma of (digital) identification shaped by several interrelated socio-political, economic and technical factors. Uncontrolled increases in the identification modalities inherent to digital technology reinforce this dilemma and benefit surveillance practices, thereby complicating the detection of privacy risks and the creation of appropriate safeguards.

Easing this problem requires a novel approach to privacy impact assessment (PIA), and this book proposes an alternative PIA framework which, at its core, comprises a basic typology of (personally and technically) identifiable information. This approach contributes to the theoretical and practical understanding of privacy impacts and thus, to the development of more effective protection standards….(More)”.

Social Media Monitoring: How the Department of Homeland Security Uses Digital Data in the Name of National Security


Report by the Brennan Center for Justice: “The Department of Homeland Security (DHS) is rapidly expanding its collection of social media information and using it to evaluate the security risks posed by foreign and American travelers. This year marks a major expansion. The visa applications vetted by DHS will include social media handles that the State Department is set to collect from some 15 million travelers per year.1 Social media can provide a vast trove of information about individuals, including their personal preferences, political and religious views, physical and mental health, and the identity of their friends and family. But it is susceptible to misinterpretation, and wholesale monitoring of social media creates serious risks to privacy and free speech. Moreover, despite the rush to implement these programs, there is scant evidence that they actually meet the goals for which they are deployed…(More)”

Data Protection and Digital Agency for Refugees


Paper by Dragana Kaurin: “For the millions of refugees fleeing conflict and persecution every year, access to information about their rights and control over their personal data are crucial for their ability to assess risk and navigate the asylum process. While asylum seekers are required to provide significant amounts of personal information on their journey to safety, they are rarely fully informed of their data rights by UN agencies or local border control and law enforcement staff tasked with obtaining and processing their personal information. Despite recent improvements in data protection mechanisms in the European Union, refugees’ informed consent for the collection and use of their personal data is rarely sought. Using examples drawn from interviews with refugees who have arrived in Europe since 2013, and an analysis of the impacts of the 2016 EU-Turkey deal on migration, this paper analyzes how the vast amount of data collected from refugees is gathered, stored and shared today, and considers the additional risks this collection process poses to an already vulnerable population navigating a perilous information-decision gap….(More)”.

San Francisco becomes the first US city to ban facial recognition by government agencies


Colin Lecher at The Verge: “In a first for a city in the United States, San Francisco has voted to ban its government agencies from using facial recognition technology.

The city’s Board of Supervisors voted eight to one to approve the proposal, set to take effect in a month, that would bar city agencies, including law enforcement, from using the tool. The ordinance would also require city agencies to get board approval for their use of surveillance technology, and set up audits of surveillance tech already in use. Other cities have approved similar transparency measures.“

The plan, called the Stop Secret Surveillance Ordinance, was spearheaded by Supervisor Aaron Peskin. In a statement read ahead of the vote, Peskin said it was “an ordinance about having accountability around surveillance technology.”

“This is not an anti-technology policy,” he said, stressing that many tools used by law enforcement are still important to the city’s security. Still, he added, facial recognition is “uniquely dangerous and oppressive.”

The ban comes amid a broader debate over facial recognition, which can be used to rapidly identify people and has triggered new questions about civil liberties. Experts have raised specific concerns about the tools, as studies have demonstrated instances of troubling bias and error rates.

Microsoft, which offers facial recognition tools, has called for some form of regulation for the technology — but how, exactly, to regulate the tool has been contested. Proposals have ranged from light regulation to full moratoriums. Legislation has largely stalled, however.

San Francisco’s decision will inevitably be used as an example as the debate continues and other cities and states decide whether and how to regulate facial recognition. Civil liberties groups like the ACLU of Northern California have already thrown their support behind the San Francisco plan, while law enforcement in the area has pushed back….(More)”.

The Pathologies of Digital Consent


Paper by Neil M. Richards and Woodrow Hartzog: “Consent permeates both our law and our lives — especially in the digital context. Consent is the foundation of the relationships we have with search engines, social networks, commercial web sites, and any one of the dozens of other digitally mediated businesses we interact with regularly. We are frequently asked to consent to terms of service, privacy notices, the use of cookies, and so many other commercial practices. Consent is important, but it’s possible to have too much of a good thing. As a number of scholars have documented, while consent models permeate the digital consumer landscape, the practical conditions of these agreements fall far short of the gold standard of knowing and voluntary consent. Yet as scholars, advocates, and consumers, we lack a common vocabulary for talking about the different ways in which digital consents can be flawed.

This article offers four contributions to improve our understanding of consent in the digital world. First, we offer a conceptual vocabulary of “the pathologies of consent” — a framework for talking about different kinds of defects that consent models can suffer, such as unwitting consent, coerced consent, and incapacitated consent. Second, we offer three conditions for when consent will be most valid in the digital context: when choice is infrequent, when the potential harms resulting from that choice are vivid and easy to imagine, and where we have the correct incentives choose consciously and seriously. The further we fall from these conditions, the more a particular consent will be pathological and thus suspect. Third, we argue that out theory of consent pathologies sheds light on the so-called “privacy paradox” — the notion that there is a gap between what consumers say about wanting privacy and what they actually do in practice. Understanding the “privacy paradox” in terms of consent pathologies shows how consumers are not hypocrites who say one thing but do another. On the contrary, the pathologies of consent reveal how consumers can be nudged and manipulated by powerful companies against their actual interests, and that this process is easier when consumer protection law falls far from the gold standard. In light of these findings, we offer a fourth contribution — the theory of consumer trust we have suggested in prior work and which we further elaborate here as an alternative to our over-reliance on consent and its many pathologies….(More)”.

How AI could save lives without spilling medical secrets


Will Knight at MIT Technology Review: “The potential for artificial intelligence to transform health care is huge, but there’s a big catch.

AI algorithms will need vast amounts of medical data on which to train before machine learning can deliver powerful new ways to spot and understand the cause of disease. That means imagery, genomic information, or electronic health records—all potentially very sensitive information.

That’s why researchers are working on ways to let AI learn from large amounts of medical data while making it very hard for that data to leak.

One promising approach is now getting its first big test at Stanford Medical School in California. Patients there can choose to contribute their medical data to an AI system that can be trained to diagnose eye disease without ever actually accessing their personal details.

Participants submit ophthalmology test results and health record data through an app. The information is used to train a machine-learning model to identify signs of eye disease in the images. But the data is protected by technology developed by Oasis Labs, a startup spun out of UC Berkeley, which guarantees that the information cannot be leaked or misused. The startup was granted permission by regulators to start the trial last week.

The sensitivity of private patient data is a looming problem. AI algorithms trained on data from different hospitals could potentially diagnose illness, prevent disease, and extend lives. But in many countries medical records cannot easily be shared and fed to these algorithms for legal reasons. Research on using AI to spot disease in medical images or data usually involves relatively small data sets, which greatly limits the technology’s promise….

Oasis stores the private patient data on a secure chip, designed in collaboration with other researchers at Berkeley. The data remains within the Oasis cloud; outsiders are able to run algorithms on the data, and receive the results, without its ever leaving the system. A smart contractsoftware that runs on top of a blockchain—is triggered when a request to access the data is received. This software logs how the data was used and also checks to make sure the machine-learning computation was carried out correctly….(More)”.

Ethics of identity in the time of big data


Paper by James Brusseau in First Monday: “Compartmentalizing our distinct personal identities is increasingly difficult in big data reality. Pictures of the person we were on past vacations resurface in employers’ Google searches; LinkedIn which exhibits our income level is increasingly used as a dating web site. Whether on vacation, at work, or seeking romance, our digital selves stream together.

One result is that a perennial ethical question about personal identity has spilled out of philosophy departments and into the real world. Ought we possess one, unified identity that coherently integrates the various aspects of our lives, or, incarnate deeply distinct selves suited to different occasions and contexts? At bottom, are we one, or many?

The question is not only palpable today, but also urgent because if a decision is not made by us, the forces of big data and surveillance capitalism will make it for us by compelling unity. Speaking in favor of the big data tendency, Facebook’s Mark Zuckerberg promotes the ethics of an integrated identity, a single version of selfhood maintained across diverse contexts and human relationships.

This essay goes in the other direction by sketching two ethical frameworks arranged to defend our compartmentalized identities, which amounts to promoting the dis-integration of our selves. One framework connects with natural law, the other with language, and both aim to create a sense of selfhood that breaks away from its own past, and from the unifying powers of big data technology….(More)”.