The language we use to describe data can also help us fix its problems


Luke Stark & Anna Lauren Hoffmann at Quartz: “Data is, apparently, everything.

It’s the “new oil” that fuels online business. It comes in floods or tsunamis. We access it via “streams” or “fire hoses.” We scrape it, mine it, bank it, and clean it. (Or, if you prefer your buzzphrases with a dash of ageism and implicit misogyny, big data is like “teenage sex,” while working with it is “the sexiest job” of the century.)

These data metaphors can seem like empty cliches, but at their core they’re efforts to come to grips with the continuing onslaught of connected devices and the huge amounts of data they generate.

In a recent article, we—an algorithmic-fairness researcher at Microsoft and a data-ethics scholar at the University of Washington—push this connection one step further. More than simply helping us wrap our collective heads around data-fueled technological change, we set out to learn what these metaphors can teach us about the real-life ethics of collecting and handling data today.

Instead of only drawing from the norms and commitments of computer science, information science, and statistics, what if we looked at the ethics of the professions evoked by our data metaphors instead?…(More)”.

Developing Artificially Intelligent Justice


Paper by Richard M. Re and Alicia Solow-Niederman: “Artificial intelligence, or AI, promises to assist, modify, and replace human decision-making, including in court. AI already supports many aspects of how judges decide cases, and the prospect of “robot judges” suddenly seems plausible—even imminent. This Article argues that AI adjudication will profoundly affect the adjudicatory values held by legal actors as well as the public at large. The impact is likely to be greatest in areas, including criminal justice and appellate decision-making, where “equitable justice,” or discretionary moral judgment, is frequently considered paramount. By offering efficiency and at least an appearance of impartiality, AI adjudication will both foster and benefit from a turn toward “codified justice,” an adjudicatory paradigm that favors standardization above discretion. Further, AI adjudication will generate a range of concerns relating to its tendency to make the legal system more incomprehensible, data-based, alienating, and disillusioning. And potential responses, such as crafting a division of labor between human and AI adjudicators, each pose their own challenges. The single most promising response is for the government to play a greater role in structuring the emerging market for AI justice, but auspicious reform proposals would borrow several interrelated approaches. Similar dynamics will likely extend to other aspects of government, such that choices about how to incorporate AI in the judiciary will inform the future path of AI development more broadly….(More)”.

The Education Data Collaborative: A new kind of partnership.


About: “Whether we work within schools or as part of the broader ecosystem of parent-teacher associations, and philanthropic, nonprofit, and volunteer organizations, we need data to guide decisions about investing our time and resources.

This data is typically expensive to gather, often unvalidated (e.g. self-reported), and commonly available only to those who collect or report it. It can even be hard to ask for data when it’s not clear what’s available. At the same time, information – in the form of discrete research, report-card style PDFs, or static websites – is everywhere. The result is that many already resource-thin organizations that could be collaborating around strategies to help kids advance, spend a lot of time in isolation collecting and searching for data.

In the past decade, we’ve seen solid progress in addressing part of the problem: the emergence of connected longitudinal data systems (LDS). These warehouses and  linked databases contain data that can help us understand how students progress over time. No personally identifiable information (or PII) is shared, yet the data can reveal where interventions are most needed. Because these systems are typically designed for researchers and policy professionals, they are rarely accessible to the educators, parents, and partners – arts, sports, academic enrichment (e.g. STEM), mentoring, and family support programs – that play such important roles in helping young people learn and succeed…

“We need open tools for the ecosystem – parents, volunteers, non-profit organizations and the foundations and agencies that support them. These partners can realize significant benefit from the same kind of data policy makers and education leaders hold in their LDS.


That’s why we’re launching the Education Data Collaborative. Working together, we can build tools that help us use data to improve the design, efficacy, and impact of programs and interventions and find new  way to work with public education systems to achieve great things for kids. …Data collaboratives, data trusts, and other kinds of multi-sector data partnerships are among the most important civic innovations to emerge in the past decade….(More)”

Open Data Retrospective


Laura Bacon at Luminate:: “Our global philanthropic organisation – previously the Government & Citizen Engagement (GCE) initiative at Omidyar Network, now Luminate – has been active in the open data space for over decade. In that time, we have invested more than $50m in organisations and platforms that are working to advance open data’s potential, including Open Data Institute, IMCO, Open Knowledge, ITS Rio, Sunlight, GovLab, Web Foundation, Open Data Charter, and Open Government Partnership.

Ahead of our transition from GCE to Luminate last year, we wanted to take a step back and assess the field in order to cultivate a richer understanding of the evolution of open data—including its critical developments, drivers of change, and influential actors[1]. This research would help inform our own strategy and provide valuable insight that we can share with the broader open data ecosystem. 

First, what is open data? Open data is data that can be freely used, shared, and built-upon by anyone, anywhere, for any purpose. At its best, open government data can empower citizens, improve governments, create opportunities, and help solve public problems. Have you used a transport app to find out when the next bus will arrive? Or a weather app to look up a forecast? When using a real estate website to buy or rent a home, have you also reviewed its proximity to health, education, and recreational facilities or checked out neighborhood crime rates? If so, your life has been impacted by open data. 

The Open Data Retrospective

We commissioned Dalberg, a global strategic advisory firm, to conduct an Open Data Retrospective to explore: ‘how and why did the open data field evolve globally over the past decade?’ as well as ‘where is the field today?’ With the concurrent release of the report “The State of Open Data” – led by IDRC and Open Data for Development initiative – we thought this would be a great time to make public the report we’d commissioned. 

You can see Dalberg’s open data report here, and its affiliated data here. Please note, this presentation is a modification of the report. Several sections and slides have been removed for brevity and/or confidentiality. Therefore, some details about particular organisations and strategies are not included in this deck.

Evolution and impact

Dalberg’s report covers the trajectory of the open data field and characterised it as: inception (pre-2008), systematisation (2009-2010), expansion (2011-2015), and reevaluation (2016-2018).This characterisation varies by region and sector, but generally captures the evolution of the open data movement….(More)”.

Datafication, development and marginalised urban communities: an applied data justice framework


Paper by Richard Heeks et al: “The role of data within international development is rapidly expanding. However, the recency of this phenomenon means analysis has been lagging; particularly, analysis of broader impacts of real-world initiatives. Addressing this gap through a focus on data’s increasing presence in urban development, this paper makes two contributions. First – drawing from the emerging literature on ‘data justice’ – it presents an explicit, systematic and comprehensive new framework that can be used for analysis of datafication. Second, it applies the framework to four mapping initiatives in cities of the global South. These initiatives capture and visualise new data about marginalised communities: residents living in slums and other informal settlements about whom data has traditionally been lacking. Analysing across procedural, rights, instrumental and structural dimensions, it finds these initiatives deliver real incremental gains for their target communities. But it is external actors and wealthier communities that gain more; thus, increasing relative inequality….(More)”.

The Age of Digital Interdependence


Report of the High-level Panel on Digital Cooperation: “The immense power and value of data in the modern economy can and must be harnessed to meet the SDGs, but this will require new models of collaboration. The Panel discussed potential pooling of data in areas such as health, agriculture and the environment to enable scientists and thought leaders to use data and artificial intelligence to better understand issues and find new ways to make progress on the SDGs. Such data commons would require criteria for establishing relevance to the SDGs, standards for interoperability, rules on access and safeguards to ensure privacy and security.

Anonymised data – information that is rendered anonymous in such a way that the data subject is not or no longer identifiable – about progress toward the SDGs is generally less sensitive and controversial than the use of personal data of the kind companies such as Facebook, Twitter or Google may collect to drive their business models, or facial and gait data that could be used for surveillance. However, personal data can also serve development goals, if handled with proper oversight to ensure its security and privacy.

For example, individual health data is extremely sensitive – but many people’s health data, taken together, can allow researchers to map disease outbreaks, compare the effectiveness of treatments and improve understanding of conditions. Aggregated data from individual patient cases was crucial to containing the Ebola outbreak in West Africa. Private and public sector healthcare providers around the world are now using various forms of electronic medical records. These help individual patients by making it easier to personalise health services, but the public health benefits require these records to be interoperable.

There is scope to launch collaborative projects to test the interoperability of data, standards and safeguards across the globe. The World Health Assembly’s consideration of a global strategy for digital health in 2020 presents an opportunity to launch such projects, which could initially be aimed at global health challenges such as Alzheimer’s and hypertension.

Improved digital cooperation on a data-driven approach to public health has the potential to lower costs, build new partnerships among hospitals, technology companies, insurance providers and research institutes and support the shift from treating diseases to improving wellness. Appropriate safeguards are needed to ensure the focus remains on improving health care outcomes. With testing, experience and necessary protective measures as well as guidelines for the responsible use of data, similar cooperation could emerge in many other fields related to the SDGs, from education to urban planning to agriculture…(More)”.

The Ethics of Big Data Applications in the Consumer Sector


Paper by Markus Christen et al : “Business applications relying on processing of large amounts of heterogeneous data (Big Data) are considered to be key drivers of innovation in the digital economy. However, these applications also pose ethical issues that may undermine the credibility of data-driven businesses. In our contribution, we discuss ethical problems that are associated with Big Data such as: How are core values like autonomy, privacy, and solidarity affected in a Big Data world? Are some data a public good? Or: Are we obliged to divulge personal data to a certain degree in order to make the society more secure or more efficient?

We answer those questions by first outlining the ethical topics that are discussed in the scientific literature and the lay media using a bibliometric approach. Second, referring to the results of expert interviews and workshops with practitioners, we identify core norms and values affected by Big Data applications—autonomy, equality, fairness, freedom, privacy, property-rights, solidarity, and transparency—and outline how they are exemplified in examples of Big Data consumer applications, for example, in terms of informational self-determination, non-discrimination, or free opinion formation. Based on use cases such as personalized advertising, individual pricing, or credit risk management we discuss the process of balancing such values in order to identify legitimate, questionable, and unacceptable Big Data applications from an ethics point of view. We close with recommendations on how practitioners working in applied data science can deal with ethical issues of Big Data….(More)”.

The war to free science


Brian Resnick and Julia Belluz at Vox: “The 27,500 scientists who work for the University of California generate 10 percent of all the academic research papers published in the United States.

Their university recently put them in a strange position: Sometime this year, these scientists will not be able to directly access much of the world’s published research they’re not involved in.

That’s because in February, the UC system — one of the country’s largest academic institutions, encompassing Berkeley, Los Angeles, Davis, and several other campuses — dropped its nearly $11 million annual subscription to Elsevier, the world’s largest publisher of academic journals.

On the face of it, this seemed like an odd move. Why cut off students and researchers from academic research?

In fact, it was a principled stance that may herald a revolution in the way science is shared around the world.

The University of California decided it doesn’t want scientific knowledge locked behind paywalls, and thinks the cost of academic publishing has gotten out of control.

Elsevier owns around 3,000 academic journals, and its articles account for some 18 percentof all the world’s research output. “They’re a monopolist, and they act like a monopolist,” says Jeffrey MacKie-Mason, head of the campus libraries at UC Berkeley and co-chair of the team that negotiated with the publisher.Elsevier makes huge profits on its journals, generating billions of dollars a year for its parent company RELX .

This is a story about more than subscription fees. It’s about how a private industry has come to dominate the institutions of science, and how librarians, academics, and even pirates are trying to regain control.

The University of California is not the only institution fighting back. “There are thousands of Davids in this story,” says University of California Davis librarian MacKenzie Smith, who, like so many other librarians around the world, has been pushing for more open access to science. “But only a few big Goliaths.”…(More)”.

Virtuous and vicious circles in the data life-cycle


Paper by Elizabeth Yakel, Ixchel M. Faniel, and Zachary J. Maiorana: “In June 2014, ‘Data sharing reveals complexity in the westward spread of domestic animals across Neolithic Turkey’, was published in PLoS One (Arbuckle et al. 2014). In this article, twenty-three authors, all zooarchaeologists, representing seventeen different archaeological sites in Turkey investigated the domestication of animals across Neolithic southwest Asia, a pivotal era of change in the region’s economy. The PLoS One article originated in a unique data sharing, curation, and reuse project in which a majority of the authors agreed to share their data and perform analyses across the aggregated datasets. The extent of data sharing and the breadth of data reuse and collaboration were previously unprecedented in archaeology. In the present article, we conduct a case study of the collaboration leading to the development of the PLoS One article. In particular, we focus on the data sharing, data curation, and data reuse practices exercised during the project in order to investigate how different phases in the data life-cycle affected each other.

Studies of data practices have generally engaged issues from the singular perspective of data producers, sharers, curators, or reusers. Furthermore, past studies have tended to focus on one aspect of the life-cycle (production, sharing, curation, reuse, etc.). A notable exception is Carlson and Anderson’s (2007) comparative case study of four research projects which discusses the life-cycle of data from production through sharing with an eye towards reuse. However, that study primarily addresses the process of data sharing. While we see from their research that data producers’ and curators’ decisions and actions regarding data are tightly coupled and have future consequences, those consequences are not fully explicated since the authors do not discuss reuse in depth.

Taking a perspective that captures the trajectory of data, our case study discusses actions and their consequences throughout the data life-cycle. Our research theme explores how different stakeholders and their work practices positively and/or negatively affected other phases of the life-cycle. More specifically, we focus on data production practices and data selection decisions made during data sharing as these have frequent and diverse consequences for other life-cycle phases in our case study. We address the following research questions:

  1. How do different aspects of data production positively and negatively impact other phases in the life-cycle?
  2. How do data selection decisions during sharing positively and negatively impact other phases in the life-cycle?
  3. How can the work of data curators intervene to reinforce positive actions or mitigate negative actions?…(More)”

The New York Times has a course to teach its reporters data skills, and now they’ve open-sourced it


Joshua Benton at Nieman Labs: “The New York Times wants more of its journalists to have those basic data skills, and now it’s releasing the curriculum they’ve built in-house out into the world, where it can be of use to reporters, newsrooms, and lots of other people too.

Here’s Lindsey Rogers Cook, an editor for digital storytelling and training at the Times, and the sort of person who is willing to have “spreadsheets make my heart sing” appear under her byline:

Even with some of the best data and graphics journalists in the business, we identified a challenge: data knowledge wasn’t spread widely among desks in our newsroom and wasn’t filtering into news desks’ daily reporting.

Yet fluency with numbers and data has become more important than ever. While journalists once were fond of joking that they got into the field because of an aversion to math, numbers now comprise the foundation for beats as wide-ranging as education, the stock market, the Census, and criminal justice. More data is released than ever before — there are nearly 250,000 datasets on data.govalone — and increasingly, government, politicians, and companies try to twist those numbers to back their own agendas…

We wanted to help our reporters better understand the numbers they get from sources and government, and give them the tools to analyze those numbers. We wanted to increase collaboration between traditional and non-traditional journalists…And with more competition than ever, we wanted to empower our reporters to find stories lurking in the hundreds of thousands of databases maintained by governments, academics, and think tanks. We wanted to give our reporters the tools and support necessary to incorporate data into their everyday beat reporting, not just in big and ambitious projects.

….You can access the Times’ training materials here. Some of what you’ll find:

  • An outline of the data skills the course aims to teach. It’s all run on Google Docs and Google Sheets; class starts with the uber-basics (mean! median! sum!), crosses the bridge of pivot tables, and then heads into data cleaning and more advanced formulas.
  • The full day-by-day outline of the Times’ three-week course, which of course you’re free to use or reshape to your newsroom’s needs.
  • It’s not just about cells, columns, and rows — the course also includes more journalism-based information around ethical questions, how to use data effectively inside a story’s narrative, and how best to work with colleagues in the graphic department.
  • Cheat sheets! If you don’t have time to dig too deeply, they’ll give a quick hit of information: onetwothreefourfive.
  • Data sets that you use to work through the beginner, intermediate, and advanced stages of the training, including such journalism classics as census datacampaign finance data, and BLS data.But don’t be a dummy and try to write real news stories off these spreadsheets; the Times cautions in bold: “NOTE: We have altered many of these datasets for instructional purposes, so please download the data from the original source if you want to use it in your reporting.”
  • How Not To Be Wrong,” which seems like a useful thing….(More)”