How Big Data Creates False Confidence


Jesse Dunietz at Nautilus: “…A feverish push for “big data” analysis has swept through biology, linguistics, finance, and every field in between. Although no one can quite agree how to define it, the general idea is to find datasets so enormous that they can reveal patterns invisible to conventional inquiry. The data are often generated by millions of real-world user actions, such as tweets or credit-card purchases, and they can take thousands of computers to collect, store, and analyze. To many companies and researchers, though, the investment is worth it because the patterns can unlock information about anything from genetic disorders to tomorrow’s stock prices.

But there’s a problem: It’s tempting to think that with such an incredible volume of data behind them, studies relying on big data couldn’t be wrong. But the bigness of the data can imbue the results with a false sense of certainty. Many of them are probably bogus—and the reasons why should give us pause about any research that blindly trusts big data.

In the case of language and culture, big data showed up in a big way in 2011, when Google released itsNgrams tool. Announced with fanfare in the journal Science, Google Ngrams allowed users to search for short phrases in Google’s database of scanned books—about 4 percent of all books ever published!—and see how the frequency of those phrases has shifted over time. The paper’s authors heralded the advent of “culturomics,” the study of culture based on reams of data and, since then, Google Ngrams has been, well, largely an endless source of entertainment—but also a goldmine for linguists, psychologists, and sociologists. They’ve scoured its millions of books to show that, for instance, yes, Americans are becoming more individualistic; that we’re “forgetting our past faster with each passing year”; and that moral ideals are disappearing from our cultural consciousness.

WE’RE LOSING HOPE: An Ngrams chart for the word “hope,” one of many intriguing plots found by xkcd author Randall Munroe. If Ngrams really does reflect our culture, we may be headed for a dark place.

The problems start with the way the Ngrams corpus was constructed. In a study published last October, three University of Vermont researchers pointed out that, in general, Google Books includes one copy of every book. This makes perfect sense for its original purpose: to expose the contents of those books to Google’s powerful search technology. From the angle of sociological research, though, it makes the corpus dangerously skewed….

Even once you get past the data sources, there’s still the thorny issue of interpretation. Sure, words like “character” and “dignity” might decline over the decades. But does that mean that people care about morality less? Not so fast, cautions Ted Underwood, an English professor at the University of Illinois, Urbana-Champaign. Conceptions of morality at the turn of the last century likely differed sharply from ours, he argues, and “dignity” might have been popular for non-moral reasons. So any conclusions we draw by projecting current associations backward are suspect.

Of course, none of this is news to statisticians and linguists. Data and interpretation are their bread and butter. What’s different about Google Ngrams, though, is the temptation to let the sheer volume of data blind us to the ways we can be misled.

This temptation isn’t unique to Ngrams studies; similar errors undermine all sorts of big data projects. Consider, for instance, the case of Google Flu Trends (GFT). Released in 2008, GFT would count words like “fever” and “cough” in millions of Google search queries, using them to “nowcast” how many people had the flu. With those estimates, public health officials could act two weeks before the Centers for Disease Control could calculate the true numbers from doctors’ reports.

When big data isn’t seen as a panacea, it can be transformative.

Initially, GFT was claimed to be 97 percent accurate. But as a study out of Northeastern University documents, that accuracy was a fluke. First, GFT completely missed the “swine flu” pandemic in the spring and summer of 2009. (It turned out that GFT was largely predicting winter.) Then, the system began to overestimate flu cases. In fact, it overshot the peak 2013 numbers by a whopping 140 percent. Eventually, Google just retired the program altogether.

So what went wrong? As with Ngrams, people didn’t carefully consider the sources and interpretation of their data. The data source, Google searches, was not a static beast. When Google started auto-completing queries, users started just accepting the suggested keywords, distorting the searches GFT saw. On the interpretation side, GFT’s engineers initially let GFT take the data at face value; almost any search term was treated as a potential flu indicator. With millions of search terms, GFT was practically guaranteed to over-interpret seasonal words like “snow” as evidence of flu.

But when big data isn’t seen as a panacea, it can be transformative. Several groups, like Columbia University researcher Jeffrey Shaman’s, for example, have outperformed the flu predictions of both the CDC and GFT by using the former to compensate for the skew of the latter. “Shaman’s team tested their model against actual flu activity that had already occurred during the season,” according to the CDC. By taking the immediate past into consideration, Shaman and his team fine-tuned their mathematical model to better predict the future. All it takes is for teams to critically assess their assumptions about their data….(More)

Accountable Algorithms


Paper by Joshua A. Kroll et al: “Many important decisions historically made by people are now made by computers. Algorithms count votes, approve loan and credit card applications, target citizens or neighborhoods for police scrutiny, select taxpayers for an IRS audit, and grant or deny immigration visas.

The accountability mechanisms and legal standards that govern such decision processes have not kept pace with technology. The tools currently available to policymakers, legislators, and courts were developed to oversee human decision-makers and often fail when applied to computers instead: for example, how do you judge the intent of a piece of software? Additional approaches are needed to make automated decision systems — with their potentially incorrect, unjustified or unfair results — accountable and governable. This Article reveals a new technological toolkit to verify that automated decisions comply with key standards of legal fairness.

We challenge the dominant position in the legal literature that transparency will solve these problems. Disclosure of source code is often neither necessary (because of alternative techniques from computer science) nor sufficient (because of the complexity of code) to demonstrate the fairness of a process. Furthermore, transparency may be undesirable, such as when it permits tax cheats or terrorists to game the systems determining audits or security screening.

The central issue is how to assure the interests of citizens, and society as a whole, in making these processes more accountable. This Article argues that technology is creating new opportunities — more subtle and flexible than total transparency — to design decision-making algorithms so that they better align with legal and policy objectives. Doing so will improve not only the current governance of algorithms, but also — in certain cases — the governance of decision-making in general. The implicit (or explicit) biases of human decision-makers can be difficult to find and root out, but we can peer into the “brain” of an algorithm: computational processes and purpose specifications can be declared prior to use and verified afterwards.

The technological tools introduced in this Article apply widely. They can be used in designing decision-making processes from both the private and public sectors, and they can be tailored to verify different characteristics as desired by decision-makers, regulators, or the public. By forcing a more careful consideration of the effects of decision rules, they also engender policy discussions and closer looks at legal standards. As such, these tools have far-reaching implications throughout law and society.

Part I of this Article provides an accessible and concise introduction to foundational computer science concepts that can be used to verify and demonstrate compliance with key standards of legal fairness for automated decisions without revealing key attributes of the decision or the process by which the decision was reached. Part II then describes how these techniques can assure that decisions are made with the key governance attribute of procedural regularity, meaning that decisions are made under an announced set of rules consistently applied in each case. We demonstrate how this approach could be used to redesign and resolve issues with the State Department’s diversity visa lottery. In Part III, we go further and explore how other computational techniques can assure that automated decisions preserve fidelity to substantive legal and policy choices. We show how these tools may be used to assure that certain kinds of unjust discrimination are avoided and that automated decision processes behave in ways that comport with the social or legal standards that govern the decision. We also show how algorithmic decision-making may even complicate existing doctrines of disparate treatment and disparate impact, and we discuss some recent computer science work on detecting and removing discrimination in algorithms, especially in the context of big data and machine learning. And lastly in Part IV, we propose an agenda to further synergistic collaboration between computer science, law and policy to advance the design of automated decision processes for accountability….(More)”

Emerging urban digital infomediaries and civic hacking in an era of big data and open data initiatives


Chapter by Thakuriah, P., Dirks, L., and Keita, Y. in Seeing Cities Through Big Data: Research Methods and Applications in Urban Informatics (forthcoming): “This paper assesses non-traditional urban digital infomediaries who are pushing the agenda of urban Big Data and Open Data. Our analysis identified a mix of private, public, non-profit and informal infomediaries, ranging from very large organizations to independent developers. Using a mixed-methods approach, we identified four major groups of organizations within this dynamic and diverse sector: general-purpose ICT providers, urban information service providers, open and civic data infomediaries, and independent and open source developers. A total of nine organizational types are identified within these four groups. We align these nine organizational types along five dimensions accounts for their mission and major interests, products and services, as well activities they undertake: techno-managerial, scientific, business and commercial, urban engagement, and openness and transparency. We discuss urban ICT entrepreneurs, and the role of informal networks involving independent developers, data scientists and civic hackers in a domain that historically involved professionals in the urban planning and public management domains. Additionally, we examine convergence in the sector by analyzing overlaps in their activities, as determined by a text mining exercise of organizational webpages. We also consider increasing similarities in products and services offered by the infomediaries, while highlighting ideological tensions that might arise given the overall complexity of the sector, and differences in the backgrounds and end-goals of the participants involved. There is much room for creation of knowledge and value networks in the urban data sector and for improved cross-fertilization among bodies of knowledge….(More)”

Big data privacy: the datafication of personal information


Jens-Erik Mai in The Information Society: “In the age of big data we need to think differently about privacy. We need to shift our thinking from definitions of privacy (characteristics of privacy) to models of privacy (how privacy works). Moreover, in addition to the existing models of privacy—surveillance model and capture model, we need to also consider a new model —the datafication model presented in this paper, wherein new personal information is deduced by employing predictive analytics on already-gathered data. These three models of privacy supplement each other; they are not competing understandings of privacy. This broadened approach will take our thinking beyond current preoccupation with whether or not individuals’ consent was secured for data collection to privacy issues arising from the development of new information on individuals’ likely behavior through analysis of already collected data – this new information can violate privacy but does not call for consent….(More)”

How Big Data Harms Poor Communities


Kaveh Waddell in the Atlantic: “Big data can help solve problems that are too big for one person to wrap their head around. It’s helped businesses cut costs, cities plan new developments, intelligence agencies discover connections between terrorists, health officials predict outbreaks, and police forces get ahead of crime. Decision-makers are increasingly told to “listen to the data,” and make choices informed by the outputs of complex algorithms.

But when the data is about humans—especially those who lack a strong voice—those algorithms can become oppressive rather than liberating. For many poor people in the U.S., the data that’s gathered about them at every turn can obstruct attempts to escape poverty.

Low-income communities are among the most surveilled communities in America. And it’s not just the police that are watching, says Michele Gilman, a law professor at the University of Baltimore and a former civil-rights attorney at the Department of Justice. Public-benefits programs, child-welfare systems, and monitoring programs for domestic-abuse offenders all gather large amounts of data on their users, who are disproportionately poor.
In certain places, in order to qualify for public benefits like food stamps, applicants have to undergo fingerprinting and drug testing. Once people start receiving the benefits, officials regularly monitor them to see how they spend the money, and sometimes check in on them in their homes.

Data gathered from those sources can end up feeding back into police systems, leading to a cycle of surveillance. “It becomes part of these big-data information flows that most people aren’t aware they’re captured in, but that can have really concrete impacts on opportunities,” Gilman says.

Once an arrest crops up on a person’s record, for example, it becomes much more difficult for that person to find a job, secure a loan, or rent a home. And that’s not necessarily because loan officers or hiring managers pass over applicants with arrest records—computer systems that whittle down tall stacks of resumes or loan applications will often weed some out based on run-ins with the police.

When big-data systems make predictions that cut people off from meaningful opportunities like these, they can violate the legal principle of presumed innocence, according to Ian Kerr, a professor and researcher of ethics, law, and technology at the University of Ottawa.

Outside the court system, “innocent until proven guilty” is upheld by people’s due-process rights, Kerr says: “A right to be heard, a right to participate in one’s hearing, a right to know what information is collected about me, and a right to challenge that information.” But when opaque data-driven decision-making takes over—what Kerr calls “algorithmic justice”—some of those rights begin to erode….(More)”

Big Data in the Public Sector


Chapter by Ricard Munné in New Horizons for a Data-Driven Economy: “The public sector is becoming increasingly aware of the potential value to be gained from big data, as governments generate and collect vast quantities of data through their everyday activities.

The benefits of big data in the public sector can be grouped into three major areas, based on a classification of the types of benefits: advanced analytics, through automated algorithms; improvements in effectiveness, providing greater internal transparency; improvements in efficiency, where better services can be provided based on the personalization of services; and learning from the performance of such services.

The chapter examined several drivers and constraints that have been identified, which can boost or stop the development of big data in the sector depending on how they are addressed. The findings, after analysing the requirements and the technologies currently available, show that there are open research questions to be addressed in order to develop such technologies so competitive and effective solutions can be built. The main developments are required in the fields of scalability of data analysis, pattern discovery, and real-time applications. Also required are improvements in provenance for the sharing and integration of data from the public sector. It is also extremely important to provide integrated security and privacy mechanisms in big data applications, as public sector collects vast amounts of sensitive data. Finally, respecting the privacy of citizens is a mandatory obligation in the European Union….(More)”

Data and Humanitarian Response


The GovLab: “As part of an ongoing effort to build a knowledge base for the field of opening governance by organizing and disseminating its learnings, the GovLab Selected Readings series provides an annotated and curated collection of recommended works on key opening governance topics. In this edition, we explore the literature on Data and Humanitarian Response. To suggest additional readings on this or any other topic, please email biblio@thegovlab.org. All our Selected Readings can be found here.

Data and its uses for Governance

Context

Data, when used well in a trusted manner , allows humanitarian organizations to innovate how to respond to emergency events, including better coordination of post-disaster relief efforts, the ability to harness local knowledge to create more targeted relief strategies, and tools to predict and monitor disasters in real time. Consequently, in recent years both multinational groups and community-based advocates have begun to integrate data collection and evaluation strategies into their humanitarian operations, to better and more quickly respond to emergencies. However, this movement poses a number of challenges. Compared to the private sector, humanitarian organizations are often less equipped to successfully analyze and manage big data, which pose a number of risks related to the security of victims’ data. Furthermore, complex power dynamics which exist within humanitarian spaces may be further exacerbated through the introduction of new technologies and big data collection mechanisms. In the below we share:

  • Selected Reading List (summaries and hyperlinks)
  • Annotated Selected Reading List
  • Additional Readings….(More)”

What Should We Do About Big Data Leaks?


Paul Ford at the New Republic: “I have a great fondness for government data, and the government has a great fondness for making more of it. Federal elections financial data, for example, with every contribution identified, connected to a name and address. Or the results of the census. I don’t know if you’ve ever had the experience of downloading census data but it’s pretty exciting. You can hold America on your hard drive! Meditate on the miracles of zip codes, the way the country is held together and addressable by arbitrary sets of digits.

You can download whole books, in PDF format, about the foreign policy of the Reagan Administration as it related to Russia. Negotiations over which door the Soviet ambassador would use to enter a building. Gigabytes and gigabytes of pure joy for the ephemeralist. The government is the greatest creator of ephemera ever.

Consider the Financial Crisis Inquiry Commission, or FCIC, created in 2009 to figure out exactly how the global economic pooch was screwed. The FCIC has made so much data, and has done an admirable job (caveats noted below) of arranging it. So much stuff. There are reams of treasure on a single FCIC web site, hosted at Stanford Law School: Hundreds of MP3 files, for example, with interviews with Jamie Dimonof JPMorgan Chase and Lloyd Blankfein of Goldman Sachs. I am desperate to find  time to write some code that automatically extracts random audio snippets from each and puts them on top of a slow ambient drone with plenty of reverb, so that I can relax to the dulcet tones of the financial industry explaining away its failings. (There’s a Paul Krugman interview that I assume is more critical.)

The recordings are just the beginning. They’ve released so many documents, and with the documents, a finding aid that you can download in handy PDF format, which will tell you where to, well, find things, pointing to thousands of documents. That aid alone is 1,439 pages.

Look, it is excellent that this exists, in public, on the web. But it also presents a very contemporary problem: What is transparency in the age of massive database drops? The data is available, but locked in MP3s and PDFs and other documents; it’s not searchable in the way a web page is searchable, not easy to comment on or share.

Consider the WikiLeaks release of State Department cables. They were exhausting, there were so many of them, they were in all caps. Or the trove of data Edward Snowden gathered on aUSB drive, or Chelsea Manning on CD. And the Ashley Madison leak, spread across database files and logs of credit card receipts. The massive and sprawling Sony leak, complete with whole email inboxes. And with the just-released Panama Papers, we see two exciting new developments: First, the consortium of media organizations that managed the leak actually came together and collectively, well, branded the papers, down to a hashtag (#panamapapers), informational website, etc. Second, the size of the leak itself—2.5 terabytes!—become a talking point, even though that exact description of what was contained within those terabytes was harder to understand. This, said the consortia of journalists that notably did not include The New York Times, The Washington Post, etc., is the big one. Stay tuned. And we are. But the fact remains: These artifacts are not accessible to any but the most assiduous amateur conspiracist; they’re the domain of professionals with the time and money to deal with them. Who else could be bothered?

If you watched the movie Spotlight, you saw journalists at work, pawing through reams of documents, going through, essentially, phone books. I am an inveterate downloader of such things. I love what they represent. And I’m also comfortable with many-gigabyte corpora spread across web sites. I know how to fetch data, how to consolidate it, and how to search it. I share this skill set with many data journalists, and these capacities have, in some ways, become the sole province of the media. Organs of journalism are among the only remaining cultural institutions that can fund investigations of this size and tease the data apart, identifying linkages and thus constructing informational webs that can, with great effort, be turned into narratives, yielding something like what we call “a story” or “the truth.” 

Spotlight was set around 2001, and it features a lot of people looking at things on paper. The problem has changed greatly since then: The data is everywhere. The media has been forced into a new cultural role, that of the arbiter of the giant and semi-legal database. ProPublica, a nonprofit that does a great deal of data gathering and data journalism and then shares its findings with other media outlets, is one example; it funded a project called DocumentCloud with other media organizations that simplifies the process of searching through giant piles of PDFs (e.g., court records, or the results of Freedom of Information Act requests).

At some level the sheer boredom and drudgery of managing these large data leaks make them immune to casual interest; even the Ashley Madison leak, which I downloaded, was basically an opaque pile of data and really quite boring unless you had some motive to poke around.

If this is the age of the citizen journalist, or at least the citizen opinion columnist, it’s also the age of the data journalist, with the news media acting as product managers of data leaks, making the information usable, browsable, attractive. There is an uneasy partnership between leakers and the media, just as there is an uneasy partnership between the press and the government, which would like some credit for its efforts, thank you very much, and wouldn’t mind if you gave it some points for transparency while you’re at it.

Pause for a second. There’s a glut of data, but most of it comes to us in ugly formats. What would happen if the things released in the interest of transparency were released in actual transparent formats?…(More)”

New Orleans Gamifies the City Budget


Kelsey E. Thomas at Next City: “New Orleanians can try their hand at being “mayor for a day” with a new interactive website released by the Committee for a Better New Orleans Wednesday.

The Big Easy Budget Game uses open data from the city to allow players to create their own version of an operating budget. Players are given a digital $602 million, and have to balance the budget — keeping in mind the government’s responsibilities, previous year’s spending and their personal priorities.

Each department in the game has a minimum funding level (players can’t just quit funding public schools if they feel like it), and restricted funding, such as state or federal dollars, is off limits.

CBNO hopes to attract 600 players this year, and plans to compile the data from each player into a crowdsourced meta-budget called “The People’s Budget.” Next fall, the People’s Budget will be released along with the city’s proposed 2017 budget.

Along with the budgeting game, CBNO released a more detailed website, also using the city’s open data, that breaks down the city’s budgeted versus actual spending from 2007 to now and is filterable. The goal is to allow users without big data experience to easily research funding relevant to their neighborhoods.

Many cities have been releasing interactive websites to make their data more accessible to residents. Checkbook NYC updates more than $70 billion in city expenses daily and breaks them down by transaction. Fiscal Focus Pittsburgh is an online visualization tool that outlines revenues and expenses in the city’s budget….(More)”

Open data and the API economy: when it makes sense to give away data


 at ZDNet: “Open data is one of those refreshing trends that flows in the opposite direction of the culture of fear that has developed around data security. Instead of putting data under lock and key, surrounded by firewalls and sandboxes, some organizations see value in making data available to all comers — especially developers.

The GovLab.org, a nonprofit advocacy group, published an overview of the benefits governments and organizations are realizing from open data, as well as some of the challenges. The group defines open data as “publicly available data that can be universally and readily accessed, used and redistributed free of charge. It is structured for usability and computability.”…

For enterprises, an open-data stance may be the fuel to build a vibrant ecosystem of developers and business partners. Scott Feinberg, API architect for The New York Times, is one of the people helping to lead the charge to open-data ecosystems. In a recent CXOTalk interview with ZDNet colleague Michael Krigsman, he explains how through the NYT APIs program, developers can sign up for access to 165 years worth of content.

But it requires a lot more than simply throwing some APIs out into the market. Establishing such a comprehensive effort across APIs requires a change in mindset that many organizations may not be ready for, Feinberg cautions. “You can’t be stingy,” he says. “You have to just give it out. When we launched our developer portal there’s a lot of questions like, are people going to be stealing our data, questions like that. Just give it away. You don’t have to give it all but don’t be stingy, and you will find that first off not that many people are going to use it at first. you’re going to find that out, but the people who do, you’re going to find those passionate people who are really interested in using your data in new ways.”

Feinberg clarifies that the NYT’s APIs are not giving out articles for free. Rather, he explains, “we give is everything but article content. You can search for articles. You can find out what’s trending. You can almost do anything you want with our data through our APIs with the exception of actually reading all of the content. It’s really about giving people the opportunity to really interact with your content in ways that you’ve never thought of, and empowering your community to figure out what they want. You know while we don’t give our actual article text away, we give pretty much everything else and people build a lot of really cool stuff on top of that.”

Open data sets, of course, have to worthy of the APIs that offer them. In his post, Borne outlines the seven qualities open data needs to have to be of value to developers and consumers. (Yes, they’re also “Vs” like big data.)

  1. Validity: It’s “critical to pay attention to these data validity concerns when your organization’s data are exposed to scrutiny and inspection by others,” Borne states.
  2. Value: The data needs to be the font of new ideas, new businesses, and innovations.
  3. Variety: Exposing the wide variety of data available can be “a scary proposition for any data scientist,” Borne observes, but nonetheless is essential.
  4. Voice: Remember that “your open data becomes the voice of your organization to your stakeholders.”
  5. Vocabulary: “The semantics and schema (data models) that describe your data are more critical than ever when you provide the data for others to use,” says Borne. “Search, discovery, and proper reuse of data all require good metadata, descriptions, and data modeling.”
  6. Vulnerability: Accept that open data, because it is so open, will be subjected to “misuse, abuse, manipulation, or alteration.”
  7. proVenance: This is the governance requirement behind open data offerings. “Provenance includes ownership, origin, chain of custody, transformations that been made to it, processing that has been applied to it (including which versions of processing software were used), the data’s uses and their context, and more,” says Borne….(More)”