The Declassification Engine

Wired: “The CIA offers an electronic search engine that lets you mine about 11 million agency documents that have been declassified over the years. It’s called CREST, short for CIA Records Search Tool. But this represents only a portion the CIA’s declassified materials, and if you want unfettered access to the search engine, you’ll have to physically visit the National Archives at College Park, Maryland….
a new project launched by a team of historians, mathematicians, and computer scientists at Columbia University in New York City. Led by Matthew Connelly — a Columbia professor trained in diplomatic history — the project is known as The Declassification Engine, and it seeks to provide a single online database for declassified documents from across the federal government, including the CIA, the State Department, and potentially any other agency.
The project is still in the early stages, but the team has already assembled a database of documents that stretches back to the 1940s, and it has begun building new tools for analyzing these materials. In aggregating all documents into a single database, the researchers hope to not only provide quicker access to declassified materials, but to glean far more information from these documents than we otherwise could.
In the parlance of the day, the project is tackling these documents with the help of Big Data. If you put enough of this declassified information in a single place, Connelly believes, you can begin to predict what government information is still being withheld”

Deepbills project

Cato Institute: “The Deepbills project takes the raw XML of Congressional bills (available at FDsys and Thomas) and adds additional semantic information to them in inside the text.

You can download the continuously-updated data at

Congress already produces machine-readable XML of almost every bill it proposes, but that XML is designed primarily for formatting a paper copy, not for extracting information. For example, it’s not currently possible to find every mention of an Agency, every legal reference, or even every spending authorization in a bill without having a human being read it….
Currently the following information is tagged:

  • Legal citations…
  • Budget Authorities (both Authorizations of Appropriations and Appropriations)…
  • Agencies, bureaus, and subunits of the federal government.
  • Congressional committees
  • Federal elective officeholders (Congressmen)”

Intel Fuels a Rebellion Around Your Data

we the dataAntonio Regalado and Jessica Leber in MIT Technology Review:”Intel Labs, the company’s R&D arm, is launching an initiative around what it calls the “data economy”—how consumers might capture more of the value of their personal information, like digital records of their their location or work history. To make this possible, Intel is funding hackathons to urge developers to explore novel uses of personal data. It has also paid for a rebellious-sounding website called We the Data, featuring raised fists and stories comparing Facebook to Exxon Mobil.
Intel’s effort to stir a debate around “your data” is just one example of how some companies—and society more broadly—are grappling with a basic economic asymmetry of the big data age: they’ve got the data, and we don’t.

Data Edge

Steven Weber, professor in the School of Information and Political Science department at UC Berkeley, in Policy by the Numbers“It’s commonly said that most people overestimate the impact of technology in the short term, and underestimate its impact over the longer term.
Where is Big Data in 2013? Starting to get very real, in our view, and right on the cusp of underestimation in the long term. The short term hype cycle is (thankfully) burning itself out, and the profound changes that data science can and will bring to human life are just now coming into focus. It may be that Data Science is right now about where the Internet itself was in 1993 or so. That’s roughly when it became clear that the World Wide Web was a wind that would blow across just about every sector of the modern economy while transforming foundational things we thought were locked in about human relationships, politics, and social change. It’s becoming a reasonable bet that Data Science is set to do the same—again, and perhaps even more profoundly—over the next decade. Just possibly, more quickly than that….
Can data, no matter how big, change the world for the better? It may be the case that in some fields of human endeavor and behavior, the scientific analysis of big data by itself will create such powerful insights that change will simply have to happen, that businesses will deftly re-organize, that health care will remake itself for efficiency and better outcomes, that people will adopt new behaviors that make them happier, healthier, more prosperous and peaceful. Maybe. But almost everything we know about technology and society across human history argues that it won’t be so straightforward.
…join senior industry and academic leaders at DataEDGE at UC Berkeley on May 30-31 to engage in what will be a lively and important conversation aimed at answering today’s questions about the data science revolution—and formulating tomorrow’s.

Wikipedia Recent Changes Map


The Verge: “By watching a new visualization, known plainly as the Wikipedia Recent Changes Map, viewers can see the location of every unregistered Wikipedia user who makes a change to the open encyclopedia. It provides a voyeuristic look at the rate that knowledge is contributed to the website, giving you the faintest impression of the Spaniard interested in the television show Jackass or the Brazilian who defaced the page on the Jersey Devil to feature a photograph of the new pope. Though the visualization moves quickly, it’s only displaying about one-fifth of the edits being made: Wikipedia doesn’t reveal location data for registered users, and unregistered users make up just 15 to 20 percent of all contribution, according to studies of the website.”

Social networks as evolutionary game theory

in the Financial Times: “FT Alphaville has been taking a closer look at the collaborative economy, and noting the stellar growth this mysterious sector has been experiencing of late.
An important question to consider, however, is to what degree is this growth being driven by a genuine rise in reciprocity and altruism in the economy — or to what degree is this just the result of natural opportunism…
Which begs the question why should anyone put a free good out there for the taking anyway? And why is it that in most collaborative models there are very few examples of people abusing the system?
With respects to the free issue, internet pioneer Jaron Lanier believes this is because there isn’t really any such thing as free at all. What appears free is usually a veiled reciprocity or exploitation in disguise….
Lanier controversially believes users should be paid for that contribution. But in doing so we would argue that he forgets that the relationship Facebook has with its users is in fact much more reciprocal than exploitative. Users get a free platform, Facebook gets their data.
What’s more, as the BBC’s tech expert Bill Thompson has commented before, user content doesn’t really have much value on its own. It is only when that data is pooled together on a massive scale which allows the economies of scale to make sense. At least in a way that “the system” feels keen to reward. It is not independent data that has value, it is networked data that the system is demanding. Consequently, there is possibly some form of social benefit associated with contributing data to the platform, which is yet to be recognised….
A rise in collaboration, however, suggests there is more chance of personal survival if everyone collaborates together (and does not cheat the system). There is less incentive to cheat the system. In the current human economy context then, has collaboration ended up being the best pay-off for all ?
And in that context has social media, big data and the rise of networked communities simply encouraged participants in the universal survival game of prisoner’s dilemma to take the option that’s best for all?
We obviously have no idea if that’s the case, but it seems a useful thought experiment for us all to run through.”

Global Internet Policy Observatory (GIPO)

European Commission Press Release: “The Commission today unveiled plans for the Global Internet Policy Observatory (GIPO), an online platform to improve knowledge of and participation of all stakeholders across the world in debates and decisions on Internet policies. GIPO will be developed by the Commission and a core alliance of countries and Non Governmental Organisations involved in Internet governance. Brazil, the African Union, Switzerland, the Association for Progressive Communication, Diplo Foundation and the Internet Society have agreed to cooperate or have expressed their interest to be involved in the project.
The Global Internet Policy Observatory will act as a clearinghouse for monitoring Internet policy, regulatory and technological developments across the world.
It will:

  • automatically monitor Internet-related policy developments at the global level, making full use of “big data” technologies;
  • identify links between different fora and discussions, with the objective to overcome “policy silos”;
  • help contextualise information, for example by collecting existing academic information on a specific topic, highlighting the historical and current position of the main actors on a particular issue, identifying the interests of different actors in various policy fields;
  • identify policy trends, via quantitative and qualitative methods such as semantic and sentiment analysis;
  • provide easy-to-use briefings and reports by incorporating modern visualisation techniques;”

The Commodification of Patient Opinion: the Digital Patient Experience Economy in the Age of Big Data

Paper by Lupton, Deborah, from the Sydney Unversity’s Department of Sociology and Social Policy . Abstract: “As part of the digital health phenomenon, a plethora of interactive digital platforms have been established in recent years to elicit lay people’s experiences of illness and healthcare. The function of these platforms, as expressed on the main pages of their websites, is to provide the tools and forums whereby patients and caregivers, and in some cases medical practitioners, can share their experiences with others, benefit from the support and knowledge of other contributors and contribute to large aggregated data archives as part of developing better medical treatments and services and conducting medical research.
However what may not always be readily apparent to the users of these platforms are the growing commercial uses by many of the platforms’ owners of the archives of the data they contribute. This article examines this phenomenon of what I term ‘the digital patient experience economy’. In so doing I discuss such aspects as prosumption, the phenomena of big data and metric assemblages, the discourse and ethic of sharing and the commercialisation of affective labour via such platforms. I argue that via these online platforms patients’ opinions and experiences may be expressed in more diverse and accessible forums than ever before, but simultaneously they have become exploited in novel ways.”

Bringing the deep, dark world of public data to light

public_img03Venturebeat: “The realm of public data is like a vast cave. It is technically open to all, but it contains many secrets and obstacles within its walls.
Enigma launched out of beta today to shed light on this hidden world. This “big data” startup focuses on data in the public domain, such as those published by governments, NGOs, and the media….
The company describes itself as “Google for public data.” Using a combination of automated web crawlers and directly reaching out to government agencies, Engima’s database contains billions of public records across more than 100,000 datasets. Pulling them all together breaks down the barriers that exist between various local, state, federal, and institutional search portals. On top of this information is an “entity graph” which searches through the data to discover relevant results. Furthermore, once the information is broken out of the silos, users can filter, reshape, and connect various datasets to find correlations….
The technology has a wide range of applications, including professional services, finance, news media, big data, and academia. Engima has formed strategic partnerships in each of these verticals with Deloitte, Gerson Lehrman Group, The New York Times, S&P Capital IQ, and Harvard Business School, respectively.”

The Big Data Debate: Correlation vs. Causation

Gil Press: “In the first quarter of 2013, the stock of big data has experienced sudden declines followed by sporadic bouts of enthusiasm. The volatility—a new big data “V”—continues and Ted Cuzzillo summed up the recent negative sentiment in “Big data, big hype, big danger” on SmartDataCollective:
“A remarkable thing happened in Big Data last week. One of Big Data’s best friends poked fun at one of its cornerstones: the Three V’s. The well-networked and alert observer Shawn Rogers, vice president of research at Enterprise Management Associates, tweeted his eight V’s: ‘…Vast, Volumes of Vigorously, Verified, Vexingly Variable Verbose yet Valuable Visualized high Velocity Data.’ He was quick to explain to me that this is no comment on Gartner analyst Doug Laney’s three-V definition. Shawn’s just tired of people getting stuck on V’s.”…
Cuzzillo is joined by a growing chorus of critics that challenge some of the breathless pronouncements of big data enthusiasts. Specifically, it looks like the backlash theme-of-the-month is correlation vs. causation, possibly in reaction to the success of Viktor Mayer-Schönberger and Kenneth Cukier’s recent big data book in which they argued for dispensing “with a reliance on causation in favor of correlation”…
In “Steamrolled by Big Data,” The New Yorker’s Gary Marcus declares that “Big Data isn’t nearly the boundless miracle that many people seem to think it is.”…
Matti Keltanen at The Guardian agrees, explaining “Why ‘lean data’ beats big data.” Writes Keltanen: “…the lightest, simplest way to achieve your data analysis goals is the best one…The dirty secret of big data is that no algorithm can tell you what’s significant, or what it means. Data then becomes another problem for you to solve. A lean data approach suggests starting with questions relevant to your business and finding ways to answer them through data, rather than sifting through countless data sets. Furthermore, purely algorithmic extraction of rules from data is prone to creating spurious connections, such as false correlations… today’s big data hype seems more concerned with indiscriminate hoarding than helping businesses make the right decisions.”
In “Data Skepticism,” O’Reilly Radar’s Mike Loukides adds this gem to the discussion: “The idea that there are limitations to data, even very big data, doesn’t contradict Google’s mantra that more data is better than smarter algorithms; it does mean that even when you have unlimited data, you have to be very careful about the conclusions you draw from that data. It is in conflict with the all-too-common idea that, if you have lots and lots of data, correlation is as good as causation.”
Isn’t more-data-is-better the same as correlation-is-as-good-as-causation? Or, in the words of Chris Andersen, “with enough data, the numbers speak for themselves.”
“Can numbers actually speak for themselves?” non-believer Kate Crawford asks in “The Hidden Biases in Big Data” on the Harvard Business Review blog and answers: “Sadly, they can’t. Data and data sets are not objective; they are creations of human design…
And David Brooks in The New York Times, while probing the limits of “the big data revolution,” takes the discussion to yet another level: “One limit is that correlations are actually not all that clear. A zillion things can correlate with each other, depending on how you structure the data and what you compare. To discern meaningful correlations from meaningless ones, you often have to rely on some causal hypothesis about what is leading to what. You wind up back in the land of human theorizing…”