The Big Data Debate: Correlation vs. Causation


Gil Press: “In the first quarter of 2013, the stock of big data has experienced sudden declines followed by sporadic bouts of enthusiasm. The volatility—a new big data “V”—continues and Ted Cuzzillo summed up the recent negative sentiment in “Big data, big hype, big danger” on SmartDataCollective:
“A remarkable thing happened in Big Data last week. One of Big Data’s best friends poked fun at one of its cornerstones: the Three V’s. The well-networked and alert observer Shawn Rogers, vice president of research at Enterprise Management Associates, tweeted his eight V’s: ‘…Vast, Volumes of Vigorously, Verified, Vexingly Variable Verbose yet Valuable Visualized high Velocity Data.’ He was quick to explain to me that this is no comment on Gartner analyst Doug Laney’s three-V definition. Shawn’s just tired of people getting stuck on V’s.”…
Cuzzillo is joined by a growing chorus of critics that challenge some of the breathless pronouncements of big data enthusiasts. Specifically, it looks like the backlash theme-of-the-month is correlation vs. causation, possibly in reaction to the success of Viktor Mayer-Schönberger and Kenneth Cukier’s recent big data book in which they argued for dispensing “with a reliance on causation in favor of correlation”…
In “Steamrolled by Big Data,” The New Yorker’s Gary Marcus declares that “Big Data isn’t nearly the boundless miracle that many people seem to think it is.”…
Matti Keltanen at The Guardian agrees, explaining “Why ‘lean data’ beats big data.” Writes Keltanen: “…the lightest, simplest way to achieve your data analysis goals is the best one…The dirty secret of big data is that no algorithm can tell you what’s significant, or what it means. Data then becomes another problem for you to solve. A lean data approach suggests starting with questions relevant to your business and finding ways to answer them through data, rather than sifting through countless data sets. Furthermore, purely algorithmic extraction of rules from data is prone to creating spurious connections, such as false correlations… today’s big data hype seems more concerned with indiscriminate hoarding than helping businesses make the right decisions.”
In “Data Skepticism,” O’Reilly Radar’s Mike Loukides adds this gem to the discussion: “The idea that there are limitations to data, even very big data, doesn’t contradict Google’s mantra that more data is better than smarter algorithms; it does mean that even when you have unlimited data, you have to be very careful about the conclusions you draw from that data. It is in conflict with the all-too-common idea that, if you have lots and lots of data, correlation is as good as causation.”
Isn’t more-data-is-better the same as correlation-is-as-good-as-causation? Or, in the words of Chris Andersen, “with enough data, the numbers speak for themselves.”
“Can numbers actually speak for themselves?” non-believer Kate Crawford asks in “The Hidden Biases in Big Data” on the Harvard Business Review blog and answers: “Sadly, they can’t. Data and data sets are not objective; they are creations of human design…
And David Brooks in The New York Times, while probing the limits of “the big data revolution,” takes the discussion to yet another level: “One limit is that correlations are actually not all that clear. A zillion things can correlate with each other, depending on how you structure the data and what you compare. To discern meaningful correlations from meaningless ones, you often have to rely on some causal hypothesis about what is leading to what. You wind up back in the land of human theorizing…”

The Next Great Internet Disruption: Authority and Governance


An essay by David Bollier and John Clippinger as part of their ongoing work of ID3, the Institute for Data-Driven Design :As the Internet and digital technologies have proliferated over the past twenty years, incumbent enterprises nearly always resist open network dynamics with fierce determination, a narrow ingenuity and resistance….But the inevitable rearguard actions to defend old forms are invariably overwhelmed by the new, network-based ones.  The old business models, organizational structures, professional sinecures, cultural norms, etc., ultimately yield to open platforms.
When we look back on the past twenty years of Internet history, we can more fully appreciate the prescience of David P. Reed’s seminal 1999 paper on “Group Forming Networks” (GFNs). “Reed’s Law” posits that value in networks increases exponentially as interactions move from a broadcasting model that offers “best content” (in which value is described by n, the number of consumers) to a network of peer-to-peer transactions (where the network’s value is based on “most members” and mathematically described by n2).  But by far the most valuable networks are based on those that facilitate group affiliations, Reed concluded.  When users have tools for “free and responsible association for common purposes,” he found, the value of the network soars exponentially to 2– a fantastically large number.   This is the Group Forming Network.  Reed predicted that “the dominant value in a typical network tends to shift from one category to another as the scale of the network increases.…”
What is really interesting about Reed’s analysis is that today’s world of GFNs, as embodied by Facebook, Twitter, Wikipedia and other Web 2.0 technologies, remains highly rudimentary.  It is based on proprietary platforms (as opposed to open source, user-controlled platforms), and therefore provides only limited tools for members of groups to develop trust and confidence in each other.  This suggests a huge, unmet opportunity to actualize greater value from open networks.  Citing Francis Fukuyama’ book Trust, Reed points out that “there is a strong correlation between the prosperity of national economies and social capital, which [Fukuyama] defines culturally as the ease with which people in a particular culture can form new associations.”

An API for "We the People"


WeThePeopleThe White House Blog: “We can’t talk about We the People without getting into the numbers — more than 8 million users, more than 200,000 petitions, more than 13 million signatures. The sheer volume of participation is, to us, a sign of success.
And there’s a lot we can learn from a set of data that rich and complex, but we shouldn’t be the only people drawing from its lessons.
So starting today, we’re making it easier for anyone to do their own analysis or build their own apps on top of the We the People platform. We’re introducing the first version of our API, and we’re inviting you to use it.
Get started here: petitions.whitehouse.gov/developers
This API provides read-only access to data on all petitions that passed the 150 signature threshold required to become publicly-available on the We the People site. For those who don’t need real-time data, we plan to add the option of a bulk data download in the near future. Until that’s ready, an incomplete sample data set is available for download here.”

Frameworks for a Location–Enabled Society


Annual CGA Conference “Location-enabled devices are weaving “smart grids” and building “smart cities;” they allow people to discover a friend in a shopping mall, catch a bus at its next stop, check surrounding air quality while walking down a street, or avoid a rain storm on a tourist route – now or in the near future. And increasingly they allow those who provide services to track, whether we are walking past stores on the street or seeking help in a natural disaster.
The Centre for Spatial Law and Policy based in Washington, DC, the Center for Geographic Analysis, the Belfer Center for Science and International Affairs and the Berkman Center for Internet and Society at Harvard University are co-hosting a two-day program examining the legal and policy issues that will impact geospatial technologies and the development of location-enabled societies. The event will take place at Harvard University on May 2-3, 2013…The goal is to explore the different dimensions of policy and legal concerns in geospatial technology applications, and to begin in creating a policy and legal framework for a location-enabled society. Download the conference program brochure.
Live Webcast:

Stream videos at Ustream

Cities and Data


20130427_USC502The Economist: “Many cities around the country find themselves in a similar position: they are accumulating data faster than they know what to do with. One approach is to give them to the public. For example, San Francisco, New York, Philadelphia, Boston and Chicago are or soon will be sharing the grades that health inspectors give to restaurants with an online restaurant directory.
Another way of doing it is simply to publish the raw data and hope that others will figure out how to use them. This has been particularly successful in Chicago, where computer nerds have used open data to create many entirely new services. Applications are now available that show which streets have been cleared after a snowfall, what time a bus or train will arrive and how requests to fix potholes are progressing.
New York and Chicago are bringing together data from departments across their respective cities in order to improve decision-making. When a city holds a parade it can combine data on street closures, bus routes, weather patterns, rubbish trucks and emergency calls in real time.”

Open Data and Civil Society


Nick Hurd, UK Minister for Civil Society, on the potential of open data for the third sector in The Guardian:

“Part of the value of civil society is holding power to account, and if this can be underpinned by good quality data, we will have a very powerful tool indeed….The UK is absolutely at the vanguard of the global open data movement, and NGOs have a great sense that this is something they want to play a part in.There is potential to help them do more of what they do, and to do it better, but they’re going to need a lot of help in terms of information and access to events where they can exchange ideas and best practice.”

Also in the article: “The competitive marketplace and bilateral nature of funding awards make this issue perhaps even more significant in the charity sector, and it is in changing attitudes and encouraging this warts-and-all approach that movement leadership bodies such as the Open Data Institute (ODI) will play their biggest role….Joining the ODI in driving and overseeing wider adoption of these practices is the Open Knowledge Foundation (OKFN). One of its first projects was a partnership with an organisation called Publish What You Fund, the aim of which was to release data on the breakdown of funding to sectors and departments in Uganda according to source – government or aid.
…Open data can often take the form of complex databases that need to be interrogated by a data specialist, and many charities simply do not have these technical resources sitting untapped. OKFN is foremost among a number of organisations looking to bridge this gap by training members of the public in data mining and analysis techniques….
“We’re all familiar with the phrase ‘knowledge is power’, and in this case knowledge means insight gained from this newly available data. But data doesn’t turn into insight or knowledge magically. It takes people, it takes skills, it takes tools to become knowledge, data and change.
“We set up the School of Data in partnership with Peer 2 Peer University just over a year and a half ago with the aim of enabling citizens to carry out this process, and what we really want to do is empower charities to use data in the same way”, said Pollock.”

The Value of Open Data – Don’t Measure Growth, Measure Destruction


David Eaves: “…And that is my main point. The real impact of open data will likely not be in the economic wealth it generates, but rather in its destructive power. I think the real impact of open data is going to be in the value it destroys and so in the capital it frees up to do other things. Much like Red Hat is fraction of the size of Microsoft, Open Data is going to enable new players to disrupt established data players.

What do I mean by this?
Take SeeClickFix. Here is a company that, leveraging the Open311 standard, is able to provide many cities with a 311 solution that works pretty much out of the box. 20 years ago, this was a $10 million+ problem for a major city to solve, and wasn’t even something a small city could consider adopting – it was just prohibitively expensive. Today, SeeClickFix takes what was a 7 or 8 digit problem, and makes it a 5 or 6 digit problem. Indeed, I suspect SeeClickFix almost works better in a small to mid-sized government that doesn’t have complex work order software and so can just use SeeClickFix as a general solution. For this part of the market, it has crushed the cost out of implementing a solution.
Another example. And one I’m most excited. Look at CKAN and Socrata. Most people believe these are open data portal solutions. That is a mistake. These are data management companies that happen to have simply made “sharing (or “open”) a core design feature. You know who does data management? SAP. What Socrata and CKAN offer is a way to store, access, share and engage with data previously gathered and held by companies like SAP at a fraction of the cost. A SAP implementation is a 7 or 8 (or god forbid, 9) digit problem. And many city IT managers complain that doing anything with data stored in SAP takes time and it takes money. CKAN and Socrata may have only a fraction of the features, but they are dead simple to use, and make it dead simple to extract and share data. More importantly they make these costly 7 and 8 digital problems potentially become cheap 5 or 6 digit problems.
On the analysis side, again, I do hope there will be big wins – but what I really think open data is going to do is lower the costs of creating lots of small wins – crazy numbers of tiny efficiencies….
Don’t look for the big bang, and don’t measure the growth in spending or new jobs. Rather let’s try to measure the destruction and cumulative impact of a thousand tiny wins. Cause that is where I think we’ll see it most.”

Two-Way Citizen Engagement


StuffGovernment Technology: “A couple of years ago, a conversation was brewing among city leaders in the Sacramento, Calif., suburb of Elk Grove — the city realized it could no longer afford to limit interactions with an increasingly smartphone-equipped population to between the hours of 8 a.m. and 5 p.m… The city considered several options, including a vendor-built mobile app tailor-made to meet its specific needs. And during this process, the city discovered civic engagement startup PublicStuff. Founded by Forbes’ 30 Under 30 honoree Lily Liu, the company offers a service request platform that lets users report issues of concern to the city.

Liu, who previously held positions with both New York City and Long Beach, Calif., realized that many cities couldn’t afford a full-blown 311 call center system to handle citizen requests. Many need a less expensive way of providing responsive customer service to the community. PublicStuff now fills that need for more than 200 cities across the country.”

Cybersecurity Issues in Social Media and Crowdsourcing


trustworthy_thumbWilson Center: ” The Commons Lab today released a new policy memo exploring the vulnerabilities facing the widespread use and acceptance of social media and crowdsourcing. This is the second publication in the project’s policy memo series.
Using real-world examples, security expert George Chamales describes the most-pressing cybersecurity vulnerabilities in this space and calls for the development of best practices to address these vulnerabilities, ultimately concluding that it is possible for institutions to develop trust in the emerging technologies. From the memo’s executive summary:
Individuals and organizations interested in using social media and crowdsourcing currently lack two key sets of information: a systematic assessment of the vulnerabilities in these technologies and a comprehensive set of best practices describing how to address those vulnerabilities. Identifying those vulnerabilities and developing those best practices are necessary to address a growing number of incidents ranging from innocent mistakes to targeted attacks that have claimed lives and cost millions of dollars.
Click here to read the full memo on Scribd.

Open Data for Agriculture


USDA News Release: “Agriculture Secretary Tom Vilsack, along with Bill Gates, and U.S. Chief Technology Officer Todd Park, today kicked off a two-day international open data conference, saying that data “is among the most important commodities in agriculture” and sharing it openly increases its value.
Secretary Vilsack, as head of the U.S. Government delegation to the conference, announced the launch of a new “virtual community” as part of a suite of actions, including the release of new data, that the United States is taking to give farmers and ranchers, scientists, policy makers and other members of the public easy access to publicly funded data to help increase food security and nutrition.
“The digital revolution fueled by open data is starting to do for the modern world of agriculture what the industrial revolution did for agricultural productivity over the past century,” said Vilsack. “Open access to data will help combat food insecurity today while laying the groundwork for a sustainable agricultural system to feed a population that is projected to be more than nine billion by 2050.”
The virtual Food, Agriculture, and Rural data community launched today on Data.gov-the U.S. Government’s data sharing website-to catalogue America’s publicly available agricultural data and increase the ability of the public to find, download, and use datasets that are generated and held by the Federal Government. The data community features a collection of more than 300 newly cataloged datasets, databases, and raw data sources related to food, agriculture, and rural issues from agencies across the U.S. Government. In addition to the data catalog, the virtual community shares a number of applications, maps and tools designed to help farmers, scientists and policymakers improve global food security and nutrition….
The conference and the U.S. actions supporting open agricultural data fulfill the Open Data for Agriculture commitment made as part of the New Alliance for Food Security and Nutrition, which was launched by President Obama and G-8 partners at the 2012 G-8 Leaders Summit last year at Camp David, Maryland.”

G-8 Open Data for Agriculture Conference Aims to Help Feed a Growing Population and Fulfill New Alliance for Food Security and Nutrition Commitment
Secretary Vilsack Announces Launch of a Virtual Community to Give Increased Public Access to Food, Agriculture, and Rural Data