Five myths about big data


Samuel Arbesman, senior scholar at the Ewing Marion Kauffman Foundation and the author of “The Half-Life of Facts” in the Washington Post: “Big data holds the promise of harnessing huge amounts of information to help us better understand the world. But when talking about big data, there’s a tendency to fall into hyperbole. It is what compels contrarians to write such tweets as “Big Data, n.: the belief that any sufficiently large pile of s— contains a pony.” Let’s deflate the hype.
1. “Big data” has a clear definition.
The term “big data” has been in circulation since at least the 1990s, when it is believed to have originated in Silicon Valley. IBM offers a seemingly simple definition: Big data is characterized by the four V’s of volume, variety, velocity and veracity. But the term is thrown around so often, in so many contexts — science, marketing, politics, sports — that its meaning has become vague and ambiguous….
2. Big data is new.
By many accounts, big data exploded onto the scene quite recently. “If wonks were fashionistas, big data would be this season’s hot new color,” a Reuters report quipped last year. In a May 2011 report, the McKinsey Global Institute declared big data “the next frontier for innovation, competition, and productivity.”
It’s true that today we can mine massive amounts of data — textual, social, scientific and otherwise — using complex algorithms and computer power. But big data has been around for a long time. It’s just that exhaustive datasets were more exhausting to compile and study in the days when “computer” meant a person who performed calculations….
3. Big data is revolutionary.
In their new book, “Big Data: A Revolution That Will Transform How We Live, Work, and Think,”Viktor Mayer-Schonberger and Kenneth Cukier compare “the current data deluge” to the transformation brought about by the Gutenberg printing press.
If you want more precise advertising directed toward you, then yes, big data is revolutionary. Generally, though, it’s likely to have a modest and gradual impact on our lives….
4. Bigger data is better.
In science, some admittedly mind-blowing big-data analyses are being done. In business, companies are being told to “embrace big data before your competitors do.” But big data is not automatically better.
Really big datasets can be a mess. Unless researchers and analysts can reduce the number of variables and make the data more manageable, they get quantity without a whole lot of quality. Give me some quality medium data over bad big data any day…
5. Big data means the end of scientific theories.
Chris Anderson argued in a 2008 Wired essay that big data renders the scientific method obsolete: Throw enough data at an advanced machine-learning technique, and all the correlations and relationships will simply jump out. We’ll understand everything.
But you can’t just go fishing for correlations and hope they will explain the world. If you’re not careful, you’ll end up with spurious correlations. Even more important, to contend with the “why” of things, we still need ideas, hypotheses and theories. If you don’t have good questions, your results can be silly and meaningless.
Having more data won’t substitute for thinking hard, recognizing anomalies and exploring deep truths.”

Announcing Project Open Data from Cloudant Labs


Yuriy Dybskiy from Cloudant: “There has been an emerging pattern over the last few years of more and more government datasets becoming available for public access. Earlier this year, the White House announced official policy on such data – Project Open Data.

Available resources

Here are four resources on the topic:

  1. Tim Berners-Lee: Open, Linked Data for a Global Community – [10 min video]
  2. Rufus Pollock: Open Data – How We Got Here and Where We’re Going – [24 min video]
  3. Open Knowledge Foundation Datasets – http://data.okfn.org/data
  4. Max Ogden: Project dat – collaborative data – [github repo]

One of the main challenges is access to the datasets. If only there were a database that had easy access to its data baked right in it.
Luckily, there is CouchDB and Cloudant, which share the same APIs to access data via HTTP. This makes for a really great option to store interesting datasets.

Cloudant Open Data

Today we are happy to announce a Cloudant Labs project – Cloudant Open Data!
Several datasets are available at the moment, for example, businesses_sf – data regarding businesses registered in San Francisco and sf_pd_incidents – a collection of incident reports (criminal and non-criminal) made by the San Francisco Police Department.
We’ll add more, but if you have one you’d like us to add faster – drop us a line at open-data@cloudant.com
Create an account and play with these datasets yourself”

From Machinery to Mobility: Government and Democracy in a Participative Age


From Machinery to Mobility

New book by Jeffrey Roy: “The Westminster-stylized model of Parliamentary democratic politics and public service accountability is increasingly out of step with the realities of today’s digitally and socially networked era. This book explores the reconfiguration of democratic and managerial governance within democratic societies due to the advent of technological mobility. More specifically, the traditional public sector prism of organizational and accountability – denoted as ‘machinery of government’, is increasingly strained in an era characterized by smart devices, social media, and cloud computing. This book examines the roots and implications of the tensions between machinery and mobility and the sorts of investments and initiatives that have been undertaken by governments around the world as well as their appropriateness and relative impacts. This book also examines the prospects for holistic adaptation of democratic and managerial systems going forward, identifying the most crucial directions and determinants for improving public sector performance in terms of outcomes, accountability, and agility. Accordingly, the ultimate aim of this initiative is to contribute to the formation of intellectual foundations for more systemic reforms of public sector governance in Canada and elsewhere, and to offer forward-looking trajectories for government adaptation in shifting from a traditional prism of ‘machinery’ to new organizational and institutional arrangements better suited for an era of ‘mobility’.”

Defense Against National Vulnerabilities in Public Data


DOD/DARPA Notice (See also Foreign Policy article): “OBJECTIVE: Investigate the national security threat posed by public data available either for purchase or through open sources. Based on principles of data science, develop tools to characterize and assess the nature, persistence, and quality of the data. Develop tools for the rapid anonymization and de-anonymization of data sources. Develop framework and tools to measure the national security impact of public data and to defend against the malicious use of public data against national interests.
DESCRIPTION: The vulnerabilities to individuals from a data compromise are well known and documented now as “identity theft.” These include regular stories published in the news and research journals documenting the loss of personally identifiable information by corporations and governments around the world. Current trends in social media and commerce, with voluntary disclosure of personal information, create other potential vulnerabilities for individuals participating heavily in the digital world. The Netflix Challenge in 2009 was launched with the goal of creating better customer pick prediction algorithms for the movie service [1]. An unintended consequence of the Netflix Challenge was the discovery that it was possible to de-anonymize the entire contest data set with very little additional data. This de-anonymization led to a federal lawsuit and the cancellation of the sequel challenge [2]. The purpose of this topic is to understand the national level vulnerabilities that may be exploited through the use of public data available in the open or for purchase.
Could a modestly funded group deliver nation-state type effects using only public data?…”
The official link for this solicitation is: www.acq.osd.mil/osbp/sbir/solicitations/sbir20133.
 

Listen to Wikipedia


“Listen to Wikipedia‘s recent changes feed. The sounds indicate addition to (bells) or subtraction from (strings) a Wikipedia articles, and the pitch changes according to the size of the edit. Green circles show edits from unregistered contributors, and purple circles mark edits performed by automated bots. You may see announcements for new users as they join the site — you can welcome him or her by adding a note on their talk page.
This project is built using D3 and HowlerJS. It is based on Listen to Bitcoin by Maximillian Laumeister. Our source is available on GitHub, and you can read more about this project.
Built by Stephen LaPorte and Mahmoud Hashemi.”

Behold: A Digital Bill of Rights for the Internet, by the Internet


Mashable: “The digital rights conversation was thrust into the mainstream spotlight after news of ongoing, widespread mass surveillance programs leaked to the public. Always a hot topic, these revelations sparked a strong online debate among the Internet community.
It also made us here at Mashable reflect on the digital freedoms and protections we feel each user should be guaranteed as a citizen of the Internet. To highlight some of the great conversations taking place about digital rights online, we asked the digital community to collaborate with us on the creation of a crowdsourced Digital Bill of Rights.
After six weeks of public discussions, document updates and changes, as well as incorporating input from digital rights experts, Mashable is pleased to unveil its first-ever Digital Bill of Rights, made for the Internet, by the Internet.”
 

Hackers Called Into Civic Duty


Wall Street Journal: “Cash-strapped cities are turning to an unusual source to improve their online services on the cheap: helpful hackers, who use city data to create tools tracking everything from real-time subway delays to where to get a free flu shot near your home and information about a contentious school-closing plan.
Hackers have been popularly portrayed as giving fits to national-security officials and credit-card companies, but the term also refers to people who like to write their own computer programs and help solve a variety of problems. Recently, hackers have begun working with cities to find ways of building applications, or apps, that make use of data—which gets stripped of personally identifiable information—that municipalities are collecting anyway in the regular course of governance….Last year, Chicago Mayor Rahm Emanuel signed an executive order mandating the city make available all data not protected by privacy laws. Today, the city has nearly 950 data sets publicly available, the most of any U.S. city, according to Code for America, a nonprofit that promotes openness in government.”

Guidelines for Open Data Policies


“The Sunlight Foundation created this living document to present a broad vision of the kinds of challenges that open data policies can actively address.
A few general notes: Although some provisions may carry more importance or heft than others, these Guidelines are not ranked in order of priority, but organized to help define What Data Should be Public, How to Make Data Public, and How to Implement Policy — three key elements of any legislation, executive order, or other policy seeking to include language about open data. Further, it’s worth repeating that these provisions are only a guide. As such, they do not address every question one should consider in preparing a policy. Instead, these provisions attempt to answer the specific question: What can or should an open data policy do?”

Data is Inert — It’s What You Do With It That Counts


Kevin Merritt, CEO and Founder, Socrata, in NextGov: “In its infancy, the open data movement was mostly about offering catalogs of government data online that concerned citizens and civic activists could download. But now, a wide variety of external stakeholders are using open data to deliver new applications and services. At the same time, governments themselves are harnessing open data to drive better decision-making.
In a relatively short period of time, open data has evolved from serving as fodder for data publishing to fuel for open innovation.
One of the keys to making this transformation truly work, however, is our ability to re-instrument or re-tool underlying business systems and processes so managers can receive open data in consumable forms on a regular, continuous basis in real-time….”

I Flirt and Tweet. Follow Me at #Socialbot.


in The New York Times: “FROM the earliest days of the Internet, robotic programs, or bots, have been trying to pass themselves off as human. Chatbots greet users when they enter an online chat room, for example, or kick them out when they get obnoxious….

Now come socialbots. These automated charlatans are programmed to tweet and retweet. They have quirks, life histories and the gift of gab. Many of them have built-in databases of current events, so they can piece together phrases that seem relevant to their target audience. They have sleep-wake cycles so their fakery is more convincing, making them less prone to repetitive patterns that flag them as mere programs. Some have even been souped up by so-called persona management software, which makes them seem more real by adding matching Facebook, Reddit or Foursquare accounts, giving them an online footprint over time as they amass friends and like-minded followers.

Researchers say this new breed of bots is being designed not just with greater sophistication but also with grander goals: to sway elections, to influence the stock market, to attack governments, even to flirt with people and one another.

…Socialbots are tapping into an ever-expanding universe of social media. Last year, the number of Twitter accounts topped 500 million. Some researchers estimate that only 35 percent of the average Twitter user’s followers are real people. In fact, more than half of Internet traffic already comes from nonhuman sources like bots or other types of algorithms. Within two years, about 10 percent of the activity occurring on social online networks will be masquerading bots, according to technology researchers….

Much of the social media remains unregulated by campaign finance and transparency laws. So far, the Federal Election Commission has been reluctant to venture into this realm.

But the bots are likely to venture into ours, said Tim Hwang, chief scientist at the Pacific Social Architecting Corporation, which creates bots and technologies that can shape social behavior. “Our vision is that in the near future automatons will eventually be able to rally crowds, open up bank accounts, write letters,” he said, “all through human surrogates.”