Bringing the deep, dark world of public data to light


public_img03Venturebeat: “The realm of public data is like a vast cave. It is technically open to all, but it contains many secrets and obstacles within its walls.
Enigma launched out of beta today to shed light on this hidden world. This “big data” startup focuses on data in the public domain, such as those published by governments, NGOs, and the media….
The company describes itself as “Google for public data.” Using a combination of automated web crawlers and directly reaching out to government agencies, Engima’s database contains billions of public records across more than 100,000 datasets. Pulling them all together breaks down the barriers that exist between various local, state, federal, and institutional search portals. On top of this information is an “entity graph” which searches through the data to discover relevant results. Furthermore, once the information is broken out of the silos, users can filter, reshape, and connect various datasets to find correlations….
The technology has a wide range of applications, including professional services, finance, news media, big data, and academia. Engima has formed strategic partnerships in each of these verticals with Deloitte, Gerson Lehrman Group, The New York Times, S&P Capital IQ, and Harvard Business School, respectively.”

The Big Data Debate: Correlation vs. Causation


Gil Press: “In the first quarter of 2013, the stock of big data has experienced sudden declines followed by sporadic bouts of enthusiasm. The volatility—a new big data “V”—continues and Ted Cuzzillo summed up the recent negative sentiment in “Big data, big hype, big danger” on SmartDataCollective:
“A remarkable thing happened in Big Data last week. One of Big Data’s best friends poked fun at one of its cornerstones: the Three V’s. The well-networked and alert observer Shawn Rogers, vice president of research at Enterprise Management Associates, tweeted his eight V’s: ‘…Vast, Volumes of Vigorously, Verified, Vexingly Variable Verbose yet Valuable Visualized high Velocity Data.’ He was quick to explain to me that this is no comment on Gartner analyst Doug Laney’s three-V definition. Shawn’s just tired of people getting stuck on V’s.”…
Cuzzillo is joined by a growing chorus of critics that challenge some of the breathless pronouncements of big data enthusiasts. Specifically, it looks like the backlash theme-of-the-month is correlation vs. causation, possibly in reaction to the success of Viktor Mayer-Schönberger and Kenneth Cukier’s recent big data book in which they argued for dispensing “with a reliance on causation in favor of correlation”…
In “Steamrolled by Big Data,” The New Yorker’s Gary Marcus declares that “Big Data isn’t nearly the boundless miracle that many people seem to think it is.”…
Matti Keltanen at The Guardian agrees, explaining “Why ‘lean data’ beats big data.” Writes Keltanen: “…the lightest, simplest way to achieve your data analysis goals is the best one…The dirty secret of big data is that no algorithm can tell you what’s significant, or what it means. Data then becomes another problem for you to solve. A lean data approach suggests starting with questions relevant to your business and finding ways to answer them through data, rather than sifting through countless data sets. Furthermore, purely algorithmic extraction of rules from data is prone to creating spurious connections, such as false correlations… today’s big data hype seems more concerned with indiscriminate hoarding than helping businesses make the right decisions.”
In “Data Skepticism,” O’Reilly Radar’s Mike Loukides adds this gem to the discussion: “The idea that there are limitations to data, even very big data, doesn’t contradict Google’s mantra that more data is better than smarter algorithms; it does mean that even when you have unlimited data, you have to be very careful about the conclusions you draw from that data. It is in conflict with the all-too-common idea that, if you have lots and lots of data, correlation is as good as causation.”
Isn’t more-data-is-better the same as correlation-is-as-good-as-causation? Or, in the words of Chris Andersen, “with enough data, the numbers speak for themselves.”
“Can numbers actually speak for themselves?” non-believer Kate Crawford asks in “The Hidden Biases in Big Data” on the Harvard Business Review blog and answers: “Sadly, they can’t. Data and data sets are not objective; they are creations of human design…
And David Brooks in The New York Times, while probing the limits of “the big data revolution,” takes the discussion to yet another level: “One limit is that correlations are actually not all that clear. A zillion things can correlate with each other, depending on how you structure the data and what you compare. To discern meaningful correlations from meaningless ones, you often have to rely on some causal hypothesis about what is leading to what. You wind up back in the land of human theorizing…”

The Rise of Big Data


Kenneth Neil Cukier and Viktor Mayer-Schoenberger in Foreign Affairs: “Everyone knows that the Internet has changed how businesses operate, governments function, and people live. But a new, less visible technological trend is just as transformative: “big data.” Big data starts with the fact that there is a lot more information floating around these days than ever before, and it is being put to extraordinary new uses. Big data is distinct from the Internet, although the Web makes it much easier to collect and share data. Big data is about more than just communication: the idea is that we can learn from a large body of information things that we could not comprehend when we used only smaller amounts.”
Gideon Rose, editor of Foreign Affairs, sits down with Kenneth Cukier, data editor of The Economist (video):

White House: Unleashing the Power of Big Data


Tom Kalil, Deputy Director for Technology and Innovation at OSTP : “As we enter the second year of the Big Data Initiative, the Obama Administration is encouraging multiple stakeholders, including federal agencies, private industry, academia, state and local government, non-profits, and foundations to develop and participate in Big Data initiatives across the country.  Of particular interest are partnerships designed to advance core Big Data technologies; harness the power of Big Data to advance national goals such as economic growth, education, health, and clean energy; use competitions and challenges; and foster regional innovation.
The National Science Foundation has issued a request for information encouraging stakeholders to identify Big Data projects they would be willing to support to achieve these goals.  And, later this year, OSTP, NSF, and other partner agencies in the Networking and Information Technology R&D (NITRD) program plan to convene an event that highlights high-impact collaborations and identifies areas for expanded collaboration between the public and private sectors.”

Work-force Science and Big Data


Steve Lohr from the New York Times: “Work-force science, in short, is what happens when Big Data meets H.R….Today, every e-mail, instant message, phone call, line of written code and mouse-click leaves a digital signal. These patterns can now be inexpensively collected and mined for insights into how people work and communicate, potentially opening doors to more efficiency and innovation within companies.

Digital technology also makes it possible to conduct and aggregate personality-based assessments, often using online quizzes or games, in far greater detail and numbers than ever before. In the past, studies of worker behavior were typically based on observing a few hundred people at most. Today, studies can include thousands or hundreds of thousands of workers, an exponential leap ahead.

“The heart of science is measurement,” says Erik Brynjolfsson, director of the Center for Digital Business at the Sloan School of Management at M.I.T. “We’re seeing a revolution in measurement, and it will revolutionize organizational economics and personnel economics.”

The data-gathering technology, to be sure, raises questions about the limits of worker surveillance. “The larger problem here is that all these workplace metrics are being collected when you as a worker are essentially behind a one-way mirror,” says Marc Rotenberg, executive director of the Electronic Privacy Information Center, an advocacy group. “You don’t know what data is being collected and how it is used.”

David Brooks on Big Data


David Brooks in NYT: “Over the past few centuries, there have been many efforts to come up with methods to help predict human behavior — what Leon Wieseltier of The New Republic calls mathematizing the subjective. The current one is the effort to understand the world by using big data.

Other efforts to predict behavior were based on models of human nature. The people using big data don’t presume to peer deeply into people’s souls. They don’t try to explain why people are doing things. They just want to observe what they are doing. The theory of big data is to have no theory, at least about human nature. You just gather huge amounts of information, observe the patterns and estimate probabilities about how people will act in the future….

One of my take-aways is that big data is really good at telling you what to pay attention to. It can tell you what sort of student is likely to fall behind. But then to actually intervene to help that student, you have to get back in the world of causality, back into the world of responsibility, back in the world of advising someone to do x because it will cause y.”

Big Data, Big Brains


“This report on Big Data is the first MeriTalk Beacon, a new series of reports designed to shed light and provide direction on far reaching issues in government and technology. Since Beacons are designed to tackle broad concepts, each Beacon report relies on insight from a small number of big thinkers in the topic area. Less data. More insight. Real knowledge…Mankind created 150 exabytes (billion gigabytes) of data in 2005, and 1,800 exabytes in 20112; growth that only continues to accelerate. Every minute, users: Upload 48 hours of video to YouTube; Send 204 million emails; Spend $207,000 via the web; Create 571 new websites. Within the Federal government; U.S. drone aircraft sent back 24 years worth of video footage in just 2009. Every 24 hours, NASA’s Curiosity rover can send nearly three gigabytes of data, collecting in mere days the equivalent of all human knowledge through the death of Augustus Caesar – from Mars.”

Big Data Challenge to Transform Health Care Delivery


BPC Press Release: “Today, the Bipartisan Policy Center (BPC), Heritage Provider Network (HPN), and The Advisory Board Company launched the Care Transformation Prize Series, a national contest to address the most daunting data problems U.S. health care organizations face as they implement new delivery system and payment reforms.
The goal of this big data challenge is to help health care organizations more effectively use data to drive improvements in health care cost and quality. The series was announced at a BPC-hosted event today that featured a forward-thinking discussion on the strategies that providers, health plans, and states are using to harness data to help Americans lower their health care costs and improve care.”

Big Data can help keep the peace


NextGov story: “Some of the same social media analyses that have helped Google and the Centers for Disease Control and Prevention spot warning signs of a flu outbreak could be used to detect the rumblings of violent conflict before it begins, scholars said in a paper released this week.
Kenyan officials used essentially this system to track hate speech on Facebook, blogs and Twitter in advance of that nation’s 2013 presidential election, which brought Uhuru Kenyatta to power.
Similar efforts to track Syrian social media have been able to identify ceasefire violations within 15 minutes of when they occur, according to the paper on New Technology and the Prevention of Violence and Conflict prepared by the United States Agency for International Development, the United Nations Development Programme and the International Peace Institute and presented at the United States Institute of Peace Friday.”