Eight (No, Nine!) Problems With Big Data


Gary Marcus and Ernest Davis in the New York Times: “BIG data is suddenly everywhere. Everyone seems to be collecting it, analyzing it, making money from it and celebrating (or fearing) its powers. Whether we’re talking about analyzing zillions of Google search queries to predict flu outbreaks, or zillions of phone records to detect signs of terrorist activity, or zillions of airline stats to find the best time to buy plane tickets, big data is on the case. By combining the power of modern computing with the plentiful data of the digital era, it promises to solve virtually any problem — crime, public health, the evolution of grammar, the perils of dating — just by crunching the numbers.

Or so its champions allege. “In the next two decades,” the journalist Patrick Tucker writes in the latest big data manifesto, “The Naked Future,” “we will be able to predict huge areas of the future with far greater accuracy than ever before in human history, including events long thought to be beyond the realm of human inference.” Statistical correlations have never sounded so good.

Is big data really all it’s cracked up to be? There is no doubt that big data is a valuable tool that has already had a critical impact in certain areas. For instance, almost every successful artificial intelligence computer program in the last 20 years, from Google’s search engine to the I.B.M. “Jeopardy!” champion Watson, has involved the substantial crunching of large bodies of data. But precisely because of its newfound popularity and growing use, we need to be levelheaded about what big data can — and can’t — do.

The first thing to note is that although big data is very good at detecting correlations, especially subtle correlations that an analysis of smaller data sets might miss, it never tells us which correlations are meaningful. A big data analysis might reveal, for instance, that from 2006 to 2011 the United States murder rate was well correlated with the market share of Internet Explorer: Both went down sharply. But it’s hard to imagine there is any causal relationship between the two. Likewise, from 1998 to 2007 the number of new cases of autism diagnosed was extremely well correlated with sales of organic food (both went up sharply), but identifying the correlation won’t by itself tell us whether diet has anything to do with autism.

Second, big data can work well as an adjunct to scientific inquiry but rarely succeeds as a wholesale replacement. Molecular biologists, for example, would very much like to be able to infer the three-dimensional structure of proteins from their underlying DNA sequence, and scientists working on the problem use big data as one tool among many. But no scientist thinks you can solve this problem by crunching data alone, no matter how powerful the statistical analysis; you will always need to start with an analysis that relies on an understanding of physics and biochemistry.

Third, many tools that are based on big data can be easily gamed. For example, big data programs for grading student essays often rely on measures like sentence length and word sophistication, which are found to correlate well with the scores given by human graders. But once students figure out how such a program works, they start writing long sentences and using obscure words, rather than learning how to actually formulate and write clear, coherent text. Even Google’s celebrated search engine, rightly seen as a big data success story, is not immune to “Google bombing” and “spamdexing,” wily techniques for artificially elevating website search placement.

Fourth, even when the results of a big data analysis aren’t intentionally gamed, they often turn out to be less robust than they initially seem. Consider Google Flu Trends, once the poster child for big data. In 2009, Google reported — to considerable fanfare — that by analyzing flu-related search queries, it had been able to detect the spread of the flu as accurately and more quickly than the Centers for Disease Control and Prevention. A few years later, though, Google Flu Trends began to falter; for the last two years it has made more bad predictions than good ones.

As a recent article in the journal Science explained, one major contributing cause of the failures of Google Flu Trends may have been that the Google search engine itself constantly changes, such that patterns in data collected at one time do not necessarily apply to data collected at another time. As the statistician Kaiser Fung has noted, collections of big data that rely on web hits often merge data that was collected in different ways and with different purposes — sometimes to ill effect. It can be risky to draw conclusions from data sets of this kind.

A fifth concern might be called the echo-chamber effect, which also stems from the fact that much of big data comes from the web. Whenever the source of information for a big data analysis is itself a product of big data, opportunities for vicious cycles abound. Consider translation programs like Google Translate, which draw on many pairs of parallel texts from different languages — for example, the same Wikipedia entry in two different languages — to discern the patterns of translation between those languages. This is a perfectly reasonable strategy, except for the fact that with some of the less common languages, many of the Wikipedia articles themselves may have been written using Google Translate. In those cases, any initial errors in Google Translate infect Wikipedia, which is fed back into Google Translate, reinforcing the error.

A sixth worry is the risk of too many correlations. If you look 100 times for correlations between two variables, you risk finding, purely by chance, about five bogus correlations that appear statistically significant — even though there is no actual meaningful connection between the variables. Absent careful supervision, the magnitudes of big data can greatly amplify such errors.

Seventh, big data is prone to giving scientific-sounding solutions to hopelessly imprecise questions. In the past few months, for instance, there have been two separate attempts to rank people in terms of their “historical importance” or “cultural contributions,” based on data drawn from Wikipedia. One is the book “Who’s Bigger? Where Historical Figures Really Rank,” by the computer scientist Steven Skiena and the engineer Charles Ward. The other is an M.I.T. Media Lab project called Pantheon.

Both efforts get many things right — Jesus, Lincoln and Shakespeare were surely important people — but both also make some egregious errors. “Who’s Bigger?” claims that Francis Scott Key was the 19th most important poet in history; Pantheon has claimed that Nostradamus was the 20th most important writer in history, well ahead of Jane Austen (78th) and George Eliot (380th). Worse, both projects suggest a misleading degree of scientific precision with evaluations that are inherently vague, or even meaningless. Big data can reduce anything to a single number, but you shouldn’t be fooled by the appearance of exactitude.

FINALLY, big data is at its best when analyzing things that are extremely common, but often falls short when analyzing things that are less common. For instance, programs that use big data to deal with text, such as search engines and translation programs, often rely heavily on something called trigrams: sequences of three words in a row (like “in a row”). Reliable statistical information can be compiled about common trigrams, precisely because they appear frequently. But no existing body of data will ever be large enough to include all the trigrams that people might use, because of the continuing inventiveness of language.

To select an example more or less at random, a book review that the actor Rob Lowe recently wrote for this newspaper contained nine trigrams such as “dumbed-down escapist fare” that had never before appeared anywhere in all the petabytes of text indexed by Google. To witness the limitations that big data can have with novelty, Google-translate “dumbed-down escapist fare” into German and then back into English: out comes the incoherent “scaled-flight fare.” That is a long way from what Mr. Lowe intended — and from big data’s aspirations for translation.

Wait, we almost forgot one last problem: the hype….

Using Social Media to Measure Labor Market Flows


Paper by Dolan Antenucci, Michael Cafarella, Margaret C. Levenstein, Christopher Ré, and Matthew D. Shapiro: “Social media enable promising new approaches to measuring economic activity and analyzing economic behavior at high frequency and in real time using information independent from standard survey and administrative sources. This paper uses data from Twitter to create indexes of job loss, job search, and job posting. Signals are derived by counting job-related phrases in Tweets such as “lost my job.” The social media indexes are constructed from the principal components of these signals. The University of Michigan Social Media Job Loss Index tracks initial claims for unemployment insurance at medium and high frequencies and predicts 15 to 20 percent of the variance of the prediction error of the consensus forecast for initial claims. The social media indexes provide real-time indicators of events such as Hurricane Sandy and the 2013 government shutdown. Comparing the job loss index with the search and posting indexes indicates that the Beveridge Curve has been shifting inward since 2011.
The University of Michigan Social Media Job Loss index is update weeklyand is available at http://econprediction.eecs.umich.edu/.”

Smart cities are here today — and getting smarter


Computer World: “Smart cities aren’t a science fiction, far-off-in-the-future concept. They’re here today, with municipal governments already using technologies that include wireless networks, big data/analytics, mobile applications, Web portals, social media, sensors/tracking products and other tools.
These smart city efforts have lofty goals: Enhancing the quality of life for citizens, improving government processes and reducing energy consumption, among others. Indeed, cities are already seeing some tangible benefits.
But creating a smart city comes with daunting challenges, including the need to provide effective data security and privacy, and to ensure that myriad departments work in harmony.

The global urban population is expected to grow approximately 1.5% per year between 2025 and 2030, mostly in developing countries, according to the World Health Organization.

What makes a city smart? As with any buzz term, the definition varies. But in general, it refers to using information and communications technologies to deliver sustainable economic development and a higher quality of life, while engaging citizens and effectively managing natural resources.
Making cities smarter will become increasingly important. For the first time ever, the majority of the world’s population resides in a city, and this proportion continues to grow, according to the World Health Organization, the coordinating authority for health within the United Nations.
A hundred years ago, two out of every 10 people lived in an urban area, the organization says. As recently as 1990, less than 40% of the global population lived in a city — but by 2010 more than half of all people lived in an urban area. By 2050, the proportion of city dwellers is expected to rise to 70%.
As many city populations continue to grow, here’s what five U.S. cities are doing to help manage it all:

Scottsdale, Ariz.

The city of Scottsdale, Ariz., has several initiatives underway.
One is MyScottsdale, a mobile application the city deployed in the summer of 2013 that allows citizens to report cracked sidewalks, broken street lights and traffic lights, road and sewer issues, graffiti and other problems in the community….”

Visualizing Health IT: A holistic overview


Andy Oram in O’Reilly Data: “There is no dearth of health reformers offering their visions for patient engagement, information exchange, better public health, and disruptive change to health industries. But they often accept too freely the promise of technology, without grasping how difficult the technical implementations of their reforms would be. Furthermore, no document I have found pulls together the various trends in technology and explores their interrelationships.
I have tried to fill this gap with a recently released report: The Information Technology Fix for Health: Barriers and Pathways to the Use of Information Technology for Better Health Care. This posting describes some of the issues it covers.
Take a basic example: fitness devices. Lots of health reformers would love to see these pulled into treatment plans to help people overcome hypertension and other serious conditions. It’s hard to understand the factors that make doctors reluctant to do so–blind conservatism is not the problem, but actual technical factors. To become part of treatment plans, the accuracy of devices would have to be validated, they would need to produce data in formats and units that are universally recognized, and electronic records would have to be undergo major upgrades to store and process the data.
Another example is patient engagement, which doctors and hospitals are furiously pursuing. Not only are patients becoming choosier and rating their institutions publicly in Yelp-like fashion, but the clinicians have come to realize that engaged patients are more likely to participate in developing effective treatment plans, not to mention following through on them.
Engaging patients to improve their own outcomes directly affects the institutions’ bottom lines as insurers and the government move from paying for each procedure to pay-per-value (a fixed sum for handling a group of patients that share a health condition). But what data do we need to make pay-per-value fair and accurate? How do we get that data from one place to another, and–much more difficult–out of one ungainly proprietary format and possibly into others? The answer emerging among activists to these questions is: leave the data under the control of the patients, and let them share it as they find appropriate.
Collaboration may be touted even more than patient engagement as the way to better health. And who wouldn’t want his cardiologist to be consulting with his oncologist, nutritionist, and physical therapist? It doesn’t happen as much as it should, and while picking up the phone may be critical sometimes to making the right decisions, electronic media can also be of crucial value. Once again, we have to overcome technical barriers.
The The Information Technology Fix for Health report divides these issues into four umbrella categories:

  • Devices, sensors, and patient monitoring
  • Using data: records, public data sets, and research
  • Coordinated care: teams and telehealth
  • Patient empowerment

Underlying all these as a kind of vast subterranean network of interconnected roots are electronic health records (EHRs). These must function well in order for devices to send output to the interested observers, researchers to collect data, and teams to coordinate care. The article delves into the messy and often ugly area of formats and information exchange, along with issues of privacy. I extol once again the virtue of patient control over records and suggest how we could overcome all barriers to make that happen.”

Public interest labs to test open governance solutions


Kathleen Hickey in GCN: “The Governance Lab at New York University (GovLab) and the MacArthur Foundation Research Network have formed a new network, Open Governance, to study how to enhance collaboration and decision-making in the public interest.
The MacArthur Foundation provided a three-year grant of $5 million for the project; Google’s philanthropic arm, Google.org, also contributed. Google.org’s technology will be used to develop platforms to solve problems more openly and to run agile, real-world experiments with governments and NGOs to discover ways to enhance decision-making in the public interest, according to the GovLab announcement.
Network members include 12 experts in computer science, political science, policy informatics, social psychology and philosophy, law, and communications. This group is supported by an advisory network of academics, technologists, and current and former government officials. The network will assess existing government programs and experiment with ways to improve decision-making at the local, national and international government levels.
The Network’s efforts focus on three areas that members say have the potential to make governance more effective and legitimate: getting expertise in, pushing data out and distributing responsibility.
Through smarter governance, they say, institutions can seek input from lay and expert citizens via expert networking, crowdsourcing or challenges.  With open data governance, institutions can publish machine-readable data so that citizens can easily analyze and use this information to detect and solve problems. And by shared governance, institutions can help citizens develop solutions through participatory budgeting, peer production or digital commons.
“Recognizing that we cannot solve today’s challenges with yesterday’s tools, this interdisciplinary group will bring fresh thinking to questions about how our governing institutions operate and how they can develop better ways to help address seemingly intractable social problems for the common good,” said MacArthur Foundation President Robert Gallucci.
GovLab’s mission is to study and launch “experimental, technology-enabled solutions that advance a collaborative, networked approach to re-invent existing institutions and processes of governance to improve people’s lives.” Earlier this year GovLab released a preview of its Open Data 500 study of 500 companies using open government data as a key business resource.”

Open Data: What Is It and Why Should You Care?


Jason Shueh at Government Technology: “Though the debate about open data in government is an evolving one, it is indisputably here to stay — it can be heard in both houses of Congress, in state legislatures, and in city halls around the nation.
Already, 39 states and 46 localities provide data sets to data.gov, the federal government’s online open data repository. And 30 jurisdictions, including the federal government, have taken the additional step of institutionalizing their practices in formal open data policies.
Though the term “open data” is spoken of frequently — and has been since President Obama took office in 2009 — what it is and why it’s important isn’t always clear. That’s understandable, perhaps, given that open data lacks a unified definition.
“People tend to conflate it with big data,” said Emily Shaw, the national policy manager at the Sunlight Foundation, “and I think it’s useful to think about how it’s different from big data in the sense that open data is the idea that public information should be accessible to the public online.”
Shaw said the foundation, a Washington, D.C., non-profit advocacy group promoting open and transparent government, believes the term open data can be applied to a variety of information created or collected by public entities. Among the benefits of open data are improved measurement of policies, better government efficiency, deeper analytical insights, greater citizen participation, and a boost to local companies by way of products and services that use government data (think civic apps and software programs).
“The way I personally think of open data,” Shaw said, “is that it is a manifestation of the idea of open government.”

What Makes Data Open

For governments hoping to adopt open data in policy and in practice, simply making data available to the public isn’t enough to make that data useful. Open data, though straightforward in principle, requires a specific approach based on the agency or organization releasing it, the kind of data being released and, perhaps most importantly, its targeted audience.
According to the foundation’s California Open Data Handbook, published in collaboration with Stewards of Change Institute, a national group supporting innovation in human services, data must first be both “technically open” and “legally open.” The guide defines the terms in this way:
Technically open: [data] available in a machine-readable standard format, which means it can be retrieved and meaningfully processed by a computer application
Legally open: [data] explicitly licensed in a way that permits commercial and non-commercial use and re-use without restrictions.
Technically open means that data is easily accessible to its intended audience. If the intended users are developers and programmers, Shaw said, the data should be presented within an application programming interface (API); if it’s intended for researchers in academia, data might be structured in a bulk download; and if it’s aimed at the average citizen, data should be available without requiring software purchases.
….

4 Steps to Open Data

Creating open data isn’t without its complexities. There are many tasks that need to happen before an open data project ever begins. A full endorsement from leadership is paramount. Adding the project into the work flow is another. And allaying fears and misunderstandings is expected with any government project.
After the basic table stakes are placed, the handbook prescribes four steps: choosing a set of data, attaching an open license, making it available through a proper format and ensuring the data is discoverable.
1. Choose a Data Set
Choosing a data set can appear daunting, but it doesn’t have to be. Shaw said ample resources are available from the foundation and others on how to get started with this — see our list of open data resources for more information. In the case of selecting a data set, or sets, she referred to the foundation’s recently updated guidelines that urge identifying data sets based on goals and the demand from citizen feedback.
2. Attach an Open License
Open licenses dispel ambiguity and encourage use. However, they need to be proactive, and this means users should not be forced to request the information in order to use it — a common symptom of data accessed through the Freedom of Information Act. Tips for reference can be found at Opendefinition.org, a site that has a list of examples and links to open licenses that meet the definition of open use.
3. Format the Data to Your Audience
As previously stated, Shaw recommends tailoring the format of data to the audience, with the ideal being that data is packaged in formats that can be digested by all users: developers, civic hackers, department staff, researchers and citizens. This could mean it’s put into APIs, spreadsheet docs, text and zip files, FTP servers and torrent networking systems (a way to download files from different sources). The file type and the system for download all depends on the audience.
“Part of learning about what formats government should offer data in is to engage with the prospective users,” Shaw said.
4. Make it Discoverable
If open data is strewn across multiple download links and wedged into various nooks and crannies of a website, it probably won’t be found. Shaw recommends a centralized hub that acts as a one-stop shop for all open data downloads. In many jurisdictions, these Web pages and websites have been called “portals;” they are the online repositories for a jurisdiction’s open data publishing.
“It is important for thinking about how people can become aware of what their governments hold. If the government doesn’t make it easy for people to know what kinds of data is publicly available on the website, it doesn’t matter what format it’s in,” Shaw said. She pointed to public participation — a recurring theme in open data development — to incorporate into the process to improve accessibility.
 
Examples of portals, can be found in numerous cities across the U.S., such as San Francisco, New York, Los Angeles, Chicago and Sacramento, Calif.
Visit page 2 of our story for open data resources, and page 3 for open data file formats.

HarassMap: Using Crowdsourced Data to Map Sexual Harassment in Egypt


Chelsea Young in Technology Innovation Management Review: “Through a case study of HarassMap, an advocacy, prevention, and response tool that uses crowdsourced data to map incidents of sexual harassment in Egypt, this article examines the application of crowdsourcing technology to drive innovation in the field of social policy. This article applies a framework that explores the potential, limitations, and future applications of crowdsourcing technology in this sector to reveal how crowdsourcing technology can be applied to overcome cultural and environmental constraints that have traditionally impeded the collection of data. Many of the lessons emerging from this case study hold relevance beyond the field of social policy. Applied to specific problems, this technology can be used to improve the efficiency and effectiveness of mitigation strategies, while facilitating rapid and informed decision making based on “good enough” data. However, this case also illustrates a number of challenges arising from the integrity of crowdsourced data and the potential for ethical conflict when using this data to inform policy formulation.”

The Potential of Crowdsourcing to Improve Patient-Centered Care


Michael Weiner in the Journal The Patient – Patient-Centered Outcomes Research: “Crowdsourcing (CS) is the outsourcing of a problem or task to a crowd. Although patient-centered care (PCC) may aim to be tailored to an individual’s needs, the uses of CS for generating ideas, identifying values, solving problems, facilitating research, and educating an audience represent powerful roles that can shape both allocation of shared resources and delivery of personalized care and treatment. CS can often be conducted quickly and at relatively low cost. Pitfalls include bias, risks of research ethics, inadequate quality of data, inadequate metrics, and observer-expectancy effect. Health professionals and consumers in the US should increase their attention to CS for the benefit of PCC. Patients’ participation in CS to shape health policy and decisions is one way to pursue PCC itself and may help to improve clinical outcomes through a better understanding of patients’ perspectives. CS should especially be used to traverse the quality-cost curve, or decrease costs while preserving or improving quality of care.”

Infomediary Business Models for Connecting Open Data Providers and Users


Paper by Marijn Janssen and Anneke Zuiderwijk in Social Science Computer Review: “Many public organizations are opening their data to the general public and embracing social media in order to stimulate innovation. These developments have resulted in the rise of new, infomediary business models, positioned between open data providers and users. Yet the variation among types of infomediary business models is little understood. The aim of this article is to contribute to the understanding of the diversity of existing infomediary business models that are driven by open data and social media. Cases presenting different modes of open data utilization in the Netherlands are investigated and compared. Six types of business models are identified: single-purpose apps, interactive apps, information aggregators, comparison models, open data repositories, and service platforms. The investigated cases differ in their levels of access to raw data and in how much they stimulate dialogue between different stakeholders involved in open data publication and use. Apps often are easy to use and provide predefined views on data, whereas service platforms provide comprehensive functionality but are more difficult to use. In the various business models, social media is sometimes used for rating and discussion purposes, but it is rarely used for stimulating dialogue or as input to policy making. Hybrid business models were identified in which both public and private organizations contribute to value creation. Distinguishing between different types of open data users was found to be critical in explaining different business models.”

Ten Innovations to Compete for Global Innovation Award


Making All Voices Count: “The Global Innovation Competition was launched at the Open Government Partnership Summit in November, 2013 and set out to scout the globe for fresh ideas to enhance government accountability and boost citizen engagement. The call was worldwide and in response, nearly 200 innovative ideas were submitted. After a process of public voting and peer review, these have been reduced to ten.
Below, we highlight the innovations that will now compete for a prize of £65,000 plus six months mentorship at the Global Innovation Week March 31 – April 4, 2014 in Kenya.
The first seven emerged from a process of peer review and the following three were selected by the Global Innovation Jury.

An SMS gateway, connected to local hospitals and the web, to channel citizens’ requests for pregnancy services. At risk women, in need of information such as hospital locations and general advice, will receive relevant and targeted updates utilising both an SMS and a GIS-based system.  The aim is to reduce maternal mortality by targeting at risk women in poorer communities in Indonesia.

“One of the causes of high maternal mortality rate in Indonesia is late response in childbirth treatment and lack of pregnancy care information.”

This project, led by a civil servant, aims to engage citizens in Pakistan in service delivery governance. The project aims to enable and motivate citizens to collect, analyze and disseminate service delivery performance data in order to drive performance and help effective decision making.

“BSDU will serve as a model of better management aided by the citizens, for the citizens.”

A Geographic Information System that gives Indonesian citizens access to information regarding government funded projects. The idea is to enable and motivate citizens to compare a project’s information with its real-world implementation and to provide feedback on this. The ultimate aim is to fight corruption in the public sector by making it easier for citizens to monitor, and provide feedback on, government-funded projects.

“On-the-map information about government-funded projects, where citizens are able to submit their opinions, should became a global standard in budget transparency!”

A digital payment system in South Africa that rewards citizens who participate in activities such as waste separation and community gardening. Citizens are able to ‘spend’ rewards on airtime, pre-paid electricity and groceries. By rewarding social volunteers this project aims to boost citizen engagement, build trust and establish the link between government and citizen actors.

“GEM offers a direct channel for communication and rewards between governments and citizens.”

An app created by a team of software developers to provide Ghanaian citizens with information about the oil and gas industry, with the aim of raising awareness of the revenue generated and to spark debate about how this could be used to improve national development.

“The idea is to bring citizens, the oil and gas companies and the government all onto one platform.”

Ghana Petrol Watch seeks to deliver basic facts and figures associated with oil and gas exploration to the average Ghanaian. The solution employs mobile technology to deliver this information. The audience can voice their concerns as comments on the issue via replies to the SMS. These would then be published on the web portal for further exposure and publicity.

“The information on the petroleum industry is publicly available, but not readily accessible and often does not reach the grassroots community in an easily comprehensible manner.”

A common platform to be implemented in Khulna City, Bangladesh, where citizens and elected officials will interact on budget, expenditure and information.

“The concept of citizen engagement for the fulfillment of pre-election commitment is an innovation in establishing governance.”

The aim of this project is an increase in child engagement in governmental budgeting and policy formulation in Mwanza City, Tanzania. This project was selected as a wildcard by the Global Innovation Jury.

“In many projects I have seen, children are always the perceived beneficiaries, rarely do you see innovations where children are active participants in achieving a goal in their society. It was great to see children as active contributors to their own discourse.” – Jury Member, Shikoh Gitau.

A ‘watchdog’ newsletter in Kenya focusing on monitoring the actions of officials with the aim of educating, empowering and motivating citizens to hold their leaders to account. This project was selected as a wildcard by the Global Innovation Jury.

“We endeavor to bridge the information gap in northern Kenya by giving voice to the voiceless and also highlighting their challenges. The aim is an increase in the educational level of the people through information.”

Citizen Desk is an open-source tool that combines the ability of citizens to share eyewitness reports with the public need for verified information in real time. Citizen Desk lets citizen journalists file reports via SMS or social media, with no need for technical training. This project was selected as a wildcard by the Global Innovation Jury.

“It has become evident for some time now that good technical innovation must rest on a strong bedrock of social and political activity, on the ground, deeply in touch with local conditions, and sometimes in the face of power and privilege.” – Jury Member Bright Simons.”