Bad Data


Bad Data is a site providing real-world examples of how not to prepare or provide data. It showcases the poorly structured, the mis-formatted, or the just plain ugly. Its primary purpose is to educate – though there may also be some aspect of entertainment.
As a side-product it also provides a source of good practice material for budding data wranglers (the repo in fact began as a place to keep practice data for Data Explorer).
New examples wanted and welcome – submit them here »

Examples

Garbage In, Garbage Out… Or, How to Lie with Bad Data


Medium: For everyone who slept through Stats 101, Charles Wheelan’s Naked Statistics is a lifesaver. From batting averages and political polls to Schlitz ads and medical research, Wheelan “illustrates exactly why even the most reluctant mathophobe is well advised to achieve a personal understanding of the statistical underpinnings of life” (New York Times). What follows is adapted from the book, out now in paperback.
Behind every important study there are good data that made the analysis possible. And behind every bad study . . . well, read on. People often speak about “lying with statistics.” I would argue that some of the most egregious statistical mistakes involve lying with data; the statistical analysis is fine, but the data on which the calculations are performed are bogus or inappropriate. Here are some common examples of “garbage in, garbage out.”

Selection Bias

….Selection bias can be introduced in many other ways. A survey of consumers in an airport is going to be biased by the fact that people who fly are likely to be wealthier than the general public; a survey at a rest stop on Interstate 90 may have the opposite problem. Both surveys are likely to be biased by the fact that people who are willing to answer a survey in a public place are different from people who would prefer not to be bothered. If you ask 100 people in a public place to complete a short survey, and 60 are willing to answer your questions, those 60 are likely to be different in significant ways from the 40 who walked by without making eye contact.

Publication Bias

Positive findings are more likely to be published than negative findings, which can skew the results that we see. Suppose you have just conducted a rigorous, longitudinal study in which you find conclusively that playing video games does not prevent colon cancer. You’ve followed a representative sample of 100,000 Americans for twenty years; those participants who spend hours playing video games have roughly the same incidence of colon cancer as the participants who do not play video games at all. We’ll assume your methodology is impeccable. Which prestigious medical journal is going to publish your results?

Most things don’t prevent cancer.

None, for two reasons. First, there is no strong scientific reason to believe that playing video games has any impact on colon cancer, so it is not obvious why you were doing this study. Second, and more relevant here, the fact that something does not prevent cancer is not a particularly interesting finding. After all, most things don’t prevent cancer. Negative findings are not especially sexy, in medicine or elsewhere.
The net effect is to distort the research that we see, or do not see. Suppose that one of your graduate school classmates has conducted a different longitudinal study. She finds that people who spend a lot of time playing video games do have a lower incidence of colon cancer. Now that is interesting! That is exactly the kind of finding that would catch the attention of a medical journal, the popular press, bloggers, and video game makers (who would slap labels on their products extolling the health benefits of their products). It wouldn’t be long before Tiger Moms all over the country were “protecting” their children from cancer by snatching books out of their hands and forcing them to play video games instead.
Of course, one important recurring idea in statistics is that unusual things happen every once in a while, just as a matter of chance. If you conduct 100 studies, one of them is likely to turn up results that are pure nonsense—like a statistical association between playing video games and a lower incidence of colon cancer. Here is the problem: The 99 studies that find no link between video games and colon cancer will not get published, because they are not very interesting. The one study that does find a statistical link will make it into print and get loads of follow-on attention. The source of the bias stems not from the studies themselves but from the skewed information that actually reaches the public. Someone reading the scientific literature on video games and cancer would find only a single study, and that single study will suggest that playing video games can prevent cancer. In fact, 99 studies out of 100 would have found no such link.

Recall Bias

Memory is a fascinating thing—though not always a great source of good data. We have a natural human impulse to understand the present as a logical consequence of things that happened in the past—cause and effect. The problem is that our memories turn out to be “systematically fragile” when we are trying to explain some particularly good or bad outcome in the present. Consider a study looking at the relationship between diet and cancer. In 1993, a Harvard researcher compiled a data set comprising a group of women with breast cancer and an age-matched group of women who had not been diagnosed with cancer. Women in both groups were asked about their dietary habits earlier in life. The study produced clear results: The women with breast cancer were significantly more likely to have had diets that were high in fat when they were younger.
Ah, but this wasn’t actually a study of how diet affects the likelihood of getting cancer. This was a study of how getting cancer affects a woman’s memory of her diet earlier in life. All of the women in the study had completed a dietary survey years earlier, before any of them had been diagnosed with cancer. The striking finding was that women with breast cancer recalled a diet that was much higher in fat than what they actually consumed; the women with no cancer did not.

Women with breast cancer recalled a diet that was much higher in fat than what they actually consumed; the women with no cancer did not.

The New York Times Magazine described the insidious nature of this recall bias:

The diagnosis of breast cancer had not just changed a woman’s present and the future; it had altered her past. Women with breast cancer had (unconsciously) decided that a higher-fat diet was a likely predisposition for their disease and (unconsciously) recalled a high-fat diet. It was a pattern poignantly familiar to anyone who knows the history of this stigmatized illness: these women, like thousands of women before them, had searched their own memories for a cause and then summoned that cause into memory.

Recall bias is one reason that longitudinal studies are often preferred to cross-sectional studies. In a longitudinal study the data are collected contemporaneously. At age five, a participant can be asked about his attitudes toward school. Then, thirteen years later, we can revisit that same participant and determine whether he has dropped out of high school. In a cross-sectional study, in which all the data are collected at one point in time, we must ask an eighteen-year-old high school dropout how he or she felt about school at age five, which is inherently less reliable.

Survivorship Bias

Suppose a high school principal reports that test scores for a particular cohort of students has risen steadily for four years. The sophomore scores for this class were better than their freshman scores. The scores from junior year were better still, and the senior year scores were best of all. We’ll stipulate that there is no cheating going on, and not even any creative use of descriptive statistics. Every year this cohort of students has done better than it did the preceding year, by every possible measure: mean, median, percentage of students at grade level, and so on. Would you (a) nominate this school leader for “principal of the year” or (b) demand more data?

If you have a room of people with varying heights, forcing the short people to leave will raise the average height in the room, but it doesn’t make anyone taller.

I say “b.” I smell survivorship bias, which occurs when some or many of the observations are falling out of the sample, changing the composition of the observations that are left and therefore affecting the results of any analysis. Let’s suppose that our principal is truly awful. The students in his school are learning nothing; each year half of them drop out. Well, that could do very nice things for the school’s test scores—without any individual student testing better. If we make the reasonable assumption that the worst students (with the lowest test scores) are the most likely to drop out, then the average test scores of those students left behind will go up steadily as more and more students drop out. (If you have a room of people with varying heights, forcing the short people to leave will raise the average height in the room, but it doesn’t make anyone taller.)

Healthy User Bias

People who take vitamins regularly are likely to be healthy—because they are the kind of people who take vitamins regularly! Whether the vitamins have any impact is a separate issue. Consider the following thought experiment. Suppose public health officials promulgate a theory that all new parents should put their children to bed only in purple pajamas, because that helps stimulate brain development. Twenty years later, longitudinal research confirms that having worn purple pajamas as a child does have an overwhelmingly large positive association with success in life. We find, for example, that 98 percent of entering Harvard freshmen wore purple pajamas as children (and many still do) compared with only 3 percent of inmates in the Massachusetts state prison system.

The purple pajamas do not matter.

Of course, the purple pajamas do not matter; but having the kind of parents who put their children in purple pajamas does matter. Even when we try to control for factors like parental education, we are still going to be left with unobservable differences between those parents who obsess about putting their children in purple pajamas and those who don’t. As New York Times health writer Gary Taubes explains, “At its simplest, the problem is that people who faithfully engage in activities that are good for them—taking a drug as prescribed, for instance, or eating what they believe is a healthy diet—are fundamentally different from those who don’t.” This effect can potentially confound any study trying to evaluate the real effect of activities perceived to be healthful, such as exercising regularly or eating kale. We think we are comparing the health effects of two diets: kale versus no kale. In fact, if the treatment and control groups are not randomly assigned, we are comparing two diets that are being eaten by two different kinds of people. We have a treatment group that is different from the control group in two respects, rather than just one.

If statistics is detective work, then the data are the clues. My wife spent a year teaching high school students in rural New Hampshire. One of her students was arrested for breaking into a hardware store and stealing some tools. The police were able to crack the case because (1) it had just snowed and there were tracks in the snow leading from the hardware store to the student’s home; and (2) the stolen tools were found inside. Good clues help.
Like good data. But first you have to get good data, and that is a lot harder than it seems.

Public Open Sensor Data: Revolutionizing Smart Cities


New Paper in Technology and Society Magazine, IEEE (Volume: 32,  Issue: 4): “Local governments have decided to take advantage of the presence of wireless sensor networks (WSNs) in their cities to efficiently manage several applications in their daily responsibilities. The enormous amount of information collected by sensor devices allows the automation of several real-time services to improve city management by using intelligent traffic-light patterns during rush hour, reducing water consumption in parks, or efficiently routing garbage collection trucks throughout the city [1]. The sensor information required by these examples is mostly self-consumed by city-designed applications and managers.”

Imagining Data Without Division


Thomas Lin in Quanta Magazine: “As science dives into an ocean of data, the demands of large-scale interdisciplinary collaborations are growing increasingly acute…Seven years ago, when David Schimel was asked to design an ambitious data project called the National Ecological Observatory Network, it was little more than a National Science Foundation grant. There was no formal organization, no employees, no detailed science plan. Emboldened by advances in remote sensing, data storage and computing power, NEON sought answers to the biggest question in ecology: How do global climate change, land use and biodiversity influence natural and managed ecosystems and the biosphere as a whole?…
For projects like NEON, interpreting the data is a complicated business. Early on, the team realized that its data, while mid-size compared with the largest physics and biology projects, would be big in complexity. “NEON’s contribution to big data is not in its volume,” said Steve Berukoff, the project’s assistant director for data products. “It’s in the heterogeneity and spatial and temporal distribution of data.”
Unlike the roughly 20 critical measurements in climate science or the vast but relatively structured data in particle physics, NEON will have more than 500 quantities to keep track of, from temperature, soil and water measurements to insect, bird, mammal and microbial samples to remote sensing and aerial imaging. Much of the data is highly unstructured and difficult to parse — for example, taxonomic names and behavioral observations, which are sometimes subject to debate and revision.
And, as daunting as the looming data crush appears from a technical perspective, some of the greatest challenges are wholly nontechnical. Many researchers say the big science projects and analytical tools of the future can succeed only with the right mix of science, statistics, computer science, pure mathematics and deft leadership. In the big data age of distributed computing — in which enormously complex tasks are divided across a network of computers — the question remains: How should distributed science be conducted across a network of researchers?
Part of the adjustment involves embracing “open science” practices, including open-source platforms and data analysis tools, data sharing and open access to scientific publications, said Chris Mattmann, 32, who helped develop a precursor to Hadoop, a popular open-source data analysis framework that is used by tech giants like Yahoo, Amazon and Apple and that NEON is exploring. Without developing shared tools to analyze big, messy data sets, Mattmann said, each new project or lab will squander precious time and resources reinventing the same tools. Likewise, sharing data and published results will obviate redundant research.
To this end, international representatives from the newly formed Research Data Alliance met this month in Washington to map out their plans for a global open data infrastructure.”

User-Generated Content Is Here to Stay


in the Huffington Post: “The way media are transmitted has changed dramatically over the last 10 years. User-generated content (UGC) has completely changed the landscape of social interaction, media outreach, consumer understanding, and everything in between. Today, UGC is media generated by the consumer instead of the traditional journalists and reporters. This is a movement defying and redefining traditional norms at the same time. Current events are largely publicized on Twitter and Facebook by the average person, and not by a photojournalist hired by a news organization. In the past, these large news corporations dominated the headlines — literally — and owned the monopoly on public media. Yet with the advent of smartphones and spread of social media, everything has changed. The entire industry has been replaced; smartphones have supplanted how information is collected, packaged, edited, and conveyed for mass distribution. UGC allows for raw and unfiltered movement of content at lightening speed. With the way that the world works today, it is the most reliable way to get information out. One thing that is for certain is that UGC is here to stay whether we like it or not, and it is driving much more of modern journalistic content than the average person realizes.
Think about recent natural disasters where images are captured by citizen journalists using their iPhones. During Hurricane Sandy, 800,000 photos uploaded onto Instagram with “#Sandy.” Time magazine even hired five iPhoneographers to photograph the wreckage for its Instagram page. During the May 2013 Oklahoma City tornadoes, the first photo released was actually captured by a smartphone. This real-time footage brings environmental chaos to your doorstep in a chillingly personal way, especially considering the photographer of the first tornado photos ultimately died because of the tornado. UGC has been monumental for criminal investigations and man-made catastrophes. Most notably, the Boston Marathon bombing was covered by UGC in the most unforgettable way. Dozens of images poured in identifying possible Boston bombers, to both the detriment and benefit of public officials and investigators. Though these images inflicted considerable damage to innocent bystanders sporting suspicious backpacks, ultimately it was also smartphone images that highlighted the presence of the Tsarnaev brothers. This phenomenon isn’t limited to America. Would the so-called Arab Spring have happened without social media and UGC? Syrians, Egyptians, and citizens from numerous nations facing protests can easily publicize controversial images and statements to be shared worldwide….
This trend is not temporary but will only expand. The first iPhone launched in 2007, and the world has never been the same. New smartphones are released each month with better cameras and faster processors than computers had even just a few years ago….”

Introducing Socrata’s Open Data Magazine: Open Innovation


“Socrata is dedicated to telling the story of open data as it evolves, which is why we have launched a quarterly magazine, “Open Innovation.”
As innovators push the open data movement forward, they are transforming government and public engagement at every level. With thousands of innovators all over the world – each with their own successes, advice, and ideas – there is a tremendous amount of story for us to tell.
The new magazine features articles, advice, infographics, and more dedicated exclusively to the open data movement. The first issue, Fall 2013, will cover topics such as:

  • What is a Chief Data Officer?
  • Who should be on your open data team?
  • How do you publish your first open data set?

It will also include four Socrata case studies and opinion pieces from some of the industry’s leading innovators…
The magazine is currently free to download or read online through the Socrata website. It is optimized for viewing on tablets and smart phones, with plans in the works to make the magazine available through the Kindle Fire and iTunes magazine stores.
Check out the first issue of Open Innovation at www.socrata.com/magazine.”

Government Is a Good Venture Capitalist


Wall Street Journal: “In a knowledge-intensive economy, innovation drives growth. But what drives innovation? In the U.S., most conservatives believe that economically significant new ideas originate in the private sector, through either the research-and-development investments of large firms with deep pockets or the inspiration of obsessive inventors haunting shabby garages. In this view, the role of government is to secure the basic conditions for honest and efficient commerce—and then get out of the way. Anything more is bound to be “wasteful” and “burdensome.”
The real story is more complex and surprising. For more than four decades, R&D magazine has recognized the top innovations—100 each year—that have moved past the conceptual stage into commercial production and sales. Economic sociologists Fred Block and Matthew Keller decided to ask a simple question: Where did these award-winning innovations come from?
The data indicated seven kinds of originating entities: Fortune 500 companies; small and medium enterprises (including startups); collaborations among private entities; government laboratories; universities; spinoffs started by researchers at government labs or universities; and a grab bag of other public and nonprofit agencies.
Messrs. Block and Keller randomly selected three years in each of the past four decades and analyzed the resulting 1,200 innovations. About 10% originated in foreign entities; the sociologists focused on the domestic innovations, more than 1,050.
Two of their findings stand out. First, the number of award winners originating in Fortune 500 companies—either working alone or in collaboration with others—has declined steadily and sharply, from an annual average of 44 in the 1970s to only nine in the first decade of this century.
Second, the number of top innovations originating in federal laboratories, universities or firms formed by former researchers in those entities rose dramatically, from 18 in the 1970s to 37 in the 1980s and 55 in the 1990s before falling slightly to 49 in the 2000s. Without the research conducted in federal labs and universities (much of it federally funded), commercial innovation would have been far less robust…”

Is making stories touchable the next big thing for journalism?


at Gigaom: “The best way to explain fracking is to let people do it, believes former LA Times reporter David Sarno, which is why he started to build interactive storytelling experiences based on game design tools….
It seems like a simple enough concept: We experience storytelling through our senses. So the more senses you add to an experience, the more immersive it can be — a concept that’s the root of Lighthaus, a new start-up founded by former journalist David Sarno.
Sarno spent eight years reporting on technology for the Los Angeles Times, but thanks to a Stanford fellowship, is now focusing on a new venture that applies game design principles to create touchable interactive graphics — graphics which can help bring important stories to life.

As demoed above, Sarno and a team of artists and designers have built an interactive experience illustrating the realities of fracking — a “touchable story” created, Sarno says, “in less than a month for a few thousand dollars.” The goal, Sarno told me in a Skype interview, is to get faster and cheaper.
While relatively new, Lighthaus already has a few clients: One is the Stanford Medicine magazine — Sarno is designing a guide to the condition placenta accreta as part of an issue focusing on childbirth.”

Create a Crowd Competition That Works


Ahmad Ashkar in HBR Blog Network: “It’s no secret that people in business are turning to the crowd to solve their toughest challenges. Well-known sites like Kickstarter and Indiegogo allow people to raise money for new projects. Design platforms like Crowdspring and 99designs give people the tools needed to crowdsource graphic design ideas and feedback.
At the Hult Prize — a start-up accelerator that challenges Millennials to develop innovative social enterprises to solve our world’s most pressing issues (and rewards the top team with $1,000,000 in start-up capital) — we’ve learned that the crowd can also offer an unorthodox solution in developing innovative and disruptive ideas, particularly ones focused on tackling complex, large-scale social issues.
But to effectively harness the power of the crowd, you have to engage it carefully. Over the past four years, we’ve developed a well-defined set of principles that guide our annual “challenge,” (lauded by Bill Clinton in TIME magazine as one of the top five initiatives changing the world for the better) that produces original and actionable ideas to solve social issues.
Companies like Netflix, General Electric, and Proctor & Gamble have also started “challenging the crowd” and employing many of these principles to tackle their own business roadblocks. If you’re looking to spark disruptive and powerful ideas that benefit your company, follow these guidelines to launch an engaging competition:
1. Define the boundaries
2. Identify a specific and bold stretch target. …
3. Insist on low barriers to entry. …
4. Encourage teams and networks. …
5. Provide a toolkit. Once interested parties become participants in your challenge, provide tools to set them up for success. If you are working on a social problem, you can use IDEO’s human-centered design toolkit. If you have a private-sector challenge, consider posting it on an existing innovation platform. As an organizer, you don’t have to spend time recreating the wheel — use one of the many existing platforms and borrow materials from those willing to share.”

Targeting Transparency


New paper by David Weil, Mary Graham, and Archon Fung in Science Magazine: “When rules, taxes, or subsidies prove impractical as policy tools, governments increasingly employ “targeted transparency,” compelling disclosure of information as an alternative means of achieving specific objectives. For example, the U.S. Affordable Care Act of 2010 requires calories be posted on menus to enlist both restaurants and patrons in the effort to reduce obesity. It is crucial to understand when and how such targeted transparency works, as well as when it is inappropriate. Research about its use and effectiveness has begun to take shape, drawing on social and behavioral scientists, economists, and legal scholars. We explore questions central to the performance of targeted transparency policies.

Targeted transparency differs from broader “right-to-know” and “open-government” policies that span from the 1966 Freedom of Information Act to the Obama Administration’s “open-government” initiative encouraging officials to make existing data sets readily available and easy to parse as an end in itself (1, 2). Targeted transparency offers a more focused approach often used to introduce new scientific evidence of public risks into market choices. Government compels companies or agencies to disclose information in standardized formats to reduce specific risks, to ameliorate externalities arising from a failure of consumers or producers to fully consider social costs associated with a product, or to improve provision of public goods and services. Such policies are more light-handed than conventional regulation, relying on the power of information rather than on enforcement of rules and standards or financial inducements….”

See also the Transparency Policy Project at http://transparencypolicy.net/