Selected Readings on Big Data


The Living Library’s Selected Readings series seeks to build a knowledge base on innovative approaches for improving the effectiveness and legitimacy of governance. This curated and annotated collection of recommended works on the topic of big data was originally published in 2014.

Big Data refers to the wide-scale collection, aggregation, storage, analysis and use of data. Government is increasingly in control of a massive amount of raw data that, when analyzed and put to use, can lead to new insights on everything from public opinion to environmental concerns. The burgeoning literature on Big Data argues that it generates value by: creating transparency; enabling experimentation to discover needs, expose variability, and improve performance; segmenting populations to customize actions; replacing/supporting human decision making with automated algorithms; and innovating new business models, products and services. The insights drawn from data analysis can also be visualized in a manner that passes along relevant information, even to those without the tech savvy to understand the data on its own terms (see The GovLab Selected Readings on Data Visualization).

Selected Reading List (in alphabetical order)

Annotated Selected Reading List (in alphabetical order)

Australian Government Information Management Office. The Australian Public Service Big Data Strategy: Improved Understanding through Enhanced Data-analytics Capability Strategy Report. August 2013. http://bit.ly/17hs2xY.

  • This Big Data Strategy produced for Australian Government senior executives with responsibility for delivering services and developing policy is aimed at ingraining in government officials that the key to increasing the value of big data held by government is the effective use of analytics. Essentially, “the value of big data lies in [our] ability to extract insights and make better decisions.”
  • This positions big data as a national asset that can be used to “streamline service delivery, create opportunities for innovation, identify new service and policy approaches as well as supporting the effective delivery of existing programs across a broad range of government operations.”

Bollier, David. The Promise and Peril of Big Data. The Aspen Institute, Communications and Society Program, 2010. http://bit.ly/1a3hBIA.

  • This report captures insights from the 2009 Roundtable exploring uses of Big Data within a number of important consumer behavior and policy implication contexts.
  • The report concludes that, “Big Data presents many exciting opportunities to improve modern society. There are incalculable opportunities to make scientific research more productive, and to accelerate discovery and innovation. People can use new tools to help improve their health and well-being, and medical care can be made more efficient and effective. Government, too, has a great stake in using large databases to improve the delivery of government services and to monitor for threats to national security.
  • However, “Big Data also presents many formidable challenges to government and citizens precisely because data technologies are becoming so pervasive, intrusive and difficult to understand. How shall society protect itself against those who would misuse or abuse large databases? What new regulatory systems, private-law innovations or social practices will be capable of controlling anti-social behaviors–and how should we even define what is socially and legally acceptable when the practices enabled by Big Data are so novel and often arcane?”

Boyd, Danah and Kate Crawford. “Six Provocations for Big Data.” A Decade in Internet Time: Symposium on the Dynamics of the Internet and Society. September 2011http://bit.ly/1jJstmz.

  • In this paper, Boyd and Crawford raise challenges to unchecked assumptions and biases regarding big data. The paper makes a number of assertions about the “computational culture” of big data and pushes back against those who consider big data to be a panacea.
  • The authors’ provocations for big data are:
    • Automating Research Changes the Definition of Knowledge
    • Claims to Objectivity and Accuracy are Misleading
    • Big Data is not always Better Data
    • Not all Data is Equivalent
    • Just Because it is accessible doesn’t make it ethical
    • Limited Access to Big Data creates New Digital Divide

The Economist Intelligence Unit. Big Data and the Democratisation of Decisions. October 2012. http://bit.ly/17MpH8L.

  • This report from the Economist Intelligence Unit focuses on the positive impact of big data adoption in the private sector, but its insights can also be applied to the use of big data in governance.
  • The report argues that innovation can be spurred by democratizing access to data, allowing a diversity of stakeholders to “tap data, draw lessons and make business decisions,” which in turn helps companies and institutions respond to new trends and intelligence at varying levels of decision-making power.

Manyika, James, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh, and Angela Hung Byers. Big Data: The Next Frontier for Innovation, Competition, and Productivity.  McKinsey & Company. May 2011. http://bit.ly/18Q5CSl.

  • This report argues that big data “will become a key basis of competition, underpinning new waves of productivity growth, innovation, and consumer surplus, and that “leaders in every sector will have to grapple with the implications of big data.” 
  • The report offers five broad ways in which using big data can create value:
    • First, big data can unlock significant value by making information transparent and usable at much higher frequency.
    • Second, as organizations create and store more transactional data in digital form, they can collect more accurate and detailed performance information on everything from product inventories to sick days, and therefore expose variability and boost performance.
    • Third, big data allows ever-narrower segmentation of customers and therefore much more precisely tailored products or services.
    • Fourth, big sophisticated analytics can substantially improve decision-making.
    • Finally, big data can be used to improve the development of the next generation of products and services.

The Partnership for Public Service and the IBM Center for The Business of Government. “From Data to Decisions II: Building an Analytics Culture.” October 17, 2012. https://bit.ly/2EbBTMg.

  • This report discusses strategies for better leveraging data analysis to aid decision-making. The authors argue that, “Organizations that are successful at launching or expanding analytics program…systematically examine their processes and activities to ensure that everything they do clearly connects to what they set out to achieve, and they use that examination to pinpoint weaknesses or areas for improvement.”
  • While the report features many strategies for government decisions-makers, the central recommendation is that, “leaders incorporate analytics as a way of doing business, making data-driven decisions transparent and a fundamental approach to day-to-day management. When an analytics culture is built openly, and the lessons are applied routinely and shared widely, an agency can embed valuable management practices in its DNA, to the mutual benet of the agency and the public it serves.”

TechAmerica Foundation’s Federal Big Data Commission. “Demystifying Big Data: A Practical Guide to Transforming the Business of Government.” 2013. http://bit.ly/1aalUrs.

  • This report presents key big data imperatives that government agencies must address, the challenges and the opportunities posed by the growing volume of data and the value Big Data can provide. The discussion touches on the value of big data to businesses and organizational mission, presents case study examples of big data applications, technical underpinnings and public policy applications.
  • The authors argue that new digital information, “effectively captured, managed and analyzed, has the power to change every industry including cyber security, healthcare, transportation, education, and the sciences.” To ensure that this opportunity is realized, the report proposes a detailed big data strategy framework with the following steps: define, assess, plan, execute and review.

World Economic Forum. “Big Data, Big Impact: New Possibilities for International Development.” 2012. http://bit.ly/17hrTKW.

  • This report examines the potential for channeling the “flood of data created every day by the interactions of billions of people using computers, GPS devices, cell phones, and medical devices” into “actionable information that can be used to identify needs, provide services, and predict and prevent crises for the benefit of low-income populations”
  • The report argues that, “To realise the mutual benefits of creating an environment for sharing mobile-generated data, all ecosystem actors must commit to active and open participation. Governments can take the lead in setting policy and legal frameworks that protect individuals and require contractors to make their data public. Development organisations can continue supporting governments and demonstrating both the public good and the business value that data philanthropy can deliver. And the private sector can move faster to create mechanisms for the sharing data that can benefit the public.”

How Government Can Make Open Data Work


Joel Gurin in Information Week: “At the GovLab at New York University, where I am senior adviser, we’re taking a different approach than McKinsey’s to understand the evolving value of government open data: We’re studying open data companies from the ground up. I’m now leading the GovLab’s Open Data 500 project, funded by the John S. and James L. Knight Foundation, to identify and examine 500 American companies that use government open data as a key business resource.
Our preliminary results show that government open data is fueling companies both large and small, across the country, and in many sectors of the economy, including health, finance, education, energy, and more. But it’s not always easy to use this resource. Companies that use government open data tell us it is often incomplete, inaccurate, or trapped in hard-to-use systems and formats.
It will take a thorough and extended effort to make government data truly useful. Based on what we are hearing and the research I did for my book, here are some of the most important steps the federal government can take, starting now, to make it easier for companies to add economic value to the government’s data.
1. Improve data quality
The Open Data Policy not only directs federal agencies to release more open data; it also requires them to release information about data quality. Agencies will have to begin improving the quality of their data simply to avoid public embarrassment. We can hope and expect that they will do some data cleanup themselves, demand better data from the businesses they regulate, or use creative solutions like turning to crowdsourcing for help, as USAID did to improve geospatial data on its grantees.
 
 

2. Keep improving open data resources
The government has steadily made Data.gov, the central repository of federal open data, more accessible and useful, including a significant relaunch last week. To the agency’s credit, the GSA, which administers Data.gov, plans to keep working to make this key website still better. As part of implementing the Open Data Policy, the administration has also set up Project Open Data on GitHub, the world’s largest community for open-source software. These resources will be helpful for anyone working with open data either inside or outside of government. They need to be maintained and continually improved.
3. Pass DATA
The Digital Accountability and Transparency Act would bring transparency to federal government spending at an unprecedented level of detail. The Act has strong bipartisan support. It passed the House with only one dissenting vote and was unanimously approved by a Senate committee, but still needs full Senate approval and the President’s signature to become law. DATA is also supported by technology companies who see it as a source of new open data they can use in their businesses. Congress should move forward and pass DATA as the logical next step in the work that the Obama administration’s Open Data Policy has begun.
4. Reform the Freedom of Information Act
Since it was passed in 1966, the federal Freedom of Information Act has gone through two major revisions, both of which strengthened citizens’ ability to access many kinds of government data. It’s time for another step forward. Current legislative proposals would establish a centralized web portal for all federal FOIA requests, strengthen the FOIA ombudsman’s office, and require agencies to post more high-interest information online before they receive formal requests for it. These changes could make more information from FOIA requests available as open data.
5. Engage stakeholders in a genuine way
Up to now, the government’s release of open data has largely been a one-way affair: Agencies publish datasets that they hope will be useful without consulting the organizations and companies that want to use it. Other countries, including the UK, France, and Mexico, are building in feedback loops from data users to government data providers, and the US should, too. The Open Data Policy calls for agencies to establish points of contact for public feedback. At the GovLab, we hope that the Open Data 500 will help move that process forward. Our research will provide a basis for new, productive dialogue between government agencies and the businesses that rely on them.
6. Keep using federal challenges to encourage innovation
The federal Challenge.gov website applies the best principles of crowdsourcing and collective intelligence. Agencies should use this approach extensively, and should pose challenges using the government’s open data resources to solve business, social, or scientific problems. Other approaches to citizen engagement, including federally sponsored hackathons and the White House Champions of Change program, can play a similar role.
Through the Open Data Policy and other initiatives, the Obama administration has set the right goals. Now it’s time to implement and move toward what US CTO Todd Park calls “data liberation.” Thousands of companies, organizations, and individuals will benefit.”

New Open Data Tool Helps Countries Compare Progress on Education


World Bank Group: “The World Bank Group today launched a new open data tool that provides in-depth, comparative, and easily accessible data on education policies around the world. The Systems Approach for Better Education Results (SABER) web tool helps countries collect and analyze information on their education policies, benchmark themselves against other countries, and prioritize areas for reform, with the goal of ensuring that all children and youth go to school and learn….
To date, the Bank Group, through SABER, has analyzed more than 100 countries to guide more effective reforms and investments in education at all levels, from pre-primary to tertiary education and workforce development.
Through SABER, the Bank Group aims to improve education quality by supplying policymakers, civil society, school administrators, teachers, parents, and students with more, and more meaningful, data about key education policy areas, including early childhood development, student assessment, teachers, school autonomy and accountability, and workforce development, among others.
SABER helps countries improve their education systems in three ways:

  1. Providing new data on policies and institutions. SABER collects comparable country data on education policies and institutions that are publicly available at: http://worldbank.org/education/saber, allowing governments, researchers, and other stakeholders to measure and monitor progress.
  2. Benchmarking education policies and institutions. Each policy area is rated on a four-point scale, from “Latent” to “Emerging” to “Established” and “Advanced.” These ratings highlight a country’s areas of strength and weakness while promoting cross-country learning.
  3. Highlighting key policy choices. SABER data collection and analysis produce an objective snapshot of how well a country’s education system is performing in relation to global good practice. This helps highlight the most important policy choices to spur learning.”

Brazil let its citizens make decisions about city budgets. Here’s what happened.


Brian Wampler and Mike Touchton in the Washington Post: “Over the past 20 years, “participatory institutions” have spread around the world. Participatory institutions delegate decision-making authority directly to citizens, often in local politics, and have attracted widespread support.  International organizations, such as the World Bank and USAID, promote citizen participation in hopes that it will generate more accountable governments, strengthen social networks, improve public services, and inform voters. Elected officials often support citizen participation because it provides them the legitimacy necessary to alter spending patterns, develop new programs, mobilize citizens, or open murky policymaking processes to greater public scrutiny. Civil society organizations and citizens support participating institution because they get unprecedented access to policymaking venues, public budgets and government officials.
But do participatory institutions actually achieve any of these beneficial outcomes?  In a new study of participatory institutions in Brazil, we find that they do.  In particular, we find that municipalities with participatory programs improve the lives of their citizens.
Brazil is a leading innovator in participatory institutions. Brazilian municipal governments can voluntarily adopt a program known as Participatory Budgeting. This program directly incorporates citizens into public meetings where citizens decide how to allocate public funds. The funding amounts can represent up to 100 percent of all new capital spending projects and generally fall between 5 and 15 percent of the total municipal budget.  This is not enough to radically change how cities spend limited resources, but it is enough to generate meaningful change. For example, the Brazilian cities of Belo Horizonte and Porto Alegre have each spent hundreds of millions of U.S. dollars over the past two decades on projects that citizens selected. Moreover, many Participatory Budgeting programs have an outsize impact because they focus resources on areas that have lower incomes and fewer public services.
Between 1990 and 2008, over 120 of Brazil’s largest 250 cities adopted Participatory Budgeting. In order to assess whether PB had an impact, we compared the number of cities that adopted Participatory Budgeting during each mayoral period to cities that did not adopt it, and accounted for a range of other factors that might distinguish these two groups of cities.
The results are promising. Municipal governments that adopted Participatory Budgeting spent more on education and sanitation and saw infant mortality decrease as well. We estimate cities without PB to have infant mortality levels similar to Brazil’s mean. However, infant mortality drops by almost 20 percent for municipalities that have used PB for more than eight years — again, after accounting for other political and economic factors that might also influence infant mortality.  The evidence strongly suggests that the investment in these programs is paying important dividends. We are not alone in this conclusion: Sónia Gonçalves has reached similar conclusions about Participatory Budgeting in Brazil….
Our results also show that Participatory Budgeting’s influence strengthens over time, which indicates that its benefits do not merely result from governments making easy policy changes. Instead, Participatory Budgeting’s increasing impact indicates that governments, citizens, and civil society organizations are building new institutions that produce better forms of governance. These cities incorporate citizens at multiple moments of the policy process, allowing community leaders and public officials to exchange better information. The cities are also retraining policy experts and civil servants to better work with poor communities. Finally, public deliberation about spending priorities makes these city governments more transparent, which decreases corruption…”

Citizen roles in civic problem-solving and innovation


Satish Nambisan: “Can citizens be fruitfully engaged in solving civic problems? Recent initiatives in cities such as Boston (Citizens Connect), Chicago (Smart Chicago Collaborative), San Francisco (ImproveSF) and New York (NYC BigApps) indicate that citizens can be involved in not just identifying and reporting civic problems but in conceptualizing, designing and developing, and implementing solutions as well.
The availability of new technologies (e.g. social media) has radically lowered the cost of collaboration and the “distance” between government agencies and the citizens they serve. Further involving citizens — who are often closest to and possess unique knowledge about the problems they face — makes a lot of sense given the increasing complexity of the problems that need to be addressed.
A recent research report that I wrote highlights four distinct roles that citizens can play in civic innovation and problem-solving.
As explorer, citizens can identify and report emerging and existing civic problems. For example, Boston’s Citizen Connect initiative enables citizens to use specially built smartphone apps to report minor and major civic problems (from potholes and graffiti to water/air pollution). Closer to home, both Wisconsin and Minnesota have engaged thousands of citizen volunteers in collecting data on the quality of water in their neighborhood streams, lakes and rivers (the data thus gathered are analyzed by the state pollution control agency). Citizens also can be engaged in data analysis. The N.Y.-based Datakind initiative involves citizen volunteers using their data analysis skills to mine public data in health, education, environment, etc., to identify important civic issues and problems.
As “ideator,”citizens can conceptualize novel solutions to well-defined problems in public services. For example, the federal government’s Challenge.gov initiative employs online contests and competitions to solicit innovative ideas from citizens to solve important civic problems. Such “crowdsourcing” initiatives also have been launched at the county, city and state levels (e.g. Prize2theFuture competition in Birmingham, Ala.; ImproveSF in San Francisco).
As designer, citizens can design and/or develop implementable solutions to well-defined civic problems. For example, as part of initiatives such as NYC Big Apps and Apps for California, citizens have designed mobile apps to address specific issues such as public parking availability, public transport delays, etc. Similarly, the City Repair project in Portland, Ore., focuses on engaging citizens in co-designing and creatively transforming public places into sustainable community-oriented urban spaces.
As diffuser,citizens can play the role of a change agent and directly support the widespread adoption of civic innovations and solutions. For example, in recent years, physicians interacting with peer physicians in dedicated online communities have assisted federal and state government agencies in diffusing health technology innovations such as electronic medical record systems (EMRs).
In the private sector, companies across industries have benefited much from engaging with their customers in innovation. Evidence so far suggests that the benefits from citizen engagement in civic problem-solving are equally tangible, valuable and varied. However, the challenges associated with organizing such citizen co-creation initiatives are also many and imply the need for government agencies to adopt an intentional, well-thought-out approach….”

The Eight Key Issues of Digital Government


Andrea Di Maio (Gartner): “…Now, to set the record straight, I do believe digital government is profoundly different from e-government as well as from government 2.0 (although in some jurisdictions the latter terms still looks more relevant than “digital”). Whereas there are many differences as far as technologies and what they make possible,political will, and evolving citizen demand, my contention is that the single most fundamental difference is in the relevance of data and how new and unforeseen uses of data can truly transform the way governments deliver their services and perform their operations.
This is not at all just about government as a platform or open government, where government is primarily a provider of data that constituents – be they citizens, business or intermediaries – use and mash up in new ways. It is also about government themselves inventing new ways to user their own as well as constituents’ data. It is only by striking the right balance between being a data provider and being a data broker and consumer that governments will find the right path to being truly digital.
During the Gartner Symposia I attended last fall, I had numerous interesting conversations with people who are exploring very innovative ways of using its own data, such as:

  • tax authorities contemplating to use up-to-date financial information about taxpayers to proactively suggest investments that may provide tax breaks,
  • education institutions leveraging data about student location from their original purpose (giving parents information about students’ whereabouts) to providing new tools for teachers to understand behavioral patterns and relate those to more personalized learning
  • immigration authorities leveraging data coming from video analysis, whose role is to flag suspicious immigrants for secondary inspection, to inform public safety authorities or the hospitality sector about specific issues and opportunities with tourists.

In the second half of 2013, Gartner government analysts focused on distilling the fundamental components of a digital government initiative, in order to be able to shape our research and advice in ways that hit the most important issues that client face. The new government research agenda has just been published (see Agenda Overview for Government, 2014) and eight key issues, grouped in three distinct areas,  that need to be addressed to successfully transform into a digital government organization.
Engaging Citizens

  • Service Delivery Innovation: How will governments use technology to support innovative services that produce better results for society?
  • Open Government: How will governments create and sustain a digital ecosystem that citizens can trust and want to participate in?

Connecting Agencies

  • New Digital Business Models: What data-driven business models will emerge to meet the growing needs for adequate and sustainable public services?
  • Joint Governance: How will governance coordinate IT and service decisions across independent public and private organizations?
  • Scalable Interoperability: How much interoperability is needed to support connected government services and at what cost?

Resourcing Government

  • Workforce Innovation: How will the IT organization and role transform to support government workforce innovation?
  • Adaptive Sourcing: How will government IT organizations expand their sourcing strategies to take advantage of competitive cloud-based and consumer-grade solutions?
  • Sustainable Financing: How will government IT organizations obtain and manage the financial resources required to connect government and engage citizens?”

Garbage In, Garbage Out… Or, How to Lie with Bad Data


Medium: For everyone who slept through Stats 101, Charles Wheelan’s Naked Statistics is a lifesaver. From batting averages and political polls to Schlitz ads and medical research, Wheelan “illustrates exactly why even the most reluctant mathophobe is well advised to achieve a personal understanding of the statistical underpinnings of life” (New York Times). What follows is adapted from the book, out now in paperback.
Behind every important study there are good data that made the analysis possible. And behind every bad study . . . well, read on. People often speak about “lying with statistics.” I would argue that some of the most egregious statistical mistakes involve lying with data; the statistical analysis is fine, but the data on which the calculations are performed are bogus or inappropriate. Here are some common examples of “garbage in, garbage out.”

Selection Bias

….Selection bias can be introduced in many other ways. A survey of consumers in an airport is going to be biased by the fact that people who fly are likely to be wealthier than the general public; a survey at a rest stop on Interstate 90 may have the opposite problem. Both surveys are likely to be biased by the fact that people who are willing to answer a survey in a public place are different from people who would prefer not to be bothered. If you ask 100 people in a public place to complete a short survey, and 60 are willing to answer your questions, those 60 are likely to be different in significant ways from the 40 who walked by without making eye contact.

Publication Bias

Positive findings are more likely to be published than negative findings, which can skew the results that we see. Suppose you have just conducted a rigorous, longitudinal study in which you find conclusively that playing video games does not prevent colon cancer. You’ve followed a representative sample of 100,000 Americans for twenty years; those participants who spend hours playing video games have roughly the same incidence of colon cancer as the participants who do not play video games at all. We’ll assume your methodology is impeccable. Which prestigious medical journal is going to publish your results?

Most things don’t prevent cancer.

None, for two reasons. First, there is no strong scientific reason to believe that playing video games has any impact on colon cancer, so it is not obvious why you were doing this study. Second, and more relevant here, the fact that something does not prevent cancer is not a particularly interesting finding. After all, most things don’t prevent cancer. Negative findings are not especially sexy, in medicine or elsewhere.
The net effect is to distort the research that we see, or do not see. Suppose that one of your graduate school classmates has conducted a different longitudinal study. She finds that people who spend a lot of time playing video games do have a lower incidence of colon cancer. Now that is interesting! That is exactly the kind of finding that would catch the attention of a medical journal, the popular press, bloggers, and video game makers (who would slap labels on their products extolling the health benefits of their products). It wouldn’t be long before Tiger Moms all over the country were “protecting” their children from cancer by snatching books out of their hands and forcing them to play video games instead.
Of course, one important recurring idea in statistics is that unusual things happen every once in a while, just as a matter of chance. If you conduct 100 studies, one of them is likely to turn up results that are pure nonsense—like a statistical association between playing video games and a lower incidence of colon cancer. Here is the problem: The 99 studies that find no link between video games and colon cancer will not get published, because they are not very interesting. The one study that does find a statistical link will make it into print and get loads of follow-on attention. The source of the bias stems not from the studies themselves but from the skewed information that actually reaches the public. Someone reading the scientific literature on video games and cancer would find only a single study, and that single study will suggest that playing video games can prevent cancer. In fact, 99 studies out of 100 would have found no such link.

Recall Bias

Memory is a fascinating thing—though not always a great source of good data. We have a natural human impulse to understand the present as a logical consequence of things that happened in the past—cause and effect. The problem is that our memories turn out to be “systematically fragile” when we are trying to explain some particularly good or bad outcome in the present. Consider a study looking at the relationship between diet and cancer. In 1993, a Harvard researcher compiled a data set comprising a group of women with breast cancer and an age-matched group of women who had not been diagnosed with cancer. Women in both groups were asked about their dietary habits earlier in life. The study produced clear results: The women with breast cancer were significantly more likely to have had diets that were high in fat when they were younger.
Ah, but this wasn’t actually a study of how diet affects the likelihood of getting cancer. This was a study of how getting cancer affects a woman’s memory of her diet earlier in life. All of the women in the study had completed a dietary survey years earlier, before any of them had been diagnosed with cancer. The striking finding was that women with breast cancer recalled a diet that was much higher in fat than what they actually consumed; the women with no cancer did not.

Women with breast cancer recalled a diet that was much higher in fat than what they actually consumed; the women with no cancer did not.

The New York Times Magazine described the insidious nature of this recall bias:

The diagnosis of breast cancer had not just changed a woman’s present and the future; it had altered her past. Women with breast cancer had (unconsciously) decided that a higher-fat diet was a likely predisposition for their disease and (unconsciously) recalled a high-fat diet. It was a pattern poignantly familiar to anyone who knows the history of this stigmatized illness: these women, like thousands of women before them, had searched their own memories for a cause and then summoned that cause into memory.

Recall bias is one reason that longitudinal studies are often preferred to cross-sectional studies. In a longitudinal study the data are collected contemporaneously. At age five, a participant can be asked about his attitudes toward school. Then, thirteen years later, we can revisit that same participant and determine whether he has dropped out of high school. In a cross-sectional study, in which all the data are collected at one point in time, we must ask an eighteen-year-old high school dropout how he or she felt about school at age five, which is inherently less reliable.

Survivorship Bias

Suppose a high school principal reports that test scores for a particular cohort of students has risen steadily for four years. The sophomore scores for this class were better than their freshman scores. The scores from junior year were better still, and the senior year scores were best of all. We’ll stipulate that there is no cheating going on, and not even any creative use of descriptive statistics. Every year this cohort of students has done better than it did the preceding year, by every possible measure: mean, median, percentage of students at grade level, and so on. Would you (a) nominate this school leader for “principal of the year” or (b) demand more data?

If you have a room of people with varying heights, forcing the short people to leave will raise the average height in the room, but it doesn’t make anyone taller.

I say “b.” I smell survivorship bias, which occurs when some or many of the observations are falling out of the sample, changing the composition of the observations that are left and therefore affecting the results of any analysis. Let’s suppose that our principal is truly awful. The students in his school are learning nothing; each year half of them drop out. Well, that could do very nice things for the school’s test scores—without any individual student testing better. If we make the reasonable assumption that the worst students (with the lowest test scores) are the most likely to drop out, then the average test scores of those students left behind will go up steadily as more and more students drop out. (If you have a room of people with varying heights, forcing the short people to leave will raise the average height in the room, but it doesn’t make anyone taller.)

Healthy User Bias

People who take vitamins regularly are likely to be healthy—because they are the kind of people who take vitamins regularly! Whether the vitamins have any impact is a separate issue. Consider the following thought experiment. Suppose public health officials promulgate a theory that all new parents should put their children to bed only in purple pajamas, because that helps stimulate brain development. Twenty years later, longitudinal research confirms that having worn purple pajamas as a child does have an overwhelmingly large positive association with success in life. We find, for example, that 98 percent of entering Harvard freshmen wore purple pajamas as children (and many still do) compared with only 3 percent of inmates in the Massachusetts state prison system.

The purple pajamas do not matter.

Of course, the purple pajamas do not matter; but having the kind of parents who put their children in purple pajamas does matter. Even when we try to control for factors like parental education, we are still going to be left with unobservable differences between those parents who obsess about putting their children in purple pajamas and those who don’t. As New York Times health writer Gary Taubes explains, “At its simplest, the problem is that people who faithfully engage in activities that are good for them—taking a drug as prescribed, for instance, or eating what they believe is a healthy diet—are fundamentally different from those who don’t.” This effect can potentially confound any study trying to evaluate the real effect of activities perceived to be healthful, such as exercising regularly or eating kale. We think we are comparing the health effects of two diets: kale versus no kale. In fact, if the treatment and control groups are not randomly assigned, we are comparing two diets that are being eaten by two different kinds of people. We have a treatment group that is different from the control group in two respects, rather than just one.

If statistics is detective work, then the data are the clues. My wife spent a year teaching high school students in rural New Hampshire. One of her students was arrested for breaking into a hardware store and stealing some tools. The police were able to crack the case because (1) it had just snowed and there were tracks in the snow leading from the hardware store to the student’s home; and (2) the stolen tools were found inside. Good clues help.
Like good data. But first you have to get good data, and that is a lot harder than it seems.

Open Development (Networked Innovations in International Development)


New book edited by Matthew L. Smith and Katherine M. A. Reilly (Foreword by Yochai Benkler) : “The emergence of open networked models made possible by digital technology has the potential to transform international development. Open network structures allow people to come together to share information, organize, and collaborate. Open development harnesses this power, to create new organizational forms and improve people’s lives; it is not only an agenda for research and practice but also a statement about how to approach international development. In this volume, experts explore a variety of applications of openness, addressing challenges as well as opportunities.
Open development requires new theoretical tools that focus on real world problems, consider a variety of solutions, and recognize the complexity of local contexts. After exploring the new theoretical terrain, the book describes a range of cases in which open models address such specific development issues as biotechnology research, improving education, and access to scholarly publications. Contributors then examine tensions between open models and existing structures, including struggles over privacy, intellectual property, and implementation. Finally, contributors offer broader conceptual perspectives, considering processes of social construction, knowledge management, and the role of individual intent in the development and outcomes of social models.”

When Lean Startup Arrives in a Trojan Horse–Innovation in Extreme Bureaucracy


Steven Hodas @ The Lean Startup Conference 2013 –…Steven runs an procurement-innovation program in one of the world’s most notorious bureaucracies: the New York City Department of Education. In a fear-driven atmosphere, with lots of incentive to not be embarrassed, he’ll talk about the challenges he’s faced and progress he’s made testing new ideas.

Can a Better Taxonomy Help Behavioral Energy Efficiency?


Article at GreenTechEfficiency: “Hundreds of behavioral energy efficiency programs have sprung up across the U.S. in the past five years, but the effectiveness of the programs — both in terms of cost savings and reduced energy use — can be difficult to gauge.
Of nearly 300 programs, a new report from the American Council for an Energy-Efficient Economy was able to accurately calculate the cost of saved energy from only ten programs….
To help utilities and regulators better define and measure behavioral programs, ACEEE offers a new taxonomy of utility-run behavior programs that breaks them into three major categories:
Cognition: Programs that focus on delivering information to consumers.  (This includes general communication efforts, enhanced billing and bill inserts, social media and classroom-based education.)
Calculus: Programs that rely on consumers making economically rational decisions. (This includes real-time and asynchronous feedback, dynamic pricing, games, incentives and rebates and home energy audits.)
Social interaction: Programs whose key drivers are social interaction and belonging. (This includes community-based social marketing, peer champions, online forums and incentive-based gifts.)
….
While the report was mostly preliminary, it also offered four steps forward for utilities that want to make the most of behavioral programs.
Stack. The types of programs might fit into three broad categories, but judiciously blending cues based on emotion, reason and social interaction into programs is key, according to ACEEE. Even though the report recommends stacked programs that have a multi-modal approach, the authors acknowledge, “This hypothesis will remain untested until we see more stacked programs in the marketplace.”
Track. Just like other areas of grid modernization, utilities need to rethink how they collect, analyze and report the data coming out of behavioral programs. This should include metrics that go beyond just energy savings.
Share. As with other utility programs, behavior-based energy efficiency programs can be improved upon if utilities share results and if reporting is standardized across the country instead of varying by state.
Coordinate. Sharing is only the first step. Programs that merge water, gas and electricity efficiency can often gain better results than siloed programs. That approach, however, requires a coordinated effort by regional utilities and a change to how programs are funded and evaluated by regulators.”