How Government Can Make Open Data Work

Joel Gurin in Information Week: “At the GovLab at New York University, where I am senior adviser, we’re taking a different approach than McKinsey’s to understand the evolving value of government open data: We’re studying open data companies from the ground up. I’m now leading the GovLab’s Open Data 500 project, funded by the John S. and James L. Knight Foundation, to identify and examine 500 American companies that use government open data as a key business resource.
Our preliminary results show that government open data is fueling companies both large and small, across the country, and in many sectors of the economy, including health, finance, education, energy, and more. But it’s not always easy to use this resource. Companies that use government open data tell us it is often incomplete, inaccurate, or trapped in hard-to-use systems and formats.
It will take a thorough and extended effort to make government data truly useful. Based on what we are hearing and the research I did for my book, here are some of the most important steps the federal government can take, starting now, to make it easier for companies to add economic value to the government’s data.
1. Improve data quality
The Open Data Policy not only directs federal agencies to release more open data; it also requires them to release information about data quality. Agencies will have to begin improving the quality of their data simply to avoid public embarrassment. We can hope and expect that they will do some data cleanup themselves, demand better data from the businesses they regulate, or use creative solutions like turning to crowdsourcing for help, as USAID did to improve geospatial data on its grantees.

2. Keep improving open data resources
The government has steadily made, the central repository of federal open data, more accessible and useful, including a significant relaunch last week. To the agency’s credit, the GSA, which administers, plans to keep working to make this key website still better. As part of implementing the Open Data Policy, the administration has also set up Project Open Data on GitHub, the world’s largest community for open-source software. These resources will be helpful for anyone working with open data either inside or outside of government. They need to be maintained and continually improved.
3. Pass DATA
The Digital Accountability and Transparency Act would bring transparency to federal government spending at an unprecedented level of detail. The Act has strong bipartisan support. It passed the House with only one dissenting vote and was unanimously approved by a Senate committee, but still needs full Senate approval and the President’s signature to become law. DATA is also supported by technology companies who see it as a source of new open data they can use in their businesses. Congress should move forward and pass DATA as the logical next step in the work that the Obama administration’s Open Data Policy has begun.
4. Reform the Freedom of Information Act
Since it was passed in 1966, the federal Freedom of Information Act has gone through two major revisions, both of which strengthened citizens’ ability to access many kinds of government data. It’s time for another step forward. Current legislative proposals would establish a centralized web portal for all federal FOIA requests, strengthen the FOIA ombudsman’s office, and require agencies to post more high-interest information online before they receive formal requests for it. These changes could make more information from FOIA requests available as open data.
5. Engage stakeholders in a genuine way
Up to now, the government’s release of open data has largely been a one-way affair: Agencies publish datasets that they hope will be useful without consulting the organizations and companies that want to use it. Other countries, including the UK, France, and Mexico, are building in feedback loops from data users to government data providers, and the US should, too. The Open Data Policy calls for agencies to establish points of contact for public feedback. At the GovLab, we hope that the Open Data 500 will help move that process forward. Our research will provide a basis for new, productive dialogue between government agencies and the businesses that rely on them.
6. Keep using federal challenges to encourage innovation
The federal website applies the best principles of crowdsourcing and collective intelligence. Agencies should use this approach extensively, and should pose challenges using the government’s open data resources to solve business, social, or scientific problems. Other approaches to citizen engagement, including federally sponsored hackathons and the White House Champions of Change program, can play a similar role.
Through the Open Data Policy and other initiatives, the Obama administration has set the right goals. Now it’s time to implement and move toward what US CTO Todd Park calls “data liberation.” Thousands of companies, organizations, and individuals will benefit.”

New Open Data Tool Helps Countries Compare Progress on Education

World Bank Group: “The World Bank Group today launched a new open data tool that provides in-depth, comparative, and easily accessible data on education policies around the world. The Systems Approach for Better Education Results (SABER) web tool helps countries collect and analyze information on their education policies, benchmark themselves against other countries, and prioritize areas for reform, with the goal of ensuring that all children and youth go to school and learn….
To date, the Bank Group, through SABER, has analyzed more than 100 countries to guide more effective reforms and investments in education at all levels, from pre-primary to tertiary education and workforce development.
Through SABER, the Bank Group aims to improve education quality by supplying policymakers, civil society, school administrators, teachers, parents, and students with more, and more meaningful, data about key education policy areas, including early childhood development, student assessment, teachers, school autonomy and accountability, and workforce development, among others.
SABER helps countries improve their education systems in three ways:

  1. Providing new data on policies and institutions. SABER collects comparable country data on education policies and institutions that are publicly available at:, allowing governments, researchers, and other stakeholders to measure and monitor progress.
  2. Benchmarking education policies and institutions. Each policy area is rated on a four-point scale, from “Latent” to “Emerging” to “Established” and “Advanced.” These ratings highlight a country’s areas of strength and weakness while promoting cross-country learning.
  3. Highlighting key policy choices. SABER data collection and analysis produce an objective snapshot of how well a country’s education system is performing in relation to global good practice. This helps highlight the most important policy choices to spur learning.”

Use big data and crowdsourcing to detect nuclear proliferation, says DSB

FierceGovernmentIT: “A changing set of counter-nuclear proliferation problems requires a paradigm shift in monitoring that should include big data analytics and crowdsourcing, says a report from the Defense Science Board.
Much has changed since the Cold War when it comes to ensuring that nuclear weapons are subject to international controls, meaning that monitoring in support of treaties covering declared capabilities should be only one part of overall U.S. monitoring efforts, says the board in a January report (.pdf).
There are challenges related to covert operations, such as testing calibrated to fall below detection thresholds, and non-traditional technologies that present ambiguous threat signatures. Knowledge about how to make nuclear weapons is widespread and in the hands of actors who will give the United States or its allies limited or no access….
The report recommends using a slew of technologies including radiation sensors, but also exploitation of digital sources of information.
“Data gathered from the cyber domain establishes a rich and exploitable source for determining activities of individuals, groups and organizations needed to participate in either the procurement or development of a nuclear device,” it says.
Big data analytics could be used to take advantage of the proliferation of potential data sources including commercial satellite imaging, social media and other online sources.
The report notes that the proliferation of readily available commercial satellite imagery has created concerns about the introduction of more noise than genuine signal. “On balance, however, it is the judgment from the task force that more information from remote sensing systems, both commercial and dedicated national assets, is better than less information,” it says.
In fact, the ready availability of commercial imagery should be an impetus of governmental ability to find weak signals “even within the most cluttered and noisy environments.”
Crowdsourcing also holds potential, although the report again notes that nuclear proliferation analysis by non-governmental entities “will constrain the ability of the United States to keep its options open in dealing with potential violations.” The distinction between gathering information and making political judgments “will erode.”
An effort by Georgetown University students (reported in the Washington Post in 2011) to use open source data analyzing the network of tunnels used in China to hide its missile and nuclear arsenal provides a proof-of-concept on how crowdsourcing can be used to augment limited analytical capacity, the report says – despite debate on the students’ work, which concluded that China’s arsenal could be many times larger than conventionally accepted…
For more:
download the DSB report, “Assessment of Nuclear Monitoring and Verification Technologies” (.pdf)
read the WaPo article on the Georgetown University crowdsourcing effort”

The Power to Decide

Special Report by Antonio Regalado in MIT Technology Review: “Back in 1956, an engineer and a mathematician, William Fair and Earl Isaac, pooled $800 to start a company. Their idea: a score to handicap whether a borrower would repay a loan.
It was all done with pen and paper. Income, gender, and occupation produced numbers that amounted to a prediction about a person’s behavior. By the 1980s the three-digit scores were calculated on computers and instead took account of a person’s actual credit history. Today, Fair Isaac Corp., or FICO, generates about 10 billion credit scores annually, calculating 50 times a year for many Americans.
This machinery hums in the background of our financial lives, so it’s easy to forget that the choice of whether to lend used to be made by a bank manager who knew a man by his handshake. Fair and Isaac understood that all this could change, and that their company didn’t merely sell numbers. “We sell a radically different way of making decisions that flies in the face of tradition,” Fair once said.
This anecdote suggests a way of understanding the era of “big data”—terabytes of information from sensors or social networks, new computer architectures, and clever software. But even supercharged data needs a job to do, and that job is always about a decision.
In this business report, MIT Technology Review explores a big question: how are data and the analytical tools to manipulate it changing decision making today? On Nasdaq, trading bots exchange a billion shares a day. Online, advertisers bid on hundreds of thousands of keywords a minute, in deals greased by heuristic solutions and optimization models rather than two-martini lunches. The number of variables and the speed and volume of transactions are just too much for human decision makers.
When there’s a person in the loop, technology takes a softer approach (see “Software That Augments Human Thinking”). Think of recommendation engines on the Web that suggest products to buy or friends to catch up with. This works because Internet companies maintain statistical models of each of us, our likes and habits, and use them to decide what we see. In this report, we check in with LinkedIn, which maintains the world’s largest database of résumés—more than 200 million of them. One of its newest offerings is University Pages, which crunches résumé data to offer students predictions about where they’ll end up working depending on what college they go to (see “LinkedIn Offers College Choices by the Numbers”).
These smart systems, and their impact, are prosaic next to what’s planned. Take IBM. The company is pouring $1 billion into its Watson computer system, the one that answered questions correctly on the game show Jeopardy! IBM now imagines computers that can carry on intelligent phone calls with customers, or provide expert recommendations after digesting doctors’ notes. IBM wants to provide “cognitive services”—computers that think, or seem to (see “Facing Doubters, IBM Expands Plans for Watson”).
Andrew Jennings, chief analytics officer for FICO, says automating human decisions is only half the story. Credit scores had another major impact. They gave lenders a new way to measure the state of their portfolios—and to adjust them by balancing riskier loan recipients with safer ones. Now, as other industries get exposed to predictive data, their approach to business strategy is changing, too. In this report, we look at one technique that’s spreading on the Web, called A/B testing. It’s a simple tactic—put up two versions of a Web page and see which one performs better (see “Seeking Edge, Websites Turn to Experiments” and “Startups Embrace a Way to Fail Fast”).
Until recently, such optimization was practiced only by the largest Internet companies. Now, nearly any website can do it. Jennings calls this phenomenon “systematic experimentation” and says it will be a feature of the smartest companies. They will have teams constantly probing the world, trying to learn its shifting rules and deciding on strategies to adapt. “Winners and losers in analytic battles will not be determined simply by which organization has access to more data or which organization has more money,” Jennings has said.

Of course, there’s danger in letting the data decide too much. In this report, Duncan Watts, a Microsoft researcher specializing in social networks, outlines an approach to decision making that avoids the dangers of gut instinct as well as the pitfalls of slavishly obeying data. In short, Watts argues, businesses need to adopt the scientific method (see “Scientific Thinking in Business”).
To do that, they have been hiring a highly trained breed of business skeptics called data scientists. These are the people who create the databases, build the models, reveal the trends, and, increasingly, author the products. And their influence is growing in business. This could be why data science has been called “the sexiest job of the 21st century.” It’s not because mathematics or spreadsheets are particularly attractive. It’s because making decisions is powerful…”

Citizen roles in civic problem-solving and innovation

Satish Nambisan: “Can citizens be fruitfully engaged in solving civic problems? Recent initiatives in cities such as Boston (Citizens Connect), Chicago (Smart Chicago Collaborative), San Francisco (ImproveSF) and New York (NYC BigApps) indicate that citizens can be involved in not just identifying and reporting civic problems but in conceptualizing, designing and developing, and implementing solutions as well.
The availability of new technologies (e.g. social media) has radically lowered the cost of collaboration and the “distance” between government agencies and the citizens they serve. Further involving citizens — who are often closest to and possess unique knowledge about the problems they face — makes a lot of sense given the increasing complexity of the problems that need to be addressed.
A recent research report that I wrote highlights four distinct roles that citizens can play in civic innovation and problem-solving.
As explorer, citizens can identify and report emerging and existing civic problems. For example, Boston’s Citizen Connect initiative enables citizens to use specially built smartphone apps to report minor and major civic problems (from potholes and graffiti to water/air pollution). Closer to home, both Wisconsin and Minnesota have engaged thousands of citizen volunteers in collecting data on the quality of water in their neighborhood streams, lakes and rivers (the data thus gathered are analyzed by the state pollution control agency). Citizens also can be engaged in data analysis. The N.Y.-based Datakind initiative involves citizen volunteers using their data analysis skills to mine public data in health, education, environment, etc., to identify important civic issues and problems.
As “ideator,”citizens can conceptualize novel solutions to well-defined problems in public services. For example, the federal government’s initiative employs online contests and competitions to solicit innovative ideas from citizens to solve important civic problems. Such “crowdsourcing” initiatives also have been launched at the county, city and state levels (e.g. Prize2theFuture competition in Birmingham, Ala.; ImproveSF in San Francisco).
As designer, citizens can design and/or develop implementable solutions to well-defined civic problems. For example, as part of initiatives such as NYC Big Apps and Apps for California, citizens have designed mobile apps to address specific issues such as public parking availability, public transport delays, etc. Similarly, the City Repair project in Portland, Ore., focuses on engaging citizens in co-designing and creatively transforming public places into sustainable community-oriented urban spaces.
As diffuser,citizens can play the role of a change agent and directly support the widespread adoption of civic innovations and solutions. For example, in recent years, physicians interacting with peer physicians in dedicated online communities have assisted federal and state government agencies in diffusing health technology innovations such as electronic medical record systems (EMRs).
In the private sector, companies across industries have benefited much from engaging with their customers in innovation. Evidence so far suggests that the benefits from citizen engagement in civic problem-solving are equally tangible, valuable and varied. However, the challenges associated with organizing such citizen co-creation initiatives are also many and imply the need for government agencies to adopt an intentional, well-thought-out approach….”

Opening up open data: An interview with Tim O’Reilly

McKinsey: “The tech entrepreneur, author, and investor looks at how open data is becoming a critical tool for business and government, as well as what needs to be done for it to be more effective.

We’re increasingly living in a world of black boxes. We don’t understand the way things work. And open-source software, open data are critical tools. We see this in the field of computer security. People say, “Well, we have to keep this secret.” Well, it turns out that the strongest security protocols are those that are secure even when people know how they work.

It seems to me that almost every great advance is a platform advance. When we have common standards, so much more happens.
And you think about the standardization of railroad gauges, the standardization of communications, protocols. Think about the standardization of roads, how fundamental those are to our society. And that’s actually kind of a bridge for my work on open government, because I’ve been thinking a lot about the notion of government as a platform.

We should define a little bit what we mean by “open,” because there’s open as in it’s open source. Anybody can take it and reuse it in whatever way they want. And I’m not sure that’s always necessary. There’s a pragmatic open and there’s an ideological open. And the pragmatic open is that it’s available. It’s available in a timely way, in a nonpreferential way, so that some people don’t get better access than others.
And if you look at so many of our apps now on the web, because they are ad-supported and free, we get a lot of the benefits of open. When the cost is low enough, it does in fact create many of the same conditions as a commons. That being said, that requires great restraint, as I said earlier, on the part of companies, because it becomes easy for them to say, “Well, actually we just need to take a little bit more of the value for ourselves. And oh, we just need a bit more of that.” And before long, it really isn’t open at all.

Eric Ries, of Lean Startupfame, talks about a start-up as a machine for learning under conditions of extreme uncertainty.
He said it doesn’t have to do with being a small company, being anything new. He says it’s just whenever you’re trying to do something new, where you don’t know the answers, you have to experiment. You have to have a mechanism for measuring. You have to have mechanisms for changing what you do based on the response to that measurement…
That’s one of the biggest problems, I think, in our government today, that we put out programs. Somebody has a theory about what’s going to work and what the benefit will be. We don’t measure it. We don’t actually see if it did what we thought it was going to do. And we keep doing it. And then it doesn’t work, so we do something else. And then we layer on program after program that doesn’t actually meet its objectives. And if we actually brought in the mind-set that said, “No, actually we’re going to figure out if we actually accomplish what we set out to accomplish; and if we don’t, we’re going to change it,” that would be huge.”

Social Media: A Critical Introduction

New book: “Now more than ever, we need to understand social media – the good as well as the bad. We need critical knowledge that helps us to navigate the controversies and contradictions of this complex digital media landscape. Only then can we make informed judgements about what’s
happening in our media world, and why.
Showing the reader how to ask the right kinds of questions about social media, Christian Fuchs takes us on a journey across social media,
delving deep into case studies on Google, Facebook, Twitter, WikiLeaks and Wikipedia. The result lays bare the structures and power relations
at the heart of our media landscape.
This book is the essential, critical guide for understanding social media and for all students of media studies and sociology. Readers will
never look at social media the same way again.
Sample chapter:
Twitter and Democracy: A New Public Sphere?
Introduction: What is a Critical Introduction to Social Media?

How Internet surveillance predicts disease outbreak before WHO

Kurzweil News: “Have you ever Googled for an online diagnosis before visiting a doctor? If so, you may have helped provide early warning of an infectious disease epidemic.
In a new study published in Lancet Infectious Diseases, Internet-based surveillance has been found to detect infectious diseases such as Dengue Fever and Influenza up to two weeks earlier than traditional surveillance methods, according to Queensland University of Technology (QUT) research fellow and senior author of the paper Wenbiao Hu.
Hu, based at the Institute for Health and Biomedical Innovation, said there was often a lag time of two weeks before traditional surveillance methods could detect an emerging infectious disease.
“This is because traditional surveillance relies on the patient recognizing the symptoms and seeking treatment before diagnosis, along with the time taken for health professionals to alert authorities through their health networks. In contrast, digital surveillance can provide real-time detection of epidemics.”
Hu said the study used search engine algorithms such as Google Trends and Google Insights. It found that detecting the 2005–06 avian influenza outbreak “Bird Flu” would have been possible between one and two weeks earlier than official surveillance reports.
“In another example, a digital data collection network was found to be able to detect the SARS outbreak more than two months before the first publications by the World Health Organization (WHO),” Hu said.
According to this week’s CDC FluView report published Jan. 17, 2014, influenza activity in the United States remains high overall, with 3,745 laboratory-confirmed influenza-associated hospitalizations reported since October 1, 2013 (credit: CDC)
“Early detection means early warning and that can help reduce or contain an epidemic, as well alert public health authorities to ensure risk management strategies such as the provision of adequate medication are implemented.”
Hu said the study found that social media including Twitter and Facebook and microblogs could also be effective in detecting disease outbreaks. “The next step would be to combine the approaches currently available such as social media, aggregator websites, and search engines, along with other factors such as climate and temperature, and develop a real-time infectious disease predictor.”
“The international nature of emerging infectious diseases combined with the globalization of travel and trade, have increased the interconnectedness of all countries and that means detecting, monitoring and controlling these diseases is a global concern.”
The other authors of the paper were Gabriel Milinovich (first author), Gail Williams and Archie Clements from the University of Queensland School of Population, Health and State.
Another powerful tool is Supramap, a web application that synthesizes large, diverse datasets so that researchers can better understand the spread of infectious diseases across hosts and geography by integrating genetic, evolutionary, geospatial, and temporal data. It is now open-source — create your own maps here.
Associate Professor Daniel Janies, Ph.D., an expert in computational genomics at the Wexner Medical Center at The Ohio State University (OSU), worked with software engineers at the Ohio Supercomputer Center (OSC) to allow researchers and public safety officials to develop other front-end applications that draw on the logic and computing resources of Supramap.
It was originally developed in 2007 to track the spread and evolution of pandemic (H1N1) and avian influenza (H5N1).
“Using SUPRAMAP, we initially developed maps that illustrated the spread of drug-resistant influenza and host shifts in H1N1 and H5N1 influenza and in coronaviruses, such as SARS,” said Janies. “SUPRAMAP allows the user to track strains carrying key mutations in a geospatial browser such as Google Earth. Our software allows public health scientists to update and view maps on the evolution and spread of pathogens.”
Grant funding through the U.S. Army Research Laboratory and Office supports this Innovation Group on Global Infectious Disease Research project. Support for the computational requirements of the project comes from  the American Museum of Natural History (AMNH) and OSC. Ohio State’s Wexner Medical Center, Department of Biomedical Informatics and offices of Academic Affairs and Research provide additional support.”
See also

How should we analyse our lives?

Gillian Tett in the Financial Times on the challenge of using the new form of data science: “A few years ago, Alex “Sandy” Pentland, a professor of computational social sciences at MIT Media Lab, conducted a curious experiment at a Bank of America call centre in Rhode Island. He fitted 80 employees with biometric devices to track all their movements, physical conversations and email interactions for six weeks, and then used a computer to analyse “some 10 gigabytes of behaviour data”, as he recalls.
The results showed that the workers were isolated from each other, partly because at this call centre, like others of its ilk, the staff took their breaks in rotation so that the phones were constantly manned. In response, Bank of America decided to change its system to enable staff to hang out together over coffee and swap ideas in an unstructured way. Almost immediately there was a dramatic improvement in performance. “The average call-handle time decreased sharply, which means that the employees were much more productive,” Pentland writes in his forthcoming book Social Physics. “[So] the call centre management staff converted the break structure of all their call centres to this new system and forecast a $15m per year productivity increase.”
When I first heard Pentland relate this tale, I was tempted to give a loud cheer on behalf of all long-suffering call centre staff and corporate drones. Pentland’s data essentially give credibility to a point that many people know instinctively: that it is horribly dispiriting – and unproductive – to have to toil in a tiny isolated cubicle by yourself all day. Bank of America deserves credit both for letting Pentland’s team engage in this people-watching – and for changing its coffee-break schedule in response.
But there is a bigger issue at stake here too: namely how academics such as Pentland analyse our lives. We have known for centuries that cultural and social dynamics influence how we behave but until now academics could usually only measure this by looking at micro-level data, which were often subjective. Anthropology (a discipline I know well) is a case in point: anthropologists typically study cultures by painstakingly observing small groups of people and then extrapolating this in a subjective manner.

Pentland and others like him are now convinced that the great academic divide between “hard” and “soft” sciences is set to disappear, since researchers these days can gather massive volumes of data about human behaviour with precision. Sometimes this information is volunteered by individuals, on sites such as Facebook; sometimes it can be gathered from the electronic traces – the “digital breadcrumbs” – that we all deposit (when we use a mobile phone, say) or deliberately collected with biometric devices like the ones used at Bank of America. Either way, it can enable academics to monitor and forecast social interaction in a manner we could never have dreamed of before. “Social physics helps us understand how ideas flow from person to person . . . and ends up shaping the norms, productivity and creative output of our companies, cities and societies,” writes Pentland. “Just as the goal of traditional physics is to understand how the flow of energy translates into change in motion, social physics seems to understand how the flow of ideas and information translates into changes in behaviour….

But perhaps the most important point is this: whether you love or hate this new form of data science, the genie cannot be put back in the bottle. The experiments that Pentland and many others are conducting at call centres, offices and other institutions across America are simply the leading edge of a trend.

The only question now is whether these powerful new tools will be mostly used for good (to predict traffic queues or flu epidemics) or for more malevolent ends (to enable companies to flog needless goods, say, or for government control). Sadly, “social physics” and data crunching don’t offer any prediction on this issue, even though it is one of the dominant questions of our age.”

Mapping the Data Shadows of Hurricane Sandy: Uncovering the Sociospatial Dimensions of ‘Big Data’

New Paper by Shelton, T., Poorthuis, A., Graham, M., and Zook, M. : “Digital social data are now practically ubiquitous, with increasingly large and interconnected databases leading researchers, politicians, and the private sector to focus on how such ‘big data’ can allow potentially unprecedented insights into our world. This paper investigates Twitter activity in the wake of Hurricane Sandy in order to demonstrate the complex relationship between the material world and its digital representations. Through documenting the various spatial patterns of Sandy-related tweeting both within the New York metropolitan region and across the United States, we make a series of broader conceptual and methodological interventions into the nascent geographic literature on big data. Rather than focus on how these massive databases are causing necessary and irreversible shifts in the ways that knowledge is produced, we instead find it more productive to ask how small subsets of big data, especially georeferenced social media information scraped from the internet, can reveal the geographies of a range of social processes and practices. Utilizing both qualitative and quantitative methods, we can uncover broad spatial patterns within this data, as well as understand how this data reflects the lived experiences of the people creating it. We also seek to fill a conceptual lacuna in studies of user-generated geographic information, which have often avoided any explicit theorizing of sociospatial relations, by employing Jessop et al’s TPSN framework. Through these interventions, we demonstrate that any analysis of user-generated geographic information must take into account the existence of more complex spatialities than the relatively simple spatial ontology implied by latitude and longitude coordinates.”