Using Big Data to Ask Big Questions

Chase Davis in the SOURCE: “First, let’s dispense with the buzzwords. Big Data isn’t what you think it is: Every federal campaign contribution over the last 30-plus years amounts to several tens of millions of records. That’s not Big. Neither is a dataset of 50 million Medicare records. Or even 260 gigabytes of files related to offshore tax havens—at least not when Google counts its data in exabytes. No, the stuff we analyze in pursuit of journalism and app-building is downright tiny by comparison.
But you know what? That’s ok. Because while super-smart Silicon Valley PhDs are busy helping Facebook crunch through petabytes of user data, they’re also throwing off intellectual exhaust that we can benefit from in the journalism and civic data communities. Most notably: the ability to ask Big Questions.
Most of us who analyze public data for fun and profit are familiar with small questions. They’re focused, incisive, and often have the kind of black-and-white, definitive answers that end up in news stories: How much money did Barack Obama raise in 2012? Is the murder rate in my town going up or down?
Big Questions, on the other hand, are speculative, exploratory, and systemic. As the name implies, they are also answered at scale: Rather than distilling a small slice of a dataset into a concrete answer, Big Questions look at entire datasets and reveal small questions you wouldn’t have thought to ask.
Can we track individual campaign donor behavior over decades, and what does that tell us about their influence in politics? Which neighborhoods in my city are experiencing spikes in crime this week, and are police changing patrols accordingly?
Or, by way of example, how often do interest groups propose cookie-cutter bills in state legislatures?

Looking at Legislation

Even if you don’t follow politics, you probably won’t be shocked to learn that lawmakers don’t always write their own bills. In fact, interest groups sometimes write them word-for-word.
Sometimes those groups even try to push their bills in multiple states. The conservative American Legislative Exchange Council has gotten some press, but liberal groups, social and business interests, and even sororities and fraternities have done it too.
On its face, something about elected officials signing their names to cookie-cutter bills runs head-first against people’s ideal of deliberative Democracy—hence, it tends to make news. Those can be great stories, but they’re often limited in scope to a particular bill, politician, or interest group. They’re based on small questions.
Data science lets us expand our scope. Rather than focusing on one bill, or one interest group, or one state, why not ask: How many model bills were introduced in all 50 states, period, by anyone, during the last legislative session? No matter what they’re about. No matter who introduced them. No matter where they were introduced.
Now that’s a Big Question. And with some basic data science, it’s not particularly hard to answer—at least at a superficial level.

Analyze All the Things!

Just for kicks, I tried building a system to answer this question earlier this year. It was intended as an example, so I tried to choose methods that would make intuitive sense. But it also makes liberal use of techniques applied often to Big Data analysis: k-means clustering, matrices, graphs, and the like.
If you want to follow along, the code is here….
To make exploration a little easier, my code represents similar bills in graph space, shown at the top of this article. Each dot (known as a node) represents a bill. And a line connecting two bills (known as an edge) means they were sufficiently similar, according to my criteria (a cosine similarity of 0.75 or above). Thrown into a visualization software like Gephi, it’s easy to click around the clusters and see what pops out. So what do we find?
There are 375 clusters in total. Because of the limitations of our data, many of them represent vague, subject-specific bills that just happen to have similar titles even though the legislation itself is probably very different (think things like “Budget Bill” and “Campaign Finance Reform”). This is where having full bill text would come handy.
But mixed in with those bills are a handful of interesting nuggets. Several bills that appear to be modeled after legislation by the National Conference of Insurance Legislators appear in multiple states, among them: a bill related to limited lines travel insurance; another related to unclaimed insurance benefits; and one related to certificates of insurance.”

The Shutdown’s Data Blackout

Opinion piece by Katherine G. Abraham and John Haltiwanger in The New York Times: “Today, for the first time since 1996 and only the second time in modern memory, the Bureau of Labor Statistics will not issue its monthly jobs report, as a result of the shutdown of nonessential government services. This raises an important question: Are the B.L.S. report and other economic data that the government provides “nonessential”?

If we’re trying to understand how much damage the shutdown or sequestration cuts are doing to jobs or the fragile economic recovery, they are definitely essential. Without robust economic data from the federal government, we can speculate, but we won’t really know.

In the last two shutdowns, in 1995 and 1996, the Congressional Budget Office estimated the economic damage at around 0.5 percent of the gross domestic product. This time, Moody’s estimates that a three-to-four-week shutdown might subtract 1.4 percent (annualized) from gross domestic product growth this quarter and take $55 billion out of the economy. Democrats tend to play up such projections; Republicans tend to play them down. If the shutdown continues, though, we’ll all be less able to tell what impact it is having, because more reports like the B.L.S. jobs report will be delayed, while others may never be issued.

In fact, sequestration cuts that affected 2013 budgets are already leading federal statistics agencies to defer or discontinue dozens of reports on everything from income to overseas labor costs. The economic data these agencies produce are key to tracking G.D.P., earnings and jobs, and to informing the Federal Reserve, the executive branch and Congress on the state of the economy and the impact of economic policies. The data are also critical for decisions made by state and local policy makers, businesses and households.

The combined budget for all the federal statistics agencies totals less than 0.1 percent of the federal budget. Yet the same across-the-board-cut mentality that led to sequester and shutdown has shortsightedly cut statistics agencies, too, as if there were something “nonessential” about spending money on accurately assessing the economic effects of government actions and inactions. As a result, as we move through the shutdown, the debt-ceiling fight and beyond, reliable, essential data on the impact of policy decisions will be harder to come by.

Unless the sequester cuts are reversed, funding for economic data will shrink further in 2014, on top of a string of lean budget years. More data reports will be eliminated at the B.L.S., the Census Bureau, the Bureau of Economic Analysis and other agencies. Even more insidious damage will come from compromising the methods for producing the reports that still are paid for and from failing to prepare for the future.

To save money, survey sample sizes will be cut, reducing the reliability of national data and undermining local statistics. Fewer resources will be devoted to maintaining the listings used to draw business survey samples, running the risk that surveys based on those listings won’t do as good a job of capturing actual economic conditions. Hiring and training will be curtailed. Over time, the availability and quality of economic indicators will diminish.

That would be especially paradoxical and backward at a time when economic statistics can and should be advancing through technological innovation instead of marched backward by politics. Integrating survey data, administrative data and commercial data collected with scanners and other digital technologies could produce richer, more useful information with less of a burden on businesses and households.

Now more than ever, framing sound economic policy depends on timely and accurate information about the economy. Bad or ill-targeted data can lead to bad or ill-targeted decisions about taxes and spending. The tighter the budget and the more contentious the political debate around it, the more compelling the argument for investing in federal data that accurately show how government policies are affecting the economy, so we can target the most effective cuts or spending or other policies, and make ourselves accountable for their results. That’s why Congress should restore funding to the federal statistical agencies at a level that allows them to carry out their critical work.”

Commons at the Intersection of Peer Production, Citizen Science, and Big Data: Galaxy Zoo

New paper by Michael J. Madison: “The knowledge commons research framework is applied to a case of commons governance grounded in research in modern astronomy. The case, Galaxy Zoo, is a leading example of at least three different contemporary phenomena. In the first place Galaxy Zoo is a global citizen science project, in which volunteer non-scientists have been recruited to participate in large-scale data analysis via the Internet. In the second place Galaxy Zoo is a highly successful example of peer production, some times known colloquially as crowdsourcing, by which data are gathered, supplied, and/or analyzed by very large numbers of anonymous and pseudonymous contributors to an enterprise that is centrally coordinated or managed. In the third place Galaxy Zoo is a highly visible example of data-intensive science, sometimes referred to as e-science or Big Data science, by which scientific researchers develop methods to grapple with the massive volumes of digital data now available to them via modern sensing and imaging technologies. This chapter synthesizes these three perspectives on Galaxy Zoo via the knowledge commons framework.”

Are Some Tweets More Interesting Than Others? #HardQuestion

New paper by Microsoft Research (Omar Alonso, Catherine C. Marshall, and Marc Najork): “Twitter has evolved into a significant communication nexus, coupling personal and highly contextual utterances with local news, memes, celebrity gossip, headlines, and other microblogging subgenres. If we take Twitter as a large and varied dynamic collection, how can we predict which tweets will be interesting to a broad audience in advance of lagging social indicators of interest such as retweets? The telegraphic form of tweets, coupled with the subjective notion of interestingness, makes it difficult for human judges to agree on which tweets are indeed interesting.
In this paper, we address two questions: Can we develop a reliable strategy that results in high-quality labels for a collection of tweets, and can we use this labeled collection to predict a tweet’s interestingness?
To answer the first question, we performed a series of studies using crowdsourcing to reach a diverse set of workers who served as a proxy for an audience with variable interests and perspectives. This method allowed us to explore different labeling strategies, including varying the judges, the labels they applied, the datasets, and other aspects of the task.
To address the second question, we used crowdsourcing to assemble a set of tweets rated as interesting or not; we scored these tweets using textual and contextual features; and we used these scores as inputs to a binary classifier. We were able to achieve moderate agreement (kappa = 0.52) between the best classifier and the human assessments, a figure which reflects the challenges of the judgment task.”

Technology Can Expose Government Sins, But You Need Humans to Fix Them

Lorelei Kelly: “We can’t bring accountability to the NSA unless we figure out how to give the whole legislative branch modern methods for policy oversight. Those modern methods can include technology, but the primary requirement is figuring out how to supply Congress with unbiased subject matter experts—not just industry lobbyists or partisan think tank analysts. Why? Because trusted and available expertise inside the process of policymaking is what is missing today.
According to calculations by the Sunlight Foundation, today’s Congress is operating with about 40 percent less staff than in 1979. According to the Congressional Management Foundation, it’s also contending with at least 800 percent more incoming communications. Yet, instead of helping Congress gain insight in new ways, instead of helping it sort and filter, curate and authenticate, technology has mostly created disorganized information overload. And the information Congress receives is often sentiment, not substance. Elected leaders should pay attention to both, but need the latter for policymaking.
The result? Congress defaults to what it knows. And that means slapping a “national security” label on policy questions that instead deserve to be treated as broad public conversations about the evolution of American democracy. This is a Congress that categorizes questions about our freedoms on the Internet as “cyber security.”
What can we do? First, recognize that Congress is an obsolete and incapacitated system, and treat it as such. Technology and transparency can help modernize our legislature, but they can’t fix the system of governance.
Activists, even tech-savvy ones, need to talk directly with Congressional members and staff at home. Hackers, you should invite your representatives to wherever you do your hacking. And then offer your skills to help them in any way possible. You may create some great data maps and visualization tools, but the real point is to make friends in Congress. There’s no substitute for repeated conversations, and long-haul engagement. In politics, relationships will leverage the technology. All technology can do is help you find one another.
Without our help and our knowledge, our elected leaders and governing institutions won’t have the bandwidth to cope with our complex world. This will be a steep climb. But, like nearly every good outcome in politics, the climb starts with an outstretched hand, not one that’s poised at a keyboard, ready to tweet.”

How to Make All Apps More Civic

Nick Grossman in Idea Lab: “The big idea in all of this is that through open data and standards and API-based interoperability, it’s possible not just to build more “civic apps,” but to make all apps more civic:
So in a perfect world, I’d not only be able to get my transit information from anywhere (say, Citymapper), I’d be able to read restaurant inspection data from anywhere (say, Foursquare), be able to submit a 311 request from anywhere (say, Twitter), etc.
These examples only scratch the surface of how apps can “become more civic” (i.e., integrate with government/civic information and services). And that’s only really describing one direction: apps tapping into government information and services.
Another, even more powerful direction is the reverse: helping governments tap into the people-power in web networks. In fact, I heard an amazing stat earlier this year:
It’s incredible to think about how web-enabled networks can extend the reach and increase the leverage of public-interest programs and government services, even when (perhaps especially when) that is not their primary function — i.e., Waze is a traffic avoidance app, not a “civic” app. Other examples include the Airbnb community coming together to provide emergency housing after Sandy, and the Etsy community helping to “craft a comeback” in Rockford, Ill.
In other words, helping all apps “be more civic,” rather than just building more civic apps. I think there is a ton of leverage there, and it’s a direction that has just barely begun to be explored.”

Defining Open Data

Open Knowledge Foundation Blog: “Open data is data that can be freely used, shared and built-on by anyone, anywhere, for any purpose. This is the summary of the full Open Definition which the Open Knowledge Foundation created in 2005 to provide both a succinct explanation and a detailed definition of open data.
As the open data movement grows, and even more governments and organisations sign up to open data, it becomes ever more important that there is a clear and agreed definition for what “open data” means if we are to realise the full benefits of openness, and avoid the risks of creating incompatibility between projects and splintering the community.

Open can apply to information from any source and about any topic. Anyone can release their data under an open licence for free use by and benefit to the public. Although we may think mostly about government and public sector bodies releasing public information such as budgets or maps, or researchers sharing their results data and publications, any organisation can open information (corporations, universities, NGOs, startups, charities, community groups and individuals).

Read more about different kinds of data in our one page introduction to open data
There is open information in transport, science, products, education, sustainability, maps, legislation, libraries, economics, culture, development, business, design, finance …. So the explanation of what open means applies to all of these information sources and types. Open may also apply both to data – big data and small data – or to content, like images, text and music!
So here we set out clearly what open means, and why this agreed definition is vital for us to collaborate, share and scale as open data and open content grow and reach new communities.

What is Open?

The full Open Definition provides a precise definition of what open data is. There are 2 important elements to openness:

  • Legal openness: you must be allowed to get the data legally, to build on it, and to share it. Legal openness is usually provided by applying an appropriate (open) license which allows for free access to and reuse of the data, or by placing data into the public domain.
  • Technical openness: there should be no technical barriers to using that data. For example, providing data as printouts on paper (or as tables in PDF documents) makes the information extremely difficult to work with. So the Open Definition has various requirements for “technical openness,” such as requiring that data be machine readable and available in bulk.”…

New crowdsourcing platform links tech-skilled volunteers with charities

Charity Digital News: “The Atlassian Foundation today previewed its innovative crowdsourcing platform,, which will allow nonprofits to coordinate with technically-skilled volunteers who want to help convert ideas into successful projects…
Once vetted, nonprofits will be able to list their volunteer jobs on the site. Skilled volunteers such as developers, designers, business analysts and project managers will then be able to go online and quickly search the site for opportunities relevant and convenient to them.
Atlassian Foundation manager, Melissa Beaumont Lee, said: “We started hearing from nonprofits that what they valued even more than donations was access to Atlassian’s technology expertise. Similarly, we had lots of employees who were keen to volunteer, but didn’t know how to get involved; coordinating volunteers for all these amazing projects was just not scalable. Thus, was born to benefit both nonprofits and volunteers. We wanted to reduce the friction in coordinating efforts so more time can be spent doing really meaningful work.”

The Science Behind Using Online Communities To Change Behavior

Sean D. Young in TechCrunch: “Although social media and online communities might have been developed for people to connect and share information, recent research shows that these technologies are really helpful in changing behaviors. My colleagues and I in the medical school, for instance, created online communities designed to improve health by getting people to do things, such as test for HIV, stop using methamphetamines, and just de-stress and relax. We don’t handpick people to join because we think they’ll love the technology; that’s not how science works. We invite them because the technology is relevant to them — they’re engaging in drugs, sex and other behaviors that might put themselves and others at risk. It’s our job to create the communities in a way that engages them enough to want to stay and participate. Yes, we do offer to pay them $30 to complete an hour-long survey, but then they are free to collect their money and never talk to us again. But for some reason, they stay in the group and decide to be actively engaged with strangers.
So how do we create online communities that keep people engaged and change their behaviors? Our starting point is to understand and address their psychological needs….
Throughout our research, we find that newly created online communities can change people’s behaviors by addressing the following psychological needs:
The Need to Trust. Sharing our thoughts, experiences, and difficulties with others makes us feel closer to others and increases our trust. When we trust people, we’re more open-minded, more willing to learn, and more willing to change our behavior. In our studies, we found that sharing personal information (even something as small as describing what you did today) can help increase trust and change behavior.
The Need to Fit In. Most of us inherently strive to fit in. Social norms, or other people’s attitudes and behaviors, heavily influence our own attitudes and behaviors. Each time a new online community or group forms, it creates its own set of social norms and expectations for how people should behave. Most people are willing to change their attitudes and/or behavior to fit these group norms and fit in with the community.
The Need for Self-Worth. When people feel good about themselves, they are more open to change and feel empowered to be able to change their behavior. When an online community is designed to have people support and care for each other, they can help to increase self-esteem.
The Need to Be Rewarded for Good Behavior. Anyone who has trained a puppy knows that you can get him to keep sitting as long as you keep the treats flowing to reward him, but if you want to wean him off the treats and really train him then you’ll need to begin spacing out the treats to make them less predictable. Well, people aren’t that different from animals in that way and can be trained with reinforcements too. For example, “liking” people’s communications when they immediately join a network, and then progressively spacing out the time that their posts are liked (psychologists call this variable reinforcement) can be incorporated onto social network platforms to encourage them to keep posting content. Eventually, these behaviors become habits.
The Need to Feel Empowered. While increasing self-esteem makes people feel good about themselves, increasing empowerment helps them know they have the ability to change. Creating a sense of empowerment is one of the most powerful predictors of whether people will change their behavior. Belonging to a network of people who are changing their own behaviors, support our needs, and are confident in our changing our behavior empowers us and gives us the ability to change our behavior.”

Best Practices for Government Crowdsourcing Programs

Anton Root: “Crowdsourcing helps communities connect and organize, so it makes sense that governments are increasingly making use of crowd-powered technologies and processes.
Just recently, for instance, we wrote about the Malaysian government’s initiative to crowdsource the national budget. Closer to home, we’ve seen government agencies from U.S. AID to NASA make use of the crowd.
Daren Brabham, professor at the University of Southern California, recently published a report titled “Using Crowdsourcing In Government” that introduces readers to the basics of crowdsourcing, highlights effective use cases, and establishes best practices when it comes to governments opening up to the crowd. Below, we take a look at a few of the suggestions Brabham makes to those considering crowdsourcing.
Brabham splits up his ten best practices into three phases: planning, implementation, and post-implementation. The first suggestion in the planning phase he makes may be the most critical of all: “Clearly define the problem and solution parameters.” If the community isn’t absolutely clear on what the problem is, the ideas and solutions that users submit will be equally vague and largely useless.
This applies not only to government agencies, but also to SMEs and large enterprises making use of crowdsourcing. At Massolution NYC 2013, for instance, we heard again and again the importance of meticulously defining a problem. And open innovation platform InnoCentive’s CEO Andy Zynga stressed the big role his company plays in helping organizations do away with the “curse of knowledge.”
Brabham also has advice for projects in their implementation phase, the key bit being: “Launch a promotional plan and a plan to grow and sustain the community.” Simply put, crowdsourcing cannot work without a crowd, so it’s important to build up the community before launching a campaign. It does take some balance, however, as a community that’s too large by the time a campaign launches can turn off newcomers who “may not feel welcome or may be unsure how to become initiated into the group or taken seriously.”
Brabham’s key advice for the post-implementation phase is: “Assess the project from many angles.” The author suggests tracking website traffic patterns, asking users to volunteer information about themselves when registering, and doing original research through surveys and interviews. The results of follow-up research can help to better understand the responses submitted, and also make it easier to show the successes of the crowdsourcing campaign. This is especially important for organizations partaking in ongoing crowdsourcing efforts.”