Using Big Data to Ask Big Questions


Chase Davis in the SOURCE: “First, let’s dispense with the buzzwords. Big Data isn’t what you think it is: Every federal campaign contribution over the last 30-plus years amounts to several tens of millions of records. That’s not Big. Neither is a dataset of 50 million Medicare records. Or even 260 gigabytes of files related to offshore tax havens—at least not when Google counts its data in exabytes. No, the stuff we analyze in pursuit of journalism and app-building is downright tiny by comparison.
But you know what? That’s ok. Because while super-smart Silicon Valley PhDs are busy helping Facebook crunch through petabytes of user data, they’re also throwing off intellectual exhaust that we can benefit from in the journalism and civic data communities. Most notably: the ability to ask Big Questions.
Most of us who analyze public data for fun and profit are familiar with small questions. They’re focused, incisive, and often have the kind of black-and-white, definitive answers that end up in news stories: How much money did Barack Obama raise in 2012? Is the murder rate in my town going up or down?
Big Questions, on the other hand, are speculative, exploratory, and systemic. As the name implies, they are also answered at scale: Rather than distilling a small slice of a dataset into a concrete answer, Big Questions look at entire datasets and reveal small questions you wouldn’t have thought to ask.
Can we track individual campaign donor behavior over decades, and what does that tell us about their influence in politics? Which neighborhoods in my city are experiencing spikes in crime this week, and are police changing patrols accordingly?
Or, by way of example, how often do interest groups propose cookie-cutter bills in state legislatures?

Looking at Legislation

Even if you don’t follow politics, you probably won’t be shocked to learn that lawmakers don’t always write their own bills. In fact, interest groups sometimes write them word-for-word.
Sometimes those groups even try to push their bills in multiple states. The conservative American Legislative Exchange Council has gotten some press, but liberal groups, social and business interests, and even sororities and fraternities have done it too.
On its face, something about elected officials signing their names to cookie-cutter bills runs head-first against people’s ideal of deliberative Democracy—hence, it tends to make news. Those can be great stories, but they’re often limited in scope to a particular bill, politician, or interest group. They’re based on small questions.
Data science lets us expand our scope. Rather than focusing on one bill, or one interest group, or one state, why not ask: How many model bills were introduced in all 50 states, period, by anyone, during the last legislative session? No matter what they’re about. No matter who introduced them. No matter where they were introduced.
Now that’s a Big Question. And with some basic data science, it’s not particularly hard to answer—at least at a superficial level.

Analyze All the Things!

Just for kicks, I tried building a system to answer this question earlier this year. It was intended as an example, so I tried to choose methods that would make intuitive sense. But it also makes liberal use of techniques applied often to Big Data analysis: k-means clustering, matrices, graphs, and the like.
If you want to follow along, the code is here….
To make exploration a little easier, my code represents similar bills in graph space, shown at the top of this article. Each dot (known as a node) represents a bill. And a line connecting two bills (known as an edge) means they were sufficiently similar, according to my criteria (a cosine similarity of 0.75 or above). Thrown into a visualization software like Gephi, it’s easy to click around the clusters and see what pops out. So what do we find?
There are 375 clusters in total. Because of the limitations of our data, many of them represent vague, subject-specific bills that just happen to have similar titles even though the legislation itself is probably very different (think things like “Budget Bill” and “Campaign Finance Reform”). This is where having full bill text would come handy.
But mixed in with those bills are a handful of interesting nuggets. Several bills that appear to be modeled after legislation by the National Conference of Insurance Legislators appear in multiple states, among them: a bill related to limited lines travel insurance; another related to unclaimed insurance benefits; and one related to certificates of insurance.”

Commons at the Intersection of Peer Production, Citizen Science, and Big Data: Galaxy Zoo


New paper by Michael J. Madison: “The knowledge commons research framework is applied to a case of commons governance grounded in research in modern astronomy. The case, Galaxy Zoo, is a leading example of at least three different contemporary phenomena. In the first place Galaxy Zoo is a global citizen science project, in which volunteer non-scientists have been recruited to participate in large-scale data analysis via the Internet. In the second place Galaxy Zoo is a highly successful example of peer production, some times known colloquially as crowdsourcing, by which data are gathered, supplied, and/or analyzed by very large numbers of anonymous and pseudonymous contributors to an enterprise that is centrally coordinated or managed. In the third place Galaxy Zoo is a highly visible example of data-intensive science, sometimes referred to as e-science or Big Data science, by which scientific researchers develop methods to grapple with the massive volumes of digital data now available to them via modern sensing and imaging technologies. This chapter synthesizes these three perspectives on Galaxy Zoo via the knowledge commons framework.”

Defense Against National Vulnerabilities in Public Data


DOD/DARPA Notice (See also Foreign Policy article): “OBJECTIVE: Investigate the national security threat posed by public data available either for purchase or through open sources. Based on principles of data science, develop tools to characterize and assess the nature, persistence, and quality of the data. Develop tools for the rapid anonymization and de-anonymization of data sources. Develop framework and tools to measure the national security impact of public data and to defend against the malicious use of public data against national interests.
DESCRIPTION: The vulnerabilities to individuals from a data compromise are well known and documented now as “identity theft.” These include regular stories published in the news and research journals documenting the loss of personally identifiable information by corporations and governments around the world. Current trends in social media and commerce, with voluntary disclosure of personal information, create other potential vulnerabilities for individuals participating heavily in the digital world. The Netflix Challenge in 2009 was launched with the goal of creating better customer pick prediction algorithms for the movie service [1]. An unintended consequence of the Netflix Challenge was the discovery that it was possible to de-anonymize the entire contest data set with very little additional data. This de-anonymization led to a federal lawsuit and the cancellation of the sequel challenge [2]. The purpose of this topic is to understand the national level vulnerabilities that may be exploited through the use of public data available in the open or for purchase.
Could a modestly funded group deliver nation-state type effects using only public data?…”
The official link for this solicitation is: www.acq.osd.mil/osbp/sbir/solicitations/sbir20133.
 

Data Science for Social Good


Data Science for Social Good: “By analyzing data from police reports to website clicks to sensor signals, governments are starting to spot problems in real-time and design programs to maximize impact. More nonprofits are measuring whether or not they’re helping people, and experimenting to find interventions that work.
None of this is inevitable, however.
We’re just realizing the potential of using data for social impact and face several hurdles to it’s widespread adoption:

  • Most governments and nonprofits simply don’t know what’s possible yet. They have data – but often not enough and maybe not the right kind.
  • There are too few data scientists out there – and too many spending their days optimizing ads instead of bettering lives.

To make an impact, we need to show social good organizations the power of data and analytics. We need to work on analytics projects that have high social impact. And we need to expose data scientists to the problems that really matter.

The fellowship

That’s exactly why we’re doing the Eric and Wendy Schmidt Data Science for Social Good summer fellowship at the University of Chicago.
We want to bring three dozen aspiring data scientists to Chicago, and have them work on data science projects with social impact.
Working closely with governments and nonprofits, fellows will take on real-world problems in education, health, energy, transportation, and more.
Over the next three months, they’ll apply their coding, machine learning, and quantitative skills, collaborate in a fast-paced atmosphere, and learn from mentors in industry, academia, and the Obama campaign.
The program is led by a strong interdisciplinary team from the Computation institute and the Harris School of Public Policy at the University of Chicago.”

Analyzing the Analyzers


catAn Introspective Survey of Data Scientists and Their Work,By Harlan Harris, Sean Murphy, Marck Vaisman: “There has been intense excitement in recent years around activities labeled “data science,” “big data,” and “analytics.” However, the lack of clarity around these terms and, particularly, around the skill sets and capabilities of their practitioners has led to inefficient communication between “data scientists” and the organizations requiring their services. This lack of clarity has frequently led to missed opportunities. To address this issue, we surveyed several hundred practitioners via the Web to explore the varieties of skills, experiences, and viewpoints in the emerging data science community.

We used dimensionality reduction techniques to divide potential data scientists into five categories based on their self-ranked skill sets (Statistics, Math/Operations Research, Business, Programming, and Machine Learning/Big Data), and four categories based on their self-identification (Data Researchers, Data Businesspeople, Data Engineers, and Data Creatives). Further examining the respondents based on their division into these categories provided additional insights into the types of professional activities, educational background, and even scale of data used by different types of Data Scientists.
In this report, we combine our results with insights and data from others to provide a better understanding of the diversity of practitioners, and to argue for the value of clearer communication around roles, teams, and careers.”

First, they gave us targeted ads. Now, data scientists think they can change the world


in Gigaom: “The best minds of my generation are thinking about how to make people click ads … That sucks.” – Jeff Hammerbacher, co-founder and chief scientist, Cloudera
Well, something has to pay the bills. Thankfully, there’s also a sweeping trend in the data science world right now around bringing those skills to bear on some really meaningful problems, …
We’ve already covered some of these efforts, including the SumAll Foundation’s work on modern-day slavery and future work on child pornography. Closely related is the effort — led by Google.org’s deep pockets — to create an international hotline network for reporting human trafficking and collecting data. Microsoft, in particular Microsoft Research’s danah boyd, has been active in helping fight child exploitation using technology.
This week, I came across two new efforts on different ends of the spectrum. One is ActivityInfo, which describes itself on its website as “an online humanitarian project monitoring tool” — developed by Unicef and a consulting firm called BeDataDriven — that “helps humanitarian organizations to collect, manage, map and analyze indicators….
The other effort I came across is DataKind, specifically its work helping the New York City Department of Parks and Recreations, or NYC Parks, quantify the benefits of a strategic tree-pruning program. Founded by renowned data scientists Drew Conway and Jake Porway (who’s also the host of the National Geographic channel’s The Numbers Game), DataKind exists for the sole purpose of helping non-profit organizations and small government agencies solve their most-pressing data problems.”

Data Edge


Steven Weber, professor in the School of Information and Political Science department at UC Berkeley, in Policy by the Numbers“It’s commonly said that most people overestimate the impact of technology in the short term, and underestimate its impact over the longer term.
Where is Big Data in 2013? Starting to get very real, in our view, and right on the cusp of underestimation in the long term. The short term hype cycle is (thankfully) burning itself out, and the profound changes that data science can and will bring to human life are just now coming into focus. It may be that Data Science is right now about where the Internet itself was in 1993 or so. That’s roughly when it became clear that the World Wide Web was a wind that would blow across just about every sector of the modern economy while transforming foundational things we thought were locked in about human relationships, politics, and social change. It’s becoming a reasonable bet that Data Science is set to do the same—again, and perhaps even more profoundly—over the next decade. Just possibly, more quickly than that….
Can data, no matter how big, change the world for the better? It may be the case that in some fields of human endeavor and behavior, the scientific analysis of big data by itself will create such powerful insights that change will simply have to happen, that businesses will deftly re-organize, that health care will remake itself for efficiency and better outcomes, that people will adopt new behaviors that make them happier, healthier, more prosperous and peaceful. Maybe. But almost everything we know about technology and society across human history argues that it won’t be so straightforward.
…join senior industry and academic leaders at DataEDGE at UC Berkeley on May 30-31 to engage in what will be a lively and important conversation aimed at answering today’s questions about the data science revolution—and formulating tomorrow’s.

Is Privacy Algorithmically Impossible?


MIT Technology Reviewwhat.is_.personal.data2x519: “In 1995, the European Union introduced privacy legislation that defined “personal data” as any information that could identify a person, directly or indirectly. The legislators were apparently thinking of things like documents with an identification number, and they wanted them protected just as if they carried your name.
Today, that definition encompasses far more information than those European legislators could ever have imagined—easily more than all the bits and bytes in the entire world when they wrote their law 18 years ago.
Here’s what happened. First, the amount of data created each year has grown exponentially (see figure)…
Much of this data is invisible to people and seems impersonal. But it’s not. What modern data science is finding is that nearly any type of data can be used, much like a fingerprint, to identify the person who created it: your choice of movies on Netflix, the location signals emitted by your cell phone, even your pattern of walking as recorded by a surveillance camera. In effect, the more data there is, the less any of it can be said to be private. We are coming to the point that if the commercial incentives to mine the data are in place, anonymity of any kind may be “algorithmically impossible,” says Princeton University computer scientist Arvind Narayanan.”