Core Concepts: Computational social science


Adam Mann at PNAS:Cell phone tower data predicts which parts of London can expect a spike in crime (1). Google searches for polling place information on the day of an election reveal the consequences of different voter registration laws (2). Mathematical models explain how interactions among financial investors produce better yields, and even how they generate economic bubbles (3).

Figure

Using cell-phone and taxi GPS data, researchers classified people in San Francisco into “tribal networks,” clustering them according to their behavioral patterns. Student’s, tourists, and businesspeople all travel through the city in various ways, congregating and socializing in different neighborhoods. Image courtesy of Alex Pentland (Massachusetts Institute of Technology, Cambridge, MA).

Figure

Where people hail from in the Mexico City area, here indicated by different colors, feeds into a crime-prediction model devised by Alex Pentland and colleagues (6). Image courtesy of Alex Pentland (Massachusetts Institute of Technology, Cambridge, MA).

 These are just a few examples of how a suite of technologies is helping bring sociology, political science, and economics into the digital age. Such social science fields have historically relied on interviews and survey data, as well as censuses and other government databases, to answer important questions about human behavior. These tools often produce results based on individuals—showing, for example, that a wealthy, well-educated, white person is statistically more likely to vote (4)—but struggle to deal with complex situations involving the interactions of many different people.

 

A growing field called “computational social science” is now using digital tools to analyze the rich and interactive lives we lead. The discipline uses powerful computer simulations of networks, data collected from cell phones and online social networks, and online experiments involving hundreds of thousands of individuals to answer questions that were previously impossible to investigate. Humans are fundamentally social creatures and these new tools and huge datasets are giving social scientists insights into exactly how connections among people create societal trends or heretofore undetected patterns, related to everything from crime to economic fortunes to political persuasions. Although the field provides powerful ways to study the world, it’s an ongoing challenge to ensure that researchers collect and store the requisite information safely, and that they and others use that information ethically….(More)”

Democracy Dashboard


The Brookings Democracy Dashboard is a collection of data designed to help users evaluate political system and governmental performance in the United States. The Democracy Dashboard displays trends in democracy and governance in seven key areas: Elections administration; democratic participation and voting; public opinion; institutional functioning in the executive, legislative, and judicial branches; and media capacity.

The dashboard—and accompanying analyses on the FixGov blog—provide information that can help efforts tScreen Shot 2016-01-27 at 2.01.03 PMo strengthen democracy and improve governance in the U.S.

Data will be released on a rolling basis during 2016 and expanded in future election years. Scroll through the interactive charts below to explore data points and trends in key areas for midterm and presidential elections and/or download the data in Excel format here »….(More)”

 

Yahoo Releases the Largest-ever Machine Learning Dataset for Researchers


Suju Rajan at Yahoo Labs: “Data is the lifeblood of research in machine learning. However, access to truly large-scale datasets is a privilege that has been traditionally reserved for machine learning researchers and data scientists working at large companies – and out of reach for most academic researchers.

Research scientists at Yahoo Labs have long enjoyed working on large-scale machine learning problems inspired by consumer-facing products. This has enabled us to advance the thinking in areas such as search ranking, computational advertising, information retrieval, and core machine learning. A key aspect of interest to the external research community has been the application of new algorithms and methodologies to production traffic and to large-scale datasets gathered from real products.

Today, we are proud to announce the public release of the largest-ever machine learning dataset to the research community. The dataset stands at a massive ~110B events (13.5TB uncompressed) of anonymized user-news item interaction data, collected by recording the user-news item interactions of about 20M users from February 2015 to May 2015.

The Yahoo News Feed dataset is a collection based on a sample of anonymized user interactions on the news feeds of several Yahoo properties, including the Yahoo homepage, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Movies, and Yahoo Real Estate.

Our goals are to promote independent research in the fields of large-scale machine learning and recommender systems, and to help level the playing field between industrial and academic research. The dataset is available as part of the Yahoo Labs Webscope data-sharing program, which is a reference library of scientifically-useful datasets comprising anonymized user data for non-commercial use.

In addition to the interaction data, we are providing categorized demographic information (age range, gender, and generalized geographic data) for a subset of the anonymized users. On the item side, we are releasing the title, summary, and key-phrases of the pertinent news article. The interaction data is timestamped with the relevant local time and also contains partial information about the device on which the user accessed the news feeds, which allows for interesting work in contextual recommendation and temporal data mining….(More)”

7 Ways Local Governments Are Getting Creative with Data Mapping


Ben Miller at GovTech:  “As government data collection expands, and as more of that data becomes publicly available, more people are looking to maps as a means of expressing the information.

And depending on the type of application, a map can be useful for both the government and its constituents. Many maps help government servants operate more efficiently and savemoney, while others will answer residents’ questions so they don’t have to call a government worker for theanswer…..

Here are seven examples of state and local governments using maps to help themselves and the people they serve.

1. DISTRICT OF COLUMBIA, IOWA GET LOCAL AND CURRENT WITH THE WEATHER

Washington%2C+D.C.+snow+plow+map

As Winter Storm Jonas was busy dropping nearly 30 inches of snow on the nation’s capital, officials in D.C. were working to clear it. And thanks to a mapping application they launched, citizens could see exactly how the city was going about that business.

The District of Columbia’s snow map lets users enter an address, and then shows what snow plows did near that address within a given range of days. The map also shows where the city received 311 requests for snow removal and gives users a chance to look at recent photos from road cameras showing driving conditions…..

2. LOS ANGELES MAPS EL NIÑO RESOURCES, TRENDS

El Niño Watch map

Throughout the winter, weather monitoring experts warned the public time and again that an El Niño system was brewing in the Pacific Ocean that looked to be one of the largest, if not the largest, ever. That would mean torrents of rain for a parched state that’s seen mudslides and flooding during storms in the past.

So to prepare its residents, the city of Los Angeles published a map in January that lets users see both decision-informing trends and the location of resources. Using the application, one can toggle layers that let them know what the weather is doing around the city, where traffic is backed up, where the power is out, where they can find sand bags to prevent flood damage and more….

3. CALIFORNIA DIVES DEEP INTO AIR POLLUTION RISKS

CalEnviroScreen

….So, faced with a legislative mandate to identify disadvantaged communities, the California Office of Environmental Health Hazard Assessment decided that it wouldn’t just examine smog levels — it also would also take a look at the prevalence of at-risk people across the state.

The result is a series of three maps, the first two examining both factors and the third combining them. That allows the state and its residents to see the places where air pollution is the biggest problem for people it poses a greater risk to….

4. STREAMLINING RESIDENT SERVICE INFORMATION

Manassas+curbside+pickup+map

The city of Manassas, Va., relied on an outdated paper map and a long-time, well-versed staffer to answer questions about municipal curbside pickup services until they launched this map in 2014. The map allows users to enter their address, and then gives them easy-to-read information about when to put out various things on their curb for pickup.

That’s useful because the city’s fall leaf collection schedule changes every year. So the map not only acts as a benefit to residents who want information, but to city staff who don’t have to deal with as many calls.

The map also shows users the locations of resources they can use and gives them city phone numbers in case they still have questions, and displays it all in a popup pane at the bottom of the map.

5. PLACING TOOLS IN THE HANDS OF THE PUBLIC

A lot of cities and counties have started publishing online maps showing city services and releasing government data.

But Chicago, Boston and Philadelphia stand out as examples of maps that take the idea one step further — because each one offers a staggering amount of choices for users.

Chicago’s new OpenGrid map, just launched in January, is a versatile map that lets users search for certain data like food inspection reports, street closures, potholes and more. That’s enough to answer a lot of questions, but what adds even more utility is the map’s various narrowing tools. Users can narrow searches to a zip code, or they can draw a shape on the map and only see results within that shape. They can perform sub-searches within results and they can choose how they’d like to see the data displayed.

Philadelphia’s platform makes use of buttons, icons and categories to help users sift through the spatially-enabled data available to them. Options include future lane closures, bicycle paths, flu shots, city resources, parks and more.

Boston’s platform is open for users to submit their own maps. And submit they have. The city portal offers everything from maps of bus stops to traffic data pulled from the Waze app.

6. HOUSTON TRANSFORMS SERVICE REQUEST DATA

Houston+311+service+request+map

A 311 service functions as a means of bringing problems to city staff’s attention. But the data itself only goes so far — it needs interpretation.

Houston’s 311 service request map helps users easily analyze the data so as to spot trends. The tool offers lots of ways to narrow data down, and can isolate many different kinds of request so users can see whether one problem is reported more often in certain areas.

7. GUIDING BUSINESS GROWTH

For the last several years, the city of Rancho Cucamonga, Calif., has been designing all sorts of maps through its Rancho Enterprise Geographic Information Systems (REGIS) project. Many of them have served specific city purposes, such as tracking code enforcement violations and offering police a command system tool for special events.

The utilitarian foundation of REGIS extends to its public-facing applications as well. One example is INsideRancho, a map built with economic development efforts in mind. The map lets users search and browse available buildings to suit business needs, narrowing results by square footage, zoning and building type. Users can also find businesses by name or address, and look at property exteriors via an embedded connection with Google Street View….(More)”

The Crusade Against Multiple Regression Analysis


Richard Nisbett at the Edge: (VIDEO) “…The thing I’m most interested in right now has become a kind of crusade against correlational statistical analysis—in particular, what’s called multiple regression analysis. Say you want to find out whether taking Vitamin E is associated with lower prostate cancer risk. You look at the correlational evidence and indeed it turns out that men who take Vitamin E have lower risk for prostate cancer. Then someone says, “Well, let’s see if we do the actual experiment, what happens.” And what happens when you do the experiment is that Vitamin E contributes to the likelihood of prostate cancer. How could there be differences? These happen a lot. The correlational—the observational—evidence tells you one thing, the experimental evidence tells you something completely different.

In the case of health data, the big problem is something that’s come to be called the healthy user bias, because the guy who’s taking Vitamin E is also doing everything else right. A doctor or an article has told him to take Vitamin E, so he does that, but he’s also the guy who’s watching his weight and his cholesterol, gets plenty of exercise, drinks alcohol in moderation, doesn’t smoke, has a high level of education, and a high income. All of these things are likely to make you live longer, to make you less subject to morbidity and mortality risks of all kinds. You pull one thing out of that correlate and it’s going to look like Vitamin E is terrific because it’s dragging all these other good things along with it.

This is not, by any means, limited to health issues. A while back, I read a government report in The New York Times on the safety of automobiles. The measure that they used was the deaths per million drivers of each of these autos. It turns out that, for example, there are enormously more deaths per million drivers who drive Ford F150 pickups than for people who drive Volvo station wagons. Most people’s reaction, and certainly my initial reaction to it was, “Well, it sort of figures—everybody knows that Volvos are safe.”

Let’s describe two people and you tell me who you think is more likely to be driving the Volvo and who is more likely to be driving the pickup: a suburban matron in the New York area and a twenty-five-year-old cowboy in Oklahoma. It’s obvious that people are not assigned their cars. We don’t say, “Billy, you’ll be driving a powder blue Volvo station wagon.” Because of this self-selection problem, you simply can’t interpret data like that. You know virtually nothing about the relative safety of cars based on that study.

I saw in The New York Times recently an article by a respected writer reporting that people who have elaborate weddings tend to have marriages that last longer. How would that be? Maybe it’s just all the darned expense and bother—you don’t want to get divorced. It’s a cognitive dissonance thing.

Let’s think about who makes elaborate plans for expensive weddings: people who are better off financially, which is by itself a good prognosis for marriage; people who are more educated, also a better prognosis; people who are richer; people who are older—the later you get married, the more likelihood that the marriage will last, and so on.

The truth is you’ve learned nothing. It’s like saying men who are a somebody III or IV have longer-lasting marriages. Is it because of the suffix there? No, it’s because those people are the types who have a good prognosis for a lengthy marriage.

A huge range of science projects are done with multiple regression analysis. The results are often somewhere between meaningless and quite damaging….(More)

What World Are We Building?


danah boyd at Points: “….Knowing how to use data isn’t easy. One of my colleagues at Microsoft Research — Eric Horvitz — can predict with startling accuracy whether someone will be hospitalized based on what they search for. What should he do with that information? Reach out to people? That’s pretty creepy. Do nothing? Is that ethical? No matter how good our predictions are, figuring out how to use them is a complex social and cultural issue that technology doesn’t solve for us. In fact, as it stands, technology is just making it harder for us to have a reasonable conversation about agency and dignity, responsibility and ethics.

Data is power. Increasingly we’re seeing data being used to assert power over people. It doesn’t have to be this way, but one of the things that I’ve learned is that, unchecked, new tools are almost always empowering to the privileged at the expense of those who are not.

For most media activists, unfettered Internet access is at the center of the conversation, and that is critically important. Today we’re standing on a new precipice, and we need to think a few steps ahead of the current fight.

We are moving into a world of prediction. A world where more people are going to be able to make judgments about others based on data. Data analysis that can mark the value of people as worthy workers, parents, borrowers, learners, and citizens. Data analysis that has been underway for decades but is increasingly salient in decision-making across numerous sectors. Data analysis that most people don’t understand.

Many activists will be looking to fight the ecosystem of prediction — and to regulate when and where prediction can be used. This is all fine and well when we’re talking about how these technologies are designed to do harm. But more often than not, these tools will be designed to be helpful, to increase efficiency, to identify people who need help. Their positive uses will exist alongside uses that are terrifying. What do we do?One of the most obvious issues is the limited diversity of people who are building and using these tools to imagine our future. Statistical and technical literacy isn’t even part of the curriculum in most American schools. In our society where technology jobs are high-paying and technical literacy is needed for citizenry, less than 5% of high schools offer AP computer science courses. Needless to say, black and brown youth are much less likely to have access, let alone opportunities. If people don’t understand what these systems are doing, how do we expect people to challenge them?

One of the most obvious issues is the limited diversity of people who are building and using these tools to imagine our future. Statistical and technical literacy isn’t even part of the curriculum in most American schools. In our society where technology jobs are high-paying and technical literacy is needed for citizenry, less than 5% of high schools offer AP computer science courses. Needless to say, black and brown youth are much less likely to have access, let alone opportunities. If people don’t understand what these systems are doing, how do we expect people to challenge them?

We must learn how to ask hard questions of technology and of those making decisions based data-driven tech. And opening the black box isn’t enough. Transparency of data, algorithms, and technology isn’t enough. We need to build assessment into any system that we roll-out. You can’t just put millions of dollars of surveillance equipment into the hands of the police in the hope of creating police accountability, yet, with police body-worn cameras, that’s exactly what we’re doing. And we’re not even trying to assess the implications. This is probably the fastest roll-out of a technology out of hope, and it won’t be the last. How do we get people to look beyond their hopes and fears and actively interrogate the trade-offs?

Technology plays a central role — more and more — in every sector, every community, every interaction. It’s easy to screech in fear or dream of a world in which every problem magically gets solved. To make the world a better place, we need to start paying attention to the different tools that are emerging and learn to frame hard questions about how they should be put to use to improve the lives of everyday people.

We need those who are thinking about social justice to understand technology and those who understand technology to commit to social justice….(More)”

Methods of Estimating the Total Cost of Regulations


Maeve P. Carey for the Congressional Research Service: “Federal agencies issue thousands of regulations each year under delegated authority from Congress. Over the past 70 years, Congress and various Presidents have created a set of procedures agencies must follow to issue these regulations, some of which contain requirements for the calculation and consideration of costs, benefits, and other economic effects of regulations. In recent years, many Members of Congress have expressed an interest in various regulatory reform efforts that would change the current set of rulemaking requirements, including requirements to estimate costs and benefits of regulations. As part of this debate, it has become common for supporters of regulatory reform to comment on the total cost of federal regulation. Estimating the total cost of regulations is inherently difficult. Current estimates of the cost of regulation should be viewed with a great deal of caution. Scholars and governmental entities estimating the total cost of regulation use one of two methods, which are referred to as the “bottom-up” and the “top-down” approach.

The bottom-up approach aggregates individual cost and benefit estimates produced by agencies, arriving at a governmentwide total. In 2014, the annual report to Congress from the Office of Management and Budget estimated the total cost of federal regulations to range between $68.5 and $101.8 billion and the total benefits to be between $261.7 billion and $1,042.1 billion. The top-down approach estimates the total cost of regulation by looking at the relationship of certain macroeconomic factors, including the size of a country’s economy and a proxy measure of how much regulation the country has. This method estimates the economic effect that a hypothetical change in the amount of regulation in the United States might have, considering that economic effect to represent the cost of regulation. One frequently cited study estimated the total cost of regulation in 2014 to be $2.028 trillion, $1.439 trillion of which was calculated using this top-down approach. Each approach has inherent advantages and disadvantages.

The bottom-up approach relies on agency estimates of the effects of specific regulations and can also be used to estimate benefits, because agencies typically estimate both costs and benefits under current requirements so that they may be compared and evaluated against alternatives. The bottom-up approach does not, however, include estimates of costs and benefits of all rules, nor does it include costs and benefits of regulations that are not monetized—meaning that the bottom-up approach is likely an underestimate of the total cost of regulation. Furthermore, the individual estimates produced by agencies and used in the bottom-up approach may not always be accurate.

The top-down approach can be used to estimate effects of rules that are not captured by the bottom-up approach—such as indirect costs and costs of rules issued by independent regulatory agencies, which are not included in the bottom-up approach—thus theoretically capturing the whole universe of regulatory costs. Its results are, however, entirely reliant upon a number of methodological challenges that are difficult, if not impossible, to overcome. The biggest challenge may be finding a valid proxy measure for regulation: proxy measures of the total amount of regulation in a country are inherently imprecise and cannot be reliably used to estimate macroeconomic outcomes. Because of this difficulty in identifying a suitable proxy measure of regulation, even if the total cost of regulation is substantial, it cannot be estimated with any precision. The top-down method is intended to measure only costs; measuring costs without also considering benefits does not provide the complete context for evaluating the appropriateness of a country’s amount of regulation.

For these and other reasons, both approaches to estimating the total cost of regulation have inherent—and potentially insurmountable—flaws….(More)”

Can We Use Data to Stop Deadly Car Crashes?


Allison Shapiro in Pacific Standard Magazine: “In 2014, New York City Mayor Bill de Blasio decided to adopt Vision Zero, a multi-national initiative dedicated to eliminating traffic-related deaths. Under Vision Zero, city services, including the Department of Transportation, began an engineering and public relations plan to make the streets safer for drivers, pedestrians, and cyclists. The plan included street re-designs, improved accessibility measures, and media campaigns on safer driving.

The goal may be an old one, but the approach is innovative: When New York City officials wanted to reduce traffic deaths, they crowdsourced and used data.

Many cities in the United States—from Washington, D.C., all the way to Los Angeles—have adopted some version of Vision Zero, which began in Sweden in 1997. It’s part of a growing trend to make cities “smart” by integrating data collection into things like infrastructure and policing.

Map of high crash corridors in Portland, Oregon. (Map: Portland Bureau of Transportation)
Map of high crash corridors in Portland, Oregon. (Map: Portland Bureau of Transportation)

Cities have access to an unprecedented amount of data about traffic patterns, driving violations, and pedestrian concerns. Although advocacy groups say Vision Zero is moving too slowly, de Blasio has invested another $115 million in this data-driven approach.

Interactive safety map. (Map: District Department of Transportation)
Interactive safety map. (Map: District Department of Transportation)

De Blasio may have been vindicated. A 2015 year-end report released by the city last week analyzes the successes and shortfalls of data-driven city life, and the early results look promising. In 2015, fewer New Yorkers lost their lives in traffic accidents than in any year since 1910, according to the report, despite the fact that the population has almost doubled in those 105 years.

Below are some of the project highlights.

New Yorkers were invited to add to this public dialogue map, where they could list information ranging from “not enough time to cross” to “red light running.” The Department of Transportation ended up with over 10,000 comments, which led to 80 safety projects in 2015, including the creation of protected bike lanes, the introduction of leading pedestrian intervals, and the simplifying of complex intersections….

Data collected from the public dialogue map, town hall meetings, and past traffic accidents led to “changes to signals, street geometry and markings and regulations that govern actions like turning and parking. These projects simplify driving, walking and bicycling, increase predictability, improve visibility and reduce conflicts,” according to Vision Zero in NYC….(More)”

Don’t let transparency damage science


Stephan Lewandowsky and Dorothy Bishop explain in Nature “how the research community should protect its members from harassment, while encouraging the openness that has become essential to science:…

Screen Shot 2016-01-26 at 10.37.26 AMTransparency has hit the headlines. In the wake of evidence that many research findings are not reproducible, the scientific community has launched initiatives to increase data sharing, transparency and open critique. As with any new development, there are unintended consequences. Many measures that can improve science — shared data, post-publication peer review and public engagement on social media — can be turned against scientists. Endless information requests, complaints to researchers’ universities, online harassment, distortion of scientific findings and even threats of violence: these were all recurring experiences shared by researchers from a broad range of disciplines at a Royal Society-sponsored meeting last year that we organized to explore this topic. Orchestrated and well-funded harassment campaigns against researchers working in climate change and tobacco control are well documented. Some hard-line opponents to other research, such as that on nuclear fallout, vaccination, chronic fatigue syndrome or genetically modified organisms, although less resourced, have employed identical strategies….(More)”

 

Iowa fights snow with data


Patrick Marshall at GCN: “Most residents of the Mid-Atlantic states, now digging out from the recent record-setting snowstorm, probably don’t know how soon their streets will be clear.  If they lived in Iowa, however, they could simply go to the state’s Track a Plow website to see in near real time where snow plows are and in what direction they’re heading.

In fact, the Track a Plow site — the first iteration of which launched three years ago — shows much more than just the location and direction of the state’s more than 900 plows. Because they are equipped with geolocation equipment and a variety of sensors,  the plows also provide information on road conditions, road closures and whether trucks are applying liquid or solid materials to counter snow and ice.  That data is regularly uploaded to Track a Plow, which also offers near-real-time video and photos of conditions.

Track a Plow screenshot

According to Eric Abrams, geospatial manager at the Iowa Department of Transportation, the service is very popular and is being used for a variety of purposes.  “It’s been one of the greatest public interface things that DOT has ever done,” he said.  In addition to citizens considering travel, Abrams said the, site’s heavy users include news stations, freight companies routing vehicles and school districts determining whether to delay opening or cancel classes.

How it works

While Track a Plow launched with just location information, it has been frequently enhanced over the past two years, beginning with the installation of video cameras.  “The challenge was to find a cost-effective way to put cams in the plows and then get those images not just to supervisors but to the public,” Abrams said.  The solution he arrived at was dashboard-mounted iPhones that transmit time and location data in addition to images.  These were especially cost-effective because they were free with the department’s Verizon data plan. “Our IT division built a custom iPhone app that is configurable for how often it sends pictures back to headquarters here, where we process them and get them out to the feed,” he explained….(More)”