Scientists can now figure out detailed, accurate neighborhood demographics using Google Street View photos


Christopher Ingraham at the Washington Post: “A team of computer scientists has derived accurate, neighborhood-level estimates of the racial, economic and political characteristics of 200 U.S. cities using an unlikely data source — Google Street View images of people’s cars.

Published this week in the Proceedings of the National Academy of Sciences, the report details how the scientists extracted 50 million photographs of street scenes captured by Google’s Street View cars in 2013 and 2014. They then trained a computer algorithm to identify the make, model and year of 22 million automobiles appearing in neighborhoods in those images, parked outside homes or driving down the street.

The vehicles seen in Street View images are often small or blurry, making precise identification a challenge. So the researchers had human experts identify a small subsample of the vehicles and compare those to the results churned out by their algorithm. They that the algorithm correctly identified whether a vehicle was U.S.- or foreign-made roughly 88 percent of the time, got the manufacturer right 66 percent of the time and nailed the exact model 52 percent of the time.

While far from perfect, the sheer size of the vehicle database means those numbers are still useful for real-world statistical applications, like drawing connections between vehicle preferences and demographic data. The 22 million vehicles in the database comprise roughly 8 percent of all vehicles in the United States. By comparison, the U.S. Census Bureau’s massive American Community Survey reaches only about 1.6 percent of American householdseach year, while the typical 1,000-person opinion poll includes just 0.0004 of American adults.

To test what this data set could be capable of, the researchers first paired the Zip code-level vehicle data with numbers on race, income and education from the American Community Survey. They did this for a random 15 percent of the Zip codes in their data set to create a “training set.” They then created another algorithm to go through the training set to see how vehicle characteristics correlated with neighborhood characteristics: What kinds of vehicles are disproportionately likely to appear in white neighborhoods, or black ones? Low-income vs. high-income? Highly-educated areas vs. less-educated ones?

That yielded a number of reliable correlations….(More)”.