Accelerating the Sharing of Data Across Sectors to Advance the Common Good


Paper by Robert M. Groves and Adam Neufeld: “The public pays for and provides an incredible amount of data to governments and companies. Yet much of the value of this data is being wasted, remaining in silos rather than being shared to enhance the common good—whether it’s helping governments to stop opioid addiction or helping companies predict and meet the demand for electric or autonomous vehicles.

  • Many companies and governments are interested in sharing more of their data with each other; however, right now the process of sharing is very time consuming and can pose great risks since it often involves sharing full data sets with another entity
  • We need intermediaries to design safe environments to facilitate data sharing in the low-trust and politically sensitive context of companies and governments. These safe environments would exist outside the government, be transparent to the public, and use modern technologies and techniques to allow only statistical uses of data through temporary linkages in order to minimize the risk to individuals’ privacy.
  • Governments must lead the way in sharing more data by re-evaluating laws that limit sharing of data, and must embrace new technologies that could allow the private sector to receive at least some value from many sensitive data sets. By decreasing the cost and risks of sharing data, more data will be freed from their silos, and we will move closer to what we deserve—that our data are used for the greatest societal benefit….(More)”.

Scientists can now figure out detailed, accurate neighborhood demographics using Google Street View photos


Christopher Ingraham at the Washington Post: “A team of computer scientists has derived accurate, neighborhood-level estimates of the racial, economic and political characteristics of 200 U.S. cities using an unlikely data source — Google Street View images of people’s cars.

Published this week in the Proceedings of the National Academy of Sciences, the report details how the scientists extracted 50 million photographs of street scenes captured by Google’s Street View cars in 2013 and 2014. They then trained a computer algorithm to identify the make, model and year of 22 million automobiles appearing in neighborhoods in those images, parked outside homes or driving down the street.

The vehicles seen in Street View images are often small or blurry, making precise identification a challenge. So the researchers had human experts identify a small subsample of the vehicles and compare those to the results churned out by their algorithm. They that the algorithm correctly identified whether a vehicle was U.S.- or foreign-made roughly 88 percent of the time, got the manufacturer right 66 percent of the time and nailed the exact model 52 percent of the time.

While far from perfect, the sheer size of the vehicle database means those numbers are still useful for real-world statistical applications, like drawing connections between vehicle preferences and demographic data. The 22 million vehicles in the database comprise roughly 8 percent of all vehicles in the United States. By comparison, the U.S. Census Bureau’s massive American Community Survey reaches only about 1.6 percent of American householdseach year, while the typical 1,000-person opinion poll includes just 0.0004 of American adults.

To test what this data set could be capable of, the researchers first paired the Zip code-level vehicle data with numbers on race, income and education from the American Community Survey. They did this for a random 15 percent of the Zip codes in their data set to create a “training set.” They then created another algorithm to go through the training set to see how vehicle characteristics correlated with neighborhood characteristics: What kinds of vehicles are disproportionately likely to appear in white neighborhoods, or black ones? Low-income vs. high-income? Highly-educated areas vs. less-educated ones?

That yielded a number of reliable correlations….(More)”.

Solving Public Problems with Data


Dinorah Cantú-Pedraza and Sam DeJohn at The GovLab: “….To serve the goal of more data-driven and evidence-based governing,  The GovLab at NYU Tandon School of Engineering this week launched “Solving Public Problems with Data,” a new online course developed with support from the Laura and John Arnold Foundation.

This online lecture series helps those working for the public sector, or simply in the public interest, learn to use data to improve decision-making. Through real-world examples and case studies — captured in 10 video lectures from leading experts in the field — the new course outlines the fundamental principles of data science and explores ways practitioners can develop a data analytical mindset. Lectures in the series include:

  1. Introduction to evidence-based decision-making  (Quentin Palfrey, formerly of MIT)
  2. Data analytical thinking and methods, Part I (Julia Lane, NYU)
  3. Machine learning (Gideon Mann, Bloomberg LP)
  4. Discovering and collecting data (Carter Hewgley, Johns Hopkins University)
  5. Platforms and where to store data (Arnaud Sahuguet, Cornell Tech)
  6. Data analytical thinking and methods, Part II (Daniel Goroff, Alfred P. Sloan Foundation)
  7. Barriers to building a data practice (Beth Blauer, Johns Hopkins University and GovEx)
  8. Data collaboratives (Stefaan G. Verhulst, The GovLab)
  9. Strengthening a data analytic culture (Amen Ra Mashariki, ESRI)
  10. Data governance and sharing (Beth Simone Noveck, NYU Tandon/The GovLab)

The goal of the lecture series is to enable participants to define and leverage the value of data to achieve improved outcomes and equities, reduced cost and increased efficiency in how public policies and services are created. No prior experience with computer science or statistics is necessary or assumed. In fact, the course is designed precisely to serve public professionals seeking an introduction to data science….(More)”.

GovEx Launches First International Open Data Standards Directory


GT Magazine: “…A nonprofit gov tech group has created an international open data standards directory, aspiring to give cities a singular resource for guidance on formatting data they release to the public…The nature of municipal data is nuanced and diverse, and the format in which it is released often varies depending on subject matter. In other words, a format that works well for public safety data is not necessarily the same that works for info about building permits, transit or budgets. Not having a coordinated and agreed-upon resource to identify the best standards for these different types of info, Nicklin said, creates problems.

One such problem is that it can be time-consuming and challenging for city government data workers to research and identify ideal formats for data. Another is that the lack of info leads to discord between different jurisdictions, meaning one city might format a data set about economic development in an entirely different way than another, making collaboration and comparisons problematic.

What the directory does is provide a list of standards that are in use within municipal governments, as well as an evaluation based on how frequent that use is, whether the format is machine-readable, and whether users have to pay to license it, among other factors.

The directory currently contains 60 standards, some of which are in Spanish, and those involved with the project say they hope to expand their efforts to include more languages. There is also a crowdsourcing component to the directory, in that users are encouraged to make additions and updates….(More)”

The Hidden Pitfall of Innovation Prizes


Reto Hofstetter, John Zhang and Andreas Herrmann at Harvard Business Review: “…it is not so easy to get people to submit their ideas to online innovation platforms. Our data from an online panel reveal that 65% of the contributors do not come back more than twice, and that most of the rest quit after a few tries. This kind of user churn is endemic to online social platforms — on Twitter, for example, a majority of users become inactive over time — and crowdsourcing is no exception. In a way, this turnover is even worse than ordinary customer churn: When a customer defects, a firm knows the value of what it’s lost, but there is no telling how valuable the ideas not submitted might have been….

It is surprising, then, that crowdsourcing on popular platforms is typically designed in a way that amplifies churn. Right now, in typical innovation contests, rewards are granted to winners only and the rest get no return on their participation. This design choice is often motivated by the greater effort participants exert when there is a top prize much more valuable than the rest. Often, the structure is something like the Wimbledon Tennis Championship, where the winning player wins twice as much as the runner up and four times as much as the semifinalists — with the rest eventually leaving empty handed.

This winner-take-most prize spread increases the incentive to win and thus individual efforts. With only one winner, however, the others are left with nothing to show for their effort, which may significantly reduce their motivation to enter again.

An experiment we recently ran confirmed that the way entrants respond to this kind of winner-take-all prize structure. …

In line with the above reasoning, we found that winner-take-all contests yielded significantly better ideas compared to multiple prizes in the first round. Importantly, however, this result flipped when we invited the same cohort of innovators to participate again in the second subsequent contest. While 50% of the multiple-prize contest chose to participate again, only 37% did so when the winner-took-all in their first contest. Moreover, innovators who had received no reward in the first contest showed significantly lower effort in the second contest and generated fewer ideas. In the second contest, multiple prizes generated better ideas than the second round of the winner-take-all contest….

Other non-monetary positive feedback, such as encouraging comments or ratings, can have similar effects. These techniques are important, because alleviating innovator churn helps companies interested in longer-term success of their crowdsourcing activities….(More)”.

Nearly All of Wikipedia Is Written By Just 1 Percent of Its Editors


Daniel Oberhaus at Motherboard: “…Sixteen years later, the free encyclopedia and fifth most popular website in the world is well on its way to this goal. Today, Wikipedia is home to 43 million articles in 285 languages and all of these articles are written and edited by an autonomous group of international volunteers.

Although the non-profit Wikimedia Foundation diligently keeps track of how editors and users interact with the site, until recently it was unclear how content production on Wikipedia was distributed among editors. According to the results of a recent study that looked at the 250 million edits made on Wikipedia during its first ten years, only about 1 percent of Wikipedia’s editors have generated 77 percent of the site’s content.

“Wikipedia is both an organization and a social movement,” Sorin Matei, the director of the Purdue University Data Storytelling Network and lead author of the study, told me on the phone. “The assumption is that it’s a creation of the crowd, but this couldn’t be further from the truth. Wikipedia wouldn’t have been possible without a dedicated leadership.”

At the time of writing, there are roughly 132,000 registered editors who have been active on Wikipedia in the last month (there are also an unknown number of unregistered Wikipedians who contribute to the site). So statistically speaking, only about 1,300 people are creating over three-quarters of the 600 new articles posted to Wikipedia every day.

Of course, these “1 percenters” have changed over the last decade and a half. According to Matei, roughly 40 percent of the top 1 percent of editors bow out about every five weeks. In the early days, when there were only a few hundred thousand people collaborating on Wikipedia, Matei said the content production was significantly more equitable. But as the encyclopedia grew, and the number of collaborators grew with it, a cadre of die-hard editors emerged that have accounted for the bulk of Wikipedia’s growth ever since.

Matei and his colleague Brian Britt, an assistant professor of journalism at South Dakota State University, used a machine learning algorithm to crawl the quarter of a billion publicly available edit logs from Wikipedia’s first decade of existence. The results of this research, published September as a book, suggests that for all of Wikipedia’s pretension to being a site produced by a network of freely collaborating peers, “some peers are more equal than others,” according to Matei.

Matei and Britt argue that rather than being a decentralized, spontaneously evolving organization, Wikipedia is better described as an “adhocracy“—a stable hierarchical power structure which nevertheless allows for a high degree of individual mobility within that hierarchy….(More)”.

More Machine Learning About Congress’ Priorities


ProPublica: “We keep training machine learning models on Congress. Find out what this one learned about lawmakers’ top issues…

Speaker of the House Paul Ryan is a tax wonk ― and most observers of Congress know that. But knowing what interests the other 434 members of Congress is harder.

To make it easier to know what issues each lawmaker really focuses on, we’re launching a new feature in our Represent database called Policy Priorities. We had two goals in creating it: To help researchers and journalists understand what drives particular members of Congress and to enable regular citizens to compare their representatives’ priorities to their own and their communities.

We created Policy Priorities using some sophisticated computer algorithms (more on this in a second) to calculate interest based on what each congressperson talks ― and brags ― about in their press releases.

Voting and drafting legislation aren’t the only things members of Congress do with their time, but they’re often the main way we analyze congressional data, in part because they’re easily measured. But the job of a member of Congress goes well past voting. They go to committee meetings, discuss policy on the floor and in caucuses, raise funds and ― important for our purposes ― communicate with their constituents and journalists back home. They use press releases to talk about what they’ve accomplished and to demonstrate their commitment to their political ideals.

We’ve been gathering these press releases for a few years, and have a body of some 86,000 that we used for a kind of analysis called machine learning….(More)”.

The frontiers of data interoperability for sustainable development


Report from the Joined-Up Data Standards [JUDS] project: “…explores where progress has been made, what challenges still remain, and how the new Collaborative on SDG Data Interoperability will play a critical role in moving forward the agenda for interoperability policy.

There is an ever-growing need for a more holistic picture of development processes worldwide and interoperability solutions that can be scaled, driven by global development agendas such as the 2030 Agenda and the Open Data movement. This requires the ability to join up data across multiple data sources and standards to create actionable information.

Solutions that create value for front-line decision makers — health centre managers, local school authorities or water and sanitation committees, for example, and those engaged in government accountability – will be crucial to meet the data needs of the SDGs, and do so in an internationally comparable way. While progress has been made at both a national and international level, moving from principle to practice by embedding interoperability into day-to-day work continues to present challenges.

Based on research and learning generated by the JUDS project team at Development Initiatives and Publish What You Fund, as well as inputs from interviews with key stakeholders, this report aims to provide an overview of the different definitions and components of interoperability and why it is important, and an outline of the current policy landscape.

We offer a set of guiding principles that we consider essential to implementing interoperability, and contextualise the five frontiers of interoperability for sustainable development that we have identified. The report also offers recommendations on what the role of the Collaborative could be in this fast-evolving landscape….(More)”.

Leveraging the disruptive power of artificial intelligence for fairer opportunities


Makada Henry-Nickie at Brookings: “According to President Obama’s Council of Economic Advisers (CEA), approximately 3.1 million jobs will be rendered obsolete or permanently altered as a consequence of artificial intelligence technologies. Artificial intelligence (AI) will, for the foreseeable future, have a significant disruptive impact on jobs. That said, this disruption can create new opportunities if policymakers choose to harness them—including some with the potential to help address long-standing social inequities. Investing in quality training programs that deliver premium skills, such as computational analysis and cognitive thinking, provides a real opportunity to leverage AI’s disruptive power.

AI’s disruption presents a clear challenge: competition to traditional skilled workers arising from the cross-relevance of data scientists and code engineers, who can adapt quickly to new contexts. Data analytics has become an indispensable feature of successful companies across all industries. ….

Investing in high-quality education and training programs is one way that policymakers proactively attempt to address the workforce challenges presented by artificial intelligence. It is essential that we make affirmative, inclusive choices to ensure that marginalized communities participate equitably in these opportunities.

Policymakers should prioritize understanding the demographics of those most likely to lose jobs in the short-run. As opposed to obsessively assembling case studies, we need to proactively identify policy entrepreneurs who can conceive of training policies that equip workers with technical skills of “long-game” relevance. As IBM points out, “[d]ata democratization impacts every career path, so academia must strive to make data literacy an option, if not a requirement, for every student in any field of study.”

Machines are an equal opportunity displacer, blind to color and socioeconomic status. Effective policy responses require collaborative data collection and coordination among key stakeholders—policymakers, employers, and educational institutions—to  identify at-risk worker groups and to inform workforce development strategies. Machine substitution is purely an efficiency game in which workers overwhelmingly lose. Nevertheless, we can blunt these effects by identifying critical leverage points….

Policymakers can choose to harness AI’s disruptive power to address workforce challenges and redesign fair access to opportunity simultaneously. We should train our collective energies on identifying practical policies that update our current agrarian-based education model, which unfairly disadvantages children from economically segregated neighborhoods…(More)”

Democracy is dead: long live democracy!


Helen Margetts in OpenDemocracy: “In the course of the World Forum for Democracy 2017, and in political commentary more generally, social media are blamed for almost everything that is wrong with democracy. They are held responsible for pollution of the democratic environment through fake news, junk science, computational propaganda and aggressive micro-targeting. In turn, these phenomena have been blamed for the rise of populism, political polarization, far-right extremism and radicalisation, waves of hate against women and minorities, post-truth, the end of representative democracy, fake democracy and ultimately, the death of democracy. It feels like the tirade of relatives of the deceased at the trial of the murderer. It is extraordinary how much of this litany is taken almost as given, the most gloomy prognoses as certain visions of the future.

Yet actually we know rather little about the relationship between social media and democracy. Because ten years of the internet and social media have challenged everything we thought we knew.  They have injected volatility and instability into political systems, bringing a continual cast of unpredictable events. They bring into question normative models of democracy – by which we might understand the macro-level shifts at work  – seeming to make possible the highest hopes and worst fears of republicanism and pluralism.

They have transformed the ecology of interest groups and mobilizations. They have challenged élites and ruling institutions, bringing regulatory decay and policy sclerosis. They create undercurrents of political life that burst to the surface in seemingly random ways, making fools of opinion polls and pollsters. And although the platforms themselves generate new sources of real-time transactional data that might be used to understand and shape this changed environment, most of this data is proprietary and inaccessible to researchers, meaning that the revolution in big data and data science has passed by democracy research.

What do we know? The value of tiny acts

Certainly digital media are entwined with every democratic institution and the daily lives of citizens. When deciding whether to vote, to support, to campaign, to demonstrate, to complain – digital media are with us at every step, shaping our information environment and extending our social networks by creating hundreds or thousands of ‘weak ties’, particularly for users of social media platforms such as Facebook or Instagram….(More)”.