Filling Public Data Gaps

Report by Judah Axelrod, Karolina Ramos, and Rebecca Bullied: “Data are central to understanding the lived experiences of different people and communities and can serve as a powerful force for promoting racial equity. Although public data, including foundational sources for policymaking such as the US Census Bureau’s American Community Survey (ACS), offer accessible information on a range of topics, challenges of timeliness, granularity, representativeness, and degrees of disaggregation can limit those data’s utility for real-time analysis. Private data—data produced by private-sector organizations either through standard business or to market as an asset for purchase—can serve as a richer, more granular, and higher-frequency supplement or alternative to public data sources. This raises questions about how well private data assets can offer race-disaggregated insights that can inform policymaking.

In this report, we explore the current landscape of public-private data sharing partnerships that address topic areas where racial equity research faces data gaps: wealth and assets, financial well-being and income, and employment and job quality. We held 20 semistructured interviews with current producers and users of private-sector data and subject matter experts in the areas of data-sharing models and ethical data usage. Our findings are divided into five key themes:

  • Incentives and disincentives, benefits, and risks to public-private data sharing
    Agreements with prestigious public partners can bolster credibility for private firms and broaden their customer base, while public partners benefit from access to real-time, granular, rich data sources. But data sharing is often time and labor intensive, and firms can be concerned with conflicting business interests or diluting the value of proprietary data assets.
  • Availability of race-disaggregated data sources
    We found no examples in our interviews of race-disaggregated data sources related to our thematic focus areas that are available externally. However, there are promising methods for data imputation, linkage, and augmentation through internal surveys.
  • Data collaboratives in practice
    Most public-private data sharing agreements we learned about are between two parties and entail free or “freemium” access. However, we found promising examples of multilateral agreements that diversify the data-sharing landscape.
  • From data champions to data stewards
    We found many examples of informal data champions who bear responsibility for relationship-building and securing data partnerships. This role has yet to mature to an institutionalized data steward within private firms we interviewed, which can make data sharing a fickle process.
  • Considerations for ethical data usage
    Data privacy and transparency about how data are accessed and used are prominent concerns among prospective data users. Interviewees also stressed the importance of not privileging existing quantitative data above qualitative insights in cases where communities have offered long-standing feedback and narratives about their own experiences facing racial inequities, and that policymakers should not use a need to collect more data as an excuse for delaying policy action.

Our research yielded several recommendations for data producers and users that engage in data sharing, and for funders seeking to advance data-sharing efforts and promote racial equity…(More)”

The 15-Minute City Quantified Using Mobility Data

Paper by Timur Abbiasov et al: “Americans travel 7 to 9 miles on average for shopping and recreational activities, which is far longer than the 15-minute (walking) city advocated by ecologically-oriented urban planners. This paper provides a comprehensive analysis of local trip behavior in US cities using GPS data on individual trips from 40 million mobile devices. We define local usage as the share of trips made within 15-minutes walking distance from home, and find that the median US city resident makes only 12% of their daily trips within such a short distance. We find that differences in access to local services can explain eighty percent of the variation in 15-minute usage across metropolitan areas and 74 percent of the variation in usage within metropolitan areas. Differences in historic zoning permissiveness within New York suggest a causal link between access and usage, and that less restrictive zoning rules, such as permitting more mixed-use development, would lead to shorter travel times. Finally, we document a strong correlation between local usage and experienced segregation for poorer, but not richer, urbanites, which suggests that 15-minute cities may also exacerbate the social isolation of marginalized communities…(More)”.

Is bigger better? A study of the effect of group size on collective intelligence in online groups

Paper by Nada Hashmi, G. Shankaranarayanan and Thomas W. Malone: “What is the optimal size for online groups that use electronic communication and collaboration tools? Previous research typically suggested optimal group sizes of about 5 to 7 members, but this research predominantly examined in-person groups. Here we investigate online groups whose members communicate with each other using two electronic collaboration tools: text chat and shared editing. Unlike previous research that studied groups performing a single task, here we measure group performance using a test of collective intelligence (CI) that includes a combination of tasks specifically chosen to predict performance on a wide range of other tasks [72]. Our findings suggest that there is a curvilinear relationship between group size and performance and that the optimal group size in online groups is between 25 and 35. This, in turn, suggests that online groups may now allow more people to be productively involved in group decision-making than was possible with in-person groups in the past…(More)”.

Code for What? Computer Science for Storytelling and Social Justice

Book by Clifford Lee and Elisabeth Soep: “Educators are urged to teach “code for all”—to make a specialized field accessible for students usually excluded from it. In this book, Clifford Lee and Elisabeth Soep instead ask the question, “Code for what?” What if coding were a justice-driven medium for storytelling rather than a narrow technical skill? What if “democratizing” computer science went beyond the usual one-off workshop and empowered youth to create digital products for social impact? Lee and Soep answer these questions with stories of a diverse group of young people in Oakland, California, who combine journalism, data, design, and code to create media that makes a difference.

These teenage and young adult producers created interactive projects that explored gendered and racialized dress code policies in schools; designed tools for LBGTQ+ youth experiencing discrimination; investigated facial recognition software and what can be done about it; and developed a mobile app to promote mental health through self-awareness and outreach for support, and more, for distribution to audiences that could reach into the millions. Working with educators and media professionals at YR Media, an award-winning organization that helps young people from underserved communities build skills in media, journalism, and the arts, these teens found their own vibrant answers to “why code?” They code for insight, connection and community, accountability, creative expression, joy, and hope…(More)”.

ResearchDataGov is a product of the federal statistical agencies and units, created in response to the Foundations of Evidence-based Policymaking Act of 2018. The site is the single portal for discovery of restricted data in the federal statistical system. The agencies have provided detailed descriptions of each data asset. Users can search for data by topic, agency, and keywords. Questions related to the data should be directed to the owning agency, using the contact information on the page that describes the data. In late 2022, users will be able to apply for access to these data using a single-application process built into ResearchDataGov. is built by and hosted at ICPSR at the University of Michigan, under contract and guidance from the National Center for Science and Engineering Statistics within the National Science Foundation.

The data described in are owned by and accessed through the agencies and units of the federal statistical system. Data access is determined by the owning or distributing agency and is limited to specific physical or virtual data enclaves. Even though all data assets are listed in a single inventory, they are not necessarily available for use in the same location(s). Multiple data assets accessed in the same location may not be able to be used together due to disclosure risk and other requirements. Please note the access modality of the data in which you are interested and seek guidance from the owning agency about whether assets can be linked or otherwise used together…(More)”.

All Eyes on Them: A Field Experiment on Citizen Oversight and Electoral Integrity

Paper by Natalia Garbiras-Díaz and Mateo Montenegro: “Can information and communication technologies help citizens monitor their elections? We analyze a large-scale field experiment designed to answer this question in Colombia. We leveraged Facebook advertisements sent to over 4 million potential voters to encourage citizen reporting of electoral irregularities. We also cross-randomized whether candidates were informed about the campaign in a subset of municipalities. Total reports, and evidence-backed ones, experienced a large increase. Across a wide array of measures, electoral irregularities decreased. Finally, the reporting campaign reduced the vote share of candidates dependent on irregularities. This light-touch intervention is more cost-effective than monitoring efforts traditionally used by policymakers…(More)”.

Virtual Public Involvement: Lessons from the COVID-19 Pandemic

Report by the National Academies: “During the COVID-19 pandemic, transportation agencies’ most used public-engagement tools were virtual public meetings, social media, dedicated project websites or webpages, email blasts, and electronic surveys. As the pandemic subsides, virtual and hybrid models continue to provide opportunities and challenges.

The TRB National Cooperative Highway Research Program’s NCHRP Web-Only Document 349: Virtual Public Involvement: Lessons from the COVID-19 Pandemic discusses gaps that need to be addressed so that transportation agencies can better use virtual tools and techniques to facilitate two-way communication with the public…(More)”.

Smart OCR – Advancing the Use of Artificial Intelligence with Open Data

Article by Parth Jain, Abhinay Mannepalli, Raj Parikh, and Jim Samuel: “Optical character recognition (OCR) is growing at a projected compounded annual growth rate (CAGR) of 16%, and is expected to have a value of 39.7 billion USD by 2030, as estimated by Straits research. There has been a growing interest in OCR technologies over the past decade. Optical character recognition is the technological process for transforming images of typed, handwritten, scanned, or printed texts into machine-encoded and machine-readable texts (Tappert, et al., 1990). OCR can be used with a broad range of image or scan formats – for example, these could be in the form of a scanned document such as a .pdf file, a picture of a piece of paper in .png or .jpeg format, or images with embedded text, such as characters on a coffee cup, title on the cover page of a book, the license number on vehicular plates, and images of code on websites. OCR has proven to be a valuable technological process for tackling the important challenge of transforming non-machine-readable data into machine readable data. This enables the use of natural language processing and computational methods on information-rich data which were previously largely non-processable. Given the broad array of scanned and image documents in open government data and other open data sources, OCR holds tremendous promise for value generation with open data.

Open data has been defined as “being data that is made freely available for open consumption, at no direct cost to the public, which can be efficiently located, filtered, downloaded, processed, shared, and reused without any significant restrictions on associated derivatives, use, and reuse” (Chidipothu et al., 2022). Large segments of open data contain images, visuals, scans, and other non-machine-readable content. The size and complexity associated with the manual analysis of such content is prohibitive. The most efficient way would be to establish standardized processes for transforming documents into their OCR output versions. Such machine-readable text could then be analyzed using a range of NLP methods. Artificial Intelligence (AI) can be viewed as being a “set of technologies that mimic the functions and expressions of human intelligence, specifically cognition and logic” (Samuel, 2021). OCR was one of the earliest AI technologies implemented. The first ever optical reader to identify handwritten numerals was the advanced reading machine “IBM 1287,” presented at the World Fair in New York in 1965 (Mori, et al., 1990). The value of open data is well established – however, the extent of usefulness of open data is dependent on “accessibility, machine readability, quality” and the degree to which data can be processed by using analytical and NLP methods (, 2022John, et al., 2022)…(More)”

Leveraging Data to Improve Racial Equity in Fair Housing

Report by Temilola Afolabi: “Residential segregation is related to inequalities in education, job opportunities, political power, access to credit, access to health care, and more. Steering, redlining, mortgage lending discrimination, and other historic policies have all played a role in creating this state of affairs.

Over time, federal efforts including the Fair Housing Act and Home Mortgage Disclosure Act have been designed to improve housing equity in the United States. While these laws have not been entirely effective, they have made new kinds of data available—data that can shed light on some of the historic drivers of housing inequity and help inform tailored solutions to their ongoing impact.

This report explores a number of current opportunities to strengthen longstanding data-driven tools to address housing equity. The report also shows how the effects of mortgage lending discrimination and other historic practices are still being felt today. At the same time, it outlines opportunities to apply data to increase equity in many areas related to the homeownership gap, including negative impacts on health and well-being, socioeconomic disparities, and housing insecurity….(More)”.

Closing the gap between user experience and policy design 

Article by Cecilia Muñoz & Nikki Zeichner: “..Ask the average American to use a government system, whether it’s for a simple task like replacing a Social Security Card or a complicated process like filing taxes, and you’re likely to be met with groans of dismay. We all know that government processes are cumbersome and frustrating; we have grown used to the government struggling to deliver even basic services. 

Unacceptable as the situation is, fixing government processes is a difficult task. Behind every exhausting government application form or eligibility screener lurks a complex policy that ultimately leads to what Atlantic staff writer Anne Lowrey calls the time tax, “a levy of paperwork, aggravation, and mental effort imposed on citizens in exchange for benefits that putatively exist to help them.” 

Policies are complex, in part because they each represent many voices. The people who we call policymakers are key actors in governments and elected officials at every level from city councils to the U.S. Congress. As they seek to solve public problems like child poverty or improving economic mobility, they consult with experts at government agencies, researchers in academia, and advocates working directly with affected communities. They also hear from lobbyists from affected industries. They consider current events and public sentiments. All of these voices and variables, representing different and sometimes conflicting interests, contribute to the policies that become law. And as a result, laws reflect a complex mix of objectives. After a new law is in place, relevant government agencies are responsible for implementing them by creating new programs and services to carry them out. Complex policies then get translated into complex processes and experiences for members of the public. They become long application forms, unclear directions, and too often, barriers that keep people from accessing a benefit. 

Policymakers and advocates typically declare victory when a new policy is signed into law; if they think about the implementation details at all, that work mostly happens after the ink is dry. While these policy actors may have deep expertise in a given issue area, or deep understanding of affected communities, they often lack experience designing services in a way that will be easy for the public to navigate…(More)”.