What 40 Million Devices Can Teach Us About Digital Literacy in America


Blog by Juan M. Lavista Ferres: “…For the first time, Microsoft is releasing a privacy-protected dataset that provides new insights into digital engagement across the United States. This dataset, built from anonymized usage data from 40 million Windows devices, offers the most comprehensive view ever assembled of how digital tools are being used across the country. It goes beyond surveys and self-reported data to provide a real-world look at software application usage across 28,000 ZIP codes, creating a more detailed and nuanced understanding of digital engagement than any existing commercial or government study.

In collaboration with leading researchers at Harvard University and the University of Pennsylvania, we analyzed this dataset and developed two key indices to measure digital literacy:

  • Media & Information Composite Index (MCI): This index captures general computing activity, including media consumption, information gathering, and usage of productivity applications like word processing, spreadsheets, and presentations.
  • Content Creation & Computation Index (CCI): This index measures engagement with more specialized digital applications, such as content creation tools like Photoshop and software development environments.

By combining these indices with demographic data, several important insights emerge:

Urban-Rural Disparities Exist—But the Gaps Are Uneven While rural areas often lag in digital engagement, disparities within urban areas are just as pronounced. Some city neighborhoods have digital activity levels on par with major tech hubs, while others fall significantly behind, revealing a more complex digital divide than previously understood.

Income and Education Are Key Drivers of Digital Engagement Higher-income and higher-education areas show significantly greater engagement in content creation and computational tasks. This suggests that digital skills—not just access—are critical in shaping economic mobility and opportunity. Even in places where broadband availability is the same, digital usage patterns vary widely, demonstrating that access alone is not enough.

Infrastructure Alone Won’t Close the Digital Divide Providing broadband connectivity is essential, but it is not a sufficient solution to the challenges of digital literacy. Our findings show that even in well-connected regions, significant skill gaps persist. This means that policies and interventions must go beyond infrastructure investments to include comprehensive digital education, skills training, and workforce development initiatives…(More)”.

AI crawler wars threaten to make the web more closed for everyone


Article by Shayne Longpre: “We often take the internet for granted. It’s an ocean of information at our fingertips—and it simply works. But this system relies on swarms of “crawlers”—bots that roam the web, visit millions of websites every day, and report what they see. This is how Google powers its search engines, how Amazon sets competitive prices, and how Kayak aggregates travel listings. Beyond the world of commerce, crawlers are essential for monitoring web security, enabling accessibility tools, and preserving historical archives. Academics, journalists, and civil societies also rely on them to conduct crucial investigative research.  

Crawlers are endemic. Now representing half of all internet traffic, they will soon outpace human traffic. This unseen subway of the web ferries information from site to site, day and night. And as of late, they serve one more purpose: Companies such as OpenAI use web-crawled data to train their artificial intelligence systems, like ChatGPT. 

Understandably, websites are now fighting back for fear that this invasive species—AI crawlers—will help displace them. But there’s a problem: This pushback is also threatening the transparency and open borders of the web, that allow non-AI applications to flourish. Unless we are thoughtful about how we fix this, the web will increasingly be fortified with logins, paywalls, and access tolls that inhibit not just AI but the biodiversity of real users and useful crawlers…(More)”.

Recommendations for Better Sharing of Climate Data


Creative Commons: “…the culmination of a nine-month research initiative from our Open Climate Data project. These guidelines are a result of collaboration between Creative Commons, government agencies and intergovernmental organizations. They mark a significant milestone in our ongoing effort to enhance the accessibility, sharing, and reuse of open climate data to address the climate crisis. Our goal is to share strategies that align with existing data sharing principles and pave the way for a more interconnected and accessible future for climate data.

Our recommendations offer practical steps and best practices, crafted in collaboration with key stakeholders and organizations dedicated to advancing open practices in climate data. We provide recommendations for 1) legal and licensing terms, 2) using metadata values for attribution and provenance, and 3) management and governance for better sharing.

Opening climate data requires an examination of the public’s legal rights to access and use the climate data, often dictated by copyright and licensing. This legal detail is sometimes missing from climate data sharing and legal interoperability conversations. Our recommendations suggest two options: Option A: CC0 + Attribution Request, in order to maximize reuse by dedicating climate data to the public domain, plus a request for attribution; and Option B: CC BY 4.0, for retaining data ownership and legal enforcement of attribution. We address how to navigate license stacking and attribution stacking for climate data hosts and for users working with multiple climate data sources.

We also propose standardized human- and machine-readable metadata values that enhance transparency, reduce guesswork, and ensure broader accessibility to climate data. We built upon existing model metadata schemas and standards, including those that address license and attribution information. These recommendations address a gap and provide metadata schema that standardize the inclusion of upfront, clear values related to attribution, licensing and provenance.

Lastly, we highlight four key aspects of effective climate data management: designating a dedicated technical managing steward, designating a legal and/or policy steward, encouraging collaborative data sharing, and regularly revisiting and updating data sharing policies in accordance with parallel open data policies and standards…(More)”.

Cities, health, and the big data revolution


Blog by Harvard Public Health: “Cities influence our health in unexpected ways. From sidewalks to crosswalks, the built environment affects how much we move, impacting our risk for diseases like obesity and diabetes. A recent New York City study underscores that focusing solely on infrastructure, without understanding how people use it, can lead to ineffective interventions. Researchers analyzed over two million Google Street View images, combining them with health and demographic data to reveal these dynamics. Harvard Public Health spoke with Rumi Chunara, director of New York University’s Center for Health Data Science and lead author of the study.

Why study this topic?

We’re seeing an explosion of new data sources, like street-view imagery, being used to make decisions. But there’s often a disconnect—people using these tools don’t always have the public health knowledge to interpret the data correctly. We wanted to highlight the importance of combining data science and domain expertise to ensure interventions are accurate and impactful.

What did you find?

We discovered that the relationship between built environment features and health outcomes isn’t straightforward. It’s not just about having sidewalks; it’s about how often people are using them. Improving physical activity levels in a community could have a far greater impact on health outcomes than simply adding more infrastructure.

It also revealed the importance of understanding the local context. For instance, Google Street View data sometimes misclassifies sidewalks, particularly near highways or bridges, leading to inaccurate conclusions. Relying solely on this data, without accounting for these nuances, could result in less effective interventions…(More)”.

Establish data collaboratives to foster meaningful public involvement


Article by Gwen Ottinger: “…Data Collaboratives would move public participation and community engagement upstream in the policy process by creating opportunities for community members to contribute their lived experience to the assessment of data and the framing of policy problems. This would in turn foster two-way communication and trusting relationships between government and the public. Data Collaboratives would also help ensure that data and their uses in federal government are equitable, by inviting a broader range of perspectives on how data analysis can promote equity and where relevant data are missing. Finally, Data Collaboratives would be one vehicle for enabling individuals to participate in science, technology, engineering, math, and medicine activities throughout their lives, increasing the quality of American science and the competitiveness of American industry…(More)”.

Big data for decision-making in public transport management: A comparison of different data sources


Paper by Valeria Maria Urbano, Marika Arena, and Giovanni Azzone: “The conventional data used to support public transport management have inherent constraints related to scalability, cost, and the potential to capture space and time variability. These limitations underscore the importance of exploring innovative data sources to complement more traditional ones.

For public transport operators, who are tasked with making pivotal decisions spanning planning, operation, and performance measurement, innovative data sources are a frontier that is still largely unexplored. To fill this gap, this study first establishes a framework for evaluating innovative data sources, highlighting the specific characteristics that data should have to support decision-making in the context of transportation management. Second, a comparative analysis is conducted, using empirical data collected from primary public transport operators in the Lombardy region, with the aim of understanding whether and to what extent different data sources meet the above requirements.

The findings of this study support transport operators in selecting data sources aligned with different decision-making domains, highlighting related benefits and challenges. This underscores the importance of integrating different data sources to exploit their complementarities…(More)”.

Overcoming challenges associated with broad sharing of human genomic data


Paper by Jonathan E. LoTempio Jr & Jonathan D. Moreno: “Since the Human Genome Project, the consensus position in genomics has been that data should be shared widely to achieve the greatest societal benefit. This position relies on imprecise definitions of the concept of ‘broad data sharing’. Accordingly, the implementation of data sharing varies among landmark genomic studies. In this Perspective, we identify definitions of broad that have been used interchangeably, despite their distinct implications. We further offer a framework with clarified concepts for genomic data sharing and probe six examples in genomics that produced public data. Finally, we articulate three challenges. First, we explore the need to reinterpret the limits of general research use data. Second, we consider the governance of public data deposition from extant samples. Third, we ask whether, in light of changing concepts of broad, participants should be encouraged to share their status as participants publicly or not. Each of these challenges is followed with recommendations…(More)”.

Digitalizing sewage: The politics of producing, sharing, and operationalizing data from wastewater-based surveillance


Paper by Josie Wittmer, Carolyn Prouse, and Mohammed Rafi Arefin: “Expanded during the COVID-19 pandemic, Wastewater-Based Surveillance (WBS) is now heralded by scientists and policy makers alike as the future of monitoring and governing urban health. The expansion of WBS reflects larger neoliberal governance trends whereby digitalizing states increasingly rely on producing big data as a ‘best practice’ to surveil various aspects of everyday life. With a focus on three South Asian cities, our paper investigates the transnational pathways through which WBS data is produced, made known, and operationalized in ‘evidence-based’ decision-making in a time of crisis. We argue that in South Asia, wastewater surveillance data is actively produced through fragile but power-laden networks of transnational and local knowledge, funding, and practices. Using mixed qualitative methods, we found these networks produced artifacts like dashboards to communicate data to the public in ways that enabled claims to objectivity, ethical interventions, and transparency. Interrogating these representations, we demonstrate how these artifacts open up messy spaces of translation that trouble linear notions of objective data informing accountable, transparent, and evidence-based decision-making for diverse urban actors. By thinking through the production of precarious biosurveillance infrastructures, we respond to calls for more robust ethical and legal frameworks for the field and suggest that the fragility of WBS infrastructures has important implications for the long-term trajectories of urban public health governance in the global South…(More)”

Behaviour-based dependency networks between places shape urban economic resilience


Paper by Takahiro Yabe et al: “Disruptions, such as closures of businesses during pandemics, not only affect businesses and amenities directly but also influence how people move, spreading the impact to other businesses and increasing the overall economic shock. However, it is unclear how much businesses depend on each other during disruptions. Leveraging human mobility data and same-day visits in five US cities, we quantify dependencies between points of interest encompassing businesses, stores and amenities. We find that dependency networks computed from human mobility exhibit significantly higher rates of long-distance connections and biases towards specific pairs of point-of-interest categories. We show that using behaviour-based dependency relationships improves the predictability of business resilience during shocks by around 40% compared with distance-based models, and that neglecting behaviour-based dependencies can lead to underestimation of the spatial cascades of disruptions. Our findings underscore the importance of measuring complex relationships in patterns of human mobility to foster urban economic resilience to shocks…(More)”.

Data solidarity: Operationalising public value through a digital tool


Paper by Seliem El-Sayed, Ilona Kickbusch & Barbara Prainsack: “Most data governance frameworks are designed to protect the individuals from whom data originates. However, the impacts of digital practices extend to a broader population and are embedded in significant power asymmetries within and across nations. Further, inequities in digital societies impact everyone, not just those directly involved. Addressing these challenges requires an approach which moves beyond individual data control and is grounded in the values of equity and a just contribution of benefits and risks from data use. Solidarity-based data governance (in short: data solidarity), suggests prioritising data uses over data type and proposes that data uses that generate public value should be actively facilitated, those that generate significant risks and harms should be prohibited or strictly regulated, and those that generate private benefits with little or no public value should be ‘taxed’ so that profits generated by corporate data users are reinvested in the public domain. In the context of global health data governance, the public value generated by data use is crucial. This contribution clarifies the meaning, importance, and potential of public value within data solidarity and outlines methods for its operationalisation through the PLUTO tool, specifically designed to assess the public value of data uses…(More)”.