Data Sandboxes: Managing the Open Data Spectrum


Primer by Uma Kalkar, Sampriti Saxena, and Stefaan Verhulst: “Opening up data offers opportunities to enhance governance, elevate public and private services, empower individuals, and bolster public well-being. However, achieving the delicate balance between open data access and the responsible use of sensitive and valuable information presents complex challenges. Data sandboxes are an emerging approach to balancing these needs.

In this white paper, The GovLab seeks to answer the following questions surrounding data sandboxes: What are data sandboxes? How can data sandboxes empower decision-makers to unlock the potential of open data while maintaining the necessary safeguards for data privacy and security? Can data sandboxes help decision-makers overcome barriers to data access and promote purposeful, informed data (re-)use?

The six characteristics of a data sandbox. Image by The GovLab.

After evaluating a series of case studies, we identified the following key findings:

  • Data sandboxes present six unique characteristics that make them a strong tool for facilitating open data and data re-use. These six characteristics are: controlled, secure, multi-sectoral and collaborative, high computing environments, temporal in nature, adaptable, and scalable.
  • Data sandboxes can be used for: pre-engagement assessment, data mesh enablement, rapid prototyping, familiarization, quality and privacy assurance, experimentation and ideation, white labeling and minimization, and maturing data insights.
  • There are many benefits to implementing data sandboxes. We found ten value propositions, such as: decreasing risk in accessing more sensitive data; enhancing data capacity; and fostering greater experimentation and innovation, to name a few.
  • When looking to implement a data sandbox, decision-makers should consider how they will attract and obtain high-quality, relevant data, keep the data fresh for accurate re-use, manage risks of data (re-)use, and translate and scale up sandbox solutions in real markets.
  • Advances in the use of the Internet of Things and Privacy Enhancing Technologies could help improve the creation, preparation, analysis, and security of data in a data sandbox. The development of these technologies, in parallel with European legislative measures such as the Digital Markets Act, the Data Act and the Data Governance Act, can improve the way data is unlocked in a data sandbox, improving trust and encouraging data (re-)use initiatives…(More)” (FULL PRIMER)”

Data Dysphoria: The Governance Challenge Posed by Large Learning Models


Paper by Susan Ariel Aaronson: “Only 8 months have passed since Chat-GPT and the large learning model underpinning it took the world by storm. This article focuses on the data supply chain—the data collected and then utilized to train large language models and the governance challenge it presents to policymakers These challenges include:

• How web scraping may affect individuals and firms which hold copyrights.
• How web scraping may affect individuals and groups who are supposed to be protected under privacy and personal data protection laws.
• How web scraping revealed the lack of protections for content creators and content providers on open access web sites; and
• How the debate over open and closed source LLM reveals the lack of clear and universal rules to ensure the quality and validity of datasets. As the US National Institute of Standards explained, many LLMs depend on “largescale datasets, which can lead to data quality and validity concerns. “The difficulty of finding the “right” data may lead AI actors to select datasets based more on accessibility and availability than on suitability… Such decisions could contribute to an environment where the data used in processes is not fully representative of the populations or phenomena that are being modeled, introducing downstream risks” –in short problems of quality and validity…(More)”.

International Definitions of Artificial Intelligence


Report by IAPP: “Computer scientist John McCarthy coined the term artificial intelligence in 1955, defining it as “the science and engineering of making intelligent machines.” He organized the Dartmouth Summer Research Project on Artificial Intelligence a year later — an event that many consider the birthplace of the field.

In today’s world, the definition of AI has been in continuous evolution, its contours and constraints changing to align with current and perhaps future technological progress and cultural contexts. In fact, most papers and articles are quick to point out the lack of common consensus around the definition of AI. As a resource from British research organization the Ada Lovelace Institute states, “We recognise that the terminology in this area is contested. This is a fast-moving topic, and we expect that terminology will evolve quickly.” The difficulty in defining AI is illustrated by what AI historian Pamela McCorduck called the “odd paradox,” referring to the idea that, as computer scientists find new and innovative solutions, computational techniques once considered AI lose the title as they become common and repetitive.

The indeterminate nature of the term poses particular challenges in the regulatory space. Indeed, in 2017 a New York City Council task force downgraded its mission to regulate the city’s use of automated decision-making systems to just defining the types of systems subject to regulation because it could not agree on a workable, legal definition of AI.

With this understanding, the following chart provides a snapshot of some of the definitions of AI from various global and sectoral (government, civil society and industry) perspectives. The chart is not an exhaustive list. It allows for cross-contextual comparisons from key players in the AI ecosystem…(More)”

Can Google Trends predict asylum-seekers’ destination choices?


Paper by Haodong Qi & Tuba Bircan: “Google Trends (GT) collate the volumes of search keywords over time and by geographical location. Such data could, in theory, provide insights into people’s ex ante intentions to migrate, and hence be useful for predictive analysis of future migration. Empirically, however, the predictive power of GT is sensitive, it may vary depending on geographical context, the search keywords selected for analysis, as well as Google’s market share and its users’ characteristics and search behavior, among others. Unlike most previous studies attempting to demonstrate the benefit of using GT for forecasting migration flows, this article addresses a critical but less discussed issue: when GT cannot enhance the performances of migration models. Using EUROSTAT statistics on first-time asylum applications and a set of push-pull indicators gathered from various data sources, we train three classes of gravity models that are commonly used in the migration literature, and examine how the inclusion of GT may affect models’ abilities to predict refugees’ destination choices. The results suggest that the effects of including GT are highly contingent on the complexity of different models. Specifically, GT can only improve the performance of relatively simple models, but not of those augmented by flow Fixed-Effects or by Auto-Regressive effects. These findings call for a more comprehensive analysis of the strengths and limitations of using GT, as well as other digital trace data, in the context of modeling and forecasting migration. It is our hope that this nuanced perspective can spur further innovations in the field, and ultimately bring us closer to a comprehensive modeling framework of human migration…(More)”.

AI and Big Data: Disruptive Regulation


Book by Mark Findlay, Josephine Seah, and Willow Wong: “This provocative and timely book identifies and disrupts the conventional regulation and governance discourses concerning AI and big data. It suggests that, instead of being used as tools for exclusionist commercial markets, AI and big data can be employed in governing digital transformation for social good.

Analysing the ways in which global technology companies have colonised data access, the book reveals how trust, ethics, and digital self-determination can be reconsidered and engaged to promote the interests of marginalised stakeholders in data arrangement. Chapters examine the regulation of labour engagement in digital economies, the landscape of AI ethics, and a multitude of questions regarding participation, costs, and sustainability. Presenting several informative case studies, the book challenges some of the accepted qualifiers of frontier tech and data use and proposes innovative ways of actioning the more conventional regulatory components of big data.

Scholars and students in information and media law, regulation and governance, and law and politics will find this book to be critical reading. It will also be of interest to policymakers and the AI and data science community…(More)”.

We, the Data


Book by Wendy H. Wong: “Our data-intensive world is here to stay, but does that come at the cost of our humanity in terms of autonomy, community, dignity, and equality? In We, the Data, Wendy H. Wong argues that we cannot allow that to happen. Exploring the pervasiveness of data collection and tracking, Wong reminds us that we are all stakeholders in this digital world, who are currently being left out of the most pressing conversations around technology, ethics, and policy. This book clarifies the nature of datafication and calls for an extension of human rights to recognize how data complicate what it means to safeguard and encourage human potential.

As we go about our lives, we are co-creating data through what we do. We must embrace that these data are a part of who we are, Wong explains, even as current policies do not yet reflect the extent to which human experiences have changed. This means we are more than mere “subjects” or “sources” of data “by-products” that can be harvested and used by technology companies and governments. By exploring data rights, facial recognition technology, our posthumous rights, and our need for a right to data literacy, Wong has crafted a compelling case for engaging as stakeholders to hold data collectors accountable. Just as the Universal Declaration of Human Rights laid the global groundwork for human rights, We, the Data gives us a foundation upon which we claim human rights in the age of data…(More)”.

Demographic Parity: Mitigating Biases in Real-World Data


Paper by Orestis Loukas, and Ho-Ryun Chung: “Computer-based decision systems are widely used to automate decisions in many aspects of everyday life, which include sensitive areas like hiring, loaning and even criminal sentencing. A decision pipeline heavily relies on large volumes of historical real-world data for training its models. However, historical training data often contains gender, racial or other biases which are propagated to the trained models influencing computer-based decisions. In this work, we propose a robust methodology that guarantees the removal of unwanted biases while maximally preserving classification utility. Our approach can always achieve this in a model-independent way by deriving from real-world data the asymptotic dataset that uniquely encodes demographic parity and realism. As a proof-of-principle, we deduce from public census records such an asymptotic dataset from which synthetic samples can be generated to train well-established classifiers. Benchmarking the generalization capability of these classifiers trained on our synthetic data, we confirm the absence of any explicit or implicit bias in the computer-aided decision…(More)”.

Five types of urban digital twins


Blog by Darrel Ronald: “The definition for urban digital twins is too vague — so it is important to create a clearer picture of the types of urban digital twins that are available. Not all digital twins are the same and each one comes with features and capabilities, strengths and weakness, as well as appropriate and inappropriate use cases….

Darrel Ronald
Urban Twin taxonomy, Source: Darrel Ronald, Spatiomatics

As shown in my proposed Urban Digital Twin Taxonomy above, I propose that we classify these products first based on their Main Functionality (the Use Case), then based on their Technology Platform. I highlight some of main products within the different categories and their product scope. Next, I detail the different types of twins and offer some brief strengths and weaknesses for each type. This taxonomy could apply to other industries such as architecture or manufacturing, but it is specifically applied to cities and urban development projects.

The main functionalities can be grouped by:

  • Modelling Twin
  • Computational Twin
  • Scenario Twin
  • Operational Twin
  • Experiential Twin

The technology platforms can be grouped by:

  • Computer Aided Design (CAD)
  • Web GIS
  • Geographic Information System (GIS)
  • Gaming…(More)”.

From Happiness Data to Economic Conclusions


Paper by Daniel J. Benjamin, Kristen Cooper, Ori Heffetz & Miles S. Kimball: “Happiness data—survey respondents’ self-reported well-being (SWB)—have become increasingly common in economics research, with recent calls to use them in policymaking. Researchers have used SWB data in novel ways, for example to learn about welfare or preferences when choice data are unavailable or difficult to interpret. Focusing on leading examples of this pioneering research, the first part of this review uses a simple theoretical framework to reverse-engineer some of the crucial assumptions that underlie existing applications. The second part discusses evidence bearing on these assumptions and provides practical advice to the agencies and institutions that generate SWB data, the researchers who use them, and the policymakers who may use the resulting research. While we advocate creative uses of SWB data in economics, we caution that their use in policy will likely require both additional data collection and further research to better understand the data…(More)”.

Evidence 2.0: The Next Era of Evidence-Based Policymaking


Interview with Nick Hart & Jason Saul: “One of the great—if largely unsung—bipartisan congressional acts of recent history was the passage in 2018 of the Foundations for Evidence-Based Policymaking Act. In essence, the “Evidence Act” codified the goal of using solid, consistent evidence as the basis for funding decisions on trillions of dollars of public money. Agencies use this data to decide on the most effective and most promising solutions for a vast array of issues, from early-childhood education to environmental protection.

Five years later, while most federal agencies have created fairly robust evidence bases, unlocking that evidence for practical use by decision makers remains challenging. One might argue that if Evidence 1.0 was focused on the production of evidence, then the next five years—let’s call it Evidence 2.0—will be focused on the effective use of that evidence. Now that evidence is readily available to policymakers, the question is, how can that data be standardized, aggregated, derived, applied, and used for predictive decision-making?…(More)”.