Synthetic Data and Social Science Research


Paper by Jordan C. Stanley & Evan S. Totty: “Synthetic microdata – data retaining the structure of original microdata while replacing original values with modeled values for the sake of privacy – presents an opportunity to increase access to useful microdata for data users while meeting the privacy and confidentiality requirements for data providers. Synthetic data could be sufficient for many purposes, but lingering accuracy concerns could be addressed with a validation system through which the data providers run the external researcher’s code on the internal data and share cleared output with the researcher. The U.S. Census Bureau has experience running such systems. In this chapter, we first describe the role of synthetic data within a tiered data access system and the importance of synthetic data accuracy in achieving a viable synthetic data product. Next, we review results from a recent set of empirical analyses we conducted to assess accuracy in the Survey of Income & Program Participation (SIPP) Synthetic Beta (SSB), a Census Bureau product that made linked survey-administrative data publicly available. Given this analysis and our experience working on the SSB project, we conclude with thoughts and questions regarding future implementations of synthetic data with validation…(More)”

Artificial Intelligence as a Catalyzer for Open Government Data Ecosystems: A Typological Theory Approach


Paper by Anthony Simonofski et al: “Artificial Intelligence (AI) within digital government has witnessed growing interest as it can improve governance processes and stimulate citizen engagement. Despite the rise of Generative AI, discussions on AI fusion with Open Government Data (OGD) remain limited to specific implementations and scattered across disciplines. Drawing from the synthesis of the literature through a systematic review, this study examines and structures how AI can enrich OGD initiatives. Employing a typological approach, ideal profiles of AI application within the OGD lifecycle are formalized, capturing varied roles across the portal and ecosystems perspectives. The resulting conceptual framework identifies eight ideal types of AI applications for OGD: AI as Portal Curator, Explorer, Linker, and Monitor, and AI as Ecosystem Data Retriever, Connecter, Value Developer and Engager. This theoretical foundation shows the under-investigation of some types and will inform policymakers, practitioners, and researchers in leveraging AI to cultivate OGD ecosystems…(More)”.

Second-Order Agency


Paper by Cass Sunstein: “Many people prize agency; they want to make their own choices. Many people also prize second-order agency, by which they decide whether and when to exercise first-order agency. First-order agency can be an extraordinary benefit or an immense burden. When it is an extraordinary benefit, people might reject any kind of interference, or might welcome a nudge, or might seek some kind of boost, designed to increase their capacities. When first-order agency is an immense burden, people might also welcome a nudge or might make some kind of delegation (say, to an employer, a doctor, an algorithm, or a regulator). These points suggests that the line between active choosing and paternalism can be illusory. When private or public institutions override people’s desire not to exercise first-order agency, and thus reject people’s exercise of second-order agency, they are behaving paternalistically, through a form of choice-requiring paternalism. Choice-requiring paternalism may compromise second-order agency. It might not be very nice to do that…(More)”.

Data Privacy for Record Linkage and Beyond


Paper by Shurong Lin & Eric Kolaczyk: “In a data-driven world, two prominent research problems are record linkage and data privacy, among others. Record linkage is essential for improving decision-making by integrating information of the same entities from different sources. On the other hand, data privacy research seeks to balance the need to extract accurate insights from data with the imperative to protect the privacy of the entities involved. Inevitably, data privacy issues arise in the context of record linkage. This article identifies two complementary aspects at the intersection of these two fields: (1) how to ensure privacy during record linkage and (2) how to mitigate privacy risks when releasing the analysis results after record linkage. We specifically discuss privacy-preserving record linkage, differentially private regression, and related topics…(More)”.

Hopes over fears: Can democratic deliberation increase positive emotions concerning the future?


Paper by Mikko Leino and Katariina Kulha: “Deliberative mini-publics have often been considered to be a potential way to promote future-oriented thinking. Still, thinking about the future can be hard as it can evoke negative emotions such as stress and anxiety. This article establishes why a more positive outlook towards the future can benefit long-term decision-making. Then, it explores whether and to what extent deliberative mini-publics can facilitate thinking about the future by moderating negative emotions and encouraging positive emotions. We analyzed an online mini-public held in the region of Satakunta, Finland, organized to involve the public in the drafting process of a regional plan extending until the year 2050. In addition to the standard practices related to mini-publics, the Citizens’ Assembly included an imaginary time travel exercise, Future Design, carried out with half of the participants. Our analysis makes use of both survey and qualitative data. We found that democratic deliberation can promote positive emotions, like hopefulness and compassion, and lessen negative emotions, such as fear and confusion, related to the future. There were, however, differences in how emotions developed in the various small groups. Interviews with participants shed further light onto how participants felt during the event and how their sentiments concerning the future changed…(More)”

Utilizing big data without domain knowledge impacts public health decision-making


Paper by Miao Zhang, Salman Rahman, Vishwali Mhasawade and Rumi Chunara: “…New data sources and AI methods for extracting information are increasingly abundant and relevant to decision-making across societal applications. A notable example is street view imagery, available in over 100 countries, and purported to inform built environment interventions (e.g., adding sidewalks) for community health outcomes. However, biases can arise when decision-making does not account for data robustness or relies on spurious correlations. To investigate this risk, we analyzed 2.02 million Google Street View (GSV) images alongside health, demographic, and socioeconomic data from New York City. Findings demonstrate robustness challenges; built environment characteristics inferred from GSV labels at the intracity level often do not align with ground truth. Moreover, as average individual-level behavior of physical inactivity significantly mediates the impact of built environment features by census tract, intervention on features measured by GSV would be misestimated without proper model specification and consideration of this mediation mechanism. Using a causal framework accounting for these mediators, we determined that intervening by improving 10% of samples in the two lowest tertiles of physical inactivity would lead to a 4.17 (95% CI 3.84–4.55) or 17.2 (95% CI 14.4–21.3) times greater decrease in the prevalence of obesity or diabetes, respectively, compared to the same proportional intervention on the number of crosswalks by census tract. This study highlights critical issues of robustness and model specification in using emergent data sources, showing the data may not measure what is intended, and ignoring mediators can result in biased intervention effect estimates…(More)”

Augmenting the availability of historical GDP per capita estimates through machine learning


Paper by Philipp Koch, Viktor Stojkoski, and César A. Hidalgo: “Can we use data on the biographies of historical figures to estimate the GDP per capita of countries and regions? Here, we introduce a machine learning method to estimate the GDP per capita of dozens of countries and hundreds of regions in Europe and North America for the past seven centuries starting from data on the places of birth, death, and occupations of hundreds of thousands of historical figures. We build an elastic net regression model to perform feature selection and generate out-of-sample estimates that explain 90% of the variance in known historical income levels. We use this model to generate GDP per capita estimates for countries, regions, and time periods for which these data are not available and externally validate our estimates by comparing them with four proxies of economic output: urbanization rates in the past 500 y, body height in the 18th century, well-being in 1850, and church building activity in the 14th and 15th century. Additionally, we show our estimates reproduce the well-known reversal of fortune between southwestern and northwestern Europe between 1300 and 1800 and find this is largely driven by countries and regions engaged in Atlantic trade. These findings validate the use of fine-grained biographical data as a method to augment historical GDP per capita estimates. We publish our estimates with CI together with all collected source data in a comprehensive dataset…(More)”.

Place identity: a generative AI’s perspective


Paper by Kee Moon Jang et al: “Do cities have a collective identity? The latest advancements in generative artificial intelligence (AI) models have enabled the creation of realistic representations learned from vast amounts of data. In this study, we test the potential of generative AI as the source of textual and visual information in capturing the place identity of cities assessed by filtered descriptions and images. We asked questions on the place identity of 64 global cities to two generative AI models, ChatGPT and DALL·E2. Furthermore, given the ethical concerns surrounding the trustworthiness of generative AI, we examined whether the results were consistent with real urban settings. In particular, we measured similarity between text and image outputs with Wikipedia data and images searched from Google, respectively, and compared across cases to identify how unique the generated outputs were for each city. Our results indicate that generative models have the potential to capture the salient characteristics of cities that make them distinguishable. This study is among the first attempts to explore the capabilities of generative AI in simulating the built environment in regard to place-specific meanings. It contributes to urban design and geography literature by fostering research opportunities with generative AI and discussing potential limitations for future studies…(More)”.

The future of agricultural data-sharing policy in Europe: stakeholder insights on the EU Code of Conduct


Paper by Mark Ryan, Can Atik, Kelly Rijswijk, Marc-Jeroen Bogaardt, Eva Maes & Ella Deroo: “n 2018, the EU Code of Conduct of Agricultural Data Sharing by Contractual Agreement (EUCC) was published. This voluntary initiative is considered a basis for rights and responsibilities for data sharing in the agri-food sector, with a specific farmer orientation. While the involved industry associations agreed on its content, there are limited insights into how and to what extent the EUCC has been received and implemented within the sector. In 2024, the Data Act was introduced, a horizontal legal framework that aims to enforce specific legal requirements for data sharing across sectors. Yet, it remains to be seen if it will be the ultimate solution for the agricultural sector, as some significant agricultural data access issues remain. It is thus essential to determine if the EUCC may still play a significant role to address sector-specific issues in line with the horizontal rules of the Data Act. During six workshops across Europe with 89 stakeholders, we identified how the EUCC has been (1) received by stakeholders, (2) implemented, and (3) its future use (particularly in response to the Data Act). Based on the workshop results and continued engagements with researchers and stakeholders, we conclude that the EUCC is still an important document for the agricultural sector but should be updated in response to the content of the Data Act. Hence we propose the following improvements to the EUCC: 1. Provide clear, practical examples for applying the EUCC combined with the Data Act; 2. Generate model contractual terms based on the EUCC provisions; 3. Clarify GDPR-centric concepts like anonymisation and pseudonymisation in the agricultural data-sharing setting; 4. Develop a functional enforcement and implementation framework; and 5. Play a role in increasing interoperability and trust among stakeholders…(More)”

It’s time we put agency into Behavioural Public Policy


Article by Sanchayan Banerjee et al: “Promoting agency – people’s ability to form intentions and to act on them freely – must become a primary objective for Behavioural Public Policy (BPP). Contemporary BPPs do not directly pursue this objective, which is problematic for many reasons. From an ethical perspective, goals like personal autonomy and individual freedom cannot be realised without nurturing citizens’ agency. From an efficacy standpoint, BPPs that override agency – for example, by activating automatic psychological processes – leave citizens ‘in the dark’, incapable of internalising and owning the process of behaviour change. This may contribute to non-persistent treatment effects, compensatory negative spillovers or psychological reactance and backfiring effects. In this paper, we argue agency-enhancing BPPs can alleviate these ethical and efficacy limitations to longer-lasting and meaningful behaviour change. We set out philosophical arguments to help us understand and conceptualise agency. Then, we review three alternative agency-enhancing behavioural frameworks: (1) boosts to enhance people’s competences to make better decisions; (2) debiasing to encourage people to reduce the tendency for automatic, impulsive responses; and (3) nudge+ to enable citizens to think alongside nudges and evaluate them transparently. Using a multi-dimensional framework, we highlight differences in their workings, which offer comparative insights and complementarities in their use. We discuss limitations of agency-enhancing BPPs and map out future research directions…(More)”.