Innovating with Non-Traditional Data: Recent Use Cases for Unlocking Public Value


Article by Stefaan Verhulst and Adam Zable: “Non-Traditional Data (NTD): “data that is digitally captured (e.g. mobile phone records), mediated (e.g. social media), or observed (e.g. satellite imagery), using new instrumentation mechanisms, often privately held.”

Digitalization and the resulting datafication have introduced a new category of data that, when re-used responsibly, can complement traditional data in addressing public interest questions—from public health to environmental conservation. Unlocking these often privately held datasets through data collaboratives is a key focus of what we have called The Third Wave of Open Data

To help bridge this gap, we have curated below recent examples of the use of NTD for research and decision-making that were published the past few months. They are organized into five categories:

  • Health and Well-being;
  • Humanitarian Aid;
  • Environment and Climate;
  • Urban Systems and Mobility, and 
  • Economic and Labor Dynamics…(More)”.

It Was the Best of Times, It Was the Worst of Times: The Dual Realities of Data Access in the Age of Generative AI


Article by Stefaan Verhulst: “It was the best of times, it was the worst of times… It was the spring of hope, it was the winter of despair.” –Charles Dickens, A Tale of Two Cities

Charles Dickens’s famous line captures the contradictions of the present moment in the world of data. On the one hand, data has become central to addressing humanity’s most pressing challenges — climate change, healthcare, economic development, public policy, and scientific discovery. On the other hand, despite the unprecedented quantity of data being generated, significant obstacles remain to accessing and reusing it. As our digital ecosystems evolve, including the rapid advances in artificial intelligence, we find ourselves both on the verge of a golden era of open data and at risk of slipping deeper into a restrictive “data winter.”

A Tale of Two Cities by Charles Dickens (1902)

These two realities are concurrent: the challenges posed by growing restrictions on data reuse, and the countervailing potential brought by advancements in privacy-enhancing technologies (PETs), synthetic data, and data commons approaches. It argues that while current trends toward closed data ecosystems threaten innovation, new technologies and frameworks could lead to a “Fourth Wave of Open Data,” potentially ushering in a new era of data accessibility and collaboration…(More)” (First Published in Industry Data for Society Partnership’s (IDSP) 2024 Year in Review).

Space, Satellites, and Democracy: Implications of the New Space Age for Democratic Processes and Recommendations for Action


NDI Report: “The dawn of a new space age is upon us, marked by unprecedented engagement from both state and private actors. Driven by technological innovations such as reusable rockets and miniaturized satellites, this era presents a double-edged sword for global democracy. On one side, democratized access to space offers powerful tools for enhancing civic processes. Satellite technology now enables real-time election monitoring, improved communication in remote areas, and more effective public infrastructure planning. It also equips democratic actors with means to document human rights abuses and circumvent authoritarian internet restrictions.

However, the accessibility of these technologies also raises significant concerns. The potential for privacy infringements and misuse by authoritarian regimes or malicious actors casts a shadow over these advancements.

This report discusses the opportunities and risks that space and satellite technologies pose to democracy, human rights, and civic processes globally. It examines the current regulatory and normative frameworks governing space activities and highlights key considerations for stakeholders navigating this increasingly competitive domain.

It is essential that the global democracy community be familiar with emerging trends in space and satellite technology and their implications for the future. Failure to do so will leave the community unprepared to harness the opportunities or address the challenges that space capabilities present. It would also cede influence over the development of global norms and standards in this arena to states and private sector interests alone and, in turn, ensure those standards are not rooted in democratic norms and human rights, but rather in principles such as state sovereignty and profit maximization…(More)”.

The AI revolution is running out of data. What can researchers do?


Article by Nicola Jones: “The Internet is a vast ocean of human knowledge, but it isn’t infinite. And artificial intelligence (AI) researchers have nearly sucked it dry.

The past decade of explosive improvement in AI has been driven in large part by making neural networks bigger and training them on ever-more data. This scaling has proved surprisingly effective at making large language models (LLMs) — such as those that power the chatbot ChatGPT — both more capable of replicating conversational language and of developing emergent properties such as reasoning. But some specialists say that we are now approaching the limits of scaling. That’s in part because of the ballooning energy requirements for computing. But it’s also because LLM developers are running out of the conventional data sets used to train their models.

A prominent study1 made headlines this year by putting a number on this problem: researchers at Epoch AI, a virtual research institute, projected that, by around 2028, the typical size of data set used to train an AI model will reach the same size as the total estimated stock of public online text. In other words, AI is likely to run out of training data in about four years’ time (see ‘Running out of data’). At the same time, data owners — such as newspaper publishers — are starting to crack down on how their content can be used, tightening access even more. That’s causing a crisis in the size of the ‘data commons’, says Shayne Longpre, an AI researcher at the Massachusetts Institute of Technology in Cambridge who leads the Data Provenance Initiative, a grass-roots organization that conducts audits of AI data sets.

The imminent bottleneck in training data could be starting to pinch. “I strongly suspect that’s already happening,” says Longpre…(More)”

Running out of data: Chart showing projections of the amount of text data used to train large language models and the amount of available text on the Internet, suggesting that by 2028, developers will be using data sets that match the total amount of text that is available.

My Voice, Your Voice, Our Voice: Attitudes Towards Collective Governance of a Choral AI Dataset


Paper by Jennifer Ding, Eva Jäger, Victoria Ivanova, and Mercedes Bunz: “Data grows in value when joined and combined; likewise the power of voice grows in ensemble. With 15 UK choirs, we explore opportunities for bottom-up data governance of a jointly created Choral AI Dataset. Guided by a survey of chorister attitudes towards generative AI models trained using their data, we explore opportunities to create empowering governance structures that go beyond opt in and opt out. We test the development of novel mechanisms such as a Trusted Data Intermediary (TDI) to enable governance of the dataset amongst the choirs and AI developers. We hope our findings can contribute to growing efforts to advance collective data governance practices and shape a more creative, empowering future for arts communities in the generative AI ecosystem…(More)”.

Towards Civic Digital Twins: Co-Design the Citizen-Centric Future of Bologna


Paper by Massimiliano Luca et al: “We introduce Civic Digital Twin (CDT), an evolution of Urban Digital Twins designed to support a citizen-centric transformative approach to urban planning and governance. CDT is being developed in the scope of the Bologna Digital Twin initiative, launched one year ago by the city of Bologna, to fulfill the city’s political and strategic goal of adopting innovative digital tools to support decision-making and civic engagement. The CDT, in addition to its capability of sensing the city through spatial, temporal, and social data, must be able to model and simulate social dynamics in a city: the behavior, attitude, and preference of citizens and collectives and how they impact city life and transform transformation processes. Another distinctive feature of CDT is that it must be able to engage citizens (individuals, collectives, and organized civil society) and other civic stakeholders (utilities, economic actors, third sector) interested in co-designing the future of the city. In this paper, we discuss the motivations that led to the definition of the CDT, define its modeling aspects and key research challenges, and illustrate its intended use with two use cases in urban mobility and urban development…(More)”.

Can the world’s most successful index get back up the rankings?


Article by James Watson: “You know your ranking model is influential when national governments change policies with the explicit goal of boosting their position on your index. That was the power of the Ease of Doing Business Index (also known as Doing Business) until 2021.

However, the index’s success became its downfall. Some governments set up dedicated teams with an explicit goal of improving the country’s performance on the index. If those teams’ activity was solely focussed on positive policy reform, that would be great; unfortunately, in at least some cases, they were simply trying to game the results.

World Bank’s Business Ready Index

Index ranking optimisation (aka gaming the results)

To give an example of how that could happen, we need to take a brief detour into the world of qualitative indicators. Bear with me. In many indexes grappling with complex topics, there is a perennial problem of data availability. Imagine you want to measure the number of days it takes to set up a new business (this was one of the indicators in Doing Business). You will find that most of the time the data either doesn’t exist or is rarely updated by governments. Instead, put very simplistically, you’d need to ask a few experts or businesses for their views, and use those to create a numerical score for your index.

This is a valid approach, and it’s used in a lot of studies. Take Transparency International’s long-running Corruption Perceptions Index (CPI). Transparency International goes to great lengths to use robust and comparable data across countries, but measuring actual corruption is not viable — for obvious reasons. So the CPI does something different, and the clue is in the name: it measures people’s perceptions of corruption. It asks local businesses and experts whether they think there’s much bribery, nepotism and other forms of corruption in their country. This foundational input is then bolstered with other data points. The data doesn’t aim to measure corruption; instead, it’s about assessing which countries are more, or less, corrupt. 

Transparency International’s Corruption Perceptions Index (CPI)

This technique can work well, but it got a bit shaky as Doing Business’s fame grew. Some governments that were anxious to move up the rankings started urging the World Bank to tweak the methodology used to assess their ratings, or to use the views of specific experts. The analysts responsible for assessing a country’s scores and data points were put under significant pressure, often facing strong criticism from governments that didn’t agree with their assessments. In the end, an internal review showed that a number of countries’ scores had been improperly manipulated…The criticism must have stung, because the team behind the World Bank’s new Business Ready report has spent three years trying to address those issues. The new methodology handbook lands with a thump at 704 pages…(More)”.

Synthetic content and its implications for AI policy: a primer


UNESCO Paper: “The deployment of advanced Artificial Intelligence (AI) models, particularly generative AI, has sparked discussions regarding the creation and use of synthetic content – i.e. AI-generated or modified outputs, including text, images, sounds, and combinations thereof – and its impact on individuals, societies, and economies. This note explores the different ways in which synthetic content can be generated and used and proposes a taxonomy that encompasses synthetic media and deepfakes, among others. The taxonomy aims to systematize key characteristics, enhancing understanding and informing policy discussions. Key findings highlight both the potential benefits and concerns associated with synthetic content in fields like data analytics, environmental sustainability, education, creativity, and mis/disinformation and point to the need to frame them ethically, in line with the principles and values of UNESCO Recommendation on the Ethics of Artificial Intelligence. Finally, the note brings to the fore critical questions that policymakers and experts alike need to address to ensure that the development of AI technologies aligns with human rights, human dignity, and fundamental freedoms…(More)”.

Synthetic Data, Synthetic Media, and Surveillance


Paper by Aaron Martin and Bryce Newell: “Public and scholarly interest in the related concepts of synthetic data and synthetic media has exploded in recent years. From issues raised by the generation of synthetic datasets to train machine learning models to the public-facing, consumer availability of artificial intelligence (AI) powered image manipulation and creation apps and the associated increase in synthetic (or “deepfake”) media, these technologies have shifted from being niche curiosities of the computer science community to become topics of significant public, corporate, and regulatory import. They are emblematic of a “data-generation revolution” (Gal and Lynskey 2024: 1091) that is already raising pressing questions for the academic surveillance studies community. Within surveillance studies scholarship, Fussey (2022: 348) has argued that synthetic media is one of several “issues of urgent societal and planetary concern” and that it has “arguably never been more important” for surveillance studies “researchers to understand these dynamics and complex processes, evidence their implications, and translate esoteric knowledge to produce meaningful analysis.” Yet, while fields adjacent to surveillance studies have begun to explore the ethical risks of synthetic data, we currently perceive a lack of attention to the surveillance implications of synthetic data and synthetic media in published literature within our field. In response, this Dialogue is designed to help promote thinking and discussion about the links and disconnections between synthetic data, synthetic media, and surveillance…(More)”

Social licence for health data


Evidence Brief by NSW Government: “Social licence, otherwise referred to as social licence to operate, refers to an approval or consensus from the society members or the community for the users, either as a public or private enterprise or individual, to use their health data as desired or accepted under certain conditions. Social licence is a dynamic and fluid concept and is subject to change over time often influenced by societal and contextual factors.
The social licence is usually indicated through ongoing engagement and negotiations with the public and is not a contract with strict terms and conditions. It is, rather, a moral and ethical responsibility assumed by the data users based on trust and legitimacy, It supplements the techno-legal mechanisms to regulate the use of data.
For example, through public engagement, certain values and principles can emerge as pertinent to public support for using their data. Similarly, the public may view certain activities relating to their data use as acceptable and beneficial, implying their permission for certain activities or usecase scenarios. Internationally, although not always explicitly referred to as a social licence, the most common approach to establishing public trust and support and identifying common grounds or agreements on acceptable practices for use of data is through public engagement. Engagement methods and mechanisms for gaining public perspectives vary across countries (Table 1).
− Canada – Health Data Research Network Canada reports on social licence for uses of health data, based on deliberative discussions with 20 experienced public and patient advisors. The output is a list of agreements and disagreements on what uses and users of health data have social licence.
− New Zealand – In 2022, the Ministry of Health commissioned a survey on public perceptions on use of personal health information. This report identified conditions under which the public supports the re-use of their data…(More)”.