The AI revolution is running out of data. What can researchers do?


Article by Nicola Jones: “The Internet is a vast ocean of human knowledge, but it isn’t infinite. And artificial intelligence (AI) researchers have nearly sucked it dry.

The past decade of explosive improvement in AI has been driven in large part by making neural networks bigger and training them on ever-more data. This scaling has proved surprisingly effective at making large language models (LLMs) — such as those that power the chatbot ChatGPT — both more capable of replicating conversational language and of developing emergent properties such as reasoning. But some specialists say that we are now approaching the limits of scaling. That’s in part because of the ballooning energy requirements for computing. But it’s also because LLM developers are running out of the conventional data sets used to train their models.

A prominent study1 made headlines this year by putting a number on this problem: researchers at Epoch AI, a virtual research institute, projected that, by around 2028, the typical size of data set used to train an AI model will reach the same size as the total estimated stock of public online text. In other words, AI is likely to run out of training data in about four years’ time (see ‘Running out of data’). At the same time, data owners — such as newspaper publishers — are starting to crack down on how their content can be used, tightening access even more. That’s causing a crisis in the size of the ‘data commons’, says Shayne Longpre, an AI researcher at the Massachusetts Institute of Technology in Cambridge who leads the Data Provenance Initiative, a grass-roots organization that conducts audits of AI data sets.

The imminent bottleneck in training data could be starting to pinch. “I strongly suspect that’s already happening,” says Longpre…(More)”

Running out of data: Chart showing projections of the amount of text data used to train large language models and the amount of available text on the Internet, suggesting that by 2028, developers will be using data sets that match the total amount of text that is available.

Can the world’s most successful index get back up the rankings?


Article by James Watson: “You know your ranking model is influential when national governments change policies with the explicit goal of boosting their position on your index. That was the power of the Ease of Doing Business Index (also known as Doing Business) until 2021.

However, the index’s success became its downfall. Some governments set up dedicated teams with an explicit goal of improving the country’s performance on the index. If those teams’ activity was solely focussed on positive policy reform, that would be great; unfortunately, in at least some cases, they were simply trying to game the results.

World Bank’s Business Ready Index

Index ranking optimisation (aka gaming the results)

To give an example of how that could happen, we need to take a brief detour into the world of qualitative indicators. Bear with me. In many indexes grappling with complex topics, there is a perennial problem of data availability. Imagine you want to measure the number of days it takes to set up a new business (this was one of the indicators in Doing Business). You will find that most of the time the data either doesn’t exist or is rarely updated by governments. Instead, put very simplistically, you’d need to ask a few experts or businesses for their views, and use those to create a numerical score for your index.

This is a valid approach, and it’s used in a lot of studies. Take Transparency International’s long-running Corruption Perceptions Index (CPI). Transparency International goes to great lengths to use robust and comparable data across countries, but measuring actual corruption is not viable — for obvious reasons. So the CPI does something different, and the clue is in the name: it measures people’s perceptions of corruption. It asks local businesses and experts whether they think there’s much bribery, nepotism and other forms of corruption in their country. This foundational input is then bolstered with other data points. The data doesn’t aim to measure corruption; instead, it’s about assessing which countries are more, or less, corrupt. 

Transparency International’s Corruption Perceptions Index (CPI)

This technique can work well, but it got a bit shaky as Doing Business’s fame grew. Some governments that were anxious to move up the rankings started urging the World Bank to tweak the methodology used to assess their ratings, or to use the views of specific experts. The analysts responsible for assessing a country’s scores and data points were put under significant pressure, often facing strong criticism from governments that didn’t agree with their assessments. In the end, an internal review showed that a number of countries’ scores had been improperly manipulated…The criticism must have stung, because the team behind the World Bank’s new Business Ready report has spent three years trying to address those issues. The new methodology handbook lands with a thump at 704 pages…(More)”.

Synthetic content and its implications for AI policy: a primer


UNESCO Paper: “The deployment of advanced Artificial Intelligence (AI) models, particularly generative AI, has sparked discussions regarding the creation and use of synthetic content – i.e. AI-generated or modified outputs, including text, images, sounds, and combinations thereof – and its impact on individuals, societies, and economies. This note explores the different ways in which synthetic content can be generated and used and proposes a taxonomy that encompasses synthetic media and deepfakes, among others. The taxonomy aims to systematize key characteristics, enhancing understanding and informing policy discussions. Key findings highlight both the potential benefits and concerns associated with synthetic content in fields like data analytics, environmental sustainability, education, creativity, and mis/disinformation and point to the need to frame them ethically, in line with the principles and values of UNESCO Recommendation on the Ethics of Artificial Intelligence. Finally, the note brings to the fore critical questions that policymakers and experts alike need to address to ensure that the development of AI technologies aligns with human rights, human dignity, and fundamental freedoms…(More)”.

Synthetic Data, Synthetic Media, and Surveillance


Paper by Aaron Martin and Bryce Newell: “Public and scholarly interest in the related concepts of synthetic data and synthetic media has exploded in recent years. From issues raised by the generation of synthetic datasets to train machine learning models to the public-facing, consumer availability of artificial intelligence (AI) powered image manipulation and creation apps and the associated increase in synthetic (or “deepfake”) media, these technologies have shifted from being niche curiosities of the computer science community to become topics of significant public, corporate, and regulatory import. They are emblematic of a “data-generation revolution” (Gal and Lynskey 2024: 1091) that is already raising pressing questions for the academic surveillance studies community. Within surveillance studies scholarship, Fussey (2022: 348) has argued that synthetic media is one of several “issues of urgent societal and planetary concern” and that it has “arguably never been more important” for surveillance studies “researchers to understand these dynamics and complex processes, evidence their implications, and translate esoteric knowledge to produce meaningful analysis.” Yet, while fields adjacent to surveillance studies have begun to explore the ethical risks of synthetic data, we currently perceive a lack of attention to the surveillance implications of synthetic data and synthetic media in published literature within our field. In response, this Dialogue is designed to help promote thinking and discussion about the links and disconnections between synthetic data, synthetic media, and surveillance…(More)”

Social licence for health data


Evidence Brief by NSW Government: “Social licence, otherwise referred to as social licence to operate, refers to an approval or consensus from the society members or the community for the users, either as a public or private enterprise or individual, to use their health data as desired or accepted under certain conditions. Social licence is a dynamic and fluid concept and is subject to change over time often influenced by societal and contextual factors.
The social licence is usually indicated through ongoing engagement and negotiations with the public and is not a contract with strict terms and conditions. It is, rather, a moral and ethical responsibility assumed by the data users based on trust and legitimacy, It supplements the techno-legal mechanisms to regulate the use of data.
For example, through public engagement, certain values and principles can emerge as pertinent to public support for using their data. Similarly, the public may view certain activities relating to their data use as acceptable and beneficial, implying their permission for certain activities or usecase scenarios. Internationally, although not always explicitly referred to as a social licence, the most common approach to establishing public trust and support and identifying common grounds or agreements on acceptable practices for use of data is through public engagement. Engagement methods and mechanisms for gaining public perspectives vary across countries (Table 1).
− Canada – Health Data Research Network Canada reports on social licence for uses of health data, based on deliberative discussions with 20 experienced public and patient advisors. The output is a list of agreements and disagreements on what uses and users of health data have social licence.
− New Zealand – In 2022, the Ministry of Health commissioned a survey on public perceptions on use of personal health information. This report identified conditions under which the public supports the re-use of their data…(More)”.

AI could help scale humanitarian responses. But it could also have big downsides


Article by Thalia Beaty: “As the International Rescue Committee copes with dramatic increases in displaced people in recent years, the refugee aid organization has looked for efficiencies wherever it can — including using artificial intelligence.

Since 2015, the IRC has invested in Signpost — a portfolio of mobile apps and social media channels that answer questions in different languages for people in dangerous situations. The Signpost project, which includes many other organizations, has reached 18 million people so far, but IRC wants to significantly increase its reach by using AI tools — if they can do so safely.

Conflict, climate emergencies and economic hardship have driven up demand for humanitarian assistance, with more than 117 million people forcibly displaced in 2024, according to the United Nations refugee agency. The turn to artificial intelligence technologies is in part driven by the massive gap between needs and resources.

To meet its goal of reaching half of displaced people within three years, the IRC is testing a network of AI chatbots to see if they can increase the capacity of their humanitarian officers and the local organizations that directly serve people through Signpost. For now, the pilot project operates in El Salvador, Kenya, Greece and Italy and responds in 11 languages. It draws on a combination of large language models from some of the biggest technology companies, including OpenAI, Anthropic and Google.

The chatbot response system also uses customer service software from Zendesk and receives other support from Google and Cisco Systems.

If they decide the tools work, the IRC wants to extend the technical infrastructure to other nonprofit humanitarian organizations at no cost. They hope to create shared technology resources that less technically focused organizations could use without having to negotiate directly with tech companies or manage the risks of deployment…(More)”.

Rethinking the Measurement of Resilience for
Food and Nutrition Security


Paper by John M. Ulimwengu: “This paper presents a novel framework for assessing resilience in food systems, focusing on three dynamic metrics: return time, magnitude of deviation, and recovery rate. Traditional resilience measures have often relied on static and composite indicators, creating gaps in understanding the complex responses of food systems to shocks. This framework addresses these gaps, providing a more nuanced assessment of resilience in agrifood sectors. It highlights how integrating dynamic metrics enables policymakers to design tailored, sector-specific interventions that enhance resilience. Recognizing the data intensity required for these metrics, the paper indicates how emerging satellite imagery and advancements in artificial intelligence (AI) can make data collection both high-frequency and location-specific, at a fraction of the cost of traditional methods. These technologies facilitate a scalable approach to resilience measurement, enhancing the accuracy, timeliness, and accessibility of resilience data. The paper concludes with recommendations for refining resilience tools and adapting policy frameworks to better respond to the increasing challenges faced by food systems across the world…(More)”.

The Collaboration Playbook: A leader’s guide to cross-sector collaboration


Playbook by Ian Taylor and Nigel Ball: “The challenges facing our societies and economies today are so large and complex that, in many cases, cross-sector collaboration is not a choice, but an imperative. Yet collaboration remains elusive for many, often being put into the ‘too hard’ category. This playbook offers guidance on how we can seize collaboration opportunities successfully and rise to the challenges.

The recommendations in the playbook were informed by academic literature and practitioner experience. Rather than offer a procedural, step-by-step guide, this playbook offers provoking questions and frameworks that applies to different situations and objectives. While formal aspects such as contracts and procedures are well understood, it was found that what was needed was guidance on the intangible elements, sometimes referred to as ‘positive chemistry’. The significance of aspects like leadership, trust, culture, learning and power in cross-sector collaborations can be the game-changers for productive endeavours but are hard to get right.

Structured around these five key themes, the playbook presents 18 discreet ‘plays’ for effective collaboration. The plays allow the reader to delve into specific areas of interest to gain a deeper understanding of what it means for their collaborative work.

The intention of the playbook is to provide a resource that informs and guides cross-sector leaders. It will be especially relevant for those working in, and partnering with, central and local government in an effort to improve social outcomes…(More)”.

Predictability, AI, And Judicial Futurism: Why Robots Will Run The Law And Textualists Will Like It


Paper by Jack Kieffaber: “The question isn’t whether machines are going to replace judges and lawyers—they are. The question is whether that’s a good thing. If you’re a textualist, you have to answer yes. But you won’t—which means you’re not a textualist. Sorry.

Hypothetical: The year is 2030.  AI has far eclipsed the median federal jurist as a textual interpreter. A new country is founded; it’s a democratic republic that uses human legislators to write laws and programs a state-sponsored Large Language Model called “Judge.AI” to apply those laws to facts. The model makes judicial decisions as to conduct on the back end, but can also provide advisory opinions on the front end; if a citizen types in his desired action and hits “enter,” Judge.AI will tell him, ex ante, exactly what it would decide ex post if the citizen were to perform the action and be prosecuted. The primary result is perfect predictability; secondary results include the abolition of case law, the death of common law, and the replacement of all judges—indeed, all lawyers—by a single machine. Don’t fight the hypothetical, assume it works. This article poses the question:  Is that a utopia or a dystopia?

If you answer dystopia, you cannot be a textualist. Part I of this article establishes why:  Because predictability is textualism’s only lodestar, and Judge.AI is substantially more predictable than any regime operating today. Part II-A dispatches rebuttals premised on positive nuances of the American system; such rebuttals forget that my hypothetical presumes a new nation and take for granted how much of our nation’s founding was premised on mitigating exactly the kinds of human error that Judge.AI would eliminate. And Part II-B dispatches normative rebuttals, which ultimately amount to moral arguments about objective good—which are none of the textualist’s business. 

When the dust clears, you have only two choices: You’re a moralist, or you’re a formalist. If you’re the former, you’ll need a complete account of the objective good—which has evaded man for his entire existence. If you’re the latter, you should relish the fast-approaching day when all laws and all lawyers are usurped by a tin box.  But you’re going to say you’re something in between. And you’re not…(More)”

The Next Phase of the Data Economy: Economic & Technological Perspectives


Paper by Jad Esber et al: The data economy is poised to evolve toward a model centered on individual agency and control, moving us toward a world where data is more liquid across platforms and applications. In this future, products will either utilize existing personal data stores or create them when they don’t yet exist, empowering individuals to fully leverage their own data for various use cases.

The analysis begins by establishing a foundation for understanding data as an economic good and the dynamics of data markets. The article then investigates the concept of personal data stores, analyzing the historical challenges that have limited their widespread adoption. Building on this foundation, the article then considers how recent shifts in regulation, technology, consumer behavior, and market forces are converging to create new opportunities for a user-centric data economy. The article concludes by discussing potential frameworks for value creation and capture within this evolving paradigm, summarizing key insights and potential future directions for research, development, and policy.

We hope this article can help shape the thinking of scholars, policymakers, investors, and entrepreneurs, as new data ownership and privacy technologies emerge, and regulatory bodies around the world mandate open flows of data and new terms of service intended to empower users as well as small-to-medium–sized businesses…(More)”.