Will we run out of data? Limits of LLM scaling based on human-generated data


Paper by Pablo Villalobos: We investigate the potential constraints on LLM scaling posed by the availability of public human-generated text data. We forecast the growing demand for training data based on current trends and estimate the total stock of public human text data. Our findings indicate that if current LLM development trends continue, models will be trained on datasets roughly equal in size to the available stock of public human text data between 2026 and 2032, or slightly earlier if models are overtrained. We explore how progress in language modeling can continue when human-generated text datasets cannot be scaled any further. We argue that synthetic data generation, transfer learning from data-rich domains, and data efficiency improvements might support further progress…(More)”.

What does it mean to be good? The normative and metaethical problem with ‘AI for good’


Article by Tom Stenson: “Using AI for good is an imperative for its development and regulation, but what exactly does it mean? This article contends that ‘AI for good’ is a powerful normative concept and is problematic for the ethics of AI because it oversimplifies complex philosophical questions in defining good and assumes a level of moral knowledge and certainty that may not be justified. ‘AI for good’ expresses a value judgement on what AI should be and its role in society, thereby functioning as a normative concept in AI ethics. As a moral statement, AI for good makes two things implicit: i) we know what a good outcome is and ii) we know the process by which to achieve it. By examining these two claims, this article will articulate the thesis that ‘AI for good’ should be examined as a normative and metaethical problem for AI ethics. Furthermore, it argues that we need to pay more attention to our relationship with normativity and how it guides what we believe the ‘work’ of ethical AI should be…(More)”.

Uganda’s Sweeping Surveillance State Is Built on National ID Cards


Article by Olivia Solon: “Uganda has spent hundreds of millions of dollars in the past decade on biometric tools that document a person’s unique physical characteristics, such as their face, fingerprints and irises, to form the basis of a comprehensive identification system. While the system is central to many of the state’s everyday functions, as Museveni has grown increasingly authoritarian over nearly four decades in power, it has also become a powerful mechanism for surveilling politicians, journalists, human rights advocates and ordinary citizens, according to dozens of interviews and hundreds of pages of documents obtained and analyzed by Bloomberg and nonprofit investigative newsroom Lighthouse Reports.

It’s a cautionary tale for any country considering establishing a biometric identity system without rigorous checks and balances and input from civil society. Dozens of global south countries have adopted this approach as part of an effort to meet sustainable development goals from the UN, which considers having a legal identity to be a fundamental human right. But, despite billions of dollars of investment, with backing from organizations including the World Bank, those identity systems haven’t always lived up to expectations. In many cases, the key problem is the failure to register large swathes of the population, leading to exclusion from public services. But in other places, like Uganda, inclusion in the system has been weaponized for surveillance purposes.

A year-long investigation by Bloomberg and Lighthouse Reports sheds new light on the ways in which Museveni’s regime has built and deployed this system to target opponents and consolidate power. It shows how the underlying software and data sets are easily accessed by individuals at all levels of law enforcement, despite official claims to the contrary. It also highlights, in some cases for the first time, how senior government and law enforcement officials have used these tools to target individuals deemed to pose a political threat…(More)”.

Using ChatGPT for analytics


Paper by Aleksei Turobov et al: “The utilisation of AI-driven tools, notably ChatGPT (Generative Pre-trained Transformer), within academic research is increasingly debated from several perspectives including ease of implementation, and potential enhancements in research efficiency, as against ethical concerns and risks such as biases and unexplained AI operations. This paper explores the use of the GPT model for initial coding in qualitative thematic analysis using a sample of United Nations (UN) policy documents. The primary aim of this study is to contribute to the methodological discussion regarding the integration of AI tools, offering a practical guide to validation for using GPT as a collaborative research assistant. The paper outlines the advantages and limitations of this methodology and suggests strategies to mitigate risks. Emphasising the importance of transparency and reliability in employing GPT within research methodologies, this paper argues for a balanced use of AI in supported thematic analysis, highlighting its potential to elevate research efficacy and outcomes…(More)”.

Unmasking and Quantifying Power Structures: How Network Analysis Enhances Peace and State-Building Efforts


Blog by Issa Luna Pla: “Critiques of peace and state-building efforts have pointed out the inadequate grasp of the origins of conflict, political unrest, and the intricate dynamics of criminal and illicit networks (Holt and Bouch, 2009Cockayne and Lupel, 2011). This limited understanding has failed to sufficiently weaken their economic and political influence or effectively curb their activities and objectives. A recent study highlights that although punitive approaches may have temporarily diminished the power of these networks, the absence of robust analytical tools has made it difficult to assess the enduring impact of these strategies.

1. Application of Network Analytics in State-Building

The importance of analytics in international peace and state-building operations is becoming increasingly recognized (O’Brien, 2010Gnanguenon, 2021Rød et al., 2023). Analytics, particularly network analysis, plays a crucial role in dissecting and dismantling complex power structures that often undermine peace initiatives and governance reforms. This analytical approach is crucial for revealing and disrupting the entrenched networks that sustain ongoing conflicts or obstruct peace processes. From the experiences in Guatemala, three significant lessons have been learned regarding the need for analytics for regional and thematic priorities in such operations (Waxenecker, 2019). These insights are vital for understanding how to tailor analytical strategies to address specific challenges in conflict-affected areas.

  1. The effectiveness of the International Commission CICIG in dismantling criminal networks was constrained by its lack of advanced analytical tools. This limitation prevented a deeper exploration of the conflicts’ roots and hindered the assessment of the long-term impacts of its strategies. While the CICIG had a systematic approach to understanding criminal networks from a contextual and legal perspective, its action plans lacked comprehensive statistic analytics methodologies, leading to missed opportunities in targeting key strategic players within these networks. High-level arrests were based on available evidence and charges that prosecutors could substantiate, rather than a strategic analysis of actors’ roles and influences within the networks’ dynamics.
  2. Furthermore, the extent of network dismantlement and the lasting effects of imprisonment and financial control of the illicit groups’ assets remain unclear, highlighting the need for predictive analytics to anticipate conflicts and sustainability. Such tools could enable operations to forecast potential disruptions or stability, allowing for data-driven proactive measures to prevent violence or bolster peace.
  3. Lastly, insights derived from network analysis suggest that efforts should focus on enhancing diplomatic negotiations, promoting economic development and social capital, and balancing punitive measures with strategic interventions. By understanding the dynamics and modeling group behavior in conflict zones, negotiations can be better informed by a deep and holistic comprehension of the underlying power structures and motivations. This approach could also help in forecasting recidivism, assessing risks of network reorganization, and evaluating the potential for increased armament, workforce, or empowerment, thereby facilitating more effective and sustainable peacebuilding initiatives.

2. Advancing Legal and Institutional Reforms

Utilizing data sciences in conflicted environments offers unique insights into the behavior of illicit networks and their interactions within the public and private sectors (Morselli et al., 2007Leuprecht and Hall, 2014Campedelli et al., 2019). This systematic approach, grounded in the analysis of years of illicit activities in Guatemala, highlights the necessity of rethinking traditional legal and institutional frameworks…(More)”.

Scraping the demos. Digitalization, web scraping and the democratic project


Paper by Lena Ulbricht: “Scientific, political and bureaucratic elites use epistemic practices like “big data analysis” and “web scraping” to create representations of the citizenry and to legitimize policymaking. I develop the concept of “demos scraping” for these practices of gaining information about citizens (the “demos”) through automated analysis of digital trace data which are re-purposed for political means. This article critically engages with the discourse advocating demos scraping and provides a conceptual analysis of its democratic implications. It engages with the promise of demos scraping advocates to reduce the gap between political elites and citizens and highlights how demos scraping is presented as a superior form of accessing the “will of the people” and to increase democratic legitimacy. This leads me to critically discuss the implications of demos scraping for political representation and participation. In its current form, demos scraping is technocratic and de-politicizing; and the larger political and economic context in which it takes place makes it unlikely that it will reduce the gap between elites and citizens. From the analytic perspective of a post-democratic turn, demos scraping is an attempt of late modern and digitalized societies to address the democratic paradox of increasing citizen expectations coupled with a deep legitimation crisis…(More)”.

How this mental health care app is using generative AI to improve its chatbot


Interview by Daniela Dib: “Andrea Campos struggled with depression for years before founding Yana, a mental health care app, in 2017. The app’s chatbot provides users emotional companionship in Spanish. Although she was reluctant at first, Campos began using generative artificial intelligence for the Yana chatbot after ChatGPT launched in 2022. Yana, which recently launched its English-language version, has 15 million users, and is available in Latin America and the U.S.

This interview has been edited for clarity and brevity.

How has your product evolved since you introduced generative AI to it?

At first, we didn’t use generative AI because we believed it was far from ready for mental health support. We designed and guardrailed our chatbot’s responses with decision trees. But when ChatGPT launched and we saw what it could do, it wasn’t a question of whether to use generative AI or not, but how soon — we’d fall behind otherwise. It’s been a challenge because everyone quickly began developing with generative AI, but our advantage was that, having operated our chatbot for a while, we had gathered over 2 billion data points that have been invaluable for our app’s fine-tuning. One thing is clear: It’s crucial to have a model tailored to the specific needs of our product…(More)”.

Japan’s push to make all research open access is taking shape


Article by Dalmeet Singh Chawla: “The Japanese government is pushing ahead with a plan to make Japan’s publicly funded research output free to read. In June, the science ministry will assign funding to universities to build the infrastructure needed to make research papers free to read on a national scale. The move follows the ministry’s announcement in February that researchers who receive government funding will be required to make their papers freely available to read on the institutional repositories from April 2025.

The Japanese plan “is expected to enhance the long-term traceability of research information, facilitate secondary research and promote collaboration”, says Kazuki Ide, a health-sciences and public-policy scholar at Osaka University in Suita, Japan, who has written about open access in Japan.

The nation is one of the first Asian countries to make notable advances towards making more research open access (OA) and among the first countries in the world to forge a nationwide plan for OA.

The plan follows in the footsteps of the influential Plan S, introduced six years ago by a group of research funders in the United States and Europe known as cOAlition S, to accelerate the move to OA publishing. The United States also implemented an OA mandate in 2022 that requires all research funded by US taxpayers to be freely available from 2026…(More)”.

Seeing Like a Data Structure


Essay by Barath Raghavan and Bruce Schneier: “Technology was once simply a tool—and a small one at that—used to amplify human intent and capacity. That was the story of the industrial revolution: we could control nature and build large, complex human societies, and the more we employed and mastered technology, the better things got. We don’t live in that world anymore. Not only has technology become entangled with the structure of society, but we also can no longer see the world around us without it. The separation is gone, and the control we thought we once had has revealed itself as a mirage. We’re in a transitional period of history right now.

We tell ourselves stories about technology and society every day. Those stories shape how we use and develop new technologies as well as the new stories and uses that will come with it. They determine who’s in charge, who benefits, who’s to blame, and what it all means.

Some people are excited about the emerging technologies poised to remake society. Others are hoping for us to see this as folly and adopt simpler, less tech-centric ways of living. And many feel that they have little understanding of what is happening and even less say in the matter.

But we never had total control of technology in the first place, nor is there a pretechnological golden age to which we can return. The truth is that our data-centric way of seeing the world isn’t serving us well. We need to tease out a third option. To do so, we first need to understand how we got here…(More)”

“The Death of Wikipedia?” — Exploring the Impact of ChatGPT on Wikipedia Engagement


Paper by Neal Reeves, Wenjie Yin, Elena Simperl: “Wikipedia is one of the most popular websites in the world, serving as a major source of information and learning resource for millions of users worldwide. While motivations for its usage vary, prior research suggests shallow information gathering — looking up facts and information or answering questions — dominates over more in-depth usage. On the 22nd of November 2022, ChatGPT was released to the public and has quickly become a popular source of information, serving as an effective question-answering and knowledge gathering resource. Early indications have suggested that it may be drawing users away from traditional question answering services such as Stack Overflow, raising the question of how it may have impacted Wikipedia. In this paper, we explore Wikipedia user metrics across four areas: page views, unique visitor numbers, edit counts and editor numbers within twelve language instances of Wikipedia. We perform pairwise comparisons of these metrics before and after the release of ChatGPT and implement a panel regression model to observe and quantify longer-term trends. We find no evidence of a fall in engagement across any of the four metrics, instead observing that page views and visitor numbers increased in the period following ChatGPT’s launch. However, we observe a lower increase in languages where ChatGPT was available than in languages where it was not, which may suggest ChatGPT’s availability limited growth in those languages. Our results contribute to the understanding of how emerging generative AI tools are disrupting the Web ecosystem…(More)”. See also: Are we entering a Data Winter? On the urgent need to preserve data access for the public interest.