Towards Best Practices for Open Datasets for LLM Training


Paper by Stefan Baack et al: “Many AI companies are training their large language models (LLMs) on data without the permission of the copyright owners. The permissibility of doing so varies by jurisdiction: in countries like the EU and Japan, this is allowed under certain restrictions, while in the United States, the legal landscape is more ambiguous. Regardless of the legal status, concerns from creative producers have led to several high-profile copyright lawsuits, and the threat of litigation is commonly cited as a reason for the recent trend towards minimizing the information shared about training datasets by both corporate and public interest actors. This trend in limiting data information causes harm by hindering transparency, accountability, and innovation in the broader ecosystem by denying researchers, auditors, and impacted individuals access to the information needed to understand AI models.
While this could be mitigated by training language models on open access and public domain data, at the time of writing, there are no such models (trained at a meaningful scale) due to the substantial technical and sociological challenges in assembling the necessary corpus. These challenges include incomplete and unreliable metadata, the cost and complexity of digitizing physical records, and the diverse set of legal and technical skills required to ensure relevance and responsibility in a quickly changing landscape. Building towards a future where AI systems can be trained on openly licensed data that is responsibly curated and governed requires collaboration across legal, technical, and policy domains, along with investments in metadata standards, digitization, and fostering a culture of openness…(More)”.

Beware the Intention Economy: Collection and Commodification of Intent via Large Language Models


Article by Yaqub Chaudhary and Jonnie Penn: “The rapid proliferation of large language models (LLMs) invites the possibility of a new marketplace for behavioral and psychological data that signals intent. This brief article introduces some initial features of that emerging marketplace. We survey recent efforts by tech executives to position the capture, manipulation, and commodification of human intentionality as a lucrative parallel to—and viable extension of—the now-dominant attention economy, which has bent consumer, civic, and media norms around users’ finite attention spans since the 1990s. We call this follow-on the intention economy. We characterize it in two ways. First, as a competition, initially, between established tech players armed with the infrastructural and data capacities needed to vie for first-mover advantage on a new frontier of persuasive technologies. Second, as a commodification of hitherto unreachable levels of explicit and implicit data that signal intent, namely those signals borne of combining (a) hyper-personalized manipulation via LLM-based sycophancy, ingratiation, and emotional infiltration and (b) increasingly detailed categorization of online activity elicited through natural language.

This new dimension of automated persuasion draws on the unique capabilities of LLMs and generative AI more broadly, which intervene not only on what users want, but also, to cite Williams, “what they want to want” (Williams, 2018, p. 122). We demonstrate through a close reading of recent technical and critical literature (including unpublished papers from ArXiv) that such tools are already being explored to elicit, infer, collect, record, understand, forecast, and ultimately manipulate, modulate, and commodify human plans and purposes, both mundane (e.g., selecting a hotel) and profound (e.g., selecting a political candidate)…(More)”.

How and When to Involve Crowds in Scientific Research


Book by Marion K. Poetz and Henry Sauermann: “This book explores how millions of people can significantly contribute to scientific research with their effort and experience, even if they are not working at scientific institutions and may not have formal scientific training. 

Drawing on a strong foundation of scholarship on crowd involvement, this book helps researchers recognize and understand the benefits and challenges of crowd involvement across key stages of the scientific process. Designed as a practical toolkit, it enables scientists to critically assess the potential of crowd participation, determine when it can be most effective, and implement it to achieve meaningful scientific and societal outcomes.

The book also discusses how recent developments in artificial intelligence (AI) shape the role of crowds in scientific research and can enhance the effectiveness of crowd science projects…(More)”

Governing artificial intelligence means governing data: (re)setting the agenda for data justice


Paper by Linnet Taylor, Siddharth Peter de Souza, Aaron Martin, and Joan López Solano: “The field of data justice has been evolving to take into account the role of data in powering the field of artificial intelligence (AI). In this paper we review the main conceptual bases for governing data and AI: the market-based approach, the personal–non-personal data distinction and strategic sovereignty. We then analyse how these are being operationalised into practical models for governance, including public data trusts, data cooperatives, personal data sovereignty, data collaboratives, data commons approaches and indigenous data sovereignty. We interrogate these models’ potential for just governance based on four benchmarks which we propose as a reformulation of the Data Justice governance agenda identified by Taylor in her 2017 framework. Re-situating data justice at the intersection of data and AI, these benchmarks focus on preserving and strengthening public infrastructures and public goods; inclusiveness; contestability and accountability; and global responsibility. We demonstrate how they can be used to test whether a governance approach will succeed in redistributing power, engaging with public concerns and creating a plural politics of AI…(More)”.

Artificial Intelligence Narratives


A Global Voices Report: “…Framing AI systems as intelligent is further complicated and intertwined with neighboring narratives. In the US, AI narratives often revolve around opposing themes such as hope and fear, often bridging two strong emotions: existential fears and economic aspirations. In either case, they propose that the technology is powerful. These narratives contribute to the hype surrounding AI tools and their potential impact on society. Some examples include:

Many of these framings often present AI as an unstoppable and accelerating force. While this narrative can generate excitement and investment in AI research, it can also contribute to a sense of technological determinism and a lack of critical engagement with the consequences of widespread AI adoption. Counter-narratives are many and expand on the motifs of surveillance, erosions of trust, bias, job impacts, exploitation of labor, high-risk uses, the concentration of power, and environmental impacts, among others.

These narrative frames, combined with the metaphorical language and imagery used to describe AI, contribute to the confusion and lack of public knowledge about the technology. By positioning AI as a transformative, inevitable, and necessary tool for national success, these narratives can shape public opinion and policy decisions, often in ways that prioritize rapid adoption and commercialization…(More)”

Information Ecosystems and Troubled Democracy


Report by the Observatory on Information and Democracy: “This inaugural meta-analysis provides a critical assessment of the role of information ecosystems in the Global North and Global Majority World, focusing on their relationship with information integrity (the quality of public discourse), the fairness of political processes, the protection of media freedoms, and the resilience of public institutions.

The report addresses three thematic areas with a cross-cutting theme of mis- and disinformation:

  • Media, Politics and Trust;
  • Artificial Intelligence, Information Ecosystems and Democracy;
  • and Data Governance and Democracy.

The analysis is based mainly on academic publications supplemented by reports and other materials from different disciplines and regions (1,664 citations selected among a total corpus of over +2700 resources aggregated). The report showcases what we can learn from landmark research on often intractable challenges posed by rapid changes in information and communication spaces…(More)”.

What’s a Fact, Anyway?


Essay by Fergus McIntosh: “…For journalists, as for anyone, there are certain shortcuts to trustworthiness, including reputation, expertise, and transparency—the sharing of sources, for example, or the prompt correction of errors. Some of these shortcuts are more perilous than others. Various outfits, positioning themselves as neutral guides to the marketplace of ideas, now tout evaluations of news organizations’ trustworthiness, but relying on these requires trusting in the quality and objectivity of the evaluation. Official data is often taken at face value, but numbers can conceal motives: think of the dispute over how to count casualties in recent conflicts. Governments, meanwhile, may use their powers over information to suppress unfavorable narratives: laws originally aimed at misinformation, many enacted during the COVID-19 pandemic, can hinder free expression. The spectre of this phenomenon is fuelling a growing backlash in America and elsewhere.

Although some categories of information may come to be considered inherently trustworthy, these, too, are in flux. For decades, the technical difficulty of editing photographs and videos allowed them to be treated, by most people, as essentially incontrovertible. With the advent of A.I.-based editing software, footage and imagery have swiftly become much harder to credit. Similar tools are already used to spoof voices based on only seconds of recorded audio. For anyone, this might manifest in scams (your grandmother calls, but it’s not Grandma on the other end), but for a journalist it also puts source calls into question. Technologies of deception tend to be accompanied by ones of detection or verification—a battery of companies, for example, already promise that they can spot A.I.-manipulated imagery—but they’re often locked in an arms race, and they never achieve total accuracy. Though chatbots and A.I.-enabled search engines promise to help us with research (when a colleague “interviewed” ChatGPT, it told him, “I aim to provide information that is as neutral and unbiased as possible”), their inability to provide sourcing, and their tendency to hallucinate, looks more like a shortcut to nowhere, at least for now. The resulting problems extend far beyond media: election campaigns, in which subtle impressions can lead to big differences in voting behavior, feel increasingly vulnerable to deepfakes and other manipulations by inscrutable algorithms. Like everyone else, journalists have only just begun to grapple with the implications.

In such circumstances, it becomes difficult to know what is true, and, consequently, to make decisions. Good journalism offers a way through, but only if readers are willing to follow: trust and naïveté can feel uncomfortably close. Gaining and holding that trust is hard. But failure—the end point of the story of generational decay, of gold exchanged for dross—is not inevitable. Fact checking of the sort practiced at The New Yorker is highly specific and resource-intensive, and it’s only one potential solution. But any solution must acknowledge the messiness of truth, the requirements of attention, the way we squint to see more clearly. It must tell you to say what you mean, and know that you mean it…(More)”.

Governance of Indigenous data in open earth systems science


Paper by Lydia Jennings et al: “In the age of big data and open science, what processes are needed to follow open science protocols while upholding Indigenous Peoples’ rights? The Earth Data Relations Working Group (EDRWG), convened to address this question and envision a research landscape that acknowledges the legacy of extractive practices and embraces new norms across Earth science institutions and open science research. Using the National Ecological Observatory Network (NEON) as an example, the EDRWG recommends actions, applicable across all phases of the data lifecycle, that recognize the sovereign rights of Indigenous Peoples and support better research across all Earth Sciences…(More)”

Facing & mitigating common challenges when working with real-world data: The Data Learning Paradigm


Paper by Jake Lever et al: “The rapid growth of data-driven applications is ubiquitous across virtually all scientific domains, and has led to an increasing demand for effective methods to handle data deficiencies and mitigate the effects of imperfect data. This paper presents a guide for researchers encountering real-world data-driven applications, and the respective challenges associated with this. This article proposes the concept of the Data Learning Paradigm, combining the principles of machine learning, data science and data assimilation to tackle real-world challenges in data-driven applications. Models are a product of the data upon which they are trained, and no data collected from real world scenarios is perfect due to natural limitations of sensing and collection. Thus, computational modelling of real world systems is intrinsically limited by the various deficiencies encountered in real data. The Data Learning Paradigm aims to leverage the strengths of data improvement to enhance the accuracy, reliability, and interpretability of data-driven models. We outline a range of methods which are currently being implemented in the field of Data Learning involving machine learning and data science methods, and discuss how these mitigate the various problems associated with data-driven models, illustrating improved results in a multitude of real world applications. We highlight examples where these methods have led to significant advancements in fields such as environmental monitoring, planetary exploration, healthcare analytics, linguistic analysis, social networks, and smart manufacturing. We offer a guide to how these methods may be implemented to deal with general types of limitations in data, alongside their current and potential applications…(More)”.

Sortition: Past and Present


Introduction to the Journal of Sortition: “Since ancient times sortition (random selection by lot) has been used both to distribute political office and as a general prophylactic against factionalism and corruption in societies as diverse as classical-era Athens and the Most Serene Republic of Venice. Lotteries have also been employed for the allocation of scarce goods such as social housing and school places to eliminate bias and ensure just distribution, along with drawing lots in circumstances where unpopular tasks or tragic choices are involved (as some situations are beyond rational human decision-making). More recently, developments in public opinion polling using random sampling have led to the proliferation of citizens’ assemblies selected by lot. Some activists have even proposed such bodies as an alternative to elected representatives. The Journal of Sortition benefits from an editorial board with a wide range of expertise and perspectives in this area. In this introduction to the first issue, we have invited our editors to explain why they are interested in sortition, and to outline the benefits (and pitfalls) of the recent explosion of interest in the topic…(More)”.