AI firms must play fair when they use academic data in training


Nature Editorial: “But others are worried about principles such as attribution, the currency by which science operates. Fair attribution is a condition of reuse under CC BY, a commonly used open-access copyright license. In jurisdictions such as the European Union and Japan, there are exemptions to copyright rules that cover factors such as attribution — for text and data mining in research using automated analysis of sources to find patterns, for example. Some scientists see LLM data-scraping for proprietary LLMs as going well beyond what these exemptions were intended to achieve.

In any case, attribution is impossible when a large commercial LLM uses millions of sources to generate a given output. But when developers create AI tools for use in science, a method known as retrieval-augmented generation could help. This technique doesn’t apportion credit to the data that trained the LLM, but does allow the model to cite papers that are relevant to its output, says Lucy Lu Wang, an AI researcher at the University of Washington in Seattle.

Giving researchers the ability to opt out of having their work used in LLM training could also ease their worries. Creators have this right under EU law, but it is tough to enforce in practice, says Yaniv Benhamou, who studies digital law and copyright at the University of Geneva. Firms are devising innovative ways to make it easier. Spawning, a start-up company in Minneapolis, Minnesota, has developed tools to allow creators to opt out of data scraping. Some developers are also getting on board: OpenAI’s Media Manager tool, for example, allows creators to specify how their works can be used by machine-learning algorithms…(More)”.

The Imperial Origins of Big Data


Blog and book by Asheesh Kapur Siddique: “We live in a moment of massive transformation in the nature of information. In 2020, according to one report, users of the Internet created 64.2 zetabytes of data, a quantity greater than the “number of detectable stars in the cosmos,” a colossal increase whose origins can be traced to the emergence of the World Wide Web in 1993.1 Facilitated by technologies like satellites, smartphones, and artificial intelligence, the scale and speed of data creation seems like it may only balloon over the rest of our lifetimes—and with it, the problem of how to govern ourselves in relation to the inequalities and opportunities that the explosion of data creates.

But while much about our era of big data is indeed revolutionary, the political questions that it raises—How should information be used? Who should control it? And how should it be preserved?—are ones with which societies have long grappled. These questions attained a particular importance in Europe from the eleventh century due to a technological change no less significant than the ones we are witnessing today: the introduction of paper into Europe. Initially invented in China, paper travelled to Europe via the conduit of Islam around the eleventh century after the Moors conquered Spain. Over the twelfth, thirteenth, and fourteenth centuries, paper emerged as the fundamental substrate which politicians, merchants, and scholars relied on to record and circulate information in governance, commerce, and learning. At the same time, governing institutions sought to preserve and control the spread of written information through the creation of archives: repositories where they collected, organized, and stored documents.

The expansion of European polities overseas from the late fifteenth century onward saw governments massively scale up their use of paper—and confront the challenge of controlling its dissemination across thousands of miles of ocean and land. These pressures were felt particularly acutely in what eventually became the largest empire in world history, the British empire. As people from the British isles from the early seventeenth century fought, traded, and settled their way to power in the Atlantic world and South Asia, administrators faced the problem of how to govern both their emigrating subjects and the non-British peoples with whom they interacted. This meant collecting information about their behavior through the technology of paper. Just as we struggle to organize, search, and control our email boxes, text messages, and app notifications, so too did these early moderns confront the attendant challenges of developing practices of collection and storage to manage the resulting information overload. And despite the best efforts of states and companies to control information, it constantly escaped their grasp, falling into the hands of their opponents and rivals who deployed it to challenge and contest ruling powers.

The history of the early modern information state offers no simple or straightforward answers to the questions that data raises for us today. But it does remind us of a crucial truth, all too readily obscured by the deluge of popular narratives glorifying technological innovation: that questions of data are inherently questions about politics—about who gets to collect, control, and use information, and the ends to which information should be put. We should resist any effort to insulate data governance from democratic processes—and having an informed perspective on the politics of data requires that we attend not just to its present, but also to its past…(More)”.

When A.I.’s Output Is a Threat to A.I. Itself


Article by Aatish Bhatia: “The internet is becoming awash in words and images generated by artificial intelligence.

Sam Altman, OpenAI’s chief executive, wrote in February that the company generated about 100 billion words per day — a million novels’ worth of text, every day, an unknown share of which finds its way onto the internet.

A.I.-generated text may show up as a restaurant review, a dating profile or a social media post. And it may show up as a news article, too: NewsGuard, a group that tracks online misinformation, recently identified over a thousand websites that churn out error-prone A.I.-generated news articles.

In reality, with no foolproof methods to detect this kind of content, much will simply remain undetected.

All this A.I.-generated information can make it harder for us to know what’s real. And it also poses a problem for A.I. companies. As they trawl the web for new data to train their next models on — an increasingly challenging task — they’re likely to ingest some of their own A.I.-generated content, creating an unintentional feedback loop in which what was once the output from one A.I. becomes the input for another.

In the long run, this cycle may pose a threat to A.I. itself. Research has shown that when generative A.I. is trained on a lot of its own output, it can get a lot worse.

Here’s a simple illustration of what happens when an A.I. system is trained on its own output, over and over again:

This is part of a data set of 60,000 handwritten digits.

When we trained an A.I. to mimic those digits, its output looked like this.

This new set was made by an A.I. trained on the previous A.I.-generated digits. What happens if this process continues?

After 20 generations of training new A.I.s on their predecessors’ output, the digits blur and start to erode.

After 30 generations, they converge into a single shape.

While this is a simplified example, it illustrates a problem on the horizon.

Imagine a medical-advice chatbot that lists fewer diseases that match your symptoms, because it was trained on a narrower spectrum of medical knowledge generated by previous chatbots. Or an A.I. history tutor that ingests A.I.-generated propaganda and can no longer separate fact from fiction…(More)”.

Policy for responsible use of AI in government


Policy by the Australian Government: “The Policy for the responsible use of AI in government ensures that government plays a leadership role in embracing AI for the benefit of Australians while ensuring its safe, ethical and responsible use, in line with community expectations. The policy:

  • provides a unified approach for government to engage with AI confidently, safely and responsibly, and realise its benefits
  • aims to strengthen public trust in government’s use of AI by providing enhanced transparency, governance and risk assurance
  • aims to embed a forward leaning, adaptive approach for government’s use of AI that is designed to evolve and develop over time…(More)”.

Data Protection Law and Emotion


Book by Damian Clifford: “Data protection law is often positioned as a regulatory solution to the risks posed by computational systems. Despite the widespread adoption of data protection laws, however, there are those who remain sceptical as to their capacity to engender change. Much of this criticism focuses on our role as ‘data subjects’. It has been demonstrated repeatedly that we lack the capacity to act in our own best interests and, what is more, that our decisions have negative impacts on others. Our decision-making limitations seem to be the inevitable by-product of the technological, social, and economic reality. Data protection law bakes in these limitations by providing frameworks for notions such as consent and subjective control rights and by relying on those who process our data to do so fairly.

Despite these valid concerns, Data Protection Law and Emotion argues that the (in)effectiveness of these laws are often more difficult to discern than the critical literature would suggest, while also emphasizing the importance of the conceptual value of subjective control. These points are explored (and indeed, exposed) by investigating data protection law through the lens of the insights provided by law and emotion scholarship and demonstrating the role emotions play in our decision-making. The book uses the development of Emotional Artificial Intelligence, a particularly controversial technology, as a case study to analyse these issues.

Original and insightful, Data Protection Law and Emotion offers a unique contribution to a contentious debate that will appeal to students and academics in data protection and privacy, policymakers, practitioners, and regulators…(More)”.

Relational ethics in health care automation


Paper by Frances Shaw and Anthony McCosker: “Despite the transformative potential of automation and clinical decision support technology in health care, there is growing urgency for more nuanced approaches to ethics. Relational ethics is an approach that can guide the responsible use of a range of automated decision-making systems including the use of generative artificial intelligence and large language models as they affect health care relationships. 

There is an urgent need for sector-wide training and scrutiny regarding the effects of automation using relational ethics touchstones, such as patient-centred health care, informed consent, patient autonomy, shared decision-making, empathy and the politics of care.

The purpose of this review is to offer a provocation for health care practitioners, managers and policy makers to consider the use automated tools in practice settings and examine how these tools might affect relationships and hence care outcomes…(More)”.

Modeling Cities and Regions as Complex Systems


Book by Roger White, Guy Engelen and Inge Uljee: “Cities and regions grow (or occasionally decline), and continuously transform themselves as they do so. This book describes the theory and practice of modeling the spatial dynamics of urban growth and transformation. As cities are complex, adaptive, self-organizing systems, the most appropriate modeling framework is one based on the theory of self-organizing systems—an approach already used in such fields as physics and ecology. The book presents a series of models, most of them developed using cellular automata (CA), which are inherently spatial and computationally efficient. It also provides discussions of the theoretical, methodological, and philosophical issues that arise from the models. A case study illustrates the use of these models in urban and regional planning. Finally, the book presents a new, dynamic theory of urban spatial structure that emerges from the models and their applications.

The models are primarily land use models, but the more advanced ones also show the dynamics of population and economic activities, and are integrated with models in other domains such as economics, demography, and transportation. The result is a rich and realistic representation of the spatial dynamics of a variety of urban phenomena. The book is unique in its coverage of both the general issues associated with complex self-organizing systems and the specifics of designing and implementing models of such systems…(More)”.

Constructing Valid Geospatial Tools for Environmental Justice


Report from the National Academies of Sciences, Engineering, and Medicine: “Decades of research have shown that the most disadvantaged communities exist at the intersection of high levels of hazard exposure, racial and ethnic marginalization, and poverty.

Mapping and geographical information systems have been crucial for analyzing the environmental burdens of marginalized communities, and several federal and state geospatial tools have emerged to help address environmental justice concerns — such as the Climate and Economic Justice Screening Tool developed in 2022 in response to Justice40 initiatives from the Biden administration.

Constructing Valid Geospatial Tools for Environmental Justice, a new report from the National Academies of Sciences, Engineering, and Medicine, offers recommendations for developing environmental justice tools that reflect the experiences of the communities they measure.

The report recommends data strategies focused on community engagement, validation, and documentation. It emphasizes using a structured development process and offers guidance for selecting and assessing indicators, integrating indicators, and incorporating cumulative impact scoring. Tool developers should choose measures of economic burden beyond the federal poverty level that account for additional dimensions of wealth and geographic variations in cost of living. They should also use indicators that measure the impacts of racism in policies and practices that have led to current disparities…(More)”.

Governing mediation in the data ecosystem: lessons from media governance for overcoming data asymmetries


Chapter by Stefaan Verhulst in Handbook of Media and Communication Governance edited by Manuel Puppis , Robin Mansell , and Hilde Van den Bulck: “The internet and the accompanying datafication were heralded to usher in a golden era of disintermediation. Instead, the modern data ecology witnessed a process of remediation, or ‘hyper-mediation’, resulting in governance challenges, many of which underlie broader socioeconomic difficulties. Particularly, the rise of data asymmetries and silos create new forms of scarcity and dominance with deleterious political, economic and cultural consequences. Responding to these challenges requires a new data governance framework, focused on unlocking data and developing a more data pluralistic ecosystem. We argue for regulation and policy focused on promoting data collaboratives, an emerging form of cross-sectoral partnership; and on the establishment of data stewards, individuals/groups tasked with managing and responsibly sharing organizations’ data assets. Some regulatory steps are discussed, along with the various ways in which these two emerging stakeholders can help alleviate data scarcities and their associated problems…(More)”

Using AI to Map Urban Change


Brief by Tianyuan Huang, Zejia Wu, Jiajun Wu, Jackelyn Hwang, Ram Rajagopal: “Cities are constantly evolving, and better understanding those changes facilitates better urban planning and infrastructure assessments and leads to more sustainable social and environmental interventions. Researchers currently use data such as satellite imagery to study changing urban environments and what those changes mean for public policy and urban design. But flaws in the current approaches, such as inadequately granular data, limit their scalability and their potential to inform public policy across social, political, economic, and environmental issues.

Street-level images offer an alternative source of insights. These images are frequently updated and high-resolution. They also directly capture what’s happening on a street level in a neighborhood or across a city. Analyzing street-level images has already proven useful to researchers studying socioeconomic attributes and neighborhood gentrification, both of which are essential pieces of information in urban design, sustainability efforts, and public policy decision-making for cities. Yet, much like other data sources, street-level images present challenges: accessibility limits, shadow and lighting issues, and difficulties scaling up analysis.

To address these challenges, our paper “CityPulse: Fine-Grained Assessment of Urban Change with Street View Time Series” introduces a multicity dataset of labeled street-view images and proposes a novel artificial intelligence (AI) model to detect urban changes such as gentrification. We demonstrate the change-detection model’s effectiveness by testing it on images from Seattle, Washington, and show that it can provide important insights into urban changes over time and at scale. Our data-driven approach has the potential to allow researchers and public policy analysts to automate and scale up their analysis of neighborhood and citywide socioeconomic change…(More)”.