AI firms must play fair when they use academic data in training


Nature Editorial: “But others are worried about principles such as attribution, the currency by which science operates. Fair attribution is a condition of reuse under CC BY, a commonly used open-access copyright license. In jurisdictions such as the European Union and Japan, there are exemptions to copyright rules that cover factors such as attribution — for text and data mining in research using automated analysis of sources to find patterns, for example. Some scientists see LLM data-scraping for proprietary LLMs as going well beyond what these exemptions were intended to achieve.

In any case, attribution is impossible when a large commercial LLM uses millions of sources to generate a given output. But when developers create AI tools for use in science, a method known as retrieval-augmented generation could help. This technique doesn’t apportion credit to the data that trained the LLM, but does allow the model to cite papers that are relevant to its output, says Lucy Lu Wang, an AI researcher at the University of Washington in Seattle.

Giving researchers the ability to opt out of having their work used in LLM training could also ease their worries. Creators have this right under EU law, but it is tough to enforce in practice, says Yaniv Benhamou, who studies digital law and copyright at the University of Geneva. Firms are devising innovative ways to make it easier. Spawning, a start-up company in Minneapolis, Minnesota, has developed tools to allow creators to opt out of data scraping. Some developers are also getting on board: OpenAI’s Media Manager tool, for example, allows creators to specify how their works can be used by machine-learning algorithms…(More)”.

The Imperial Origins of Big Data


Blog and book by Asheesh Kapur Siddique: “We live in a moment of massive transformation in the nature of information. In 2020, according to one report, users of the Internet created 64.2 zetabytes of data, a quantity greater than the “number of detectable stars in the cosmos,” a colossal increase whose origins can be traced to the emergence of the World Wide Web in 1993.1 Facilitated by technologies like satellites, smartphones, and artificial intelligence, the scale and speed of data creation seems like it may only balloon over the rest of our lifetimes—and with it, the problem of how to govern ourselves in relation to the inequalities and opportunities that the explosion of data creates.

But while much about our era of big data is indeed revolutionary, the political questions that it raises—How should information be used? Who should control it? And how should it be preserved?—are ones with which societies have long grappled. These questions attained a particular importance in Europe from the eleventh century due to a technological change no less significant than the ones we are witnessing today: the introduction of paper into Europe. Initially invented in China, paper travelled to Europe via the conduit of Islam around the eleventh century after the Moors conquered Spain. Over the twelfth, thirteenth, and fourteenth centuries, paper emerged as the fundamental substrate which politicians, merchants, and scholars relied on to record and circulate information in governance, commerce, and learning. At the same time, governing institutions sought to preserve and control the spread of written information through the creation of archives: repositories where they collected, organized, and stored documents.

The expansion of European polities overseas from the late fifteenth century onward saw governments massively scale up their use of paper—and confront the challenge of controlling its dissemination across thousands of miles of ocean and land. These pressures were felt particularly acutely in what eventually became the largest empire in world history, the British empire. As people from the British isles from the early seventeenth century fought, traded, and settled their way to power in the Atlantic world and South Asia, administrators faced the problem of how to govern both their emigrating subjects and the non-British peoples with whom they interacted. This meant collecting information about their behavior through the technology of paper. Just as we struggle to organize, search, and control our email boxes, text messages, and app notifications, so too did these early moderns confront the attendant challenges of developing practices of collection and storage to manage the resulting information overload. And despite the best efforts of states and companies to control information, it constantly escaped their grasp, falling into the hands of their opponents and rivals who deployed it to challenge and contest ruling powers.

The history of the early modern information state offers no simple or straightforward answers to the questions that data raises for us today. But it does remind us of a crucial truth, all too readily obscured by the deluge of popular narratives glorifying technological innovation: that questions of data are inherently questions about politics—about who gets to collect, control, and use information, and the ends to which information should be put. We should resist any effort to insulate data governance from democratic processes—and having an informed perspective on the politics of data requires that we attend not just to its present, but also to its past…(More)”.

Even laypeople use legalese


Paper by Eric Martínez, Francis Mollica and Edward Gibson: “Whereas principles of communicative efficiency and legal doctrine dictate that laws be comprehensible to the common world, empirical evidence suggests legal documents are largely incomprehensible to lawyers and laypeople alike. Here, a corpus analysis (n = 59) million words) first replicated and extended prior work revealing laws to contain strikingly higher rates of complex syntactic structures relative to six baseline genres of English. Next, two preregistered text generation experiments (n = 286) tested two leading hypotheses regarding how these complex structures enter into legal documents in the first place. In line with the magic spell hypothesis, we found people tasked with writing official laws wrote in a more convoluted manner than when tasked with writing unofficial legal texts of equivalent conceptual complexity. Contrary to the copy-and-edit hypothesis, we did not find evidence that people editing a legal document wrote in a more convoluted manner than when writing the same document from scratch. From a cognitive perspective, these results suggest law to be a rare exception to the general tendency in human language toward communicative efficiency. In particular, these findings indicate law’s complexity to be derived from its performativity, whereby low-frequency structures may be inserted to signal law’s authoritative, world-state-altering nature, at the cost of increased processing demands on readers. From a law and policy perspective, these results suggest that the tension between the ubiquity and impenetrability of the law is not an inherent one, and that laws can be simplified without a loss or distortion of communicative content…(More)”.

When A.I.’s Output Is a Threat to A.I. Itself


Article by Aatish Bhatia: “The internet is becoming awash in words and images generated by artificial intelligence.

Sam Altman, OpenAI’s chief executive, wrote in February that the company generated about 100 billion words per day — a million novels’ worth of text, every day, an unknown share of which finds its way onto the internet.

A.I.-generated text may show up as a restaurant review, a dating profile or a social media post. And it may show up as a news article, too: NewsGuard, a group that tracks online misinformation, recently identified over a thousand websites that churn out error-prone A.I.-generated news articles.

In reality, with no foolproof methods to detect this kind of content, much will simply remain undetected.

All this A.I.-generated information can make it harder for us to know what’s real. And it also poses a problem for A.I. companies. As they trawl the web for new data to train their next models on — an increasingly challenging task — they’re likely to ingest some of their own A.I.-generated content, creating an unintentional feedback loop in which what was once the output from one A.I. becomes the input for another.

In the long run, this cycle may pose a threat to A.I. itself. Research has shown that when generative A.I. is trained on a lot of its own output, it can get a lot worse.

Here’s a simple illustration of what happens when an A.I. system is trained on its own output, over and over again:

This is part of a data set of 60,000 handwritten digits.

When we trained an A.I. to mimic those digits, its output looked like this.

This new set was made by an A.I. trained on the previous A.I.-generated digits. What happens if this process continues?

After 20 generations of training new A.I.s on their predecessors’ output, the digits blur and start to erode.

After 30 generations, they converge into a single shape.

While this is a simplified example, it illustrates a problem on the horizon.

Imagine a medical-advice chatbot that lists fewer diseases that match your symptoms, because it was trained on a narrower spectrum of medical knowledge generated by previous chatbots. Or an A.I. history tutor that ingests A.I.-generated propaganda and can no longer separate fact from fiction…(More)”.

The Power of Supercitizens


Blog by Brian Klaas: “Lurking among us, there are a group of hidden heroes, people who routinely devote significant amounts of their time, energy, and talent to making our communities better. These are the devoted, do-gooding, elite one percent. Most, but not all, are volunteers.1 All are selfless altruists. They, the supercitizens, provide some of the stickiness in the social glue that holds us together.2

What if I told you that there’s this little trick you can do that makes your community stronger, helps other people, and makes you happier and live longer? Well, it exists, there’s ample evidence it works, and best of all, it’s free.

Recently published research showcases a convincing causal link between these supercitizens—devoted, regular volunteers—and social cohesion. While such an umbrella term means a million different things, these researchers focused on two UK-based surveys that analyzed three facets of social cohesion, measured through eight questions (respondents answered on a five point scale, ranging from strongly disagree to strongly agree). They were:


Neighboring

  • ‘If I needed advice about something I could go to someone in my neighborhood’;
  • ‘I borrow things and exchange favors with my neighbors’; and
  • ‘I regularly stop and talk with people in my neighborhood’

Psychological sense of community

  • ‘I feel like I belong to this neighborhood’;
  • ‘The friendships and associations I have with other people in my neighborhood mean a lot to me’;
  • ‘I would be willing to work together with others on something to improve my neighborhood’; and
  • ‘I think of myself as similar to the people that live in this neighborhood’)

Attraction to the neighborhood

  • ‘I plan to remain a resident of this neighborhood for a number of years’

While these questions only tap into some specific components of social cohesion, high levels of these ingredients are likely to produce a reliable recipe for a healthy local community. (Social cohesion differs from social capital, popularized by Robert Putnam and his book, Bowling Alone. Social capital tends to focus on links between individuals and groups—are you a joiner or more of a loner?—whereas cohesion refers to a more diffuse sense of community, belonging, and neighborliness)…(More)”.

Policy for responsible use of AI in government


Policy by the Australian Government: “The Policy for the responsible use of AI in government ensures that government plays a leadership role in embracing AI for the benefit of Australians while ensuring its safe, ethical and responsible use, in line with community expectations. The policy:

  • provides a unified approach for government to engage with AI confidently, safely and responsibly, and realise its benefits
  • aims to strengthen public trust in government’s use of AI by providing enhanced transparency, governance and risk assurance
  • aims to embed a forward leaning, adaptive approach for government’s use of AI that is designed to evolve and develop over time…(More)”.

Policy Fit for the Future


Primer by the Australian Government: “The Futures Primer is part of the “Policy Fit for the Future” project, building Australian Public Service capability to use futures techniques in policymaking through horizon scanning, visioning and scenario planning. These tools help anticipate and navigate future risks and opportunities.

The tools and advice can be adapted to any policy challenge, and reflect the views of global experts in futures and strategic foresight, both within and outside the APS…The Futures Primer offers a range of flexible tools and advice that can be adapted to any policy challenge. It reflects the views of global experts in futures and strategic foresight, both within and outside the APS…(More)”.

The Power of Volunteers: Remote Mapping Gaza and Strategies in Conflict Areas


Blog by Jessica Pechmann: “…In Gaza, increased conflict since October 2023 has caused a prolonged humanitarian crisis. Understanding the impact of the conflict on buildings has been challenging, since pre-existing datasets from artificial intelligence and machine learning (AI/ML) models and OSM were not accurate enough to create a full building footprint baseline. The area’s buildings were too dense, and information on the ground was impossible to collect safely. In these hard-to-reach areas, HOT’s remote and crowdsourced mapping methodology was a good fit for collecting detailed information visible on aerial imagery.

In February 2024, after consultation with humanitarian and UN actors working in Gaza, HOT decided to create a pre-conflict dataset of all building footprints in the area in OSM. HOT’s community of OpenStreetMap volunteers did all the data work, coordinating through HOT’s Tasking Manager. The volunteers made meticulous edits to add missing data and to improve existing data. Due to protection and data quality concerns, only expert volunteer teams were assigned to map and validate the area. As in other areas that are hard to reach due to conflict, HOT balanced the data needs with responsible data practices based on the context.

Comparing AI/ML with human-verified OSM building datasets in conflict zones

AI/ML is becoming an increasingly common and quick way to obtain building footprints across large areas. Sources for automated building footprints range from worldwide datasets by Microsoft or Google to smaller-scale open community-managed tools such as HOT’s new application, fAIr.

Now that HOT volunteers have completely updated and validated all OSM buildings in visible imagery pre-conflict, OSM has 18% more individual buildings in the Gaza strip than Microsoft’s ML buildings dataset (estimated 330,079 buildings vs 280,112 buildings). However, in contexts where there has not been a coordinated update effort in OSM, the numbers may differ. For example, in Sudan where there has not been a large organized editing campaign, there are just under 1,500,000 in OSM, compared to over 5,820,000 buildings in Microsoft’s ML data. It is important to note that the ML datasets have not been human-verified and their accuracy is not known. Google Open Buildings has over 26 million building features in Sudan, but on visual inspection, many of these features are noise in the data that the model incorrectly identified as buildings in the uninhabited desert…(More)”.

Under which conditions can civic monitoring be admitted as a source of evidence in courts?


Blog by Anna Berti Suman: “The ‘Sensing for Justice’ (SensJus) research project – running between 2020 and 2023 – explored how people use monitoring technologies or just their senses to gather evidence of environmental issues and claim environmental justice in a variety of fora. Among the other research lines, we looked at successful and failed cases of civic-gathered data introduced in courts. The guiding question was: what are the enabling factors and/or barriers for the introduction of civic evidence in environmental litigation?

Civic environmental monitoring is the use by ordinary people of monitoring devices (e.g., a sensor) or their bare senses (e.g., smell, hearing) to detect environmental issues. It can be regarded as a form of reaction to environmental injustices, a form of political contestation through data and even as a form of collective care. The practice is fast growing, especially thanks to the widespread availability of audio and video-recording devices in the hand of diverse publics, but also due to the increase in public literacy and concern on environmental matters.

Civic monitoring can be a powerful source of evidence for law enforcement, especially when it sheds light on official informational gaps associated with the shortages of public agencies’ resources to detect environmental wrongdoings. Both legal scholars and practitioners as well as civil society organizations and institutional actors should look at the practice and its potential applications with attention.

Among the cases explored for the SensJus project, the Formosa case, Texas, United States, stands out as it sets a key precedent: issued in June 2019, the landmark ruling found a Taiwanese petrochemical company liable for violating the US Clean Water Act, mostly on the basis of citizen-collected evidence involving volunteer observations of plastic contamination over years. The contamination could not be proven through existing data held by competent authorities because the company never filed any record of pollution. Our analysis of the case highlights some key determinants of the case’s success…(More)”.

Data Protection Law and Emotion


Book by Damian Clifford: “Data protection law is often positioned as a regulatory solution to the risks posed by computational systems. Despite the widespread adoption of data protection laws, however, there are those who remain sceptical as to their capacity to engender change. Much of this criticism focuses on our role as ‘data subjects’. It has been demonstrated repeatedly that we lack the capacity to act in our own best interests and, what is more, that our decisions have negative impacts on others. Our decision-making limitations seem to be the inevitable by-product of the technological, social, and economic reality. Data protection law bakes in these limitations by providing frameworks for notions such as consent and subjective control rights and by relying on those who process our data to do so fairly.

Despite these valid concerns, Data Protection Law and Emotion argues that the (in)effectiveness of these laws are often more difficult to discern than the critical literature would suggest, while also emphasizing the importance of the conceptual value of subjective control. These points are explored (and indeed, exposed) by investigating data protection law through the lens of the insights provided by law and emotion scholarship and demonstrating the role emotions play in our decision-making. The book uses the development of Emotional Artificial Intelligence, a particularly controversial technology, as a case study to analyse these issues.

Original and insightful, Data Protection Law and Emotion offers a unique contribution to a contentious debate that will appeal to students and academics in data protection and privacy, policymakers, practitioners, and regulators…(More)”.