A typology of artificial intelligence data work


Article by James Muldoon et al: “This article provides a new typology for understanding human labour integrated into the production of artificial intelligence systems through data preparation and model evaluation. We call these forms of labour ‘AI data work’ and show how they are an important and necessary element of the artificial intelligence production process. We draw on fieldwork with an artificial intelligence data business process outsourcing centre specialising in computer vision data, alongside a decade of fieldwork with microwork platforms, business process outsourcing, and artificial intelligence companies to help dispel confusion around the multiple concepts and frames that encompass artificial intelligence data work including ‘ghost work’, ‘microwork’, ‘crowdwork’ and ‘cloudwork’. We argue that these different frames of reference obscure important differences between how this labour is organised in different contexts. The article provides a conceptual division between the different types of artificial intelligence data work institutions and the different stages of what we call the artificial intelligence data pipeline. This article thus contributes to our understanding of how the practices of workers become a valuable commodity integrated into global artificial intelligence production networks…(More)”.

AI-enhanced Collective Intelligence: The State of the Art and Prospects


Paper by Hao Cui and Taha Yasseri: “The current societal challenges exceed the capacity of human individual or collective effort alone. As AI evolves, its role within human collectives is poised to vary from an assistive tool to a participatory member. Humans and AI possess complementary capabilities that, when synergized, can achieve a level of collective intelligence that surpasses the collective capabilities of either humans or AI in isolation. However, the interactions in human-AI systems are inherently complex, involving intricate processes and interdependencies. This review incorporates perspectives from network science to conceptualize a multilayer representation of human-AI collective intelligence, comprising a cognition layer, a physical layer, and an information layer. Within this multilayer network, humans and AI agents exhibit varying characteristics; humans differ in diversity from surface-level to deep-level attributes, while AI agents range in degrees of functionality and anthropomorphism. The interplay among these agents shapes the overall structure and dynamics of the system. We explore how agents’ diversity and interactions influence the system’s collective intelligence. Furthermore, we present an analysis of real-world instances of AI-enhanced collective intelligence. We conclude by addressing the potential challenges in AI-enhanced collective intelligence and offer perspectives on future developments in this field…(More)”.

Algorithmic attention rents: A theory of digital platform market power


Paper by Tim O’Reilly, Ilan Strauss and Mariana Mazzucato: “We outline a theory of algorithmic attention rents in digital aggregator platforms. We explore the way that as platforms grow, they become increasingly capable of extracting rents from a variety of actors in their ecosystems—users, suppliers, and advertisers—through their algorithmic control over user attention. We focus our analysis on advertising business models, in which attention harvested from users is monetized by reselling the attention to suppliers or other advertisers, though we believe the theory has relevance to other online business models as well. We argue that regulations should mandate the disclosure of the operating metrics that platforms use to allocate user attention and shape the “free” side of their marketplace, as well as details on how that attention is monetized…(More)”.

Making Sense of Citizens’ Input through Artificial Intelligence: A Review of Methods for Computational Text Analysis to Support the Evaluation of Contributions in Public Participation


Paper by Julia Romberg and Tobias Escher: “Public sector institutions that consult citizens to inform decision-making face the challenge of evaluating the contributions made by citizens. This evaluation has important democratic implications but at the same time, consumes substantial human resources. However, until now the use of artificial intelligence such as computer-supported text analysis has remained an under-studied solution to this problem. We identify three generic tasks in the evaluation process that could benefit from natural language processing (NLP). Based on a systematic literature search in two databases on computational linguistics and digital government, we provide a detailed review of existing methods and their performance. While some promising approaches exist, for instance to group data thematically and to detect arguments and opinions, we show that there remain important challenges before these could offer any reliable support in practice. These include the quality of results, the applicability to non-English language corpuses and making algorithmic models available to practitioners through software. We discuss a number of avenues that future research should pursue that can ultimately lead to solutions for practice. The most promising of these bring in the expertise of human evaluators, for example through active learning approaches or interactive topic modeling…(More)”.

Synthetic Data and the Future of AI


Paper by Peter Lee: “The future of artificial intelligence (AI) is synthetic. Several of the most prominent technical and legal challenges of AI derive from the need to amass huge amounts of real-world data to train machine learning (ML) models. Collecting such real-world data can be highly difficult and can threaten privacy, introduce bias in automated decision making, and infringe copyrights on a massive scale. This Article explores the emergence of a seemingly paradoxical technical creation that can mitigate—though not completely eliminate—these concerns: synthetic data. Increasingly, data scientists are using simulated driving environments, fabricated medical records, fake images, and other forms of synthetic data to train ML models. Artificial data, in other words, is being used to train artificial intelligence. Synthetic data offers a host of technical and legal benefits; it promises to radically decrease the cost of obtaining data, sidestep privacy issues, reduce automated discrimination, and avoid copyright infringement. Alongside such promise, however, synthetic data offers perils as well. Deficiencies in the development and deployment of synthetic data can exacerbate the dangers of AI and cause significant social harm.

In light of the enormous value and importance of synthetic data, this Article sketches the contours of an innovation ecosystem to promote its robust and responsible development. It identifies three objectives that should guide legal and policy measures shaping the creation of synthetic data: provisioning, disclosure, and democratization. Ideally, such an ecosystem should incentivize the generation of high-quality synthetic data, encourage disclosure of both synthetic data and processes for generating it, and promote multiple sources of innovation. This Article then examines a suite of “innovation mechanisms” that can advance these objectives, ranging from open source production to proprietary approaches based on patents, trade secrets, and copyrights. Throughout, it suggests policy and doctrinal reforms to enhance innovation, transparency, and democratic access to synthetic data. Just as AI will have enormous legal implications, law and policy can play a central role in shaping the future of AI…(More)”.

Prompting Diverse Ideas: Increasing AI Idea Variance


Paper by Lennart Meincke, Ethan Mollick, and Christian Terwiesch: “Unlike routine tasks where consistency is prized, in creativity and innovation the goal is to create a diverse set of ideas. This paper delves into the burgeoning interest in employing Artificial Intelligence (AI) to enhance the productivity and quality of the idea generation process. While previous studies have found that the average quality of AI ideas is quite high, prior research also has pointed to the inability of AI-based brainstorming to create sufficient dispersion of ideas, which limits novelty and the quality of the overall best idea. Our research investigates methods to increase the dispersion in AI-generated ideas. Using GPT-4, we explore the effect of different prompting methods on Cosine Similarity, the number of unique ideas, and the speed with which the idea space gets exhausted. We do this in the domain of developing a new product development for college students, priced under $50. In this context, we find that (1) pools of ideas generated by GPT-4 with various plausible prompts are less diverse than ideas generated by groups of human subjects (2) the diversity of AI generated ideas can be substantially improved using prompt engineering (3) Chain-of-Thought (CoT) prompting leads to the highest diversity of ideas of all prompts we evaluated and was able to come close to what is achieved by groups of human subjects. It also was capable of generating the highest number of unique ideas of any prompt we studied…(More)”

Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy


Paper by Philipp Schoenegger, Indre Tuminauskaite, Peter S. Park, and Philip E. Tetlock: “Human forecasting accuracy in practice relies on the ‘wisdom of the crowd’ effect, in which predictions about future events are significantly improved by aggregating across a crowd of individual forecasters. Past work on the forecasting ability of large language models (LLMs) suggests that frontier LLMs, as individual forecasters, underperform compared to the gold standard of a human crowd forecasting tournament aggregate. In Study 1, we expand this research by using an LLM ensemble approach consisting of a crowd of twelve LLMs. We compare the aggregated LLM predictions on 31 binary questions to that of a crowd of 925 human forecasters from a three-month forecasting tournament. Our preregistered main analysis shows that the LLM crowd outperforms a simple no-information benchmark and is not statistically different from the human crowd. In exploratory analyses, we find that these two approaches are equivalent with respect to medium-effect-size equivalence bounds. We also observe an acquiescence effect, with mean model predictions being significantly above 50%, despite an almost even split of positive and negative resolutions. Moreover, in Study 2, we test whether LLM predictions (of GPT-4 and Claude 2) can be improved by drawing on human cognitive output. We find that both models’ forecasting accuracy benefits from exposure to the median human prediction as information, improving accuracy by between 17% and 28%: though this leads to less accurate predictions than simply averaging human and machine forecasts. Our results suggest that LLMs can achieve forecasting accuracy rivaling that of human crowd forecasting tournaments: via the simple, practically applicable method of forecast aggregation. This replicates the ‘wisdom of the crowd’ effect for LLMs, and opens up their use for a variety of applications throughout society…(More)”.

Unconventional data, unprecedented insights: leveraging non-traditional data during a pandemic


Paper by Kaylin Bolt et al: “The COVID-19 pandemic prompted new interest in non-traditional data sources to inform response efforts and mitigate knowledge gaps. While non-traditional data offers some advantages over traditional data, it also raises concerns related to biases, representativity, informed consent and security vulnerabilities. This study focuses on three specific types of non-traditional data: mobility, social media, and participatory surveillance platform data. Qualitative results are presented on the successes, challenges, and recommendations of key informants who used these non-traditional data sources during the COVID-19 pandemic in Spain and Italy….

Non-traditional data proved valuable in providing rapid results and filling data gaps, especially when traditional data faced delays. Increased data access and innovative collaborative efforts across sectors facilitated its use. Challenges included unreliable access and data quality concerns, particularly the lack of comprehensive demographic and geographic information. To further leverage non-traditional data, participants recommended prioritizing data governance, establishing data brokers, and sustaining multi-institutional collaborations. The value of non-traditional data was perceived as underutilized in public health surveillance, program evaluation and policymaking. Participants saw opportunities to integrate them into public health systems with the necessary investments in data pipelines, infrastructure, and technical capacity…(More)”.

A complexity science approach to law and governance


Introduction to a Special Issue by Pierpaolo Vivo, Daniel M. Katz and J. B. Ruhl: “The premise of this Special Issue is that legal systems are complex adaptive systems, and thus complexity science can be usefully applied to improve understanding of how legal systems operate, perform and change over time. The articles that follow take this proposition as a given and act on it using a variety of methods applied to a broad array of legal system attributes and contexts. Yet not too long ago some prominent legal scholars expressed scepticism that this field of study would produce more than broad generalizations, if even that. To orient readers unfamiliar with this field and its history, here we offer a brief background on how using complexity science to study legal systems has advanced from claims of ‘pseudoscience’ status to a widely adopted mainstream method. We then situate and summarize the articles.

The focus of complexity science is complex adaptive systems (CAS), systems ‘in which large networks of components with no central control and simple rules of operation give rise to complex collective behavior, sophisticated information processing and adaptation via learning or evolution’. It is important to distinguish CAS from systems that are merely complicated, such as a combustion engine, or complex but non-adaptive, such as a hurricane. A forest or coastal ecosystem, for example, is a complicated network of diverse physical and biological components, which, under no central rules of control, is highly adaptive over time…(More)”.

Blockchain and public service delivery: a lifetime cross-referenced model for e-government


Paper by Maxat Kassen: “The article presents the results of field studies, analysing the perspectives of blockchain developers on decentralised service delivery and elaborating on unique algorithms for lifetime ledgers to reliably and safely record e-government transactions in an intrinsically cross-referenced manner. New interesting technological niches of service delivery and emerging models of related data management in the industry were proposed and further elaborated such as the generation of unique lifetime personal data profiles, blockchain-driven cross-referencing of e-government metadata, parallel maintenance of serviceable ledgers for data identifiers and phenomena of blockchain ‘black holes’ to ensure reliable protection of important public, corporate and civic information…(More)”.